More

camel-cdr · 2026-05-25T12:51:56 1779713516

> Simplicity in the CPU hardware may reduce the probability of hardware bugs, but it increases the probability of software bugs, because the missing hardware features must be implemented at a much greater cost in software, like in the case with the missing integer overflow detection of RISC-V, which causes most RISC-V programs to omit overflow checks, increasing the chances of undetected bugs.

Since I've got a SpacemiT K3 board my self now, I though I test it again:

I compiled microjs with both tinycc and chibicc, which where both compiled for the target platform with and without -ftrapv:

    Slowdown Zen1: tinycc: 1.34%, chibicc: -0.3% (slight speedup somehow?)
    Slowdown X100: tinycc:  0.1%, chibicc:  3.4%

Last time I did full clang: https://news.ycombinator.com/item?id=47328214#47342362 And there was minimal slowdown (sometimes speedup) on x86, Arm and RISC-V. It was pointed out that llvm mostly uses size_t, however chibicc and tinycc use int as their default type, so there should be lots of overflow checking.

camel-cdr · 2026-05-24T13:01:57 1779627717

Porting this optimization to RISC-V Vector is pretty trivial.

camel-cdr · 2026-05-18T16:24:36 1779121476

> From a bystanderʼs POV it is excessively hard to memorize all the mess with multiple different extensions

It's the same for other ISAs.

> What Iʼm slightly confused for is that all these extensions, useful for a minor part of applications, arenʼt moved to longer instructions (6-byte).

Because these instructions don't need it. There will be future >4-byte instructions, for things thay can't resonably be done in 4-bytes, e.g. much larger immediates.

pclmulqdq · 2026-05-20T09:30:53 1779269453

It's way worse on RISC-V. There are maybe 5 x86 or ARM variants to care about at any given time, even if you want to hyper-optimize your code. RISC-V has a soup of literally 100s of extensions with non-uniform use and support.

camel-cdr · 2026-05-20T11:59:14 1779278354

There are a lot more ARM extensions than people are aware of. E.g. debian uses ARMv8-A with FEAT_FP and FEAT_AdvSIMDas a base. Yes, floating-point and SIMD are optional in ARMv8-A, as are the following ISA extensions, only including ones that add instructions and excluding the AArch32 stuff: FEAT_CRC32, FEAT_AES, FEAT_PMULL, FEAT_SHA1, FEAT_SHA256, FEAT_RDM, FEAT_F32MM, FEAT_F64MM, FEAT_I8MM, FEAT_LSMAOC, FEAT_SHA3, FEAT_SHA512, , FEAT_SM3, FEAT_SM4, FEAT_SVE, FEAT_EPAC, FEAT_FCMA, FEAT_JSCVT, FEAT_LRCPC, FEAT_DotProd, FEAT_FHM, FEAT_FlagM, FEAT_LRCPC2, FEAT_BTI, FEAT_FRINTTS, FEAT_FlagM2, FEAT_MTE, FEAT_MTE2, FEAT_RNG, FEAT_SB, FEAT_BF16, FEAT_DGH, FEAT_EBF16, FEAT_CSSC, ...

Also fun: FEAT_LittleEnd, FEAT_MixedEnd, FEAT_BigEnd

All of that was just 64-bit ARMv8.x-a, there is a lot more stuff, once you go to R or M profiles, 32-bit and previous versions.

The reason this is mostly not a problem, is that distros converged on a minimum of 64-bit ARMv8-A + FP + SIMD, which will also happen with RVA23 for RISC-V.

Just for fun, here are the Zen4 ISA flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3 dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm

Compared to RVA23 written out: rv64imafdcbv_zicsr_zicntr_zihpm_ziccif_ziccrse_ziccamoa_zicclsm_zic64b_za64rs_zihintpause_zba_zbb_zbs_zicbom_zicbop_zicboz_zfhmin_zkt_zvfhmin_zvbb_zvkt_zihintntl_zicond_zimop_zcmop_zcb_zfa_zawrs_svbare_svade_ssccptr_sstvecd_sstvala_sscounterenw_svpbmt_svinval_svnapot_sstc_sscofpmf_ssnpm_ssu64xl_sha_supm_zifencei

pclmulqdq · 2026-05-20T14:07:20 1779286040

I will note that you listed out all of the RVA23 instruction extensions, not all of the blessed RISC-V instruction set extensions. Here's the list of every ratified RISC-V instruction set extension, to get parity with the list you gave for the other ISAs:

M, A, F, D, Q, C, B, H, Zicsr, Zifencei, Zicntr, Zihpm, Zihintpause, Zihintntl, Zicbom, Zicbop, Zicboz, Zicond, Zicfilp, Zicfiss, Zimop, Zca, Zcb, Zcd, Zce, Zcf, Zcmp, Zcmt, Zcmop, Zclsd, Zilsd, Zmmul, Zfh, Zfhmin, Zfa, Zfbfmin, Zfinx, Zdinx, Zhinx, Zhinxmin, Zaamo, Zalrsc, Zawrs, Zacas, Zabha, Zalasr, Zba, Zbb, Zbc, Zbs, Ztso, Zbkb, Zbkc, Zbkx, Zknd, Zkne, Zknh, Zksed, Zksh, Zkn, Zks, Zkt, Zk, Zkr, Zve32x, Zve32f, Zve64x, Zve64f, Zve64d, Zve, Zvl32b, Zvl64b, Zvl128b, Zvl256b, Zvl512b, Zvl1024b, Zvl, Zv, Zvfh, Zvfhmin, Zvfbfmin, Zvfbfwma, Zvbb, Zvbc, Zvkb, Zvkg, Zvkn, Zvknc, Zvkned, Zvkng, Zvknha, Zvknhb, Zvks, Zvksc, Zvksed, Zvksg, Zvksh, Zvkt, Sm1p11, Sm1p12, Sm1p13, Smaia, Smepmp, Smstateen, Smcdeleg, Smcsrind, Smcntrpmf, Smrnmi, Smdbltrp, Smmpm, Smnpm, Smctr, Ss1p11, Ss1p12, Ss1p13, Ssaia, Ssccfg, Sscsrind, Sscofpmf, Sstc, Ssqosid, Ssdbltrp, Ssnpm, Sspm, Ssctr, Supm, Sv32, Sv39, Sv48, Sv57, Svinval, Svnapot, Svpbmt, Svadu, Svvptc, Svrsw60t59b, Sdext, Sdtrig

That doesn't look very short to me.

These are grouped into profiles, like "Skylake" or "Cortex-M33" or "Neoverse-N1." The main issue for RISC-V isn't the number of instruction set extensions, it's the number of profiles. RVA23 is one single blessed profile, but many chips will add a few more instructions or include fewer than RVA23 based on age of the chip.

RetroTechie · 2026-05-20T19:09:28 1779304168

Common Linux distros will target one of the profiles, or a commonly supported subset like RV64GC.

Beyond that, what other extensions a particular board or chip supports, doesn't affect regular uses like web browsing. Specific apps or software libraries may use an ISA extension if present. Same as for other ISAs.

Code for embedded systems is optimized for the exact cpu in there. Same thing for highly specialized jobs (scientific / datacenter type stuff).

In short: yes, fragmentation wrt ISA extensions, hardware & software support exists. In practice, it isn't a big problem as some claim it to be.

dwattttt · 2026-05-20T12:23:17 1779279797

That sure is a long list. But written out like that it gets a bit misleading: does there exist anything with that same list, just missing pae? mmx? syscall? Just because they have individual names & flags, doesn't mean every combination of them exists.

jcranmer · 2026-05-20T15:02:58 1779289378

The Intel manuals list the set of features that are removed or planned to be removed from newer hardware versions: Sub-page write permissions for EPT, xAPIC mode, Key Locker, Uncore PMI. IA32_DEBUGCTL MSR, bit 13 (MSR address 1D9H), Intel® Memory Protection Extensions (Intel® MPX), MSR_TEST_CTRL, bit 31 (MSR address 33H), Hardware Lock Elision (HLE), VP2INTERSECT. AMD's manuals suggests that they view the ISA as purely additive, but I haven't read them in detail.

Basically, outside of MPX, and the confusing lineage of AVX-512 on client versus server parts, x86 is pretty strictly additive.

benj111 · 2026-05-20T10:33:26 1779273206

What are you imagining? If this is desktop then most of the extensions are going to be standard.

The only reason they're optional is because I'm using the same instruction set on my Pico, so no it doesn't have floating point, and I believe it has integer divide but I wouldn't be surprised if it didn't.

And the extensions are in groups, a good chunk of which are compressed instructions, which unless you're writing assembly, you don't need to worry about.

In fact most of this you don't need to worry about unless youre writing assembly.

imtringued · 2026-05-20T10:57:49 1779274669

Electronics distributors search engines tend to work extremely poorly and if you try to overload them with an absurd variety of niche extensions, then nobody is going to find the right RISC V MCU for their needs.

addaon · 2026-05-20T16:17:13 1779293833

> There are maybe 5 x86 or ARM variants to care about at any given time

What? There are individual chips with nearly that many ARM variants, including incompatible ISAs (M0 vs R52) and compatible-but-very-different-performance-characteristics implementations of the same ISA (M4 vs M7, say). Even figuring out what portion of code can be shared across which cores (and for those that distinguish between ARM and Thumb mode, what mode that code can be called in), vs what code needs duplicate versions for different cores for correctness, vs what code needs duplicate versions for performance but not correctness (which changes as the code usage pattern evolves) can be a challenge on a single chip; I can't imagine a world where you can think about only five across an entire industry.

wg0 · 2026-05-20T10:26:00 1779272760

> It's the same for other ISAs.

No they are not. See the Intel Software Programmer Volumes. Highly detailed, highly structured and highly specific.

Joker_vD · 2026-05-20T10:31:51 1779273111

You're joking, right?

wg0 · 2026-05-20T13:26:57 1779283617

No. Because I read about Intel in detail for a long time. Those volumes are part of my digital library.

Tired finding similar quality documentation on ARM and RISC-V and came empty handed.

camel-cdr · 2026-05-17T15:40:27 1779032427

> Also, let's stop with the "vector length agnostic" types being the sole option for SVE extensions

They aren't, see the `arm_sve_vector_bits` attribute.

> I'm fine with recompiling my code, I do it every day

Then you can do that.

> If I have an algorithm that's truly vector length agnostic, I can make the vector length a constant in my code that can change based on the compile target.

You can do that, but why not simply write it in a vector-length-agnostic way?

IMO the better approach is to start thinking about SIMD optimizations in a VLA way, and specialize on the vector length, when that becomes advantageous. Doing it this way is better even if you end up not writing VLA code, because you though about the scalability problem.

Many libraries currently don't scale beyond 128-bit, not because they couldn't make efficient use of >128-bit, but because the library was architect around 128-bit and changing that amounts to almost a full rewrite. So now you are stuck wasting 3/4th of your ALUs running 128-bit SSE on Zen5.

camel-cdr · 2026-05-17T14:37:07 1779028627

greater then 512-bit SIMD isn't currently and in the near future relevant for regular general purpose processors.

But for smaller more specialized CPUs in embedded or automotive usecases you can get more parallel compute, while keeping the software model simpler than having to dispatch to a GPU.

Specifically a design like https://saturn-vectors.org/#_short_vector_execution, which like to use 2x or 4x wider vectors that the datapath length for more efficient chaining. I quite like that design, because you can get high utilization and limited out-of-order execution without vector register renaming.

camel-cdr · 2026-05-17T10:46:05 1779014765

In GPUs GLSL like types compile down to what basically is variable length SIMD. A vec4 doesn't get compiled to a SIMD vector with four floats, but rather to four SIMD vectors, each containing N FP32 elements (usually 32 or 64).

Look at what this simple shader compiles down to on RGA: https://godbolt.org/z/4GrfY61vf

whizzter · 2026-05-18T08:02:43 1779091363

Right, and AVX512 would thus be more relevant if ISPC-like features was mainstreamed in CPU bound C++ compilers.

camel-cdr · 2026-05-17T10:34:25 1779014065

Looks like that isn't a portable SIMD abstraction, but more similar to adding architecture-specific SIMD intrinsics support to go, with nicer syntax.

meling · 2026-05-17T14:05:16 1779026716

Sorry, I didn’t explicitly link to the issue for the portal layer.

Here is the issue discussing the portal simd package: https://github.com/golang/go/issues/78902

camel-cdr · 2026-05-17T10:29:44 1779013784

> This will take decades because you cannot change existing architectures/processors.

I think once, AVX-512, SVE and RVV are wide spread enough, you'll have a rather powerfull baselevel you can target. But this will take a lot of time.

SkiFire13 · 2026-05-17T10:50:52 1779015052

> AVX-512

Which subset though? Some of them are not supported by some recent CPUs (e.g. 2024).

Not to mention Alder Lake not supporting AVX512.

sgerenser · 2026-05-17T12:05:23 1779019523

Yeah AVX-512 is basically dead as a universal target for x86, the future is now AVX-10. But I believe there is a reasonable subset that will work on both.

Remnant44 · 2026-05-17T18:13:48 1779041628

It's a little dramatic to say avx512 is dead versus 10 - rather, I would say that avx10 finalizes a universally available set of avx512 extensions. For AVX 10.1, there's essentially, no difference after Intel backed out of reducing the vector length.

For at least the next decade AVX 512 will be the high performance target, reaching all of the zen4/5/6 CPUs as well as whatever avx-10 enabled CPUs Intel producers.

camel-cdr · 2026-05-17T07:55:11 1779004511

Here is a highway example: https://gcc.godbolt.org/z/7sdPr61W6

There is a bit of boilerplate to get dynamic dispatch working, but apart from that it's quite simple to use.

camel-cdr · 2026-05-17T07:41:39 1779003699

So you "just" write 4 assembly implementations?