Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The masked variants of most operations are a killer AVX-512 feature for me. Vectorised conditional execution was/is the last piece of the puzzle.

It baffles me that clang in particular disregards them. Clang’s intrinsics and builtins generally use the unmasked forms and fake it with subsequent combining operations. This always benchmarks slower in loops, and often demands an extra register. I haven’t delved deeply, but it feels like either the cost model is mispredicting potential k-register bottleneck, or it doesn’t know about masked AVX-512 instructions at all. In comparison, GCC does, but it falls down (on my code at least) in needing more explicit vectorisation than clang.



> This always benchmarks slower in loops

Really? It's a forced read-write dependency on the destination register. Which makes sense for cores with limited superscalar. But for ops with >1 cycle latency or >1/cycle throughput, chained masks are likely to inhibit ILP and be slower...


A big benefit comes from having branchless code. For example, if you have an if/else statement where the consequent acts on some elements of the vector and the antecedent acts on the others, you can perform them all with no branch, by taking the mask resulting from the condition for the consequent instructions, then complementing the mask and applying the antecedent to the same registers. This can also have predictable performance, because all instructions from the consequent and antecedent are executed each time, and there are no branch prediction misses to worry about. It's very useful for timing sensitive code (cryptography), and situations where you want a measurable WCET.


Masked instructions vs subsequent merging are both branchless and have no implicit data-dependent timing relative to each other.


It is possible that even masked vectorised branchless code is susceptible to side-channel attacks based on power consumption, nor would I rule out timing attacks if you can somehow get subnormal or exceptional values loaded. Is it a joy to code in this style? Perhaps. Is it a silver bullet? It is not.


Mask with zeroing would solve that. The EVEX prefix supports both merge and zero masking.


All that solves is changing the merging instruction from a masked merge to a maskless OR.


Zero merging breaks the dependency chain on the destination; all masked out lanes are set to zero. What do you mean a "maskless OR"?


Zero masking doesn't merge. If you're discarding lanes with a separate merge op, it doesn't matter what the discarded value was.


I think the point GP has, is that zero-mask ops prevent/break false dependencies on the destination register, and moreover, that this becomes a useful tool the more conditionally-executed-by-masking vectorized code you have in an algorithm body, and may also (caveat reader: I am now speculating) be a reason why AVX-512 came with so many damn registers, because they're super useful for intermediate/partial results.

Unfortunately the SysV ABI interferes with compilers allocating upper SIMD registers, since they're all call-clobbered. This motivates bigger functions: almost all my intentionally vectorized/vectorizable code is declared inline and very occasionally I've resorted, reluctantly, to inline asm. Whether the ABI design is actually a mistake, and then how/whether it might be remediated, remains a matter of opinion.

Digression:

The consequence of all this is there's often More Than One Way To Do It, which no matter how much mechanical sympathy you might hope to innately possess still means punching lots of variants on your code into uica/iaca et al to paint anything like a decent picture about bottlenecks, as well as doing your damnedest to ensure that any benchmarking of loops/computation you care to perform during development actually corresponds to real execution. The holy grail, viz. writing C or other HLL that auto-vectorizes well on more than one compiler and more than one architecture (because you wanted to support NEON, too, right?), becomes a near-bottomless programmer time sink.

There are real benefits to be had, but given the additional time-investment required to obtain those benefits, it's little wonder that AVX-512 is shortchanged on intentional adoption, and that's even before Intel started crippling Alder Lake. In the long run, only greater strides in compiler auto-vectorization capabilities will fix this for everyday code.


If it's a false dependency, then it doesn't matter what's in the inactive lanes and a simple unpredicated instruction will break the dependency just as well as a zero masking one. Which is exactly what you said the compiler already did.

AVX-512 has 32 registers because the Pentium core Larrabee was developed against was in-order. In a real sense, the P5 core dictated much of AVX-512's design.

There isn't a useful way to define a general ABI with callee saved vector registers without saying something like "only bits [127:0] are saved"


I actually reached for these (_mm*_cmpeq_epi8_mask), this morning in my (Rust) code only to find that they are still `unstable` only and therefore unavailable to me, along with so many other SIMD things in Rust.

Portable SIMD aside (which is sitting forever unstable & unavailable), the actual intrinsics I feel should not be. Quite frustrating, and along with missing allocator_api (still!) makes me feel sometimes like 'reverting' back to C++.

https://doc.rust-lang.org/stable/core/arch/x86_64/fn._mm256_...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: