The masked variants of most operations are a killer AVX-512 feature for me. Vect...

brigade · on June 20, 2023

> This always benchmarks slower in loops

Really? It's a forced read-write dependency on the destination register. Which makes sense for cores with limited superscalar. But for ops with >1 cycle latency or >1/cycle throughput, chained masks are likely to inhibit ILP and be slower...

sparkie · on June 20, 2023

A big benefit comes from having branchless code. For example, if you have an if/else statement where the consequent acts on some elements of the vector and the antecedent acts on the others, you can perform them all with no branch, by taking the mask resulting from the condition for the consequent instructions, then complementing the mask and applying the antecedent to the same registers. This can also have predictable performance, because all instructions from the consequent and antecedent are executed each time, and there are no branch prediction misses to worry about. It's very useful for timing sensitive code (cryptography), and situations where you want a measurable WCET.

brigade · on June 20, 2023

Masked instructions vs subsequent merging are both branchless and have no implicit data-dependent timing relative to each other.

inopinatus · on June 20, 2023

It is possible that even masked vectorised branchless code is susceptible to side-channel attacks based on power consumption, nor would I rule out timing attacks if you can somehow get subnormal or exceptional values loaded. Is it a joy to code in this style? Perhaps. Is it a silver bullet? It is not.

colejohnson66 · on June 20, 2023

Mask with zeroing would solve that. The EVEX prefix supports both merge and zero masking.

brigade · on June 20, 2023

All that solves is changing the merging instruction from a masked merge to a maskless OR.

colejohnson66 · on June 20, 2023

Zero merging breaks the dependency chain on the destination; all masked out lanes are set to zero. What do you mean a "maskless OR"?

brigade · on June 20, 2023

Zero masking doesn't merge. If you're discarding lanes with a separate merge op, it doesn't matter what the discarded value was.

inopinatus · on June 20, 2023

I think the point GP has, is that zero-mask ops prevent/break false dependencies on the destination register, and moreover, that this becomes a useful tool the more conditionally-executed-by-masking vectorized code you have in an algorithm body, and may also (caveat reader: I am now speculating) be a reason why AVX-512 came with so many damn registers, because they're super useful for intermediate/partial results.

Unfortunately the SysV ABI interferes with compilers allocating upper SIMD registers, since they're all call-clobbered. This motivates bigger functions: almost all my intentionally vectorized/vectorizable code is declared inline and very occasionally I've resorted, reluctantly, to inline asm. Whether the ABI design is actually a mistake, and then how/whether it might be remediated, remains a matter of opinion.

Digression:

The consequence of all this is there's often More Than One Way To Do It, which no matter how much mechanical sympathy you might hope to innately possess still means punching lots of variants on your code into uica/iaca et al to paint anything like a decent picture about bottlenecks, as well as doing your damnedest to ensure that any benchmarking of loops/computation you care to perform during development actually corresponds to real execution. The holy grail, viz. writing C or other HLL that auto-vectorizes well on more than one compiler and more than one architecture (because you wanted to support NEON, too, right?), becomes a near-bottomless programmer time sink.

There are real benefits to be had, but given the additional time-investment required to obtain those benefits, it's little wonder that AVX-512 is shortchanged on intentional adoption, and that's even before Intel started crippling Alder Lake. In the long run, only greater strides in compiler auto-vectorization capabilities will fix this for everyday code.

brigade · on June 21, 2023

If it's a false dependency, then it doesn't matter what's in the inactive lanes and a simple unpredicated instruction will break the dependency just as well as a zero masking one. Which is exactly what you said the compiler already did.

AVX-512 has 32 registers because the Pentium core Larrabee was developed against was in-order. In a real sense, the P5 core dictated much of AVX-512's design.

There isn't a useful way to define a general ABI with callee saved vector registers without saying something like "only bits [127:0] are saved"

cmrdporcupine · on June 20, 2023

I actually reached for these (_mm*_cmpeq_epi8_mask), this morning in my (Rust) code only to find that they are still `unstable` only and therefore unavailable to me, along with so many other SIMD things in Rust.

Portable SIMD aside (which is sitting forever unstable & unavailable), the actual intrinsics I feel should not be. Quite frustrating, and along with missing allocator_api (still!) makes me feel sometimes like 'reverting' back to C++.

https://doc.rust-lang.org/stable/core/arch/x86_64/fn._mm256_...