No, it got lucky because it's emitting exactly what the author wrote; there's no...

wtallis · on March 23, 2013

The compiler was not explicitly told to use a temporary variable, and could have just as easily and legally put resVal in a register (if there had been another register available). The unconditional overwriting of a register's contents (the movsd instructions) that enable register renaming are not implied or required in any way by the high-level code. You can say that it's simple and obvious, but that's your level of experience getting in the way. The high-level code prescribes nothing but additions, and the smart compiler throws in 20% other instructions that make unnecessary memory accesses and comes out ahead due to register renaming and a sufficiently large reorder window. That's definitely much subtler than the still-largely-valid guidelines like preferring to use registers over memory, and using fewer instructions overall to be faster.

brigade · on March 23, 2013

No, the compiler couldn't have kept resVal in a register because it doesn't know resVal doesn't alias with any of val1..val7. Therefore, it must store resVal each iteration, and it must load each of val1..val7 each iteration. Otherwise, it would violate the C++ standard.

Well, I'm assuming the programmer wrote it like that, because otherwise a smart compiler would have calculated val1+...+val7 before the loop and just added that precalculated value repeatedly.

Also it doesn't matter if it did keep resVal (and val1..val7) in a register, it would be every bit as fast as it is now.

Similarly, the compiler has no leeway about the order of floating point additions. If the C++ was written to have identical output as the asm, the compiler could not be faster than the asm without violating the C++ standard.

Again, OoOE plays absolutely positively no part whatsoever in making the asm slow. Neither does which values are in registers or memory.

It's slow because a chain of instructions that use the output of the previous as input cannot execute in fewer cycles than (latency)*(num instructions). Period.

It's the sort of thing that should have been written by the programmer as something like

   res += (val1+val2)+(val3+val4);
   res += (val5+val6)+(val7);

then we wouldn't be having this discussion...