No, it got lucky because it's emitting exactly what the author wrote; there's nothing complex going on. In fact, I'm sure the author is stunting where it got smart and hoisted val1+val2+val3+val4+val5+val6+val7 outside of the loop, and it can't do anything else (or even emit the hand-written version) because it would violate C++'s aliasing rules or the non-associativitiy of floating point.
Which isn't the same thing. And if the author unrolled it with a hundred registers, it would still be every bit as slow.
And it should be obvious that it's slow not because it's messing up OoOE, it's slow because it's one incredibly long dependency chain with a stall between each add, due to it depending on the result of the previous add. OoOE happens to be able to fix the programmer's mistake for the first case, which is luck because the programmer obviously didn't intend for it to do so.
The compiler was not explicitly told to use a temporary variable, and could have just as easily and legally put resVal in a register (if there had been another register available). The unconditional overwriting of a register's contents (the movsd instructions) that enable register renaming are not implied or required in any way by the high-level code. You can say that it's simple and obvious, but that's your level of experience getting in the way. The high-level code prescribes nothing but additions, and the smart compiler throws in 20% other instructions that make unnecessary memory accesses and comes out ahead due to register renaming and a sufficiently large reorder window. That's definitely much subtler than the still-largely-valid guidelines like preferring to use registers over memory, and using fewer instructions overall to be faster.
No, the compiler couldn't have kept resVal in a register because it doesn't know resVal doesn't alias with any of val1..val7. Therefore, it must store resVal each iteration, and it must load each of val1..val7 each iteration. Otherwise, it would violate the C++ standard.
Well, I'm assuming the programmer wrote it like that, because otherwise a smart compiler would have calculated val1+...+val7 before the loop and just added that precalculated value repeatedly.
Also it doesn't matter if it did keep resVal (and val1..val7) in a register, it would be every bit as fast as it is now.
Similarly, the compiler has no leeway about the order of floating point additions. If the C++ was written to have identical output as the asm, the compiler could not be faster than the asm without violating the C++ standard.
Again, OoOE plays absolutely positively no part whatsoever in making the asm slow. Neither does which values are in registers or memory.
It's slow because a chain of instructions that use the output of the previous as input cannot execute in fewer cycles than (latency)*(num instructions). Period.
It's the sort of thing that should have been written by the programmer as something like
res += (val1+val2)+(val3+val4);
res += (val5+val6)+(val7);
In effect, the compiler was asked to compile
whereas the author wrote asm doing Which isn't the same thing. And if the author unrolled it with a hundred registers, it would still be every bit as slow.And it should be obvious that it's slow not because it's messing up OoOE, it's slow because it's one incredibly long dependency chain with a stall between each add, due to it depending on the result of the previous add. OoOE happens to be able to fix the programmer's mistake for the first case, which is luck because the programmer obviously didn't intend for it to do so.
But now I'm repeating myself...