Huge block size (128bytes). Probably they are using Power7 alike scheduling (i.e. scheduling are working on packs of instructions, That might explain the humorous 600+ entry ROB. Certainly the wake-up logic can't deal with that one-by-one with such a low power). If you combine that with JIT and/or good compilers, you get this. I guess only Apple can pull this trick: they control all the stack (and some key power architects are working there).
Big cache lines and big pages together. 16 kB pages combined with 128-byte lines means it can be 8-way set associative and still take advantage of a VIPT structure.
Larger pages mean that performance on memory-mapped small files will suffer... which is a use-case that Apple doesn't normally care about in its client computers.
Larger cache lines mean that highly mulththreaded server loads could suffer from false sharing more often. Again, its a client computer so who cares?
Regarding the definition of "huge": A64FX uses 256B cache lines. Granted its a numerical computing vector machine, but still. Huge covers a lot of ground.
This, right here. It also helps that the L1D is a whopping 128 kB and only 3 cycles of load-use latency.