> Assuming a random (but TLB friendly) pattern the M1 manages a latency of aroun...

jnwatson · on Nov 30, 2020

This is the first I've heard of this. This alone, plus unified memory in general, I bet explains 60% of the performance difference.

AlphaSite · on Nov 30, 2020

I wonder how they managed that.

xpuente · on Nov 30, 2020

Huge block size (128bytes). Probably they are using Power7 alike scheduling (i.e. scheduling are working on packs of instructions, That might explain the humorous 600+ entry ROB. Certainly the wake-up logic can't deal with that one-by-one with such a low power). If you combine that with JIT and/or good compilers, you get this. I guess only Apple can pull this trick: they control all the stack (and some key power architects are working there).

brandmeyer · on Dec 1, 2020

Big cache lines and big pages together. 16 kB pages combined with 128-byte lines means it can be 8-way set associative and still take advantage of a VIPT structure.

Larger pages mean that performance on memory-mapped small files will suffer... which is a use-case that Apple doesn't normally care about in its client computers.

Larger cache lines mean that highly mulththreaded server loads could suffer from false sharing more often. Again, its a client computer so who cares?

Regarding the definition of "huge": A64FX uses 256B cache lines. Granted its a numerical computing vector machine, but still. Huge covers a lot of ground.

my123 · on Nov 30, 2020

The NVIDIA Carmel cores on 12nm had a 64KB L1D cache with a 2 cycles latency.

throwaway_pdp09 · on Nov 30, 2020

Means nothing without saying what the clock goes at.

my123 · on Nov 30, 2020

2.26GHz, on a quite old process.