Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Assuming a random (but TLB friendly) pattern the M1 manages a latency of around 30-33ns to main memory.

This, right here. It also helps that the L1D is a whopping 128 kB and only 3 cycles of load-use latency.



This is the first I've heard of this. This alone, plus unified memory in general, I bet explains 60% of the performance difference.


I wonder how they managed that.


Huge block size (128bytes). Probably they are using Power7 alike scheduling (i.e. scheduling are working on packs of instructions, That might explain the humorous 600+ entry ROB. Certainly the wake-up logic can't deal with that one-by-one with such a low power). If you combine that with JIT and/or good compilers, you get this. I guess only Apple can pull this trick: they control all the stack (and some key power architects are working there).


Big cache lines and big pages together. 16 kB pages combined with 128-byte lines means it can be 8-way set associative and still take advantage of a VIPT structure.

Larger pages mean that performance on memory-mapped small files will suffer... which is a use-case that Apple doesn't normally care about in its client computers.

Larger cache lines mean that highly mulththreaded server loads could suffer from false sharing more often. Again, its a client computer so who cares?

Regarding the definition of "huge": A64FX uses 256B cache lines. Granted its a numerical computing vector machine, but still. Huge covers a lot of ground.


The NVIDIA Carmel cores on 12nm had a 64KB L1D cache with a 2 cycles latency.


Means nothing without saying what the clock goes at.


2.26GHz, on a quite old process.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: