I wanted to see how 1 TiB/s compares to the actual theoretical limits of the har...

markhpc · on Jan 20, 2024

I wanted to chime in and mention that we've never seen any issues with IOMMU before in Ceph. We have a previous generation of the same 1U chassis from Dell with AMD Rome processors in the upstream ceph lab and they don't suffer from the same issue despite performing similarly at the same scale (~30 OSDs). The customer did say they've seen this in the past in their data center. I'm hoping we can work with AMD to figure out what's going on.

I did some work last summer kind of duct taping the OSD's existing threading model (double buffering the hand-off between async msgr and worker threads, adaptive thread wakeup, etc). I could achieve significant performance / efficiency gains under load, but at the expense of increased low-load latency (Ceph by default is very aggressive about waking up threads when new IO arrives for a given shard).

One of the other core developers and I discussed it and we both came to the conclusion that it probably makes sense to do a more thorough rewrite of the threading code.

magicalhippo · on Jan 20, 2024

They're benchmarking random IO though, and the disks can "only" do a bit over 1000k random 4k read IOPS, which translates to about 5 GiB/s. With 320 OSDs thats around 1.6 TiB/s.

At least thats the number I could find. Not exactly tons of reviews on these enterprise NVMe disks...

Still, that seems like a good match to the NICs. At this scale most workloads will likely appear as random IO at the storage layer anyway.

mrb · on Jan 20, 2024

The benchmark were they accomplish 1025 GiB/s is for sequential reads. For random reads they do 25.5M iops or ~100 GiB/s. See last table, column "630 OSDs (3x)".

magicalhippo · on Jan 20, 2024

Oh wow how did I miss that table, cheers.

wmf · on Jan 20, 2024

I think PCIe TLP overhead and NVMe commands account for the difference between 7 and 8 GB/s.

mrb · on Jan 20, 2024

You are probably right. Reading some old notes of mine when I was fine-tuning PCIe bandwith on my ZFS server, I had discovered back then that a PCIe Max_Payload_Size of 256 bytes limited usable bandwidth to about 74% of the link's theoretical max. I had calculated that 512 and 1024 bytes (the maximum) would raise it to respectively about 86% and 93% (but my SATA controllers didn't support a value greater than 256.)

_zoltan_ · on Jan 20, 2024

Mellanox recommends setting this from the default 512 to 4096 on their NICs.