Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wanted to see how 1 TiB/s compares to the actual theoretical limits of the hardware. So here is what I found:

The cluster has 68 nodes, each a Dell PowerEdge R6615 (https://www.delltechnologies.com/asset/en-us/products/server...). The R6615 configuration they run is the one with 10 U.2 drive bays. The U.2 link carries data over 4 PCIe gen4 lanes. Each PCIe lane is capable of 16 Gbit/s. The lanes have negligible ~3% overhead thanks to 128b-132b encoding.

This means each U.2 link has a maximum link bandwith of 16 * 4 = 64 Gbit/s or 8 Gbyte/s. However the U.2 NVMe drives they use are Dell 15.36TB Enterprise NVMe Read Intensive AG, which appear to be capable of 7 Gbyte/s read throughput (https://www.serversupply.com/SSD%20W-TRAY/NVMe/15.36TB/DELL/...). So they are not bottlenecked by the U.2 link (8 Gbyte/s).

Each node has 10 U.2 drive, so each node can do local read I/O at a maximum of 10 * 7 = 70 Gbyte/s.

However each node has a network bandwith of only 200 Gbit/s (2 x 100GbE Mellanox ConnectX-6) which is only 25 Gbyte/s. This implies that remote reads are under-utilizing the drives (capable of 70 Gbyte/s). The network is the bottleneck.

Assuming no additional network bottlenecks (they don't describe the network architecture), this implies the 68 nodes can provide 68 * 25 = 1700 Gbyte/s of network reads. The author benchmarked 1 TiB/s actually exactly 1025 GiB/s = 1101 Gbyte/s which is 65% of the maximum theoretical 1700 Gbyte/s. That's pretty decent, but in theory it's still possible to be doing a bit better assuming all nodes can concurrently truly saturate their 200 Gbit/s network link.

Reading this whole blog post, I got the impression ceph's complexity hits the CPU pretty hard. Not compiling a module with -O2 ("Fix Three": linked by the author: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453) can reduce performance "up to 5x slower with some workloads" (https://bugs.gentoo.org/733316) is pretty unexpected, for a pure I/O workload. Also what's up with OSD's threads causing excessive CPU waste grabbing the IOMMU spinlock? I agree with the conclusion that the OSD threading model is suboptimal. A relatively simple synthetic 100% read benchmark should not expose a threading contention if that part of ceph's software architecture was well designed (which is fixable, so I hope the ceph devs prioritize this.)



I wanted to chime in and mention that we've never seen any issues with IOMMU before in Ceph. We have a previous generation of the same 1U chassis from Dell with AMD Rome processors in the upstream ceph lab and they don't suffer from the same issue despite performing similarly at the same scale (~30 OSDs). The customer did say they've seen this in the past in their data center. I'm hoping we can work with AMD to figure out what's going on.

I did some work last summer kind of duct taping the OSD's existing threading model (double buffering the hand-off between async msgr and worker threads, adaptive thread wakeup, etc). I could achieve significant performance / efficiency gains under load, but at the expense of increased low-load latency (Ceph by default is very aggressive about waking up threads when new IO arrives for a given shard).

One of the other core developers and I discussed it and we both came to the conclusion that it probably makes sense to do a more thorough rewrite of the threading code.


They're benchmarking random IO though, and the disks can "only" do a bit over 1000k random 4k read IOPS, which translates to about 5 GiB/s. With 320 OSDs thats around 1.6 TiB/s.

At least thats the number I could find. Not exactly tons of reviews on these enterprise NVMe disks...

Still, that seems like a good match to the NICs. At this scale most workloads will likely appear as random IO at the storage layer anyway.


The benchmark were they accomplish 1025 GiB/s is for sequential reads. For random reads they do 25.5M iops or ~100 GiB/s. See last table, column "630 OSDs (3x)".


Oh wow how did I miss that table, cheers.


I think PCIe TLP overhead and NVMe commands account for the difference between 7 and 8 GB/s.


You are probably right. Reading some old notes of mine when I was fine-tuning PCIe bandwith on my ZFS server, I had discovered back then that a PCIe Max_Payload_Size of 256 bytes limited usable bandwidth to about 74% of the link's theoretical max. I had calculated that 512 and 1024 bytes (the maximum) would raise it to respectively about 86% and 93% (but my SATA controllers didn't support a value greater than 256.)


Mellanox recommends setting this from the default 512 to 4096 on their NICs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: