GPU databases are brilliant for cases where the working set can live entirely within the GPU's memory. For most applications with much larger (or more dynamic) working sets, the PCIe bus becomes a significant performance bottleneck. This is their traditional niche.
That said, I've heard anecdotes from people I trust that heavily optimized use of CPU vector instructions is competitive with GPUs for database use cases.
> GPU databases are brilliant for cases where the working set can live entirely within the GPU's memory.
Probably true for current computer software. But there are numerous algorithms that allow groups of nodes to work together on database-joins, even if they don't fit in one node.
Consider Table A (1000 rows), and Table B (1,000,000 rows). Lets say you want to compute A Join B, but B doesn't fit in your memory (lets say you only have room for 5000 rows). Well, you can split Table B into 250 pieces, each with 4000-rows.
TableA (1000 rows) + TableB (4000 rows) is 5000 rows, which fits in memory. :-)
You then compute A join B[0:4000], then A join B[4000:8000], etc. etc. In fact, all 250 of these joins can be done in parallel.
----------
As such, its theoretically possible to perform database joins on parallel systems, even if they don't fit into any particular node's RAM.
If you can afford it, much of the penalty from the pcie bus goes away if you have a system with nvlink. You still need to transfer the data back to the CPU for the final results, but most of the filtering and reduction operations across GPU memory can be done on nvlink only.
I'm clueless on the amount of bandwidth needed for larger applications, will the eventual release of PCIe 4 & 5 have a big impact on this? Or will it still be too slow?
PCIe3 x16, current generation has a bandwidth of 15.75 GB/s (×16)
Assuming you have 16 GB or RAM on the GPU, theoretically ~1 second is all it would take to load the GPU with that amount of data. Unfortunately, when you take into consideration huge data sets, you'll be able to saturate that with 5 M2 SSDs each running at 3200 MB/s, assuming Disk<-RAM DMA->GPU.
Those would also require at least 5 PCIe 2x8 ports on a pretty performant setup. RAM bandwidth is assumed to be around ~40-60GB/s, so hopefully no bottlenecks there.
This is assuming your GPU could swizzle through 16GB of data in a second. They have a theoretical memory bandwidth of between 450-970 GB/s.
Now realistically, per vendor marketing manuals, the fastest DB I've seen allows one to ingest at 3TB / hour ~~ 1GB / second.
So there must be more to it than the theoretically 16GB / second business. PCIe4 x16 doubles the speed to 32GB/s but at this time it looks pointless to me.
That said, I've heard anecdotes from people I trust that heavily optimized use of CPU vector instructions is competitive with GPUs for database use cases.