So this compiler is targeting the SIMD units of CPUs rather than GPUs. Can anyone contrast what the performance of this would be relative to Cuda or OpenCl for various applications, for example neural nets?
For large data and trivial algorithms, such as multiplying matrices (which means then any problem you can express as a set of operations on large matrices) gpu's do really well, so it's hard to compete with something that has 1000 cores (Edit: "compute units", not "cores"). Neural nets is essentially matrix multiplication.
However a lot of interesting problems are seemingly parallel but highly branching and nonlinear. Take path tracing as an example: it's very little code and highly parallel as each Ray/pixel is independent, yet it's not an easy problem for a GPU: each time a ray bounces it will disperse and not do whatever the Ray next to it was doing in terms of which geometry it will hit etc.
It might seem like today if a problem can benefit from 8 CPU cores then it benefits 100x more from being run on a GPU but this is far from true. A great machine for general computing could do well with a board with 100 x86 CPUs apart from having a big gpu with a thousand cores for brute forcing the "simpler" problems.
> A great machine for general computing could do well with a board with 100 x86 CPUs apart from having a big gpu with a thousand cores for brute forcing the "simpler" problems.
Which is the idea behind Intel's Xeon Phi "GPU" with 70+ Pentium/Atom cores, which this compiler specifically targets.
[Long before that, Intel showcased an 80 core x86 CPU in 2007 (Polaris/Teraflops Research Chip) – and then promptly shelved it to focus on building programming languages and compilers that can actually make use of it, before introducing the Xeon Phi half a decade later.]
The problem with having 100 x86 CPU on one board is NUMA becomes a bad problem. When memory accesses from one CPU to different regions of memory will have quite different bandwidths and latencies, you're much better off by acknowledging that upfront, designing a proper interconnect (Infiniband), and not sharing memory between threads but rather communicating explicitly (MPI).
> ... so it's hard to compete with something that has 1000 cores.
Any references to what has "1000 cores"? Nvidia GPUs usually have about 12 or so cores that can be compared to x86 cores, meaning they can independently branch.
For example high end Nvidia 980 GTX GPU has only 16 of such comparable SIMD execution cores. SMXs or whatever Nvidia calls them.
GPU marketing materials confusingly refer as cores to something like x86 CPU SIMD lanes (and that's being very generous to GPUs), that artificially inflates the numbers.
Or put differently, one CUDA core can compute up to 1 FMA per cycle @1196-1300 (?) MHz. One recent Intel X86 core can compute at least up to 16 FMAs per cycle @2800-4000 Mhz.
Sorry, should have written "compute units" - the point is as you point out that flops is one thing but branch prediction and feeding those floating point units is another.
There has been a surge in the tractability of massive but simple linear algebra problems lately, such as deep learning, which might have given the impression that GPUs are the answer to any supercomputing.
I am not very familiar with NVidia hardware but I imagine an SMX is not the smallest unit, which can branch. A "warp" can branch independently and it's 32 lanes wide so I figure an SMX core with 192 "CUDA cores" can run 6 warps. It's still hundreds of cores and not thousands but much more than a dozen.
SMXs were basic silicon unit being tiled in NVidia's Kepler generation and SMMs are basically the same thing but for Maxwell.
A "warp" is analogous to a hardware thread and you'd have up to 64 of those being scheduled on each SMX or SMM. Each of those SMX/SMMs has four warp schedulers which issue instructions to execution units. In an SMX the schedulers can issue to any of the 192 execution lanes but in an SMM each scheduler has it's own set of execution lanes. If we call a core anything that can independently issue instructions then I guess you'd call an SMX a core but on a SMM each warp scheduler looks like it's own core. But this is all further complicated by the fact that an instruction issued to one lane can be crossed over to a lane that's become idle due to predication. Which is maybe sort of like scheduling but not really.
But yes, you can't compare "CUDA cores" to actual cores and GPUs aren't equivalent to thousands of cores. The GM204 would have 64 core equivalents and most other chips would have less.
I think WARP is more like hardware thread, and one SMX is processing one particular WARP per clock cycle. So on any given clock cycle you still have just as many independent simultaneous control paths as you have SMX units.
Not quite. All warps are running in parallel (otherwise you won't get the performance numbers) and each has its own control path (actually each has its own code) but, indeed, only one can execute control flow instructions at a time since the control unit is shared in the SMX.
Well, GPUs don't have any branch prediction or out of order capabilities, so you need to have a way to keep execution units (mainly floating point units) busy.
A WARP is really nothing more than a way to have work for SMXs (and computational units it controls) at as many clock cycles as possible. You need some way for masking FPU pipeline and memory latency.
> All warps are running in parallel (otherwise you won't get the performance numbers) and each has its own control path (actually each has its own code)
It's not that different from x86 hyperthreading, just with more hardware threads. Pipelined execution units are fed each clock cycle by the core. Multiple FP operations are in flight in parallel, otherwise CPUs won't get the performance numbers either.
Sure, an SMX can also switch between warps in the manner similar to hyperthreading on x86 but it does not mean it executes a single warp at a time. Consider Tesla K40, a GK110 with 15 SMXs. It runs 750Mhz and has peak performance of 4.29 Tflops. If each SMX could only execute a warp at a time it could get, at most, 15(number of smx) x 32(warp width) x 750M(frequency) x 2(two flops per FMA) = 720Gflops.
The Tesla K40 has peak double performance of ~1.4 TFLOPS. It has 64 DP cores, the warp scheduler can schedule four warps per smx per cycle. It can therefore have two warps executing double instructions at the same time. But the number is not very interesting, the memory bandwidth on the other hand is, a GK110 has 288GB/s, take you code, get it's arithmetic intensity and you have a upper bound for your performance, assuming you are memory bound of course.
It's true that problems with diverging control flow work better on a number of cores than on a GPU. But by the same token a GPU's SIMT execution model does better with ray tracing than the SIMD units on a CPU. And the article is about targeting SIMD units.
"Debunking the 100X GPU vs. CPU Myth:
An Evaluation of Throughput Computing on CPU and GPU" is a good paper on this; obviously has an agenda, but rings true in my experience.
That's a good reference - however the drawback of comparing a GTX280 vs an i7 from the same period, is that now 8 years later the GPUs have scaled quite well for the same set of (simple) problems, with more units/bandwidth whereas CPU performance hasn't. The difference between todays biggest graphics cards and the GTX 280 is larger than the difference between the i7 from 2008 and the big desktop CPU from 2016. The "100x" is still a myth, but we are significantly closer today than 2008.
2008 Bloomfield i7 can do 8 FP ops per clock cycle. Recent Intel CPUs can do 32 FP ops per clock.
Bloomfield era you could have 4 (?) cores per CPU socket. Now Broadwell EP has 22.
Only thing that hasn't scaled much CPU side is memory bandwidth. I think it's only a matter of time until Intel integrates HBM2 or something like it to same package. They've already done that for eDRAM.
> Bloomfield era you could have 4 (?) cores per CPU socket.
4 per socket, and at most 2 sockets per board.
Broadwell-EX, to be released this quarter, has 24 cores and up to 8 sockets per board.
So 64 FP ops per machine and cycle versus… 6144.
In the same time, GPUs went from 900 GFLOPS per card, 2 cards per machine (1800 GFLOPS total vs. 192 on CPU), to 9600 GFLOPS per card, 4 cards per machine (38400 vs. 12000). GPUs are still faster, but the advantage isn't that significant any more.
32 FP ops per clock? I don't think that is true, and if it is it is very misleading.
First with Haswell introducing the fused multiply add, suddenly all the 'peak flops' numbers doubled, which is technically true, but only if everything you do is a fused multiply add (with no cache misses of course).
Even so only the Xeon Phi (and only the unreleased silvermont cores?) has 16 wide vector units, even Skylake still has 8 wide AVX units, which would be 16 fma operations.
Are you saying that AVX instructions are pipelined (or some other technique) and have a throughput greater than their width per cycle?
> First with Haswell introducing the fused multiply add, suddenly all the 'peak flops' numbers doubled, which is technically true, but only if everything you do is a fused multiply add (with no cache misses of course).
Yeah, FMA (fused multiply-adds).
Better or worse, it's de facto standard to quote one FMA as two FLOPS, because it's a very commonly combined operation.
> Are you saying that AVX instructions are pipelined (or some other technique) and have a throughput greater than their width per cycle?
Yeah, AFAIK, they're (mostly) pipelined and dual issue per clock.
One significant benefit of CPU SIMD is that you also do not need to manage temperamental GPUs and GPU drivers. The CPU programming experience is much nicer, the infrastructure more robust, and you can generally expect some sort of SIMD support everywhere. It is not too hard to support SIMD-with-fallback code (and SIMD will generally just work, if supported). GPU support will often require configuration on the part of the user, especially if the system has several GPUs.
(I say this as a GPU language developer. They are fast, but also a bit of a pain in the ass.)
I think because Vulkan won't be nearly as niche GPU driver support and consistency will need to be much better than they have been for OpenCL. Where that stands for flexibility of the compute side I can't say.
SPMD is conceptually more like a job system, so this generates code that is not necessarily all executing same code in lockstep. SPMD/SIMD is "single program" vs "single instruction". The compiler does also exploit SIMD but that's orthogonal to the SPMD concept.
That would depend heavily on the hardware and how the program was written. It is an apples and oranges comparison.
I will say this though, a program written well in ISPC with cache locality taken into account together with SIMD can run 100x faster than a naive C program.