> What's more interesting is that the Rust implementation is just a factor of 1.23 slower (for large arrays) than just using Numpy
I suppose what is meant is faster (also follows from the diagram?). But it is still not a dramatic gain for many use cases. This shows how non-trivial the python performance calculus: pure python, versus numpy python, versus compiled c/c++ or rust. People who want to speed up python should really look whether numpy helps before complicating their codebase more.
But there are more benefits to those bindings besides performance so its really nice to see the expanding options
Numpy is often faster because it’s often using highly optimized simd, or makes use of BLAS/Fortran/LAPACK/MKL/CuBLAS implementations.
A pure rust implementation will likely always be slower by virtue of not using the same tightly designed optimized code.
Side note: A fun implementation detail of numpy is that after you install it from pypi, it does a user side compile of some of the modules on first import. Which means you need to be somewhat careful if you ever relocate an install of it to a new machine
And note that for example on Apple M1 it's essentially impossible to beat an implementation that uses Apple's Accelerate library for things like matrix multiplication, because Apple uses undocumented instructions unavailable to the public in that library.
Deprecated since version 1.20: The native libraries on macOS, provided by Accelerate, are not fit for use in NumPy since they have bugs that cause wrong output under easily reproducible conditions. If the vendor fixes those bugs, the library could be reinstated, but until then users compiling for themselves should use another linear algebra library or use the built-in (but slower) default, see the next section.
> With the release of macOS 11.3, several different issues that numpy was encountering when using Accelerate Framework’s implementation of BLAS and LAPACK should be resolved.
If numpy uses runtime detection of available SIMD instructions while rust is only compiled with the x86-64 baseline (which only includes SSE2) then compiling the module with `RUSTFLAGS=-Ctarget-cpu=native` might provide some additional performance gains on number-crunchy code.
I had to do a double take. The Rust implementation is slower and harder to maintain. I recommend adding a cythonized function and a numba jitted function to the benchmark for completeness.
I suppose what is meant is faster (also follows from the diagram?). But it is still not a dramatic gain for many use cases. This shows how non-trivial the python performance calculus: pure python, versus numpy python, versus compiled c/c++ or rust. People who want to speed up python should really look whether numpy helps before complicating their codebase more.
But there are more benefits to those bindings besides performance so its really nice to see the expanding options