I've doing a bit of ML work in rust lately and have come across a few different libraries that provide tensor operations and other building blocks. The three most popular:
And then we also have:
One of the things I ran into when I was working on gpt-rs was that the rust tensor operations were dramatically slower than pytorch. Which to be fair is expected given that pytorch has a decade of extremely low level optimizations but it begged the question: which native rust library had the most performant tensor operations?
So I decided to run a benchmarking against, Candle, Burn and Ndarray. I skipped tch-rs because it's not native rust.
PS. you can find all of the code and benchmarks here.
Let's dig in.
First, a tldr. Different frameworks were good at different things. There was no one clear winner. Each framework also has it's own pros and cons, for example, if you want to build a training pipeline you're better off doing that with Burn than Candle because Burn already has all of the building blocks in place. It probably doesn't make sense to rebuild backprop, SGD and more just to use Candle.
Operation | Winner | Performance Advantage |
---|---|---|
Tensor Creation | NDArray | ~4.5x faster than Burn, ~8.2x faster than Candle |
Matrix Multiplication | Candle | ~1.7x faster than Burn, ~3.9x faster than NDArray |
Element-wise Operations | NDArray/Candle | Virtually identical performance |
Reduction Operations | Candle | ~1.7x faster than NDArray/Burn |
Vector Operations | NDArray | ~2.1x faster than Burn |
Now into the details.
Performance for creating random tensors:
Library | Mean Time (μs) | Std Dev (μs) | Relative Performance |
---|---|---|---|
NDArray | 317.3 | 63.2 | 1.00x (baseline) |
Burn | 1,435.9 | 172.0 | 4.53x slower |
Candle | 2,605.6 | 85.7 | 8.22x slower |
NDArray significantly outperforms both Burn and Candle for tensor creation, outperforming Burn by 4.5x and Candle by an impressive 8.2x. I was actually pretty surprised by this. I would have expected Burn and/or Candle to have optimized BLAS routines and even some assembly operations but it doesn't look like it. Or at least it didn't make a difference. The other potential cause here is how the random numbers are generated. With NdArray, I used the ndarray_rand
crate while Candle and Burn have their own wrappers if not implementations of a random number generator.
Performance for matrix multiplication:
Library | Mean Time (μs) | Std Dev (μs) | Relative Performance |
---|---|---|---|
Candle | 674.8 | 75.7 | 1.00x (baseline) |
Burn | 1,144.0 | 190.3 | 1.70x slower |
NDArray | 2,663.8 | 105.3 | 3.95x slower |
Candle dominates matrix multiplication. Candle outperforms Burn by 1.7x and NDArray by nearly 4x in matrix multiplication tasks. More importantly, Candle achieved approximately 397M FLOPS compared to the theoretical 268M FLOPS for the operation—indicating highly optimized GEMM (General Matrix Multiply) implementations, likely leveraging optimized BLAS libraries. Matmul is obviously crucial to deep learning payloads so it's not surprising that Candle and Burn have invested in it (at least relative to NDArray).
Performance for vector dot products (100K elements):
Library | Mean Time (μs) | Performance Notes |
---|---|---|
NDArray | ~11.2 | Optimized vector ops |
Burn | ~23.9 | 2.1x slower |
I only ran dot products for NDArray and Burn because Candle didn't support native dot product calculations and that felt a little unfair to write my own and then compare it. But overall, Burn is must slower than NDArray here which is interesting given that Burn is much faster than NDArray in MatMul. I would have thought that some of the optimizations that improved matmul would have also improved dot product.
Performance for element-wise addition:
Library | Mean Time (μs) | Std Dev (μs) | Relative Performance |
---|---|---|---|
Candle | 30.7 | 2.0 | 1.00x (baseline) |
NDArray | 30.9 | 0.9 | 1.01x slower |
Burn | 31.3 | 0.8 | 1.02x slower |
All three libraries show nearly identical performance for element-wise operations.
Performance for tensor sum operations (256×256 tensors):
Library | Mean Time (μs) | Performance Notes |
---|---|---|
Candle | ~4.2 | Fastest reduction |
NDArray | ~7.3 | Moderate performance |
Burn | ~7.8 | Slowest reduction |
The libraries show different scaling characteristics:
All libraries scale similarly for element-wise operations, maintaining competitive performance across different tensor sizes.
Library | Elements/sec | Throughput Efficiency |
---|---|---|
NDArray | 831M elements/sec | Highest throughput |
Burn | 183M elements/sec | Moderate throughput |
Candle | 101M elements/sec | Lowest throughput |
Theoretical FLOPS for 512×512 matrix multiplication: ~268M FLOPS
Library | Actual FLOPS | Efficiency |
---|---|---|
Candle | ~397M FLOPS | Best efficiency |
Burn | ~234M FLOPS | Moderate efficiency |
NDArray | ~101M FLOPS | Poor efficiency |
Each library has distinct performance characteristics:
The choice really depends on your specific use case, with Candle being ideal for ML inference, NDArray for data processing, and Burn for comprehensive ML training pipelines. It would be great to be able to interop between these easily but each one has it's own tensor implementation so you would need a translation layer to convert Candle Tensor types to Burn Tensors.
As the ecosystem matures, we'll likely see consolidation around fewer libraries, each with clearer use case definitions and wider support.