← Posts

Benchmarking tensor operations in rust

I've doing a bit of ML work in rust lately and have come across a few different libraries that provide tensor operations and other building blocks. The three most popular:

  • Candle - Minimalist ML framework from hugging face. It supports low level tensor operations and some higher level operations like layernorm, softmax etc.
  • Burn - More of a full stack ML framework that you can use for training and inference.
  • tch-rs - Rust bindings on top of the C++ torch API.

And then we also have:

  • Ndarray - Standard rust crate for general elements and numerics.

One of the things I ran into when I was working on gpt-rs was that the rust tensor operations were dramatically slower than pytorch. Which to be fair is expected given that pytorch has a decade of extremely low level optimizations but it begged the question: which native rust library had the most performant tensor operations?

So I decided to run a benchmarking against, Candle, Burn and Ndarray. I skipped tch-rs because it's not native rust.

PS. you can find all of the code and benchmarks here.

Let's dig in.

Rust Tensor Libraries Benchmark Results

Test Environment

  • Hardware: Macbook M2 CPU, No GPU benchmarks included
  • Data Type: f32
  • Optimization: Release mode with LTO enabled
  • Measurement: Criterion framework from the criterion crate with 95% confidence intervals
  • Limited Operations: Core operations only, no neural network layers
  • System Dependent: Results may vary across different hardware

Overall Performance Overview

First, a tldr. Different frameworks were good at different things. There was no one clear winner. Each framework also has it's own pros and cons, for example, if you want to build a training pipeline you're better off doing that with Burn than Candle because Burn already has all of the building blocks in place. It probably doesn't make sense to rebuild backprop, SGD and more just to use Candle.

OperationWinnerPerformance Advantage
Tensor CreationNDArray~4.5x faster than Burn, ~8.2x faster than Candle
Matrix MultiplicationCandle~1.7x faster than Burn, ~3.9x faster than NDArray
Element-wise OperationsNDArray/CandleVirtually identical performance
Reduction OperationsCandle~1.7x faster than NDArray/Burn
Vector OperationsNDArray~2.1x faster than Burn

Now into the details.

1. Tensor Creation (512×512 Random Tensors)

Performance for creating random tensors:

LibraryMean Time (μs)Std Dev (μs)Relative Performance
NDArray317.363.21.00x (baseline)
Burn1,435.9172.04.53x slower
Candle2,605.685.78.22x slower

NDArray significantly outperforms both Burn and Candle for tensor creation, outperforming Burn by 4.5x and Candle by an impressive 8.2x. I was actually pretty surprised by this. I would have expected Burn and/or Candle to have optimized BLAS routines and even some assembly operations but it doesn't look like it. Or at least it didn't make a difference. The other potential cause here is how the random numbers are generated. With NdArray, I used the ndarray_rand crate while Candle and Burn have their own wrappers if not implementations of a random number generator.

2. Matrix Multiplication (512×512 × 512×512)

Performance for matrix multiplication:

LibraryMean Time (μs)Std Dev (μs)Relative Performance
Candle674.875.71.00x (baseline)
Burn1,144.0190.31.70x slower
NDArray2,663.8105.33.95x slower

Candle dominates matrix multiplication. Candle outperforms Burn by 1.7x and NDArray by nearly 4x in matrix multiplication tasks. More importantly, Candle achieved approximately 397M FLOPS compared to the theoretical 268M FLOPS for the operation—indicating highly optimized GEMM (General Matrix Multiply) implementations, likely leveraging optimized BLAS libraries. Matmul is obviously crucial to deep learning payloads so it's not surprising that Candle and Burn have invested in it (at least relative to NDArray).

3. Vector Operations (Dot Product)

Performance for vector dot products (100K elements):

LibraryMean Time (μs)Performance Notes
NDArray~11.2Optimized vector ops
Burn~23.92.1x slower

I only ran dot products for NDArray and Burn because Candle didn't support native dot product calculations and that felt a little unfair to write my own and then compare it. But overall, Burn is must slower than NDArray here which is interesting given that Burn is much faster than NDArray in MatMul. I would have thought that some of the optimizations that improved matmul would have also improved dot product.

4. Element-wise Addition (512×512 + 512×512)

Performance for element-wise addition:

LibraryMean Time (μs)Std Dev (μs)Relative Performance
Candle30.72.01.00x (baseline)
NDArray30.90.91.01x slower
Burn31.30.81.02x slower

All three libraries show nearly identical performance for element-wise operations.

5. Reduction Operations (Sum)

Performance for tensor sum operations (256×256 tensors):

LibraryMean Time (μs)Performance Notes
Candle~4.2Fastest reduction
NDArray~7.3Moderate performance
Burn~7.8Slowest reduction

Performance Scaling Analysis

Matrix Multiplication Scaling (64×64 to 512×512)

The libraries show different scaling characteristics:

  • Candle: Excellent scaling, maintains performance advantage
  • Burn: Good scaling but consistently slower than Candle
  • NDArray: Poor scaling for larger matrices

Element-wise Operations Scaling

All libraries scale similarly for element-wise operations, maintaining competitive performance across different tensor sizes.

Memory and Throughput Analysis

Tensor Creation Throughput (512×512 matrices)

LibraryElements/secThroughput Efficiency
NDArray831M elements/secHighest throughput
Burn183M elements/secModerate throughput
Candle101M elements/secLowest throughput

Matrix Multiplication FLOPS (512×512 × 512×512)

Theoretical FLOPS for 512×512 matrix multiplication: ~268M FLOPS

LibraryActual FLOPSEfficiency
Candle~397M FLOPSBest efficiency
Burn~234M FLOPSModerate efficiency
NDArray~101M FLOPSPoor efficiency

Conclusion

Each library has distinct performance characteristics:

  • Candle excels at compute-intensive operations like matrix multiplication
  • NDArray dominates memory-intensive operations like tensor creation
  • Burn provides consistent, balanced performance with additional safety features

The choice really depends on your specific use case, with Candle being ideal for ML inference, NDArray for data processing, and Burn for comprehensive ML training pipelines. It would be great to be able to interop between these easily but each one has it's own tensor implementation so you would need a translation layer to convert Candle Tensor types to Burn Tensors.

As the ecosystem matures, we'll likely see consolidation around fewer libraries, each with clearer use case definitions and wider support.