NVIDIA Labs · 2026-06-16 · major
cuTile Rust v0.2.0 — NVIDIA Labs ships NVFP4 GPU kernels in safe Rust
NVIDIA Labs ships cuTile Rust v0.2.0, a safe tile-based GPU kernel DSL for Rust with NVFP4 packing and block-scaled GEMM on B200. A companion paper, Fearless Concurrency on the GPU, reports 7 TB/s element-wise and 2 PFlop/s GEMM throughput.
cuTile Rust is NVIDIA Labs' safe tile-based GPU kernel DSL for Rust, now with NVFP4 packing and a new performance paper.
Key specs
| License | Apache-2.0 |
|---|---|
| GitHub stars | 494 |
| Element wise throughput | 7 TB/s on B200 |
| Gemm throughput | 2 PFlop/s on B200 |
What is it?
cuTile Rust is an open-source Rust DSL from NVIDIA Labs for writing safe, data-race-free GPU kernels with a tile-based memory model. cuTile Rust gives Rust programmers ownership-checked tensor handles, async kernel launches, and a host API that keeps device pointers from leaking past their lifetime.
How does it work?
cuTile Rust v0.2.0 adds CUDA 13.3 low-precision support — NVFP4 packing and unpacking plus block-scaled matrix multiply — and a new cutile-kernels crate of reusable inference primitives. The release ships executable NVFP4 and MXFP8 examples and reproducibility artifacts for the Fearless Concurrency on the GPU paper (arXiv 2606.15991), which reports 7 TB/s element-wise and 2 PFlop/s GEMM on B200.
Why does it matter?
cuTile Rust lets safety-critical Rust codebases compile their own GPU kernels without writing CUDA C++, with the same NVFP4 paths NVIDIA uses on Blackwell. For ML systems teams, that closes a real gap — Rust LLM runtimes can now share inference kernels with Python frameworks at low precision instead of dispatching through a foreign-function boundary.
Who is it for?
ML systems engineers, Rust GPU kernel authors, NVFP4 inference researchers
Try it
cargo add cutile@0.2.0 --features kernels