NVIDIA Labs · 2026-06-16 · major

cuTile Rust v0.2.0 — NVIDIA Labs ships NVFP4 GPU kernels in safe Rust

Item: cuTile Rust v0.2.0 — NVIDIA Labs ships NVFP4 GPU kernels in safe Rust
Rating: 4
Author: AI/TLDR

NVIDIA Labs ships cuTile Rust v0.2.0, a safe tile-based GPU kernel DSL for Rust with NVFP4 packing and block-scaled GEMM on B200. A companion paper, Fearless Concurrency on the GPU, reports 7 TB/s element-wise and 2 PFlop/s GEMM throughput.

GitHub repository card for NVlabs/cutile-rs

cuTile Rust is NVIDIA Labs' safe tile-based GPU kernel DSL for Rust, now with NVFP4 packing and a new performance paper.

Key specs

License	Apache-2.0
GitHub stars	494
Element wise throughput	7 TB/s on B200
Gemm throughput	2 PFlop/s on B200

What is it?

cuTile Rust is an open-source Rust DSL from NVIDIA Labs for writing safe, data-race-free GPU kernels with a tile-based memory model. cuTile Rust gives Rust programmers ownership-checked tensor handles, async kernel launches, and a host API that keeps device pointers from leaking past their lifetime.

How does it work?

cuTile Rust v0.2.0 adds CUDA 13.3 low-precision support — NVFP4 packing and unpacking plus block-scaled matrix multiply — and a new cutile-kernels crate of reusable inference primitives. The release ships executable NVFP4 and MXFP8 examples and reproducibility artifacts for the Fearless Concurrency on the GPU paper (arXiv 2606.15991), which reports 7 TB/s element-wise and 2 PFlop/s GEMM on B200.

Why does it matter?

cuTile Rust lets safety-critical Rust codebases compile their own GPU kernels without writing CUDA C++, with the same NVFP4 paths NVIDIA uses on Blackwell. For ML systems teams, that closes a real gap — Rust LLM runtimes can now share inference kernels with Python frameworks at low precision instead of dispatching through a foreign-function boundary.

Who is it for?

ML systems engineers, Rust GPU kernel authors, NVFP4 inference researchers

Try it

cargo add cutile@0.2.0 --features kernels