AI/TLDR

Cloudflare · 2026-04-17 · notable

Cloudflare Unweight — 22% Lossless LLM Compression via Huffman-Coded BF16 Exponents

Cloudflare's lossless LLM compression system achieves 22% smaller model bundles and ~3 GB VRAM savings on Llama 3.1 8B by Huffman-coding BF16 exponent bytes and decompressing directly in on-chip shared memory — no quality loss.

Cloudflare Unweight blog post header — lossless LLM compression via Huffman-coded BF16 exponents

Cloudflare's lossless inference-time compression cuts LLM model bundles by 22% by exploiting a statistical property of trained BF16 weights.

Key specs

Distribution bundle reduction22%
Inference bundle reduction13%
Vram savings (llama 3.1 8 b)~3 GB
Mlp weight compression~30%
Throughput overhead30–40%

What is it?

Unweight is Cloudflare's internal LLM compression system, now publicly described as part of Agents Week. It achieves lossless compression — weights decompress to exactly the original values — and runs at inference time rather than as a one-off quantization step. The system is deployed across Cloudflare's GPU network for Workers AI and is primarily a distribution and VRAM-saving technique.

How does it work?

Trained BF16 model weights have a strong statistical skew: the top 16 most common exponent values (out of 256 possible) cover over 99% of all weights in a typical layer. Unweight applies Huffman coding to the exponent byte of each BF16 value, leaving the sign and mantissa bytes uncompressed. Crucially, weights are decompressed directly in fast on-chip shared memory and fed immediately to tensor cores, bypassing the slower main memory bandwidth bottleneck. An autotuner selects the best of four execution pipelines per layer based on measured hardware performance.

Why does it matter?

A 22% reduction in distribution bundle size and ~3 GB less VRAM on an 8B model compounds quickly at Cloudflare's scale — fewer GPU memory swaps, faster cold starts, and lower per-inference bandwidth cost. The approach is lossless (no benchmark regression) and layer-adaptive, unlike quantization. The 30–40% throughput overhead at small batch sizes is a real cost, but it narrows to ~30% at large batches. As an internally-deployed system, this is a production-grade technique, not a prototype.

Who is it for?

ML infrastructure engineers and researchers working on LLM inference efficiency.

Try it

https://blog.cloudflare.com/unweight-tensor-compression/

Sources · 2 outlets

Tags

  • cloudflare
  • llm
  • compression
  • inference
  • bf16
  • huffman
  • vram
  • workers-ai
  • open-research

← All releases · Learn AI