ggml-org · 2026-04-09 · major
llama.cpp Build b8738 — Vendor-Agnostic Tensor Parallelism, 1-Bit Quantization, AMD CDNA4
April 2026 brought backend-agnostic tensor parallelism (3–4× faster than layer splitting), Q1_0 1-bit quantization under 1 GB for 7B models, Walsh-Hadamard KV cache rotation, and hardware support for AMD CDNA4 and Qualcomm Hexagon NPUs.
llama.cpp gains true multi-GPU tensor parallelism across CUDA, ROCm, and Metal — no vendor lock-in.
Key specs
| GitHub stars | 108,661 |
|---|---|
| Builds shipped april | 170 |
What is it?
llama.cpp is the most widely deployed local LLM inference engine with 108k GitHub stars. Build b8738 (April 9) delivered the headline feature: genuine tensor parallelism merged into mainline, alongside 1-bit quantization and a suite of hardware backend expansions across 170+ builds shipped in April.
How does it work?
Tensor parallelism splits individual matrix operations across multiple GPUs simultaneously — unlike layer splitting, all GPUs process every token together. This works across NVIDIA CUDA, AMD ROCm, and Apple Metal without being locked to one vendor. Benchmarks showed 3–4× throughput gains over layer splitting with GPUs pegged at 100% utilization. Q1_0 1-bit quantization compresses 7B models below 1 GB by targeting binary-weight models. Walsh-Hadamard KV cache rotation unlocks Q4_0 KV quantization for reasoning tasks (previously unusable, now 21.7% on AIME25).
Why does it matter?
Consumer-grade multi-GPU setups can now run MoE models like Llama 4 and Qwen 3 MoE efficiently, without CUDA-only infrastructure. 1-bit quantization pushes 7B inference onto ultra-constrained hardware.
Who is it for?
Local AI developers with multi-GPU rigs, edge hardware deployers, and hobbyists running large models on consumer hardware.
Try it
Build from source: cmake -B build -DGGML_CUDA=ON -DGGML_TP=ON && cmake --build build -j