ggml-org · 2026-04-09 · major

llama.cpp Build b8738 — Vendor-Agnostic Tensor Parallelism, 1-Bit Quantization, AMD CDNA4

April 2026 brought backend-agnostic tensor parallelism (3–4× faster than layer splitting), Q1_0 1-bit quantization under 1 GB for 7B models, Walsh-Hadamard KV cache rotation, and hardware support for AMD CDNA4 and Qualcomm Hexagon NPUs.

llama.cpp gains true multi-GPU tensor parallelism across CUDA, ROCm, and Metal — no vendor lock-in.

Key specs

GitHub stars	108,661
Builds shipped april	170

What is it?

llama.cpp is the most widely deployed local LLM inference engine with 108k GitHub stars. Build b8738 (April 9) delivered the headline feature: genuine tensor parallelism merged into mainline, alongside 1-bit quantization and a suite of hardware backend expansions across 170+ builds shipped in April.

How does it work?

Tensor parallelism splits individual matrix operations across multiple GPUs simultaneously — unlike layer splitting, all GPUs process every token together. This works across NVIDIA CUDA, AMD ROCm, and Apple Metal without being locked to one vendor. Benchmarks showed 3–4× throughput gains over layer splitting with GPUs pegged at 100% utilization. Q1_0 1-bit quantization compresses 7B models below 1 GB by targeting binary-weight models. Walsh-Hadamard KV cache rotation unlocks Q4_0 KV quantization for reasoning tasks (previously unusable, now 21.7% on AIME25).

Why does it matter?

Consumer-grade multi-GPU setups can now run MoE models like Llama 4 and Qwen 3 MoE efficiently, without CUDA-only infrastructure. 1-bit quantization pushes 7B inference onto ultra-constrained hardware.

Who is it for?

Local AI developers with multi-GPU rigs, edge hardware deployers, and hobbyists running large models on consumer hardware.

Try it

Build from source: cmake -B build -DGGML_CUDA=ON -DGGML_TP=ON && cmake --build build -j