AI/TLDR

PrismML · 2026-04-16 · notable

Ternary Bonsai — 1.58-Bit 8B Runs at 82 tok/s on M4 Pro in 1.75 GB

PrismML's Ternary Bonsai squeezes an 8B model into 1.75 GB using 1.58-bit ternary weights — 9× smaller than FP16, 82 tok/s on M4 Pro, 27 tok/s on iPhone 17 Pro Max, 75.5 avg benchmark. Apache 2.0, GGUF and MLX variants on HuggingFace.

PrismML-Eng/Bonsai-demo GitHub repository — 1.58-bit Ternary Bonsai language models for on-device inference

1.58-bit ternary weights fit an 8B model in 1.75 GB and run fast enough for interactive use on iPhones and MacBooks.

Key specs

Memory (8 b model)1.75 GB
Vs. fp16 footprint9× smaller
Throughput (m4 pro)82 tok/s
Throughput (i phone 17 pro max)27 tok/s
Avg benchmark score (8 b)75.5
Energy efficiency vs. fp163–4×

What is it?

Ternary Bonsai is a family of language models from PrismML where every weight is one of three values: -1, 0, or +1. Available in 8B, 4B, and 1.7B sizes, the 8B variant occupies 1.75 GB — about 9× smaller than a standard FP16 8B model. All models are Apache 2.0 and distributed as GGUF and MLX (Apple Silicon) variants on HuggingFace. PrismML also ships a pre-built native runner (Bonsai-demo) supporting Metal, CUDA, ROCm, Vulkan, and CPU backends.

How does it work?

Ternary quantization constrains weights to {-1, 0, +1}, replacing floating-point multiplies with additions and subtractions. This slashes storage and reduces compute cost on hardware with efficient integer pipelines — especially Apple Silicon's Neural Engine. The 8B model scores 75.5 averaged across MMLU Redux, MuSR, GSM8K, HumanEval+, IFEval, and BFCLv3, competitive with other open 8B models. PrismML also ships a 1-bit Bonsai family (weights in {-1, +1}) which trades benchmark score for even smaller footprint.

Why does it matter?

Fitting an 8B model in 1.75 GB means it runs in the RAM headroom of current iPhones and RAM-constrained laptops without any extra quantization step. At 82 tok/s on M4 Pro and 27 tok/s on iPhone 17 Pro Max the throughput crosses the interactive threshold — comparable to what llama.cpp achieves with Q4 quants but at a smaller footprint. The Apache 2.0 license and ready-to-run GGUF files lower the barrier to entry for on-device deployment.

Who is it for?

Developers targeting on-device inference on Apple Silicon or iPhones; self-hosters with memory-constrained hardware.

Try it

huggingface.co/collections/prism-ml/ternary-bonsai — GGUF and MLX variants

Sources · 3 outlets

Tags

  • quantization
  • ternary-weights
  • on-device
  • apple-silicon
  • local-llm
  • mlx
  • gguf
  • mobile-inference
  • compression
  • self-hosting

← All releases · Learn AI