Zyphra · 2026-05-06 · major
Zyphra ZAYA1-8B — AMD-Trained MoE Reasoning Model With <1B Active Parameters
Open-weight 8.4B mixture-of-experts with only 760M active parameters, trained end-to-end on 1,024 AMD MI300X GPUs. Hits 89.1 on AIME26 and 71.6 on HMMT, matching open models 10–100x larger. Apache 2.0.

Zyphra trained an 8.4B sparse MoE with under a billion active params on AMD MI300X — and it matches open models 10–100x larger on math and code.
Key specs
| Parameters | 8.4B |
|---|---|
| Active params | 760M |
| GPQA | 71.0 |
| Aime26 | 89.1 |
| Hmmt feb 26 | 71.6 |
| Live code bench v6 | 65.8 |
| Training gpus | 1024 MI300X |
What is it?
ZAYA1-8B is a small mixture-of-experts language model from Zyphra: 8.4B total parameters but only ~760M active per token. It's a reasoning-first open-weight model targeting math, code, and long-form analysis, trained end-to-end on AMD hardware rather than NVIDIA — a first at this scale for any well-known frontier-style release.
How does it work?
Zyphra trained on 1,024 AMD Instinct MI300X GPUs with Pensando Pollara networking on IBM Cloud. The architecture combines a Compressed Convolutional Attention (CCA) variant, an MLP-based expert router that improves routing stability, and learned residual scaling. Post-training adds a reasoning RL cascade and a new test-time compute method called Markovian RSA, which chunks parallel reasoning traces to keep memory constant during long deliberation.
Why does it matter?
Two things are notable. First, it shows AMD's MI300X stack is now production-ready for end-to-end frontier-style training, not just inference. Second, sub-1B active params getting 89 on AIME26 and 71 on HMMT keeps Zyphra's claim that you can win on intelligence-per-parameter without scaling totals to hundreds of billions — meaningful for on-device and edge deployment.
Who is it for?
open-weight researchers, anyone optimizing reasoning per FLOP, AMD-stack teams
Try it
huggingface.co/Zyphra/ZAYA1-8B or serverless via cloud.zyphra.com