Wafer · 2026-07-03 · notable

Wafer runs GLM-5.2 on AMD MI355X — 2x cheaper inference than Blackwell

Wafer quantized Z.ai's 753B GLM-5.2 to MXFP4 and ran it on AMD MI355X with sglang. Result: 2,626 tokens/sec per node, 213 tokens/sec single-stream, ~80% of Blackwell B200 throughput at 2.75x lower GPU cost.

Wafer blog header for GLM-5.2 on AMD MI355X performance-per-dollar benchmark.

GLM-5.2 in MXFP4 on AMD MI355X hits 2.6K tokens/sec per node at 80% of Blackwell speed — for about a third of the GPU price.

What is it?

Wafer, a Y Combinator-backed inference company, published a benchmark of Z.ai's 753B open-weight GLM-5.2 on AMD's MI355X GPU. The team quantized the model to MXFP4 with AMD Quark, then served it under sglang with speculative decoding, and compared throughput and cost against NVIDIA Blackwell B200 and B300 nodes.

How does it work?

GLM-5.2 was quantized from bf16 to MXFP4 via AMD Quark, then served with sglang using speculative decoding (5/1/6 configuration), an fp8_e4m3 KV-cache, and AITER all-reduce fusion. Accuracy was validated against the FP8 baseline on GSM8K (0.965 to 0.955), GPQA-Diamond (0.922 to 0.903), and tau2 macro (0.819 to 0.834).

Why does it matter?

MI355X reaches roughly 80% of Blackwell B200 throughput on GLM-5.2 while the GPU costs about 2.75x less, so tokens per dollar tilt firmly toward AMD. For teams that already run open weights, this is a concrete signal that CUDA lock-in on frontier inference is loosening — one of the first credible AMD numbers on a 700B+ model.

Who is it for?

inference-cost-sensitive teams, AMD hardware evaluators, GLM-5.2 deployers

Wafer runs GLM-5.2 on AMD MI355X — 2x cheaper inference than Blackwell

What is it?

How does it work?

Why does it matter?

Who is it for?

Sources · 2 outlets

Tags