Tongyi Lab / Tianjin University · 2026-04-21 · notable
TEMPO — EM-Based Test-Time Training That Scales Past the Plateau for Reasoning Models
TEMPO fixes the test-time training plateau for reasoning models: an EM framework alternates policy refinement on unlabeled data with critic recalibration on labeled examples. OLMO3-7B AIME 2024: 33.0%→51.1%; Qwen3-14B: 42.3%→65.8%.

TEMPO fixes the TTT plateau: an EM loop recalibrates the critic so the reward signal stays grounded as the policy improves.
What is it?
Test-time training (TTT) adapts a model's weights on unlabeled test inputs at inference time, letting the model specialize to the kind of problems it actually faces. TEMPO is a TTT framework for large reasoning models that addresses a known failure mode: existing methods improve quickly at first, then plateau because the self-generated reward signal drifts as the policy improves, eventually producing inaccurate advantages that stall learning and collapse solution diversity. TEMPO alternates between two EM steps — refining the policy on unlabeled questions (M-step) and recalibrating a value critic on a small labeled set (E-step) — to keep the reward signal grounded throughout training.
How does it work?
TEMPO frames TTT as Expectation-Maximization. In the E-step, a value critic separate from the policy is updated on a labeled dataset to accurately estimate solution quality. In the M-step, the policy is trained on unlabeled test questions using advantages assigned by the freshly calibrated critic — the labeled set acting as a periodic anchor. Prior methods like S1 and TTT-RL skip the E-step, which the paper shows is equivalent to an incomplete EM variant that omits the step that tightens the evidence lower bound. The recalibration step is what allows TEMPO to keep improving past the plateau where prior methods stall. Implementation is on top of ByteDance's verl RL training framework.
Why does it matter?
For practitioners working on reasoning model fine-tuning, TEMPO means additional test-time compute keeps paying off rather than plateauing. The AIME 2024 improvements are concrete: OLMO3-7B jumps from 33.0% to 51.1%, Qwen3-14B from 42.3% to 65.8%. Unlike competing approaches, TEMPO achieves these gains without diversity collapse — models continue generating varied solution paths rather than converging on a single pattern.
Who is it for?
ML researchers and engineers fine-tuning reasoning models on domain-specific tasks.
Try it
github.com/QingyangZhang/TEMPO — requires verl and a small labeled calibration setKey numbers
- OLMO3-7B AIME 2024: 33.0% → 51.1% (+18.1 pp)
- Qwen3-14B AIME 2024: 42.3% → 65.8% (+23.5 pp)
- Built on: verl (ByteDance)