Shanghai AI Lab / SJTU / Rice · 2026-04-08 · notable

Rethinking Generalization in Reasoning SFT

Top-trending HF paper challenges the 'SFT memorizes, RL generalizes' narrative. Shows cross-domain reasoning gains from SFT are conditional on optimization, data quality, and base-model capability; releases code and datasets.

Plain SFT can generalize across domains in reasoning — once you fix optimization, data, and base-model strength.

Key specs

Hf upvotes	306
Released models	33
Training examples (math co t 44k)	44.4k

What is it?

A new Shanghai AI Lab / SJTU / Rice paper (arXiv 2604.06628) that pushes back on the popular claim 'SFT memorizes, RL generalizes' for reasoning models. It is currently the #1 trending paper on Hugging Face with 306 upvotes. The authors release code, 33 models, and four datasets (Math-CoT-44k with token-level logprobs, Math-CoT-20k, DeepSeek-R1-20k, and Countdown-CoT-20k) so the results are fully reproducible.

How does it work?

The paper runs a conditional analysis over three axes — optimization dynamics, training data, and base-model capability — and finds that cross-domain generalization from long-CoT SFT follows a 'dip-and-recovery' pattern: during training the cross-domain score first drops, then recovers and improves, so short training runs underestimate how well SFT transfers. Verified long-CoT traces consistently help; low-quality solutions hurt. Stronger base models internalize transferable procedural patterns like backtracking even from a toy arithmetic game (Countdown), while weaker models imitate surface verbosity. They also document an asymmetric side effect: reasoning improves while safety behaviour degrades.

Why does it matter?

The 'RL is the only path to generalization' framing has been shaping post-training pipelines for the last year. This paper argues the premise is partly an optimization artifact: with better data and longer training, plain SFT can match or beat the generalization RL papers claim as their own. That reopens a much cheaper, more stable recipe for building reasoning models — and the open datasets + logprobs make it easy for others to test the claim directly.

Who is it for?

Researchers and practitioners training reasoning models who were about to commit to an RL pipeline.

Try it

github.com/Nebularaid2000/rethink_sft_generalization