Hugging Face · 2026-04-21 · notable

HuggingFace ml-intern — Open-Source Agent That Autonomously Runs the LLM Post-Training Loop

Item: HuggingFace ml-intern — Open-Source Agent That Autonomously Runs the LLM Post-Training Loop
Rating: 3
Author: AI/TLDR

HuggingFace releases ml-intern: an open-source smolagents agent that autonomously runs the LLM post-training loop — literature review, dataset discovery, training, and eval. Scored 32% GPQA vs Claude Code's 22.99% on PostTrainBench in 10h on one H100.

HuggingFace ml-intern GitHub repository — open-source autonomous agent for LLM post-training

Open-source HuggingFace agent that autonomously runs the full LLM post-training loop — from literature review to training and eval.

Key specs

GitHub stars	1.2k
Gpqa (post train bench, qwen3 1.7 b)	32%
Claude code gpqa (same task)	22.99%
Time on single h100	10 hours
Hugging face space likes	85

What is it?

ml-intern is an autonomous AI agent built on HuggingFace's smolagents framework that automates end-to-end LLM post-training. Given a model and a goal, it browses arXiv and HuggingFace Papers, discovers and quality-checks datasets on the Hub, launches training jobs, evaluates results, diagnoses failures (such as reward collapse in GRPO), and generates synthetic data — iterating until benchmark scores improve. Released April 21, 2026 by the HuggingFace smolagents team.

How does it work?

The agent runs a loop: search papers for relevant techniques, select datasets from the Hub, generate and execute training scripts via HuggingFace Jobs, monitor runs via Trackio, then analyze benchmark results and retry with modifications when metrics fall short. In its launch evaluation (PostTrainBench), ml-intern applied GRPO to improve Qwen3-1.7B from roughly 10% to 32% GPQA in 10 hours on a single H100, surpassing Claude Code which scored 22.99% on the same task.

Why does it matter?

LLM post-training is slow and iterative: teams manually tune data recipes, training scripts, and hyperparameters over days. ml-intern automates this loop on modest hardware — a single H100 is within reach for many teams and individuals. For practitioners fine-tuning smaller models for specialized tasks, this meaningfully reduces the iteration cost and time.

Who is it for?

ML researchers and engineers doing LLM post-training and fine-tuning

Try it

github.com/huggingface/ml-intern — requires HuggingFace account for Jobs and Hub access