Hugging Face · 2026-06-18 · notable
agent-eval — Hugging Face harness benchmarks coding agents on your own library
Hugging Face shipped agent-eval, an open harness that measures how well coding agents like Kimi-K2.6 and GLM-5.1 use a library — not just task completion, but token cost, time, and error rate across bare, clone, and skill access tiers.

Hugging Face's agent-eval scores libraries on whether agents can actually use them, not just whether they succeed.
Key specs
| Harness | agent-eval |
|---|---|
| Access tiers | bare / clone / skill |
| Models covered | Kimi-K2.6, GLM-5.1, MiniMax-M2.7, Qwen3-4B, Qwen3-14B |
What is it?
agent-eval is an open evaluation harness from Hugging Face for testing how well open coding agents work with a specific library. The blog post argues that 'if it isn't tested, then it doesn't work' should apply to agent usability, not just human usability. Released June 18, 2026 by Lysandre, Nathan Habib, Pedro Cuenca and nine other contributors.
How does it work?
agent-eval runs each candidate model against the target library at three access tiers: 'bare' (raw API), 'clone' (the library copied into context), and 'skill' (a curated skill bundle). For every run the harness records token consumption, wall-clock time, error rates, and behavioural markers like retry loops — not just pass/fail. Reported runs cover Kimi-K2.6, GLM-5.1, MiniMax-M2.7, Qwen3-4B and Qwen3-14B.
Why does it matter?
Library maintainers can finally answer 'is our API agent-friendly?' with numbers. agent-eval surfaces where docs are missing or APIs are too clever for agents to invoke, which is the bottleneck for getting Claude Code, Cursor, and other agents to use a stack reliably. The harness is open so anyone can plug in their own library.
Who is it for?
Library maintainers, agent toolers, ML engineers benchmarking small open models.
Try it
https://huggingface.co/blog/is-it-agentic-enough