AI/TLDR

Hugging Face · 2026-06-18 · notable

agent-eval — Hugging Face harness benchmarks coding agents on your own library

Hugging Face shipped agent-eval, an open harness that measures how well coding agents like Kimi-K2.6 and GLM-5.1 use a library — not just task completion, but token cost, time, and error rate across bare, clone, and skill access tiers.

agent-eval harness thumbnail from Hugging Face

Hugging Face's agent-eval scores libraries on whether agents can actually use them, not just whether they succeed.

Key specs

Harnessagent-eval
Access tiersbare / clone / skill
Models coveredKimi-K2.6, GLM-5.1, MiniMax-M2.7, Qwen3-4B, Qwen3-14B

What is it?

agent-eval is an open evaluation harness from Hugging Face for testing how well open coding agents work with a specific library. The blog post argues that 'if it isn't tested, then it doesn't work' should apply to agent usability, not just human usability. Released June 18, 2026 by Lysandre, Nathan Habib, Pedro Cuenca and nine other contributors.

How does it work?

agent-eval runs each candidate model against the target library at three access tiers: 'bare' (raw API), 'clone' (the library copied into context), and 'skill' (a curated skill bundle). For every run the harness records token consumption, wall-clock time, error rates, and behavioural markers like retry loops — not just pass/fail. Reported runs cover Kimi-K2.6, GLM-5.1, MiniMax-M2.7, Qwen3-4B and Qwen3-14B.

Why does it matter?

Library maintainers can finally answer 'is our API agent-friendly?' with numbers. agent-eval surfaces where docs are missing or APIs are too clever for agents to invoke, which is the bottleneck for getting Claude Code, Cursor, and other agents to use a stack reliably. The harness is open so anyone can plug in their own library.

Who is it for?

Library maintainers, agent toolers, ML engineers benchmarking small open models.

Try it

https://huggingface.co/blog/is-it-agentic-enough

Sources · 2 outlets

Tags

  • agents
  • benchmark
  • evaluation
  • huggingface
  • coding-agents
  • agent-tooling
  • open-source
  • agent-eval

← All releases · Learn AI