What does EdgeBench measure?

EdgeBench measures how quickly AI agents learn from executable real-world environments over 12+ hours of continuous interaction per task, rather than one-shot correctness. Its 134 tasks span scientific ML problems, systems and software engineering, combinatorial optimization, knowledge work, formal math, and interactive games.

How can I access EdgeBench?

EdgeBench ships 51 of its 134 tasks publicly on Hugging Face and GitHub under an open license, alongside a technical paper at edge-bench.org. Full-benchmark access requires contacting ByteDance Seed directly at zhongshu@bytedance.com.

What is EdgeBench's headline finding?

EdgeBench reports that AI agents' learning speed from real-world environments roughly doubles every three months over the September 2025 to May 2026 data window, with performance following a log-sigmoid scaling law as a function of interaction time.

How does EdgeBench compare to shorter agent benchmarks?

Where SWE-bench or GAIA test single-shot problem solving, EdgeBench pushes horizon length: individual tasks run over 12 hours with some exceeding 72 hours, and human experts average 57.2 hours per task. This makes it one of the longest-horizon publicly released agent benchmarks so far.

ByteDance Seed · 2026-07-02 · major

EdgeBench — ByteDance's 134-task long-horizon agent benchmark

EdgeBench is a ByteDance Seed benchmark with 134 real-world tasks that each run 12+ hours to test how AI agents learn from executable environments over long horizons.

EdgeBench dataset card on Hugging Face showing the ByteDance Seed benchmark

ByteDance Seed's new agent benchmark clocks 12+ hours per task to measure how fast models learn on the job.

Quick facts

Maker	ByteDance Seed
Tasks (total)	134
Tasks (public)	51
Task duration	12+ hours each
Human expert time	57.2 h average per task
License	CC BY 4.0 / Apache-2.0 code
Availability	Hugging Face + GitHub + paper

What is it?

EdgeBench is a benchmark of 134 real-world tasks spanning six areas — scientific ML, systems and software engineering, combinatorial optimization, professional knowledge work, formal theorem proving, and interactive games. ByteDance Seed released 51 of the tasks publicly on Hugging Face and GitHub on July 2, 2026, along with a technical paper.

How does it work?

Each task in EdgeBench runs 12 or more hours of continuous agent operation inside an executable environment, and some tasks extend past 72 hours. The benchmark scores the full learning trajectory rather than a single answer, then fits a log-sigmoid curve to model how performance improves with more interaction time. Sample tasks include gravitational-wave detection, RISC-V CPU design, vehicle routing, claim-ring fraud audits, and NetHack.

Why does it matter?

Most agent benchmarks reward single-shot correctness, which hides whether a model can actually improve on a hard task with time. EdgeBench's long-horizon design and log-sigmoid fit give teams a way to compare how quickly agents learn from real environments, and its headline finding — that this learning speed roughly doubles every three months from Sep 2025 to May 2026 — is a concrete scaling curve for agents rather than raw pretraining.

Who is it for?

agent researchers, evaluation teams, RL and agentic-tools builders

Frequently asked questions

What does EdgeBench measure?: EdgeBench measures how quickly AI agents learn from executable real-world environments over 12+ hours of continuous interaction per task, rather than one-shot correctness. Its 134 tasks span scientific ML problems, systems and software engineering, combinatorial optimization, knowledge work, formal math, and interactive games.
How can I access EdgeBench?: EdgeBench ships 51 of its 134 tasks publicly on Hugging Face and GitHub under an open license, alongside a technical paper at edge-bench.org. Full-benchmark access requires contacting ByteDance Seed directly at zhongshu@bytedance.com.
What is EdgeBench's headline finding?: EdgeBench reports that AI agents' learning speed from real-world environments roughly doubles every three months over the September 2025 to May 2026 data window, with performance following a log-sigmoid scaling law as a function of interaction time.
How does EdgeBench compare to shorter agent benchmarks?: Where SWE-bench or GAIA test single-shot problem solving, EdgeBench pushes horizon length: individual tasks run over 12 hours with some exceeding 72 hours, and human experts average 57.2 hours per task. This makes it one of the longest-horizon publicly released agent benchmarks so far.

Try it

huggingface.co/datasets/ByteDance-Seed/EdgeBench