AI/TLDR

IBM Research · 2026-04-15 · notable

VAKRA — IBM Research's Enterprise Agent Benchmark with 8,000+ Live APIs

IBM Research releases VAKRA: an executable benchmark for enterprise AI agents with 8,000+ locally-hosted APIs across 62 domains. Replay-based evaluation tests API chaining, tool selection, multi-hop reasoning, and policy adherence. Current frontier models are still failing significantly.

VAKRA benchmark overview diagram showing four capability tiers for enterprise agent evaluation

IBM Research's benchmark runs AI agents against 8,000+ live enterprise APIs to measure real tool-use and multi-hop reasoning.

Key specs

Apis8,000+
Domains62
Test instances5,187
Capability tiers4

What is it?

VAKRA (eValuating API and Knowledge Retrieval Agents) is a benchmark from IBM Research for testing how well AI agents reason and act in enterprise-like environments. It uses over 8,000 locally-hosted APIs backed by real databases across 62 domains, plus domain-aligned document collections — designed for the multi-hop, multi-source task chains that appear in production enterprise deployments.

How does it work?

Evaluation is replay-based: the benchmark records the agent's predicted tool-call trajectory and replays it against a live MCP environment, scoring each step for correctness, groundedness, and policy adherence. Tasks span four capability tiers from simple API chaining (2,077 instances) to multi-hop multi-source reasoning with policy constraints (644 instances requiring 3–7 reasoning steps). The leaderboard is open for public submission.

Why does it matter?

Most agent benchmarks use static snapshots or simplified mock APIs. VAKRA uses live, executable APIs with real database state, making scores harder to game and closer to actual enterprise conditions. Results from GPT-OSS-120B and Gemini-3-flash-preview show substantial headroom, giving a realistic picture of where production agent reliability stands today.

Who is it for?

ML engineers and researchers evaluating enterprise AI agents

Try it

https://ibm-research-vakra.hf.space/

Sources · 3 outlets

Tags

  • ibm
  • benchmark
  • enterprise-ai
  • agents
  • tool-use
  • api-chaining
  • evaluation
  • open-source

← All releases · Learn AI