New AI Datasets — Training & Eval Corpora

New training and evaluation datasets for AI — fresh corpora and benchmarks, with what's inside each one and how to use it.

3 releases tracked

ClawBench — 153 Real-World Browser-Agent Tasks Across 144 Live Websites; Best Model at 33.3%ClawBench Team · 2026-04-09 · notable
Agents score 70% in sandboxes; ClawBench shows the real-world number is 33%
Nemotron-Personas-Korea — 7M Synthetic Personas from Official Korean DemographicsNVIDIA · 2026-04-21 · notable
NVIDIA's Korean-language persona dataset: 7 million synthetic personas grounded in official government statistics, CC BY 4.0.
OpenSpatial — 3M-sample spatial-intelligence data engineHKU, Microsoft Research Asia · 2026-04-08 · notable
A 3 million-sample, open-source data engine for 3D spatial reasoning — fine-tuned models gain around 19 percent relatively on spatial benchmarks.