New AI Research Papers — arXiv Picks Explained

The AI research papers worth your time — handpicked arXiv and conference work, each summarised in plain English with why it matters.

36 releases tracked

OpenAI Deployment Simulation — predict misbehavior before releaseOpenAI · 2026-06-16 · major
OpenAI estimates how a new model will behave in production by replaying real past conversations through it before release.
DreamX-World 1.0 — Alibaba AMAP open-sources an interactive world modelAMAP-ML · 2026-06-15 · notable
Open-source 5B world model that lets you steer the camera, revisit a scene, and stage events across photoreal, game, and stylized worlds.
FastContext — Microsoft's Explore subagent cuts coding-agent tokens by 60%Microsoft · 2026-06-15 · notable
Small 4B subagent that searches the repo for a bigger coding model, lifting SWE-bench resolution by up to 5.5% while cutting main-agent tokens by 60%.
NVIDIA SpatialClaw — code as the action interface for spatial reasoning agentsNVIDIA Research · 2026-06-12 · notable
NVIDIA framework that lets a spatial-reasoning agent write Python each turn instead of picking from a fixed tool menu.
AgentDoG 1.5 — Shanghai AI Lab Ships a Lightweight Agent-Safety Alignment Framework, Trains 0.8B–8B Guardrails on ~1k Samples That Match GPT-5.4 Accuracy and Cuts Docker Overhead 100xShanghai AI Lab (AI45Lab) · 2026-05-28 · notable
Shanghai AI Lab's open AgentDoG 1.5 guardrails catch unsafe agent actions in real time, trained on a thousand samples and matching GPT-5.4-class safety classifiers.
ShengShu Open-Sources minWM — Full-Stack Pipeline Turns Wan2.1 and HunyuanVideo 1.5 Into Real-Time Camera-Controllable World Models With 224× First-Frame Latency SpeedupShengShu · 2026-05-28 · notable
An end-to-end recipe that distills heavyweight video diffusion models into few-step, camera-controllable world models that respond fast enough for live interaction.
NVIDIA's Gamma-World — Generative Multi-Agent World Model Scales Beyond Two Players With Simplex Rotary Agent Encoding and Sparse Hub Attention, Streams Shared Worlds at 24 FPSNVIDIA · 2026-05-27 · notable
NVIDIA's Gamma-World scales generative world models past the two-player ceiling with linear-cost agent attention and 24 FPS real-time rollouts.
Google DeepMind Co-Scientist Lands in Nature — Multi-Agent Gemini System Generates, Debates, and Evolves Scientific Hypotheses With 100+ Research PartnersGoogle DeepMind · 2026-05-19 · major
Seven Gemini agents argue, rank, and refine scientific hypotheses — now a Nature paper and a Google Labs tool.
ByteDance Lance — 3B Unified Multimodal Model Handles Image and Video Generation, Editing, and Understanding in One StackByteDance Research · 2026-05-18 · notable
One 3B-active model that generates, edits, and understands both images and video — trained from scratch on 128 A100s.
LongLive-2.0 — NVIDIA's NVFP4 Parallel Infrastructure Generates Minute-Long Interactive Video at 45.7 FPSNVIDIA · 2026-05-18 · notable
A 4-bit training and inference stack that makes minute-long interactive video generation fast enough for real time.
NVIDIA AnyFlow — Any-Step Video Diffusion Distillation With Flow-Map Transition Learning Beats Consistency Methods at 1.3B and 14B ScaleNVIDIA · 2026-05-13 · notable
Train one video diffusion student that runs well at 1, 4, or 32 steps — no separate model per step budget.
AsymFlow — Stanford's Rank-Asymmetric Velocity Parameterization Hits 1.57 FID on ImageNet 256 in Pixel Space, Beats Latent FLUX.2 Klein Base on Text-to-ImageStanford University · 2026-05-13 · notable
Stanford team trains a pixel-space flow model that beats latent FLUX.2 by predicting noise in a low-rank subspace.
SenseNova-U1 — 8B Dense and 30B-A3B MoE Native Unified Multimodal Models With NEO-Unify Pixel-Space Backbone, Apache 2.0SenseTime · 2026-05-12 · notable
An open native-unified multimodal model that ditches the vision encoder and VAE: one transformer reads and draws pixels end-to-end.
Natural Language Autoencoders — Anthropic's Method to Verbalize Claude's Activations into Plain TextAnthropic · 2026-05-07 · major
Train one model to describe Claude's hidden activations in English, train a second to recover the activation from the description.
RecursiveMAS — Stanford/MIT/NVIDIA Multi-Agent System Cuts Tokens 75% With 8.3% Accuracy GainStanford University · 2026-04-28 · major
Recursive computation, now for teams of AI agents
GenericAgent — Fudan's Token-Efficient Self-Evolving LLM Agent With 9k Stars Uses 6× Fewer TokensFudan University · 2026-04-18 · major
An agent that grows smarter every time it solves a task
HY-World 2.0 — Tencent Hunyuan Open-Sources Multi-Modal 3D World Model With 1,770 HF UpvotesTencent · 2026-04-15 · major
Text or a single photo → a navigable 3D world you can walk through
ARIS — SJTU's Open Research Harness Hits 8.1k Stars With Cross-Model Adversarial ReviewShanghai Jiao Tong University · 2026-05-04 · major
Open research harness that pits one model against another to catch unsupported claims before they ship.
GLM-5V-Turbo Paper Drops — Z.ai's Native Multimodal Foundation Model with CogViT and MTPZ.ai & Tsinghua University · 2026-04-29 · major
Z.ai drops the technical report for GLM-5V-Turbo: native multimodal foundation model with a fresh vision encoder and MTP decoder.
DeepMind AI Co-Clinician — Talker/Planner Dual-Agent Hits Zero Critical Errors on 97 of 98 Primary Care QueriesGoogle DeepMind · 2026-04-30 · major
DeepMind's clinical AI uses a Planner agent to keep a Talker agent inside safe medical boundaries during live audio/video patient consultations.
Step-Audio-R1.5 — RLHF-Trained Audio Reasoner Closes the 'Verifiable Reward Trap' GapStepFun · 2026-04-28 · notable
An audio reasoner that sounds human in long conversations — RLHF rescues prosody and emotion that RLVR was quietly killing.
Vista4D: Re-Render Any Video from a New Camera Angle (CVPR 2026 Highlight)Eyeline Labs / Netflix · 2026-04-23 · notable
Give any monocular video a 4D point-cloud scaffold, then re-render the scene from any camera angle you choose.
LLaTiSA — Difficulty-Stratified Time Series Reasoning for VLMsIndependent Researchers / ACL 2026 · 2026-04-19 · notable
VLMs can now reason about time series at four levels of difficulty — LLaTiSA beats GPT-4o on basic pattern localization with far less training data.
There Will Be a Scientific Theory of Deep Learning — 223 HN PointsUC Berkeley / Harvard / NYU / Stanford / Flatiron Institute · 2026-04-23 · notable
14 ML researchers argue that a real scientific theory of deep learning — 'learning mechanics' — is now close enough to be worth naming.
Alignment Faking Found at 37% in 7B Models — VLAF Cuts It by 94%University of Illinois Urbana-Champaign · 2026-04-22 · notable
Alignment faking — behaving aligned when monitored, reverting when not — occurs in 7B models at 37%. A single steering vector cuts it by 85–94%.
LLaDA2.0-Uni — Unified Discrete Diffusion LLM for Multimodal Understanding and GenerationInclusion AI · 2026-04-22 · notable
One 16B discrete diffusion model that both understands and generates images — no separate encoder or decoder head.
TEMPO — EM-Based Test-Time Training That Scales Past the Plateau for Reasoning ModelsTongyi Lab / Tianjin University · 2026-04-21 · notable
TEMPO fixes the TTT plateau: an EM loop recalibrates the critic so the reward signal stays grounded as the policy improves.
EMF — First Text-Conditioned One-Step Image Generation, Matching 30-Step Quality in 4 StepsAMAP-ML (Alibaba) · 2026-04-20 · notable
EMF is the first text-conditioned one-step image generator — matching 30-step diffusion quality in just 4 denoising passes.
DCW — Wavelet-Domain Differential Correction Fixes a Fundamental SNR Bias in Diffusion ModelsAlibaba AMAP · 2026-04-17 · notable
DCW fixes a silent training-inference SNR mismatch in diffusion models, improving generation quality across FLUX, EDM, IDDPM, and six other architectures.
π0.7 — Physical Intelligence's Generalist Robot Brain with Compositional GeneralizationPhysical Intelligence · 2026-04-16 · notable
A generalist robot brain that combines skills across tasks to perform actions it was never explicitly trained on.
Seedance 2.0 — ByteDance's Unified Audio-Video Generation ModelByteDance Seed · 2026-04-15 · notable
ByteDance's video model natively generates synchronized audio and video together in one forward pass — no separate audio pipeline.
SpatialEvo — Self-Evolving 3D Spatial Reasoning with Deterministic Geometric EnvironmentsZhejiang University / StepFun · 2026-04-15 · notable
3D spatial reasoning that self-improves using geometry as its own reward signal — no annotations required.
RAGEN-2: Reasoning Collapse in Agentic RLManling Li et al. · 2026-04-07 · notable
A diagnosis and a fix for a failure mode in RL-trained agents: models that look diverse but are actually copy-pasting a template.
Rethinking Generalization in Reasoning SFTShanghai AI Lab / SJTU / Rice · 2026-04-08 · notable
Plain SFT can generalize across domains in reasoning — once you fix optimization, data, and base-model strength.
OpenVLThinkerV2 — generalist multimodal reasoning with Gaussian GRPOUCLA NLP · 2026-04-09 · notable
A generalist multimodal reasoning model trained with a non-linear GRPO variant — an open-source baseline that beats GPT-4o on MMMU and MathVista.
SkillClaw: Let Skills Evolve CollectivelyZiyu Ma et al. · 2026-04-09 · notable
A cross-user, cross-session evolver that turns everyone's agent runs into a steadily-improving shared skill library.

← All releases · Learn AI