New AI Research Papers — arXiv Picks Explained
The AI research papers worth your time — handpicked arXiv and conference work, each summarised in plain English with why it matters.
36 releases tracked
- OpenAI Deployment Simulation — predict misbehavior before release
OpenAI estimates how a new model will behave in production by replaying real past conversations through it before release.
- DreamX-World 1.0 — Alibaba AMAP open-sources an interactive world model
Open-source 5B world model that lets you steer the camera, revisit a scene, and stage events across photoreal, game, and stylized worlds.
- FastContext — Microsoft's Explore subagent cuts coding-agent tokens by 60%
Small 4B subagent that searches the repo for a bigger coding model, lifting SWE-bench resolution by up to 5.5% while cutting main-agent tokens by 60%.
- NVIDIA SpatialClaw — code as the action interface for spatial reasoning agents
NVIDIA framework that lets a spatial-reasoning agent write Python each turn instead of picking from a fixed tool menu.
- AgentDoG 1.5 — Shanghai AI Lab Ships a Lightweight Agent-Safety Alignment Framework, Trains 0.8B–8B Guardrails on ~1k Samples That Match GPT-5.4 Accuracy and Cuts Docker Overhead 100x
Shanghai AI Lab's open AgentDoG 1.5 guardrails catch unsafe agent actions in real time, trained on a thousand samples and matching GPT-5.4-class safety classifiers.
- ShengShu Open-Sources minWM — Full-Stack Pipeline Turns Wan2.1 and HunyuanVideo 1.5 Into Real-Time Camera-Controllable World Models With 224× First-Frame Latency Speedup
An end-to-end recipe that distills heavyweight video diffusion models into few-step, camera-controllable world models that respond fast enough for live interaction.
- NVIDIA's Gamma-World — Generative Multi-Agent World Model Scales Beyond Two Players With Simplex Rotary Agent Encoding and Sparse Hub Attention, Streams Shared Worlds at 24 FPS
NVIDIA's Gamma-World scales generative world models past the two-player ceiling with linear-cost agent attention and 24 FPS real-time rollouts.
- Google DeepMind Co-Scientist Lands in Nature — Multi-Agent Gemini System Generates, Debates, and Evolves Scientific Hypotheses With 100+ Research Partners
Seven Gemini agents argue, rank, and refine scientific hypotheses — now a Nature paper and a Google Labs tool.
- ByteDance Lance — 3B Unified Multimodal Model Handles Image and Video Generation, Editing, and Understanding in One Stack
One 3B-active model that generates, edits, and understands both images and video — trained from scratch on 128 A100s.
- LongLive-2.0 — NVIDIA's NVFP4 Parallel Infrastructure Generates Minute-Long Interactive Video at 45.7 FPS
A 4-bit training and inference stack that makes minute-long interactive video generation fast enough for real time.
- NVIDIA AnyFlow — Any-Step Video Diffusion Distillation With Flow-Map Transition Learning Beats Consistency Methods at 1.3B and 14B Scale
Train one video diffusion student that runs well at 1, 4, or 32 steps — no separate model per step budget.
- AsymFlow — Stanford's Rank-Asymmetric Velocity Parameterization Hits 1.57 FID on ImageNet 256 in Pixel Space, Beats Latent FLUX.2 Klein Base on Text-to-Image
Stanford team trains a pixel-space flow model that beats latent FLUX.2 by predicting noise in a low-rank subspace.
- SenseNova-U1 — 8B Dense and 30B-A3B MoE Native Unified Multimodal Models With NEO-Unify Pixel-Space Backbone, Apache 2.0
An open native-unified multimodal model that ditches the vision encoder and VAE: one transformer reads and draws pixels end-to-end.
- Natural Language Autoencoders — Anthropic's Method to Verbalize Claude's Activations into Plain Text
Train one model to describe Claude's hidden activations in English, train a second to recover the activation from the description.
- RecursiveMAS — Stanford/MIT/NVIDIA Multi-Agent System Cuts Tokens 75% With 8.3% Accuracy Gain
Recursive computation, now for teams of AI agents
- GenericAgent — Fudan's Token-Efficient Self-Evolving LLM Agent With 9k Stars Uses 6× Fewer Tokens
An agent that grows smarter every time it solves a task
- HY-World 2.0 — Tencent Hunyuan Open-Sources Multi-Modal 3D World Model With 1,770 HF Upvotes
Text or a single photo → a navigable 3D world you can walk through
- ARIS — SJTU's Open Research Harness Hits 8.1k Stars With Cross-Model Adversarial Review
Open research harness that pits one model against another to catch unsupported claims before they ship.
- GLM-5V-Turbo Paper Drops — Z.ai's Native Multimodal Foundation Model with CogViT and MTP
Z.ai drops the technical report for GLM-5V-Turbo: native multimodal foundation model with a fresh vision encoder and MTP decoder.
- DeepMind AI Co-Clinician — Talker/Planner Dual-Agent Hits Zero Critical Errors on 97 of 98 Primary Care Queries
DeepMind's clinical AI uses a Planner agent to keep a Talker agent inside safe medical boundaries during live audio/video patient consultations.
- Step-Audio-R1.5 — RLHF-Trained Audio Reasoner Closes the 'Verifiable Reward Trap' Gap
An audio reasoner that sounds human in long conversations — RLHF rescues prosody and emotion that RLVR was quietly killing.
- Vista4D: Re-Render Any Video from a New Camera Angle (CVPR 2026 Highlight)
Give any monocular video a 4D point-cloud scaffold, then re-render the scene from any camera angle you choose.
- LLaTiSA — Difficulty-Stratified Time Series Reasoning for VLMs
VLMs can now reason about time series at four levels of difficulty — LLaTiSA beats GPT-4o on basic pattern localization with far less training data.
- There Will Be a Scientific Theory of Deep Learning — 223 HN Points
14 ML researchers argue that a real scientific theory of deep learning — 'learning mechanics' — is now close enough to be worth naming.
- Alignment Faking Found at 37% in 7B Models — VLAF Cuts It by 94%
Alignment faking — behaving aligned when monitored, reverting when not — occurs in 7B models at 37%. A single steering vector cuts it by 85–94%.
- LLaDA2.0-Uni — Unified Discrete Diffusion LLM for Multimodal Understanding and Generation
One 16B discrete diffusion model that both understands and generates images — no separate encoder or decoder head.
- TEMPO — EM-Based Test-Time Training That Scales Past the Plateau for Reasoning Models
TEMPO fixes the TTT plateau: an EM loop recalibrates the critic so the reward signal stays grounded as the policy improves.
- EMF — First Text-Conditioned One-Step Image Generation, Matching 30-Step Quality in 4 Steps
EMF is the first text-conditioned one-step image generator — matching 30-step diffusion quality in just 4 denoising passes.
- DCW — Wavelet-Domain Differential Correction Fixes a Fundamental SNR Bias in Diffusion Models
DCW fixes a silent training-inference SNR mismatch in diffusion models, improving generation quality across FLUX, EDM, IDDPM, and six other architectures.
- π0.7 — Physical Intelligence's Generalist Robot Brain with Compositional Generalization
A generalist robot brain that combines skills across tasks to perform actions it was never explicitly trained on.
- Seedance 2.0 — ByteDance's Unified Audio-Video Generation Model
ByteDance's video model natively generates synchronized audio and video together in one forward pass — no separate audio pipeline.
- SpatialEvo — Self-Evolving 3D Spatial Reasoning with Deterministic Geometric Environments
3D spatial reasoning that self-improves using geometry as its own reward signal — no annotations required.
- RAGEN-2: Reasoning Collapse in Agentic RL
A diagnosis and a fix for a failure mode in RL-trained agents: models that look diverse but are actually copy-pasting a template.
- Rethinking Generalization in Reasoning SFT
Plain SFT can generalize across domains in reasoning — once you fix optimization, data, and base-model strength.
- OpenVLThinkerV2 — generalist multimodal reasoning with Gaussian GRPO
A generalist multimodal reasoning model trained with a non-linear GRPO variant — an open-source baseline that beats GPT-4o on MMMU and MathVista.
- SkillClaw: Let Skills Evolve Collectively
A cross-user, cross-session evolver that turns everyone's agent runs into a steadily-improving shared skill library.