AI/TLDR — every new AI model, tool, repo & paper
The latest AI releases, refreshed every 2 hours and explained in plain English.
What AI shipped today?
In the last 24 hours AI/TLDR tracked 14 new AI releases, including David Rosenthal: 'AI's Affordability Crisis' — the 70x subsidy that can't hold, Latent Space: 'Red-Teaming after Mythos' — Gray Swan on AI security and Claude Tag — Anthropic's @Claude Slack agent for shared teamwork. AI/TLDR is an AI release tracker that follows new AI models, open-source tools, papers, datasets and benchmarks — refreshed every 2 hours from verified primary sources and explained in plain English.
AI Release Index — live stats on AI releases · Learn AI
- David Rosenthal: 'AI's Affordability Crisis' — the 70x subsidy that can't hold
David Rosenthal pulls together SemiAnalysis and Ed Zitron numbers to argue AI tokens are sold at a fraction of cost — Anthropic up to 40x, OpenAI up to 70x — and that real billing would turn a $200 ChatGPT plan into a $14,000 bill.
- Latent Space: 'Red-Teaming after Mythos' — Gray Swan on AI security
Latent Space hosts Zico Kolter (OpenAI board, CMU) and Matt Fredrikson (Gray Swan CEO) to argue AI security is not 'cybersecurity with AI' — Gray Swan's Shade red-teaming model now beats human attackers at breaking frontier LLMs.
- Claude Tag — Anthropic's @Claude Slack agent for shared teamwork
Claude Tag is a Slack agent from Anthropic that anyone in a channel summons by tagging @Claude. The shared bot breaks a request into stages, runs the work with tools an admin scoped per channel, and posts results back to the thread.
- Fireship: 'Midjourney wants to delete 30% of all death…'
Fireship reacts to Midjourney Medical's pitch — a 60-second full-body ultrasonic CT scan planned for a San Francisco spa — and the company's bold claim that AI-driven early diagnostics could prevent a large share of premature deaths.
- Simon Willison: 'Prompt Injection as Role Confusion'
Simon Willison highlights a new paper by Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell arguing prompt injection is really 'role confusion' — language models lean on style cues, not content, to tell trusted text from user input.
- Armin Ronacher: 'The Coming Loop' — why even skeptics end up looping
Armin Ronacher argues 'harness loops' — outer systems that re-run AI agents past their natural stopping point — work well for code porting and benchmark runs, but breed defensive, dependency-creating code when pointed at real codebases.
- Mistral OCR 4 — 170-language document model with bounding boxes and confidence scores
Mistral OCR 4 extracts text plus per-block bounding boxes, type labels, and confidence scores across 170 languages, scoring 85.20 on OlmOCRBench and 93.07 on OmniDocBench at $4 per 1,000 pages.
- Wes Roth: 'Cursor JUST beat EVERYONE…'
Wes Roth's new video argues Cursor has pulled ahead of rival AI coding agents, walking through the Cursor Compile 26 opening keynote and the Composer 2.5 in-house coding model.
- Baidu Unlimited-OCR — 3B vision model parses long documents in one pass
Baidu's Unlimited-OCR is a 3B vision-language model that introduces Reference Sliding Window Attention to keep a constant KV cache, letting one forward pass transcribe dozens of document pages within a 32K context. Code and weights ship under MIT.
- PP-OCRv6 — PaddlePaddle ships 50-language OCR family from 1.5M to 34.5M params
PP-OCRv6 is the next PaddleOCR family with Tiny (1.5M), Small (7.7M), and Medium (34.5M) tiers covering 50 languages. The Medium tier lifts detection Hmean to 86.2% and recognition accuracy to 83.2%, gains of 4.6 and 5.1 points over PP-OCRv5_server.
- Oak — version control built for AI coding agents
Oak is a new version control system for AI coding agents that mounts repos lazily, runs branch-per-task, and benchmarks up to 95% faster than Git on snapshots, large binaries, and dirty trees.
- Anthropic-Cybersecurity-Skills v1.3.0 — 817 security skills across 6 frameworks
Mahipal Jangra's open Anthropic-Cybersecurity-Skills library jumps from 762 to 817 agent skills in v1.3.0, adding AI Security, Supply Chain, and Hardware/Firmware domains plus MITRE F3 as a sixth framework mapping.
- Simon Willison — porting Moebius image inpainting to the browser via Claude Code
Simon Willison shows how he used Claude Opus 4.8 to port the 0.22B Moebius image-inpainting model from PyTorch/CUDA to a browser-only WebGPU + ONNX demo, with the agent doing the framework conversion, weight upload, and UI work.
- Claude Code 2.1.186 — MCP login CLI plus auto-reply to bash commands
Claude Code v2.1.186 adds claude mcp login/logout for CLI-based MCP server auth, makes ! bash commands auto-prompt Claude to respond, and fixes 20+ background-agent and post-sleep streaming bugs.
- OpenAI Codex — SSD-burning SQLite log bug patched after 640 TB/year reports
OpenAI Codex CLI shipped two patches that cut about 85% of its SQLite log writes. Users had measured 37 TB written in 21 days, on track for 640 TB a year and full-drive SSD wear in months.
- Two Minute Papers: 'DeepSeek Just Solved AI's Billion Dollar Problem'
Two Minute Papers walks through the DualPath paper, which attacks the KV-cache I/O bottleneck behind agentic LLM serving costs and reports up to 1.96x higher online throughput.
- Sakana Fugu — multi-agent orchestration model that matches Fable 5 on quality
Sakana AI launched Fugu and Fugu Ultra, a multi-agent orchestration model delivered as one OpenAI-compatible API. Fugu Ultra coordinates a pool of expert agents and is reported to match Fable 5 on coding, reasoning, science, and agentic benchmarks.
- Hermes Agent v0.17.0 — iMessage, WhatsApp, and async subagents from Nous
Hermes Agent v0.17.0 'The Reach Release' adds iMessage support via Photon Spectrum (no Mac relay), an official WhatsApp Business Cloud adapter, Raft agent-network integration, and background subagents that return handles. 1,475 commits, 245 contributors.
- Palmier Pro v0.3.5 — open-source macOS video editor for AI agents
Palmier Pro v0.3.5 adds transcript-based cutting, ripple-insert trims with linked audio, folder imports, and a Claude Opus 4.8 upgrade to the Swift-native macOS video editor that exposes every timeline action to AI agents over MCP.
- Cloudflare Temporary Accounts — AI agents deploy live Workers in seconds, no signup
Cloudflare's new wrangler deploy --temporary command lets an AI agent provision a live Workers account in seconds without signup. The account stays usable for 60 minutes, after which a human can claim it permanently or let it auto-expire.
- Grok on Databricks — xAI models land in Agent Bricks via SpaceX deal
Grok on Databricks makes Grok 4.3 and Grok Build 0.1 natively callable from Databricks Agent Bricks, so enterprise teams can wire xAI models into governed Lakehouse data without external pipelines.
- John Jumper to Anthropic — Nobel laureate AlphaFold creator leaves DeepMind
John Jumper, the 2024 Nobel chemistry laureate and AlphaFold co-creator, posted on X that he is leaving Google DeepMind after nearly nine years to join Anthropic. He is the second senior Google AI departure this week after Noam Shazeer.
- Nathan Lambert: 'Banning Open Source AI Would Be A Mistake'
Nathan Lambert and Kevin Xu argue banning open-source AI would hurt US security, education, and competition. Their Interconnects post responds to an executive order to review AI models and a block on foreign access to Anthropic models.
- Moebius — 0.22B image inpainting matches FLUX.1-Fill-Dev's 11.9B
Moebius is a 0.22B image inpainting model from HUST and VIVO AI Lab that matches the 11.9B FLUX.1-Fill-Dev across six benchmarks while running over 15x faster, with code and weights now on GitHub.
- agent-eval — Hugging Face harness benchmarks coding agents on your own library
Hugging Face shipped agent-eval, an open harness that measures how well coding agents like Kimi-K2.6 and GLM-5.1 use a library — not just task completion, but token cost, time, and error rate across bare, clone, and skill access tiers.
- Beyond LoRA — Hugging Face benchmark shows OFT and BEFT can beat the default
Hugging Face benchmarked five PEFT methods on identical tasks and found OFT beats LoRA on image fine-tuning while BEFT and Lily trade memory for accuracy on math. PEFT also gained an adapter-to-LoRA converter for vLLM compatibility.
- Cursor 3.8 — /automate skill plus new GitHub and Slack automation triggers
Cursor 3.8 launches Cursor Automations: describe a task in plain English with /automate and it runs on its own when a GitHub event or Slack emoji react fires. Adds five GitHub triggers; cloud agents get computer use by default.
- Noam Shazeer to OpenAI — Gemini co-lead becomes Lead for Architecture Research
Noam Shazeer, Google VP and Gemini co-lead, is leaving Google to join OpenAI as Lead for Architecture Research, two years after Google paid $2.7B to bring him back from Character.AI.
- In the Weights — ex-OpenAI tool scores whether AI models remember your name
Joey Flynn and Thomas Dimson, both ex-OpenAI, launched a free site that types a name into multiple LLMs in parallel and returns a 0–996 'strength score' for how well models recall the person from training data alone.
- DeepMind AI Control Roadmap — defense-in-depth for misaligned AI agents
Google DeepMind publishes the AI Control Roadmap: a defense-in-depth framework that treats internal AI agents as insider threats, with trusted supervisor models monitoring their actions in real time.
- Two Minute Papers: 'Scientists Found A Better Language For AI Agents'
Two Minute Papers covers RecursiveMAS, a UIUC/Stanford/NVIDIA/MIT framework that lets multi-agent systems trade latent thoughts instead of text and reports 2.4x faster inference with 75.6% fewer tokens.
- MolmoMotion — Ai2's language-guided 3D motion forecasting models
Ai2 released MolmoMotion, two open models that predict where points on objects will move in 3D space from a video frame plus a text instruction. The drop bundles a 1.16M-video training set and the PointMotionBench eval, and lifts a robot pick-and-place baseline from 56.0% to 76.3%.
- MosaicLeaks — ServiceNow benchmark for research-agent privacy leaks
ServiceNow released MosaicLeaks, a 1,001-chain benchmark that measures how much private context a research agent leaks into its web queries, plus PA-DR, an RL recipe that drops leakage from 51.7% to 9.9% on Qwen3-4B with no loss of task success.
- Agentic Resource Discovery — HF, Microsoft, Google open spec for tool lookup
Hugging Face, Microsoft, Google, and GoDaddy launched Agentic Resource Discovery, an open spec that lets agents find MCP servers, skills, and A2A endpoints at runtime. ARD defines an ai-catalog.json manifest and a POST /search registry API, with a reference Discover Tool on HF.
- Sam Witteveen: VibeThinker 3B — taking on giant models
Sam Witteveen reviews Weibo's VibeThinker-3B, the new 3B-parameter reasoning model that scores 80.2% on LiveCodeBench v6. Witteveen walks through how a 3B open-weights model competes with much larger frontier models on code and math reasoning.
- MCP Enterprise-Managed Authorization — zero-touch OAuth for Claude, VS Code, Linear
Enterprise-Managed Authorization, an MCP extension, is now stable. Admins provision MCP servers once through Okta and users get every connector on first login with no per-app OAuth. Claude, VS Code, Linear, Figma, Asana, Atlassian and Supabase ship support; Ramp is live with 2,000 users.
- OpenAI Codex Record & Replay — demonstrate a macOS workflow, get a reusable skill
OpenAI Codex App 26.616 adds Record & Replay on macOS. Run a workflow once and Codex saves it as an editable, reusable skill that replays through Computer Use, browser tools, and plugins. Requires Computer Use; not in the EEA, UK, or Switzerland at launch.
- Simon Willison: GLM-5.2 is probably the most powerful text-only open weights LLM
Simon Willison calls Z.ai's GLM-5.2 today's strongest open-weights text LLM: top of Artificial Analysis Intelligence Index v4.1 at 51, second on Code Arena WebDev behind Claude Fable 5, and ~$1.40/$4.40 per 1M tokens on OpenRouter vs GPT-5.5's $5/$30.
- Wes Roth: Google's 'POST AGI' paper — DeepMind's AGI-to-ASI roadmap
Wes Roth breaks down 'From AGI to ASI', a Google DeepMind paper co-signed by Shane Legg, Marcus Hutter and Thore Graepel that maps four routes from human-level AI to superintelligence.
- LifeSciBench — OpenAI's 750-task benchmark for life-science research
OpenAI's LifeSciBench grades AI models on 750 expert-authored life-science tasks using rubrics. The strongest model, GPT-Rosalind, passes only 36.1%, with attached data files cited as the main bottleneck.