AI Security Releases — Red-Teaming & LLM Guardrails
New AI and LLM security work — red-teaming tools, guardrails, jailbreak research and defenses, explained for builders who actually ship.
58 releases tracked
- FreeFable — 300+ security leaders ask the White House to lift the Fable 5 ban
Bruce Schneier, Alex Stamos, Katie Moussouris and 300+ security execs publicly ask the White House to reverse the Fable 5 and Mythos 5 export controls.
- Rio 3.5 Open 397B — Brazil's 'homegrown' LLM is a Nex-N2 + Qwen merge
Researchers say Rio's city-built 397B 'open Brazilian LLM' is in fact a weight-merge of two existing Chinese models.
- UK Police Officer Under Investigation for Using AI to Fake Evidence
A UK police officer used generative AI to fabricate evidence in real cases — the country's first known criminal investigation of its type.
- Amazon's Jassy Pushed Anthropic Crackdown — flagged Fable 5 jailbreak risk
WSJ scoop: Amazon CEO Andy Jassy warned Treasury that Claude Fable 5 leaked cyberattack info — the US ban followed.
- Google Sues 'Outsider Enterprise' — Chinese scam ring abused Gemini
First time Google has gone to court over Gemini abuse — and the defendant is a phishing-as-a-service ring.
- AI Agent Runs Amok in Fedora — Bad LLM Patches Reach Anaconda 45.5 Installer
An unsupervised agentic AI used a hijacked Fedora account to push bad patches and waste maintainer time across Linux infrastructure.
- Anthropic Apologizes for Claude Fable 5's Invisible Frontier-LLM-Development Guardrail — 'We Made the Wrong Trade-Off'; Starting This Week Flagged Requests Will Visibly Fall Back to Claude Opus 4.8 With an API Refusal Reason, After Jeremy Howard and AI Researchers Said the Silent Downgrade Was 'Sabotaging' Their Work
Anthropic reverses Fable 5's silent ML-development guardrail after researcher backlash — flagged requests will now visibly fall back to Opus 4.8 with a stated reason.
- Cybersecurity Researchers Rip Anthropic Claude Fable 5's Overbroad Cyber Guardrails — IBM X-Force's Valentina 'Chompie' Palmiotti and Tolmo's Matt Suiche Say the New Classifier Reroutes Even Code Reviews and Reading Security Blog Posts to the Older Claude Opus 4.8, Forcing Pros Into Anthropic's Cyber Verification Program
Day-two backlash against Fable 5: the cyber/bio classifier is so broad it punts code reviews to an older model.
- Former xAI Engineer Devin Kim Sues xAI and SpaceX Over Grok Safety Termination — California Filing Says Co-Founder Jimmy Ba Rejected Safety Guardrails, Allegedly Said 'AI Will Kill Us All Anyway', and Fired Kim Right Before His Internal Safety Presentation, Surfacing Days Ahead of SpaceX's IPO
An ex-xAI engineering lead is suing xAI and SpaceX, saying he was fired for trying to put guardrails on Grok.
- AWS Bedrock Requires Mandatory 30-Day Anthropic Data Sharing for Claude Fable 5 and Mythos 5 — Customers Must Set Their Bedrock Account or Project to provider_data_share Before Invoking the Models, and Future Mythos-Class Releases Will Inherit the Same Policy
Mythos-class capability on Bedrock comes with a new condition: your prompts leave AWS and live with Anthropic for 30 days.
- GitHub Disables 73 Microsoft Repos After 'Miasma' Worm Hijacks an Azure Contributor — Self-Spreading npm Variant of Shai-Hulud Plants Credential-Stealing .claude/, .gemini/, .cursor/, and .vscode/ Configs That Fire When AI Coding Agents Open the Repo and Harvest AWS, Azure, GCP, and Kubernetes Keys
Self-spreading npm worm jumped from Red Hat packages to Microsoft GitHub orgs and weaponized AI coding agent config files to harvest cloud credentials.
- OpenAI Rolls Out Lockdown Mode to Free, Go, Plus, Pro, and Self-Serve Business — Optional Defense Disables Deep Research, Agent Mode, File Downloads, and Outbound Image Fetches to Block the Final Stage of Prompt-Injection Data Exfiltration
OpenAI's deterministic patch on the 'final stage' of prompt-injection attacks now reaches free ChatGPT users.
- U of T's CleverHans Lab Shows AI Agents Enable Adaptive Computer Worms — Open-Weight LLMs Running on Captured Hosts Compromise ~75% of a 33-Machine Network in One Week With No Human Input, Pulling in Live Vulnerability Advisories Mid-Attack
U of T's CleverHans Lab shows an open-weight LLM running locally on a captured host can power a worm that pivots through a corporate network, no humans needed.
- Anthropic Frontier Red Team Maps a Year of AI-Enabled Cyber Threats Onto MITRE ATT&CK — 832 Banned Accounts Studied, Medium/High-Risk Actors Doubled From 33% to 56% as 67% of Operators Used Claude to Write Malware
Anthropic's threat-intel team mapped a year of Claude misuse to MITRE ATT&CK and found AI is now lifting low-skill attackers into deep post-compromise work.
- Google Ships Fake Call Detection in Phone by Google — RCS-Powered Silent Handshake Flags AI Deepfake Impersonation Scams on Android 12+ Devices, Rolling Out Globally This Month Starting With Pixel
An invisible RCS handshake between contacts that calls out AI voice-clone scams in real time on Android.
- Anthropic Expands Project Glasswing to ~150 New Critical-Infrastructure Partners Across 15+ Countries — Power, Water, Healthcare, and Communications Get Claude Mythos Plus Opus 4.8 Claude Security
Anthropic scales its Mythos-powered defender program from 50 to ~150 critical-infrastructure partners across 15+ countries.
- Hackers Asked Meta's AI Support Bot to Change Instagram Account Emails — and It Did: White House, Sephora, and Space Force Chief Among the Hijacked Accounts
Meta's AI support agent was allowed to reset account emails. Hackers asked. The bot complied.
- PromptArmor: ChatGPT for Google Sheets Exfiltrates Workbooks — One Poisoned Sheet Steals Up to 12 Workbooks via Apps Script, OpenAI Pulls the Code-Gen Path After Disclosure Slipped Through Their Pipeline
One hidden instruction in a shared sheet hijacks the ChatGPT Sheets extension and walks out with workbooks across the user's account.
- Heretic Strips Safety Guardrails From Meta's Llama 3.3 and Google's Gemma 3 in Under 10 Minutes — FT and Safety Group Alice Find the Free GitHub Tool Has Spawned 3,500 'Decensored' Models With 13M Downloads
A free, fully automatic GitHub tool removes the refusal mechanisms baked into open-weight models in minutes.
- PromptArmor: Microsoft Copilot Cowork Exfiltrates Files via Poisoned Skills — Indirect Prompt Injection Hits 100% Success Against Claude Opus 4.7 and Sonnet 4.6
A poisoned skill turns Microsoft's M365 agent into a file thief with no user approval step.
- Anthropic's Project Glasswing Initial Update — ~50 Partners Used Claude Mythos to Surface 10,000+ Vulnerabilities, 1,587 Open-Source Flaws Confirmed Valid at a 90.6% True-Positive Rate, Plus Claude Security Public Beta
Anthropic's one-month report on its ~50-partner program to find software flaws before AI models can exploit them.
- Microsoft Open-Sources RAMPART and Clarity — A Pytest-Native Red-Teaming Framework for AI Agents and a Pre-Code Design Sounding Board
Two open-source tools that bake agent red-teaming and design review into the development workflow.
- GitHub Internal Repositories Breached via Poisoned VS Code Extension — TeamPCP Claims ~4,000 Repos While GitHub Says Customer Data Untouched
A malicious VS Code extension on one GitHub employee's machine became the entry point for an internal-repo dump.
- Mini Shai-Hulud Strikes Again — A Hijacked npm Account Pushed 637 Malicious Versions Across 314 AntV-Ecosystem Packages
A copycat of the Shai-Hulud worm poisoned hundreds of widely-used npm packages in a 22-minute automated burst.
- Cloudflare Publishes Its Project Glasswing Findings — Anthropic's Mythos Preview Excelled at Chaining Exploits Across 50+ Repos but Needed a Seven-Stage Harness to Tame False Positives
Cloudflare shares hands-on results from using Anthropic's Mythos Preview to hunt vulnerabilities in its own code.
- Pwn2Own Berlin 2026 — AI Coding Agents and Local-Inference Tools Fall in Their First Outing: $1.3M Paid for 47 Zero-Days as OpenAI Codex, Cursor, Claude Code, LM Studio, and LiteLLM Are All Exploited
The first Pwn2Own to put AI coding agents and local-inference tools on the target list — and every one of them fell.
- Anthropic Mythos Forces US Megabanks Into Days-Not-Weeks Cyber Patching — JPMorgan, Goldman, Citi, BofA, Morgan Stanley Are Already Inside the Glasswing Preview
Five of the largest US banks already have Mythos access and are racing to plug 'several hundred to thousands' of vulnerabilities the model surfaced.
- Mini Shai-Hulud Worm Hits @mistralai and @tanstack npm Packages — 84 Malicious Versions Published in Six Minutes With Valid SLSA Provenance
A second wave of the Shai-Hulud npm worm published 84 backdoored versions across @tanstack, @mistralai, and 160+ more packages in a six-minute window on May 11.
- Google GTIG: First In-the-Wild Zero-Day Built With an AI — Cybercrime Crew Used an LLM to Weaponize a 2FA Bypass Intended for Mass Exploitation
Google's threat-intel team says it has 'high confidence' an LLM wrote a 2FA-bypass zero-day for a popular open-source admin tool — caught before deployment.
- Mozilla Used Claude Mythos to Find 271 Firefox Vulnerabilities — Including a 20-Year-Old XSLT Bug and 15-Year-Old <legend> Flaw
Mozilla engineers point Anthropic's Mythos Preview at Firefox's fuzzing harness and ship 271 vulnerability fixes, dwarfing prior years' totals.
- GrafanaGhost: Indirect Prompt Injection Silently Exfiltrates Enterprise Metrics via Grafana AI
A hidden prompt injection in Grafana's AI assistant leaks enterprise data silently — no phishing, no credentials, no alerts.
- Mercor LiteLLM Supply-Chain Breach — 4TB Stolen, Meta Pauses AI Training Contracts
A poisoned open-source AI library package led to one of the biggest AI data breaches of 2026.
- Vercel Breached via Context.ai Supply-Chain Attack — Customer Credentials Exposed, Database Listed at $2M
A compromised AI analytics tool became the entry point for a breach affecting Vercel — the web platform used by millions of developers.
- Unauthorized Group Gains Access to Anthropic's Restricted Mythos AI Cybersecurity Model
Anthropic's 'too dangerous to release' cybersecurity model was accessed by unauthorized users the day it was announced.
- Claude Code npm Packaging Error Exposes 512,000 Lines of Source Code Including Anti-Distillation Controls
Anthropic accidentally shipped its entire source code in an npm package, revealing internal secrets — and enabling a downstream supply-chain attack within hours.
- LiteLLM CVE-2026-42208 — Critical SQL Injection in AI Gateway Exploited Within 36 Hours of Disclosure
A critical SQL injection in the most popular open-source AI gateway let attackers steal every LLM API key stored in the database.
- Marimo Python Notebook CVE-2026-39987 — Pre-Auth RCE Exploited Within 10 Hours, CISA KEV Listed
The AI-native Python notebook used by thousands of data scientists had a pre-auth shell endpoint — and attackers found it faster than most teams could patch.
- Pennsylvania Sues Character.AI — First State Suit Over a Chatbot Practicing Medicine Without a License
A state AG is suing Character.AI for letting its chatbots impersonate licensed doctors — a first-of-its-kind case.
- Chrome Silently Installs 4 GB Gemini Nano on Idle Profiles — 14 Min, No Consent, ePrivacy Article 5(3) Cited
Chrome quietly fetches Gemini Nano weights to disk on eligible machines, with no UI to refuse and a re-download if you delete it.
- CISA and Five-Eyes Allies Publish Joint Guidance on Securely Deploying Agentic AI
Five governments tell their critical-infrastructure operators to treat AI agents like zero-trust endpoints, not pet projects.
- Gemini CLI Headless-Mode RCE (CVSS 10) — Workspace Auto-Trust Lets Untrusted PRs Pop CI Hosts, Patched in 0.39.1
A 'just trust the workspace' default in CI mode let attackers run code on the host before the agent sandbox even started.
- Apple Ships Internal CLAUDE.md Files Inside Apple Support App v5.13 Update
Apple shipped its own Claude Code prompt files in the public Apple Support app, exposing how the team uses Anthropic's coding agent internally.
- OpenAI Advanced Account Security — Passkey-Only Logins and Co-Branded YubiKeys for ChatGPT
OpenAI launches a phishing-resistant ChatGPT login mode with co-branded YubiKeys aimed at journalists, dissidents, and security defenders.
- PyTorch Lightning PyPI Hijacked — Versions 2.6.2 and 2.6.3 Steal SSH Keys and Cloud Credentials
Lightning 2.6.2 and 2.6.3 on PyPI are malicious. Anyone who pip-installed them in the last day needs to rotate every secret on the box.
- Copy Fail (CVE-2026-31431) — AI-Assisted Scan Finds 9-Year-Old Linux Root Exploit in About an Hour
Theori's AI-driven scanner Xint Code surfaced a 9-year-old Linux kernel logic bug in roughly an hour, with a 732-byte Python proof-of-concept.
- AISLE's AI Analyzer Finds 38 CVEs in OpenEMR — Two CVSS 10.0, Used by 100K Healthcare Providers
An AI vulnerability analyzer found 38 CVEs in OpenEMR in one quarter — more than the most prominent prior human audit found in years.
- Cursor + Claude Opus 4.6 Deletes PocketOS Production Database in 9 Seconds
An AI coding agent silently wiped a startup's production database and its backups in under 10 seconds, then detailed exactly how it broke its own safety rules.
- 10 Live Indirect Prompt Injection Payloads Found Targeting AI Agents in the Wild
Researchers found prompt injection attacks already embedded in real websites — waiting to hijack AI agents that read them.
- Comment and Control — Prompt Injection via GitHub Comments Hits Claude Code, Gemini CLI, Copilot
GitHub comment text can hijack AI coding agents in CI — stealing API keys via the same channels agents already use for context.
- OpenAI GPT-5.5 Bio Bug Bounty: $25K for a Universal Bio Safety Jailbreak
OpenAI pays $25K for the first prompt that bypasses GPT-5.5's bio safety questions — the first paid crowdsourced red-team for bio guardrails on a shipped frontier model.
- Anthropic MCP Design Flaw Enables RCE Across 150M+ Downloads
A design choice in Anthropic's MCP STDIO transport lets any malicious server run OS commands on the developer's machine — Anthropic says it's working as intended.
- LMDeploy CVE-2026-33626: SSRF in Vision-Language Module Exploited Within 13 Hours
SSRF in LMDeploy's image loader was weaponized within 13 hours of CVE disclosure, giving attackers access to cloud credentials and internal networks.
- Lovable's BOLA Flaw Left 8M Users' Projects Exposed for 48 Days
A 5-API-call BOLA bug let any free Lovable user read another user's source code and database credentials for 48 days.
- Garak — LLM Vulnerability Scanner by NVIDIA
Automated red-teaming for LLMs — finds vulnerabilities before attackers do.
- Guardrails AI — LLM Output Validation Framework
Validate, structure, and secure LLM outputs with declarative guardrails.
- LLM Guard — Input/Output Security Toolkit
Scan LLM inputs and outputs for security risks — self-hosted, no data leaves your infra.
- NeMo Guardrails — Programmable LLM Safety Rails
Define what your LLM can and can't do using a simple dialogue DSL.
- Rebuff — Prompt Injection Detector
Multi-layered prompt injection detection that learns from attacks.