Google · 2026-04-15 · notable
Gemini 3.1 Flash TTS Preview — Google's Inline Audio-Tag Text-to-Speech
Google launches Gemini 3.1 Flash TTS Preview, a steerable TTS model with inline audio tags — embed [whispers], [excited], or [laughs] directly in text. 30 named voices, 100+ languages, up to 2 speakers.

Google's new TTS API model lets you control speech expression by embedding tags like [whispers] directly in text.
Key specs
| Voices | 30 |
|---|---|
| Languages | 100+ |
| Max speakers | 2 |
| Model id | gemini-3.1-flash-tts-preview |
What is it?
Gemini 3.1 Flash TTS Preview is Google's new text-to-speech model available via the Gemini API. Unlike traditional TTS systems that require SSML or separate prosody parameters, it accepts inline audio tags — [whispers], [laughs], [excited], [sarcastic], [crying] — embedded directly in the input text alongside scene-setting instructions.
How does it work?
The model takes directorial prompts as plain text: you describe the scene, set the tone, and embed expression tags wherever you need them. It supports multi-speaker audio (up to 2 speakers), 30 named voice presets, and automatic language detection across 100+ languages. Access is via the Gemini API using model ID gemini-3.1-flash-tts-preview.
Why does it matter?
Expressive TTS has previously required complex SSML markup or external prosody pipelines. Audio tags let developers control tone and delivery in plain text with no additional tooling — useful for voice agents, audiobooks, narration, and interactive dialogue where the delivery style changes mid-sentence.
Who is it for?
Developers building voice agents, audio content pipelines, or interactive narration.
Try it
model = genai.GenerativeModel('gemini-3.1-flash-tts-preview')