AI/TLDR

What Is Text-to-Speech? Modern AI Voice Synthesis Explained

Learn how neural TTS generates natural-sounding voices, what makes them sound human, and which APIs and open models to start with.

BEGINNER12 MIN READUPDATED 2026-06-12

In plain English

Text-to-speech (TTS) is software that reads written text out loud. You hand it a string of words, and it hands back an audio file — or a stream of audio — that sounds like a human voice speaking those words. You've heard it everywhere: the voice that reads out turn-by-turn directions in your GPS app, the narrator on an audiobook, the assistant answering a phone call, the accessibility reader on a webpage.

A useful analogy: imagine hiring a professional voice actor. You email them a script, they record it in a sound booth, and they send you back the audio file. TTS does the same job, except the 'voice actor' is a neural network, the 'sound booth' is a GPU, and the whole process takes under a second. Modern AI-powered TTS has gotten so close to real human speech that in double-blind listening tests, many listeners can't reliably tell the difference.

The abbreviation TTS is used interchangeably with voice synthesis and speech synthesis. When people say neural TTS or AI TTS they specifically mean systems driven by deep learning, as opposed to the older robotic-sounding systems that stitched together pre-recorded phoneme fragments. The neural approach, which became mainstream around 2017 with Google's WaveNet, is what makes today's voices sound natural.

Why it matters

Before neural TTS, turning text into audio required either hiring voice talent (expensive, slow) or using rule-based synthesizers that sounded unmistakably robotic. That made voice a second-class citizen in most software. Neural TTS changed the economics completely: you can now generate hours of natural-sounding audio for a few dollars, on demand, in dozens of languages.

For builders, this matters across a surprisingly wide range of products:

  • Voice agents and chatbots — your LLM-powered assistant needs a voice to speak back to users; TTS is the last mile.
  • Accessibility — screen readers help visually impaired users navigate the web; better TTS directly improves their experience.
  • Audiobooks and podcasts — publishers can narrate any written article or book without scheduling studio time.
  • E-learning — instructional content can be voiced in multiple languages without re-recording every lesson.
  • IVR and call centers — interactive phone menus and automated outbound calls rely entirely on TTS for every spoken line.
  • Real-time translation — spoken content can be synthesized in the target language and played back immediately.
  • Developer prototyping — you can mock up a voice interface in minutes before spending money on a professional voice actor.

How it works

Modern neural TTS pipelines break the job into stages. The exact architecture varies by model, but most share a common three-stage structure: text analysis, acoustic modeling, and waveform synthesis (vocoding).

Stage 1 — Text analysis

Before any audio can be synthesized, the raw text must be cleaned up and converted into a form the neural network understands. This stage handles text normalization (turning "$12.99" into "twelve dollars and ninety-nine cents", expanding abbreviations like "Dr." to "Doctor") and grapheme-to-phoneme (G2P) conversion, which maps written letters to the phonetic sounds they represent. English is notoriously hard here: 'read' is pronounced differently in 'I read the book' vs. 'I will read the book'. The model also predicts prosody — the rhythm, stress, and intonation pattern for the sentence.

Stage 2 — Acoustic model

The acoustic model takes the phoneme sequence (and prosody targets) and predicts a mel spectrogram — a compact 2D representation of audio that shows how frequency energy is distributed over time. Early neural acoustic models like Tacotron 2 used recurrent neural networks (RNNs) and attention mechanisms. Modern systems use Transformers (non-autoregressive models like FastSpeech 2) or diffusion models, which are much faster because they generate the whole spectrogram in parallel rather than one frame at a time. Some newer end-to-end models (like VITS) skip the explicit spectrogram and learn to go directly from text to waveform in a single model.

Stage 3 — Neural vocoder

The vocoder converts the mel spectrogram back into a raw audio waveform. Early vocoders like WaveNet were autoregressive — they generated one audio sample at a time, making them slow. Modern GAN-based vocoders like HiFi-GAN generate the entire waveform in parallel at real-time or faster speeds while maintaining high audio fidelity. Many production TTS systems run the whole pipeline in well under 100 ms for a typical sentence.

Hosted API options

For most production use cases, the fastest path to good TTS is a hosted API. You send text over HTTPS and receive audio back. No GPU required, voices are ready to go, and pricing scales with usage. Here are the major options as of mid-2026:

ProviderNotable modelsPrice (approx.)Standout feature
OpenAItts-1, tts-1-hd, gpt-4o-mini-tts$15–$30 / 1M chars (tts-1/hd); token-based for gpt-4o-mini-ttsInstructable delivery: tell it how to speak via a prompt
ElevenLabsMultilingual v2/v3, Flash/Turbo$60–$120 / 1M chars; free tier ~10K chars/moUltra-realistic voices; instant voice cloning from ~1 min of audio
Google CloudWaveNet, Neural2, StudioFirst 1M chars/mo free for WaveNet voicesBroad language support; SSML fine-tuning
Microsoft AzureNeural TTS, DragonHD voices500K chars/mo free (neural); pay-as-you-go afterDragonHD voices with automatic emotion detection (2025)
Amazon PollyStandard, Neural, Long-Form5M chars/mo free (first year); ~$16/1M chars neuralLong-form model for audiobook-length content
Cartesia / Deepgram Aura / RimeVariousSub-$10 / 1M chars on some tiersSub-200 ms streaming latency optimized for voice agents

OpenAI's gpt-4o-mini-tts (released March 2025) introduced a standout feature: an instructions parameter that lets you tell the model how to deliver the speech — "speak warmly and slowly with occasional pauses" — without any SSML markup. It uses token-based pricing where text input costs $0.60 per million tokens and audio output costs $12 per million audio tokens.

ElevenLabs remains the quality benchmark for expressive, realistic voices. Its instant voice cloning feature creates a usable voice from as little as one minute of audio — cloning itself is free, you only pay characters when you synthesize audio with the cloned voice. Professional voice cloning (30+ minutes of audio, higher fidelity) is available from the Pro plan at $99/month.

OpenAI TTS — minimal examplepython
from pathlib import Path
from openai import OpenAI

client = OpenAI()

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",           # alloy | echo | fable | onyx | nova | shimmer
    input="Hello! This is AI-generated speech.",
    response_format="mp3",
)

Path("output.mp3").write_bytes(response.content)
print("Saved output.mp3")

Open-source and self-hosted models

If you need to run TTS on your own infrastructure — for privacy reasons, cost at scale, or offline use — a new generation of open-weight models has made that viable without sacrificing much quality.

Kokoro-82M

Kokoro (by hexgrad, Apache 2.0 license) is the headline model of the open-source TTS world as of 2025. Despite having only 82 million parameters and fitting in under 2 GB of VRAM, it ranked first in the TTS Spaces Arena benchmark — above XTTS v2 (467M params), Fish Speech (~500M params), and MetaVoice (1.2B params). API-served Kokoro costs under $1 per million input characters. It produces highly natural English speech but does not natively support voice cloning.

Coqui XTTS v2

XTTS v2 (from Coqui) supports 17 languages and can clone a voice from as little as a 6-second audio sample — enabling cross-language voice cloning, where you clone someone's voice in English and synthesize their voice speaking Spanish. XTTS v2 is licensed under the Coqui Public Model License, which restricts commercial use without a negotiated agreement. The community fork at idiap/coqui-ai-TTS on GitHub continues active maintenance.

Bark

Bark (by Suno) is a generative audio model that treats speech synthesis more like a language model problem. It can generate not just speech but also music, sound effects, and non-verbal sounds like laughter or sighing. This makes it uniquely expressive, but also less predictable and slower than purpose-built TTS models. It's a good fit for creative or entertainment applications where character and personality matter more than latency.

ModelParamsLicenseVoice cloningBest for
Kokoro-82M82MApache 2.0NoFast, high-quality English TTS at low cost
Coqui XTTS v2467MCoqui PML (non-commercial)Yes (6-sec sample)Multilingual + voice cloning
Bark~900MMITLimitedExpressive, creative audio with sound effects
StyleTTS 2~100MMITYesHigh naturalness, diffusion-based
Kokoro — minimal local inference (via kokoro package)python
# pip install kokoro soundfile numpy
from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code="a")   # 'a' = American English

audio, sample_rate = pipeline("Hello, this is Kokoro TTS.", voice="af_heart")
sf.write("output.wav", audio, sample_rate)
print("Saved output.wav")

Going deeper

Once you have basic TTS working, the interesting engineering questions shift toward control, latency, and scale.

SSML — controlling prosody with markup

Speech Synthesis Markup Language (SSML) is an XML dialect supported by Google, Azure, and Amazon Polly that lets you embed fine-grained instructions directly in the text. You can insert <break> tags for pauses, set speaking rate and pitch with <prosody>, spell out words phonetically with <phoneme>, and switch between languages mid-sentence with <lang>. OpenAI's newer gpt-4o-mini-tts model takes a different approach — natural-language instructions instead of XML markup — but SSML remains the standard for the major cloud providers.

Streaming for voice agents

In a voice agent, the LLM generates the response token by token. If you wait for the full response before starting TTS, the user hears silence for several extra seconds. The solution is sentence-level streaming: as soon as the LLM produces a complete sentence, start synthesizing and playing it, while the LLM continues generating the next sentence. Most major TTS APIs (OpenAI, ElevenLabs, Google) support chunked streaming where they return audio bytes as soon as they are ready. ElevenLabs' streaming endpoint achieves a time-to-first-audio of around 75 ms.

Voice cloning and custom voices

Voice cloning is the ability to synthesize a specific person's voice from a small reference sample. ElevenLabs' instant voice cloning can produce a usable clone from about one minute of clean audio in seconds. A higher-fidelity professional voice clone requires 30+ minutes of recordings and a few hours of processing, and produces results that can be nearly indistinguishable from the original speaker. Azure's Custom Neural Voice goes further — a fully trained custom model that requires a longer data collection process but is optimized for a specific brand or product.

Ethical and legal considerations

Voice cloning is powerful enough to raise serious concerns. Most TTS providers require explicit consent from the voice owner before cloning, and many jurisdictions are introducing laws around synthetic voice disclosure. When building products with voice cloning, be explicit with users that they are hearing a synthetic voice, obtain consent when cloning real people, and never use cloned voices to impersonate someone in a deceptive context. OpenAI, ElevenLabs, and others publish acceptable-use policies that prohibit misuse.

Evaluating TTS quality

The standard metric is MOS (Mean Opinion Score) — human listeners rate samples from 1 to 5, and the average becomes the model's score. Human speech typically scores around 4.5. Top commercial APIs score 4.0–4.3 on standardized benchmarks. For your own evaluation, build a small test set of sentences that cover the specific edge cases your product cares about (numbers, proper nouns, emotional tone, your target language) and do a blind A/B listen with real users. Benchmark numbers on general datasets often don't predict quality on domain-specific text.

FAQ

What is the difference between text-to-speech and voice cloning?

Standard TTS converts text to speech using a pre-built voice — you pick from a library of voices the provider has already trained. Voice cloning creates a new voice that sounds like a specific person, by training on a sample of their speech. Once cloned, you can use that voice just like any other TTS voice. The cloning step is a one-time setup; synthesis is the same process afterward.

How realistic does AI text-to-speech sound in 2025?

Top commercial systems like ElevenLabs and OpenAI's gpt-4o-mini-tts can fool many listeners in short listening tests. They handle natural prosody, emotional inflection, and conversational rhythm well. Longer reads, unusual names, and code or domain-specific jargon are still edge cases where quality can drop. The gap between AI and a professional human voice actor narrows each year.

How much does AI text-to-speech cost?

Pricing is typically per character or per million characters. OpenAI's tts-1 costs $15 per million characters; ElevenLabs Flash starts around $60 per million characters. Most providers offer a free tier (Google gives 1M WaveNet characters/month free; ElevenLabs gives ~10,000 characters/month free). Open-source models like Kokoro served via API cost under $1 per million characters, and running them locally makes the marginal cost near zero.

What is the best text-to-speech API for a real-time voice agent?

For real-time conversation, prioritize streaming support and low time-to-first-audio over raw quality. ElevenLabs achieves ~75 ms to first audio byte. Cartesia, Deepgram Aura, and Rime are also built specifically for low-latency voice-agent use cases. Make sure to use sentence-level streaming from your LLM to TTS pipeline to avoid accumulating latency.

Can I run text-to-speech locally without an internet connection?

Yes. Open-source models like Kokoro-82M (Apache 2.0), XTTS v2, and Bark can run fully offline on a consumer GPU. Kokoro-82M requires under 2 GB of VRAM and runs at real-time speeds. This is useful for privacy-sensitive applications or offline scenarios, though setup takes more engineering than calling a hosted API.

What is SSML and do I need it?

SSML (Speech Synthesis Markup Language) is an XML dialect supported by Google, Azure, and Amazon Polly for fine-grained control over pronunciation, pauses, and pitch. Most developers don't need it for everyday use — the defaults are good enough. You need SSML when you need precise control: ensuring a product name is always pronounced correctly, adding dramatic pauses for effect, or mixing languages in a single phrase. OpenAI's newer gpt-4o-mini-tts offers a more intuitive alternative: natural-language instructions instead of XML tags.

Further reading