In plain English
Text-to-speech (TTS) is the opposite of speech-to-text: you hand it a sentence and it hands back an audio file of someone saying that sentence. You have heard AI TTS hundreds of times — navigation apps reading out directions, audiobook narrators, phone hold messages, and every voice assistant reply that isn't a recording.
Old TTS stitched together short pre-recorded sounds (phonemes) from a single speaker, producing robotic, stilted speech you could always spot. Modern neural TTS learns the full shape of a voice from audio examples and then generates new audio from scratch, waveform sample by sample. The result is speech with natural rhythm, emotion, and breathing — not a recording, but something the model made up on the spot.
The leap matters because it unlocks two things that older systems could not do. First, expressive quality: the voice rises in excitement, slows for emphasis, and pauses at the right moment without you scripting it. Second, voice cloning: give the model a short clip of anyone's voice and it can speak new text in that same voice. That is the capability behind personalized assistants, dubbed content, and the headline features of services like ElevenLabs.
Why it matters
Neural TTS crossed a perceptual threshold somewhere around 2023-2024: the best models now score above 4.0 on the five-point Mean Opinion Score (MOS) scale that evaluators use to rate naturalness — a score previously reserved for real human recordings. That quality jump opened a long list of use cases that were not practical before.
- Voice agents and customer support. Conversational AI products need a voice that sounds human enough that callers do not hang up. Low-latency TTS (under 300 ms to first audio byte) is what makes a spoken conversation feel like a real-time exchange rather than a slow phone menu.
- Content creation and accessibility. Authors, podcasters, and learners use TTS to turn written articles into audio. Publishers use it for instant audiobook narration. Accessibility tools use it so visually impaired users can have any document read aloud in a natural voice.
- Localization and dubbing. Instead of hiring voice actors for every language, a production team can clone a speaker's voice and generate that voice speaking the translated script, preserving vocal identity across languages.
- Video and game characters. Indie game developers and video creators who cannot afford a recording studio can generate voiced dialogue dynamically, with characters that respond to player choices in their own unique voice.
- Personalization. Apps can generate a custom voice from a short sample so users interact with their own voice, a loved one's voice, or a celebrity voice they licensed — all without pre-recording every possible phrase.
What did it replace? Rule-based TTS systems like Festival and older SAPI engines required linguists to craft phoneme dictionaries and pitch-rule tables by hand. They worked in only one language and one voice, and any word outside their dictionary sounded like a robot glitching. Neural TTS has made the old approach nearly obsolete for anything end-user-facing.
How it works
Neural TTS systems share a common two-step logic: first predict acoustic features from text (what the sound should look like), then synthesize the actual waveform from those features. The exact model architecture has evolved rapidly, but the pipeline below describes how leading systems work today.
Step 1: text analysis
Raw text is not ready to be spoken. Numbers must be expanded ("$42" → "forty-two dollars"), abbreviations resolved ("Dr." → "Doctor" or "Drive" depending on context), and punctuation converted into pause signals. A text front-end handles this normalization, producing a clean sequence of linguistic units the model can work with.
Step 2: acoustic prediction
A neural encoder reads the text and predicts a spectrogram — the same heatmap of frequency-over-time that appears in speech-to-text, just generated instead of read. This stage also decides prosody: how long each syllable lasts (duration), how loud it is (energy), and how high the pitch is (F0 contour). Prosody is what separates expressive TTS from monotone TTS. Modern systems like ElevenLabs train on large corpora of annotated audio so the model learns, for example, that a question rises at the end and an exclamation uses a faster tempo.
Step 3: waveform synthesis (the vocoder)
The predicted spectrogram is converted into an actual audio waveform by a neural vocoder. Early vocoders like WaveNet were slow (generating a single second of audio took seconds of compute). Modern vocoders — HiFi-GAN and its descendants — are dramatically faster because they generate audio in parallel rather than one sample at a time, enabling real-time synthesis. ElevenLabs' Flash v2.5 model achieves around 75 ms time-to-first-audio-byte using this generation family on optimized hardware.
Voice cloning
Voice cloning works by giving the model a reference embedding — a compact numerical fingerprint of a voice extracted from a short audio sample. The acoustic model is conditioned on this embedding, so it generates spectrogram frames that match the target voice's timbre and cadence instead of a generic one. Instant cloning needs as little as one minute of audio and works by computing the embedding at inference time. Professional cloning fine-tunes the model weights on 30+ minutes of clean audio to capture subtler traits — accent shading, emotional range, vocal quirks — and produces a higher-fidelity result.
The model landscape: ElevenLabs, OpenAI, Kokoro
By mid-2026, three names dominate most TTS integration discussions: ElevenLabs for quality and cloning depth, OpenAI TTS for simplicity and value, and Kokoro for open-weight local deployment. Each solves a different version of the problem.
ElevenLabs
ElevenLabs is the quality leader. Its model lineup splits into two use cases: Eleven v3 (also called Multilingual v3) is the most expressive model for content creation — it handles natural prosody, emotional inflection, and audio tags like <laugh> or <sigh> that let you script a performance. Flash v2.5 trades some expressiveness for a 75 ms latency target and 32-language support, making it the recommended model for real-time voice agents. The Voice Library contains over 11,000 community voices; Instant Voice Cloning is available on paid plans from $22/month.
Pricing is per character: Flash v2.5 runs at roughly $103 per million characters (about half the cost of the premium Multilingual v3 tier). The subscription plans — Starter ($5/mo, 30k chars), Creator ($22/mo, 100k chars), Pro ($99/mo, 500k chars) — are aimed at solo creators; the API is priced separately for builders.
OpenAI TTS
OpenAI's TTS API offers six built-in voices (alloy, echo, fable, onyx, nova, shimmer) in two tiers: TTS-1 at $15 per million characters and TTS-1-HD at $30 per million characters. Quality is excellent for straightforward narration, latency is around 200 ms, and the main selling point is dead-simple integration — one extra endpoint in an application already using the OpenAI SDK. Voice cloning is not offered.
Kokoro-82M
Kokoro is an open-weight model with only 82 million parameters — tiny by neural TTS standards. Despite its size, Kokoro-82M climbed to #1 on the TTS Arena leaderboard in early 2026, outscoring much larger proprietary models for perceived naturalness. It processes most text lengths in under 300 ms on a consumer GPU, and it runs on modest hardware including laptops with dedicated graphics. It is free to self-host, meaning audio never leaves your machine, there is no per-character cost, and you can run it indefinitely. The tradeoff: you manage the infrastructure, it supports fewer languages than ElevenLabs Flash, and cloning requires additional fine-tuning work.
- 75 ms time-to-first-audio
- 32 languages
- Instant voice cloning
- ~$103 / 1M chars (API)
- Best for real-time agents
- ~200 ms latency
- Broad language support
- No voice cloning
- $15 / 1M chars
- Best for simple integration
- <300 ms on consumer GPU
- English focus (extendable)
- Self-hosted, free to run
- $0 per character
- Best for privacy / cost control
Streaming and latency
In a voice agent, the TTS stage is the last link before the user hears anything. Total perceived latency is the sum of: LLM time-to-first-token + TTS time-to-first-audio-byte + network round-trip. Shaving each stage matters, but TTS streaming is one of the easiest wins.
Instead of generating the entire audio file and then sending it, a streaming TTS API sends the first audio chunk the moment it is ready — typically after processing the first sentence or a few hundred milliseconds of speech. The client can start playing that chunk while the API is still generating the rest. This approach cuts the perceived wait time dramatically: the user hears the first word in under a second even for a long reply.
ElevenLabs exposes streaming via both a standard HTTP chunked endpoint and a WebSocket connection that accepts incremental text and returns incremental audio. The WebSocket mode is especially useful when the LLM is also streaming tokens: instead of waiting for the full LLM response, you forward tokens to the TTS as they arrive, and the TTS starts speaking before the LLM has finished writing.
# ElevenLabs streaming TTS — requires: pip install elevenlabs
from elevenlabs.client import ElevenLabs
from elevenlabs import stream
client = ElevenLabs(api_key="YOUR_KEY")
audio_stream = client.text_to_speech.convert_as_stream(
text="Streaming means the user hears the first word almost immediately.",
voice_id="JBFqnCBsd6RMkjVDRZzb", # any ElevenLabs voice ID
model_id="eleven_flash_v2_5", # low-latency model
)
# stream() plays the audio as chunks arrive — no waiting for the full file
stream(audio_stream)OpenAI's TTS endpoint also supports chunked streaming via the stream=True parameter. Kokoro running locally can pipe audio to a speaker or buffer in real time using the sounddevice or pyaudio library as generation proceeds sentence by sentence.
Going deeper
The core pipeline above is the cleaned-up version. Production TTS adds several complications worth understanding before you ship.
SSML and audio tags. Providers offer varying ways to control prosody beyond the text itself. SSML (Speech Synthesis Markup Language) is the XML-based standard for inserting pauses (<break time="500ms"/>), changing rate and pitch, and spelling out acronyms. ElevenLabs v3 goes further with natural-language audio tags embedded in angle brackets — <sigh>, <laugh>, <whisper> — that the model interprets expressively rather than mapping to a fixed rule. Learning the expressive vocabulary of whichever API you use is the main lever for turning passable TTS output into compelling audio.
Multilingual and code-switching. ElevenLabs Flash v2.5 supports 32 languages and can handle mid-sentence language switches reasonably well. OpenAI TTS is multilingual but relies on the text itself to detect language — there is no explicit language parameter. For heavily accented English or non-Latin scripts, always test with real samples: character-count pricing means a surprising quality regression does not show up in your unit tests.
Audio format choices. Most TTS APIs output MP3 by default (compressed, small files, fine for non-real-time). For real-time streaming you may prefer PCM or ulaw (raw uncompressed samples) because they avoid codec startup overhead and work directly with telephony systems like Twilio. ElevenLabs supports MP3, PCM, ulaw, Opus, and FLAC. OpenAI TTS outputs MP3, AAC, FLAC, Opus, and PCM. Match the format to your delivery target — browser audio player, telephony trunk, or local speaker.
Beyond the current generation. The TTS landscape is moving fast. Speech-native multimodal models — like the audio mode in GPT-4o — skip the TTS stage entirely by generating audio tokens directly alongside text tokens. This preserves prosodic cues the LLM inferred while reasoning, lets the model laugh, hesitate, or gasp as part of its output, and collapses the voice agent pipeline from three components to one. Expect this architecture to become standard as the models mature, though dedicated TTS APIs will remain relevant for voice cloning, custom voices, and cost control for high-volume production traffic.
Evaluation. Mean Opinion Score (MOS) is the standard subjective quality metric: human raters score audio samples from 1 (bad) to 5 (excellent) and the average is the MOS. Top neural TTS models score 4.0–4.5, overlapping the lower end of human recordings. Automated alternatives like UTMOS (a neural MOS predictor) let you run quality checks programmatically at scale. For voice cloning, a separate speaker similarity score measures how closely the clone matches the reference speaker, typically by comparing embeddings from a speaker-verification model.
FAQ
What is the best AI text-to-speech for natural-sounding voices?
ElevenLabs leads on perceived naturalness and expressiveness, particularly its Eleven v3 (Multilingual v3) model, which handles emotional inflection, pacing, and audio-tag scripting. For real-time applications where latency matters more than maximum expressiveness, ElevenLabs Flash v2.5 achieves around 75 ms to first audio byte. OpenAI TTS-1-HD is close in quality and significantly cheaper. If you need self-hosted privacy, Kokoro-82M reaches similar quality scores on leaderboards and runs on a consumer GPU.
How does AI voice cloning work?
Voice cloning extracts a numerical fingerprint of a voice (a speaker embedding) from a reference audio clip. The TTS model is then conditioned on that embedding so it generates speech in the reference voice rather than a default one. Instant cloning needs roughly one minute of clean audio and works at inference time with no training. Professional cloning fine-tunes the model on 30 or more minutes of audio for higher fidelity, capturing accent and emotional range more accurately.
How do I reduce latency in a TTS voice agent?
Three approaches compound well. First, choose a low-latency model: ElevenLabs Flash v2.5 targets 75 ms, versus 200 ms for OpenAI TTS. Second, use streaming: start playing audio as soon as the first chunk arrives rather than waiting for the full response. Third, pipeline the stages: forward LLM output tokens to the TTS API as they stream in rather than waiting for the complete LLM reply. Together these can cut perceived latency from several seconds to under half a second.
Is ElevenLabs more expensive than OpenAI TTS?
At the API level, yes: ElevenLabs Flash v2.5 costs roughly $103 per million characters compared to $15 per million for OpenAI TTS-1. OpenAI TTS-1-HD is $30 per million characters, still well below ElevenLabs. If you need voice cloning or significantly higher expressiveness, ElevenLabs' premium is usually justified. For straightforward narration without cloning, OpenAI TTS is often the better value.
Can I run AI text-to-speech locally without an API?
Yes. Kokoro-82M is an open-weight model you can download and run on your own hardware — a laptop GPU handles it at real-time speed. There are no per-character costs and audio never leaves your machine, which matters for privacy-sensitive applications. Other capable open-weight options include F5-TTS and Fish Speech. The tradeoff versus cloud APIs is that you manage installation, dependencies, and scaling yourself.
What audio format should I use for real-time TTS streaming?
For browser playback, MP3 or Opus works well. For telephony systems (Twilio, Vonage), ulaw or 8kHz PCM is typically required to avoid transcoding overhead. For local playback or high-quality recording, PCM 24kHz or higher gives you maximum fidelity. Most TTS APIs let you specify the format and sample rate — always match the format to what your downstream system consumes natively.