What Is Speech-to-Text? How Whisper-Style ASR Models Work

Understand how modern ASR models like Whisper convert audio into text, and transcribe your first recording with a few lines of code.

BEGINNER11 MIN READUPDATED 2026-06-11

In plain English

Speech-to-text is exactly what it sounds like: you give a computer a recording of someone talking, and it hands back the words they said as plain text. The technical name is automatic speech recognition, or ASR, and you've used it a hundred times without thinking — every time you dictate a message, ask a voice assistant a question, or read auto-generated captions on a video.

Think of it like a court stenographer who never tires. A stenographer listens to a stream of sound and types out the matching words in real time. They don't need to understand the case to do their job — they just need to reliably map the noises coming out of a mouth onto the right letters on a page. ASR is that stenographer rebuilt in software: audio goes in, text comes out.

The trick is that human speech is messy. People mumble, talk over each other, use slang, switch languages mid-sentence, and do it all against the hum of an air conditioner or a noisy café. The whole story of speech-to-text is the slow, decades-long fight to make a machine that hears all that mess and still types the right words. Modern models like OpenAI's Whisper finally got good enough that the technology faded into the background — which is the surest sign something works.

Why it matters

Speech is how humans naturally communicate, but for most of computing history machines could only read what you typed. Speech-to-text removes the keyboard from the equation, and that unlocks a long list of things that were impossible or painful before.

Accessibility. Live captions let deaf and hard-of-hearing people follow a meeting, a lecture, or a livestream. Dictation lets people who can't comfortably type still write.
Voice interfaces. Every voice assistant, in-car command system, and smart speaker starts by turning your speech into text it can act on. No STT, no voice control.
Turning audio into searchable data. Hours of podcasts, call-center recordings, interviews, and video footage are a black box until they're transcribed. Once they're text, you can search them, summarize them, or feed them to a retrieval system.
Productivity. Meeting note-takers, medical scribes that draft a doctor's notes from the consultation, and "just talk to it" coding and writing tools all sit on top of speech-to-text.

Who should care? Anyone building a voice agent, a transcription product, a captioning tool, or any app where users would rather talk than type — which, increasingly, is most apps. Speech-to-text is also the front door to multimodal AI: it's how spoken words reach a large language model that can then reason about them.

What did it replace? For decades, ASR was a brittle stack of hand-tuned components — a separate acoustic model, a pronunciation dictionary, and a language model, each trained and bolted together by specialists. Systems like Dragon NaturallySpeaking needed you to train them on your own voice by reading scripted passages for half an hour, and they still stumbled on accents and background noise. The shift to a single end-to-end neural network, and especially Whisper's 2022 release as a free open model that worked out of the box on dozens of languages, collapsed all of that into one component anyone could call.

How it works

A recording is just a wave — air pressure wiggling over time, captured as a long list of numbers (the samples). A typical model wants those numbers at 16,000 samples per second. But you don't feed that raw wave straight into the model. First you convert it into a picture of the sound.

Step 1: turn sound into a spectrogram

The audio is sliced into tiny overlapping windows (around 25 milliseconds each), and for every window the model measures how much energy sits at each pitch — low rumble up to high hiss. Stack those measurements side by side and you get a mel spectrogram: a heatmap where the horizontal axis is time, the vertical axis is frequency, and brightness is loudness. It looks a bit like sheet music, and it's far easier for a network to read than a raw waveform because the patterns of speech — vowels, consonants, syllables — show up as visible shapes.

Step 2: the neural network reads the picture and writes text

Whisper uses a transformer in an encoder-decoder design — the same family of architecture behind modern language models. The encoder reads the whole spectrogram and builds a rich internal summary of what was said. The decoder then writes out the transcript one token at a time, where each token is a piece of a word, exactly like an LLM generating text. At each step it predicts the next most likely token given the audio and everything it has written so far.

// The speech-to-text pipeline

Audio inraw waveform, 16kHzMel spectrogramsound → pictureEncoderreads the whole clipDecoderwrites tokensText outthe transcript

The reason Whisper works so well isn't a clever architecture — encoder-decoder transformers were years old by 2022. It's the data. Whisper was trained on roughly 680,000 hours of audio paired with transcripts, scraped from across the internet. That's why it's robust to accents, background noise, and technical jargon out of the box: it heard a staggering variety of real-world audio during training. This approach is called weak supervision — the transcripts aren't perfect, but at that scale the sheer volume teaches the model to generalize.

More than one task

Whisper is also multitask. The same model can transcribe speech in its original language, translate non-English speech directly into English text, detect which language is being spoken, and add timestamps. It picks which job to do based on special control tokens the decoder is told to start with — a single network doing four things, no separate models required.

Transcribe your first file

Talk is cheap; let's transcribe something. You have two broad options: run an open model like Whisper locally on your own machine, or call a hosted API and let someone else run the GPU. Here's the local route with the original openai-whisper package — no API key, no internet once the model is downloaded.

installbash

# ffmpeg handles audio decoding; Whisper needs it
pip install -U openai-whisper
# macOS: brew install ffmpeg   |   Ubuntu: sudo apt install ffmpeg

transcribe.pypython

import whisper

# "base" is a good starting point. Bigger = more accurate, slower:
# tiny < base < small < medium < large. Use "large" for best quality.
model = whisper.load_model("base")

# Point it at any audio or video file (mp3, wav, m4a, mp4, ...).
result = model.transcribe("meeting.mp3")

print(result["text"])  # the full transcript as one string

# Each segment also carries start/end timestamps — handy for captions.
for seg in result["segments"][:3]:
    print(f"[{seg['start']:.1f}s -> {seg['end']:.1f}s] {seg['text']}")

That's the entire program. The first run downloads the model weights; after that it works offline. If you'd rather not manage GPUs and just want a transcript back over HTTP, a hosted speech-to-text API is two lines instead — you upload the file, you get text. See the LLM API basics for the general pattern of calling a model over the network.

Measuring quality: word error rate

How do you know if a speech-to-text system is any good? The industry-standard metric is word error rate (WER). You take the model's transcript, compare it word-by-word against a correct human reference, and count three kinds of mistakes.

Error type	What it means	Example
Substitution	Heard the wrong word	"recognize" → "wreck a nice"
Deletion	Dropped a word entirely	skipped "the"
Insertion	Added a word that wasn't said	invented "um"

WER is the total number of those errors divided by the number of words in the reference. So a WER of 0.05 (5%) means one mistake in every twenty words. Lower is better; 0% is a perfect transcript. On clean English audio, strong modern models land in the low single digits — roughly human-level. WER climbs fast, though, with heavy accents, crosstalk, rare names, and noisy recordings, which is exactly where systems still differ.

Choosing a speech-to-text model

There's no single "best" speech-to-text model — the right choice depends on whether you care most about accuracy, speed, cost, privacy, or live streaming. The big fork in the road is self-hosted open model vs hosted API.

// Self-hosted vs hosted API

Self-hosted (Whisper)

Free model weights, you run it
Audio never leaves your machine
Works fully offline
You manage GPUs and scaling
Great for privacy + bulk jobs

Hosted API

Pay per minute of audio
Audio sent to a provider
No infrastructure to run
Easiest to start, scales itself
Great for quick integration

A few real names you'll meet. Whisper (OpenAI) is the open-source baseline everyone compares against — free, multilingual, runs anywhere. NVIDIA NeMo and Meta's wav2vec 2.0 are other open toolkits/models, the latter especially common as a base for fine-tuning on a specific domain. On the commercial side, providers like Deepgram and AssemblyAI offer fast hosted APIs with extras like speaker labels and live streaming. The major cloud platforms (Google, Microsoft Azure, Amazon) all sell speech-to-text too.

Two distinctions worth knowing before you pick. Batch vs streaming: transcribing a finished file is batch; captioning a live call word-by-word as it's spoken is streaming, which is harder and not every model supports it. And model size: Whisper alone ships in sizes from tiny to large — the tiny one runs on a laptop CPU but makes more mistakes, the large one needs a GPU but rivals a human. Match the size to your accuracy and latency budget.

Going deeper

The pipeline above is the clean textbook version. Production speech-to-text adds several layers that the basics gloss over.

Hallucination on silence. Because Whisper's decoder is a generative language model, it can invent text when the audio gives it little to work with — long silences, music, or pure noise sometimes produce confident phantom sentences (a stray "thank you for watching" is the classic). Voice activity detection (VAD), which strips out non-speech before transcription, is the standard guardrail. It's the same failure mode as a text model that makes things up when it's unsure, just in audio form.

Speaker diarization. Plain ASR gives you words but not who said them. Diarization is the separate task of segmenting audio by speaker — labeling turns as Speaker 1, Speaker 2, and so on — and it's usually a distinct model layered on top of transcription. Getting both the words and the speakers right is what turns a raw transcript into a usable meeting record.

Streaming and latency. Live captioning can't wait for a sentence to finish. Streaming systems emit partial, revisable guesses as audio arrives and refine them a fraction of a second later. The engineering tension is between latency (show words fast) and accuracy (a little more context fixes mistakes), and tuning that tradeoff is most of the work in a real-time voice product.

The voice agent stack. Speech-to-text is one stage of a larger loop. A modern voice agent chains STT → a language model → text-to-speech, often with tool use in the middle so it can actually do things. The frontier now is collapsing that chain: speech-native multimodal models take audio in and emit audio out directly, skipping the text round-trip entirely to cut latency and preserve tone, emotion, and timing that get lost when you flatten speech to plain text. That's part of the broader move toward multimodal AI that handles sound, images, and video as naturally as it handles words — and it pairs with the same shift on the visual side, where vision-language models let one model see as well as hear.

The honest open problems are stubborn. Code-switching (flipping languages mid-sentence) still trips models up. Low-resource languages with little training audio lag far behind English. Heavy accents, children's speech, and overlapping voices remain hard. And no metric fully captures "is this transcript actually useful?" — WER misses that a single wrong digit can ruin an otherwise flawless transcript. Speech-to-text feels solved on a clean podcast and very much isn't on a noisy three-person call.

FAQ

What is the difference between speech-to-text and ASR?

They're the same thing. "Speech-to-text" is the plain-English name and "automatic speech recognition" (ASR) is the technical one. Both describe software that takes spoken audio and returns the words as written text. Speech-to-text is just the more searchable, everyday label.

How does Whisper AI work?

Whisper converts your audio into a mel spectrogram — a picture of the sound's frequencies over time — then feeds that into an encoder-decoder transformer. The encoder summarizes the whole clip and the decoder writes out the transcript one token at a time, like a language model. It was trained on about 680,000 hours of audio, which is why it handles accents and noise well without any setup.

Is Whisper free to use?

The Whisper model is open source under the MIT license, so you can download the weights and run them on your own hardware for free, including commercially. You only pay if you use a hosted API that runs Whisper (or another model) for you, which is billed per minute of audio.

What is the best speech-to-text model?

There's no single best — it depends on your priorities. Whisper (large size) is the strong free baseline and great for privacy and offline use. Hosted services like Deepgram and AssemblyAI often lead on speed, live streaming, and extras like speaker labels. Match the choice to whether you care most about accuracy, latency, cost, or keeping audio on your own machine.

How accurate is speech-to-text?

On clean English audio, top models reach a word error rate in the low single digits — roughly one mistake per twenty to thirty words, near human level. Accuracy drops with background noise, strong accents, crosstalk, and rare names or jargon. Those harder conditions are where models still differ most.

Can speech-to-text run offline on my own computer?

Yes. Open models like Whisper download once and then run fully offline, with no audio ever leaving your machine. The smaller sizes (tiny, base) run on a laptop CPU; the large size is much more accurate but wants a GPU. Tools like faster-whisper make local transcription faster and lighter on memory.

// In plain English

// Why it matters

// How it works

Step 1: turn sound into a spectrogram

Step 2: the neural network reads the picture and writes text

More than one task

// Transcribe your first file

// Measuring quality: word error rate

// Choosing a speech-to-text model

// Going deeper

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

Transcribe your first file

Measuring quality: word error rate

Choosing a speech-to-text model

Going deeper

FAQ

Further reading

Related