In plain English
Speaker diarization is the process of answering one deceptively simple question: who spoke when? You feed it a recording that contains multiple people talking — a podcast, a meeting, a court hearing, a doctor's consultation — and it divides the audio into segments, one per speaker turn, and labels each segment consistently: Speaker 1 said this, Speaker 2 said that.

Think of it like a stage play with no stage directions. A raw transcript is just a wall of dialogue — you know what was said but not who said it. Diarization adds the speaker cues back in, turning an illegible monolith into a readable script. The word itself comes from the Greek diárismos, meaning distinction or separation — the same root as "diary".
It is not the same as transcription. Transcription (done by models like Whisper) converts audio into words. Diarization figures out whose words they are. Most real applications need both: you run transcription to get the text, then diarization to attach speaker labels to each chunk of that text. The two pipelines are usually separate systems stitched together.
Why it matters
A transcript without speaker labels is only half-useful. Once you add diarization, an entirely different class of applications becomes possible.
- Meeting intelligence. Tools like Otter.ai and Fireflies attribute action items, questions, and commitments to the right person. Without diarization, you get a wall of words with no accountability.
- Call-center analytics. Customer service teams measure agent talk time vs. customer talk time, detect interruptions, and audit whether agents followed scripts. All of this requires knowing which side of the call is speaking at every moment.
- Legal and medical transcription. Court reporters and medical scribes need speaker-attributed records. A surgical note that does not distinguish surgeon from nurse from patient is a compliance risk.
- Podcast and media production. Editors search for a specific host's lines, auto-generate chapter summaries by speaker, or clip individual speakers for highlight reels.
- Voice AI evaluation. When you are building a voice agent, diarization lets you separate the agent's own TTS output from the user's live speech — crucial for measuring interruptions and turn-taking quality.
For builders, diarization is the connective tissue that makes an audio pipeline actually usable downstream. A language model asked to "summarize who agreed with what" needs attributed text, not a raw transcript. Diarization is what creates that attribution.
How it works
A modern diarization system is not a single model — it is a pipeline of four distinct stages, each solving a sub-problem. Understanding each stage helps you reason about where errors come from and which knob to turn when something goes wrong.
Stage 1 — Voice Activity Detection (VAD)
Before anything else, the system finds out which parts of the audio actually contain speech. This step, called VAD, filters out silence, music, background noise, and non-speech sounds. Getting VAD wrong — missing a sentence, or including noise as speech — cascades errors into every later stage. Libraries like Silero VAD or pyannote's own VAD module are commonly used here.
Stage 2 — Segmentation
The speech regions are then chopped into short segments — typically 0.5 to 10 seconds each — with cuts placed at likely speaker-change boundaries. Simple systems cut on silence; better ones detect acoustic change points (a shift in pitch, timbre, or energy pattern that signals a new speaker has started talking). Each segment is assumed to contain only one speaker.
Stage 3 — Speaker Embedding
This is the neural-network heart of modern diarization. Each segment is passed through a deep model — historically an x-vector network (a Time Delay Neural Network, or TDNN) and more recently ECAPA-TDNN — that compresses the segment's acoustic characteristics into a fixed-length vector of typically 192 or 256 numbers. This vector, called a speaker embedding, acts as a voice fingerprint: if two segments were spoken by the same person, their embeddings should be close together in vector space. The model is trained on thousands of speakers to learn what makes a voice distinctive, independent of what words were said.
Stage 4 — Clustering
With embeddings in hand, the system now groups them. Agglomerative hierarchical clustering (AHC) is the classic choice: it starts with every segment in its own cluster, then iteratively merges the two most similar clusters until a stopping criterion is met. Spectral clustering and Bayesian HMM clustering (VBx) are also widely used. The number of resulting clusters is the system's best guess for the number of speakers in the recording — and getting that count wrong is one of the most common failure modes. The clusters are then relabeled SPEAKER_00, SPEAKER_01, and so on.
Why diarization is hard
Diarization sounds straightforward — group the voice segments by who is speaking — but several real-world conditions routinely trip up even state-of-the-art systems.
Overlapping speech
The foundational assumption of clustering-based diarization is that each segment contains exactly one speaker. In real conversations, people interrupt each other constantly. Meetings typically have 10–30% overlap — two or more people talking at the same time. The standard pipeline simply cannot assign overlapping audio to two speakers simultaneously, so overlapping intervals are misassigned to one speaker or split incorrectly. Modern systems add an Overlap Speech Detection (OSD) module to flag these regions and handle them separately, but it remains a hard, unsolved problem in realistic multi-party settings.
Speaker count uncertainty
The clustering stage needs to decide how many speakers there are. Too few clusters and speakers get merged into one label. Too many and the same speaker gets split across multiple labels. Auto-estimating the count is brittle: a speaker who says only two sentences produces very few embeddings, and the system may not recognise them as distinct.
Short utterances and similar voices
A two-word interjection like "right, yeah" produces a tiny, noisy embedding. Family members, colleagues who are in the same age group, or conference participants with similar accents can have embeddings so close together in vector space that clustering merges them into one label.
Noisy, far-field, or telephone audio
Background music, room echo, codec compression, and microphone distance all corrupt the acoustic signal. The embedding model was typically trained on cleaner audio, so it generalises less well to a phone call with a lot of background noise — or a podcast recorded outdoors.
| Problem | Impact | Mitigation |
|---|---|---|
| Overlapping speech | Segments misattributed or lost | Add OSD module; use EEND-based pipeline |
| Unknown speaker count | Speakers merged or split | Pass known num_speakers if available |
| Short utterances | Weak embeddings, wrong cluster | Tune min-segment length; use ECAPA-TDNN |
| Noisy audio | Embedding quality drops | De-noise pre-processing; fine-tune on domain audio |
| Similar-sounding voices | Wrong cluster assignment | Use finer clustering threshold; more training data |
Tools and APIs
The ecosystem splits cleanly into open-source libraries you host yourself and commercial APIs that handle the infrastructure for you.
Open-source: pyannote.audio
pyannote.audio (version 3.1 as of 2025) is the dominant open-source diarization toolkit. Developed at CNRS in France, it provides pre-trained VAD, segmentation, and ECAPA-TDNN embedding models packaged into a single Pipeline.from_pretrained() call. It is available on Hugging Face and requires accepting a usage agreement. pyannote represents the current open-source state of the art and is what most commercial tools are built on top of.
Open-source: WhisperX
WhisperX (GitHub: m-bain/whisperX) is the easiest way to get Whisper transcription plus speaker labels in one shot. It runs Whisper with word-level timestamps, then calls pyannote's pipeline to get speaker segments, and finally aligns word timestamps to speaker segments so every word in the transcript carries a SPEAKER_XX label. If you are building a transcription pipeline and want speaker attribution without writing the glue code yourself, WhisperX is the place to start.
import whisperx
# 1. Transcribe with word-level timestamps
model = whisperx.load_model("large-v2", device="cuda")
audio = whisperx.load_audio("meeting.wav")
result = model.transcribe(audio, batch_size=16)
# 2. Align to get precise word timestamps
align_model, metadata = whisperx.load_align_model(
language_code=result["language"], device="cuda"
)
result = whisperx.align(result["segments"], align_model, metadata, audio, device="cuda")
# 3. Diarize and assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token="YOUR_HF_TOKEN", device="cuda")
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)
# 4. Print attributed transcript
for segment in result["segments"]:
print(f"[{segment['speaker']}] {segment['text']}")Commercial APIs
If you need production reliability without managing GPU infrastructure, several APIs bundle diarization into their transcription endpoint.
| Provider | Key strength | Diarization flag |
|---|---|---|
| AssemblyAI | Lowest price ($0.17/hr), up to 30 speakers, 30% improvement in noisy environments (2025) | speaker_labels: true |
| Deepgram | Speed-first, 45+ languages, streaming diarization, ~10-20% of audio duration latency | diarize: true |
| Gladia | Solaria-1 model, up to 3x lower DER than competing APIs on conversational speech | diarization: true |
| pyannote.ai | Hosted pyannote with phone-call and meeting specialisations | REST API |
NVIDIA NeMo (Sortformer)
NVIDIA NeMo introduced Sortformer, an end-to-end diarization system built on an 18-layer Transformer that replaces the traditional modular pipeline entirely. Rather than running VAD, then embedding, then clustering as separate steps, Sortformer learns to output speaker-attributed speech directly from the audio. This approach handles overlapping speech more naturally and is a preview of where the field is heading.
Diarization vs. transcription vs. speaker identification
These three terms are often conflated. They solve different problems and are usually implemented as separate systems.
- Converts audio to text
- Output: raw words + timestamps
- Does NOT identify speakers
- Example: Whisper, Deepgram Nova
- Segments audio by speaker
- Output: SPEAKER_00, SPEAKER_01...
- Does NOT produce text
- Example: pyannote, WhisperX
- Matches a voice to a known person
- Output: 'This is Alice (92% confidence)'
- Requires a reference voice sample
- Example: Azure Speaker Recognition
A complete "who said what" pipeline chains all three: ASR produces the words, diarization assigns anonymous labels to voice turns, and speaker identification (optionally) resolves the labels to real names. Most tools like WhisperX combine the first two. The third step is almost always separate, and for privacy-sensitive applications like healthcare, it is often intentionally omitted.
Going deeper
Once you have the basics working, several advanced topics let you push quality further or handle specialised domains.
End-to-End Neural Diarization (EEND)
Traditional clustering-based pipelines are modular: each stage is a separate model trained independently. EEND (End-to-End Neural Diarization) replaces all stages with a single neural network that takes raw audio and directly outputs a per-frame probability that each speaker is active. Because it is trained jointly to minimise the diarization error, it can explicitly learn to handle overlapping speech — which the traditional pipeline simply cannot. The trade-off is that EEND scales poorly to large numbers of speakers and very long recordings. DiariZen (2025) is the current leading open-source hybrid that combines a pruned WavLM encoder with powerset EEND classification and VBx clustering to get the best of both approaches.
Speaker embeddings: from x-vectors to WavLM
The quality of your diarization depends heavily on the speaker embedding model. The evolution goes: i-vectors (2010s, statistical) → x-vectors (TDNN, 2018) → ECAPA-TDNN (2021, with squeeze-excitation and Res2Net blocks, current default in pyannote 3.1) → WavLM-based embeddings (2024–2025, self-supervised pre-training followed by fine-tuning, highest accuracy on hard benchmarks). If you are fine-tuning a diarization system on your own domain data, the embedding model is typically the best lever to pull.
Handling overlapping speech in production
For applications where overlap is frequent — panel discussions, family dinners, heated negotiations — consider: (1) using an OSD (Overlap Speech Detection) module to flag overlapping regions and mark them as ambiguous rather than forcing a wrong attribution; (2) using a multi-channel recording setup where each speaker has their own microphone channel, which sidesteps the acoustic overlap problem entirely at the cost of hardware complexity; (3) using an EEND-based pipeline like Sortformer which handles overlaps during training.
Domain adaptation
Out-of-the-box diarization models are trained on a mix of datasets (telephone speech, broadcast, meetings). If your application has a distinct acoustic profile — say, surgical theatre audio with background beeping, or a courtroom with specific acoustics — you will likely see DERs significantly higher than the published benchmarks. Fine-tuning pyannote's segmentation model on even a small number of labelled examples from your domain can cut errors substantially. The pyannote documentation describes how to run this fine-tuning loop.
Real-time / streaming diarization
Standard pipelines are batch-mode: they process a complete file and look at the whole recording before outputting labels. Streaming diarization must assign speaker labels to short incoming chunks without seeing the future. This is harder — you cannot cluster across the whole recording — but providers like Deepgram expose streaming diarization endpoints. For self-hosted use, pyannote offers an online mode that processes audio in overlapping windows, trading some accuracy for latency.
FAQ
Does Whisper do speaker diarization on its own?
No. OpenAI's Whisper is a transcription (ASR) model only — it outputs text with timestamps but has no mechanism for identifying or separating speakers. To add speaker labels, you need to combine Whisper with a separate diarization library. WhisperX is the most popular way to do this: it wraps Whisper for word-level timestamps and calls pyannote.audio under the hood for diarization, giving you a fully attributed transcript in one library.
What does a diarization output actually look like?
Most libraries return a list of segments, each with a start time, end time, and speaker label, like {start: 0.5, end: 3.2, speaker: 'SPEAKER_00'}. When combined with a transcript, each word or sentence is annotated with the matching speaker label. The labels are anonymous identifiers — the system does not know the speakers' names, only that they are distinct voices.
How many speakers can diarization handle?
It depends on the tool and the recording conditions. AssemblyAI supports up to 30 speakers per file. pyannote and WhisperX work best with 2–8 speakers; performance degrades significantly above 10–12, partly because short utterances from many speakers produce noisy embeddings, and partly because clustering algorithms struggle to find the right number of clusters. Very large meeting diarization (20+ participants) remains an active research area.
What is diarization error rate (DER) and what is a good score?
DER measures the fraction of reference speech time that was incorrectly attributed — from missed speech, false alarms, or wrong speaker labels. Lower is better. State-of-the-art systems score around 11% DER on the AMI meeting benchmark and 14.5% on DIHARD III (the hardest public benchmark). In controlled two-speaker recordings like phone calls, commercial systems regularly achieve DERs below 5%. For most production applications with 2–4 speakers and reasonable audio quality, expect DERs in the 5–15% range.
Can diarization identify who the speakers are, not just that they are different?
Not by itself. Diarization only tells you that SPEAKER_00 and SPEAKER_01 are different people. To attach real names, you need an additional speaker identification or speaker verification step: you provide a reference audio clip of each person, extract their embedding, and compare it against the diarization labels. Azure Cognitive Services Speaker Recognition and NVIDIA NeMo both support this. Note that speaker identification raises significant privacy considerations and in many jurisdictions requires explicit user consent.
Why does diarization fail badly when people talk over each other?
Classical diarization pipelines assume each audio segment contains exactly one speaker. When two people talk simultaneously, the acoustic signal is a physical mix of both voices, so the extracted embedding is a blend of both voice fingerprints. The clustering stage cannot assign that blend cleanly to either speaker. Modern overlap-aware systems add an Overlap Speech Detection module that flags these regions, and end-to-end approaches like EEND model overlapping speech explicitly during training, but even these are imperfect when overlap exceeds 30% of total speech time.