Whisper

General-purpose speech recognition that transcribes and translates audio in many languages

github.com/openai/whisper★ 103k openai.com/research/whisper

Overview

Whisper is an open-source speech recognition model from OpenAI. It is trained on a large, diverse audio dataset and handles several tasks at once: multilingual speech recognition, speech translation, and spoken language identification. Under the hood it is a Transformer sequence-to-sequence model that treats these tasks as one sequence of tokens to predict, so a single model replaces many stages of a traditional speech pipeline.

It is aimed at developers and researchers who need to turn audio into text without calling a hosted API. You install it as a Python package and run it locally, choosing from six model sizes that trade off speed against accuracy and VRAM. There are English-only variants (.en) and multilingual variants, plus a turbo model optimized for faster transcription.

In the speech and audio category, Whisper covers two common needs in one tool: transcribing speech to text in its original language, and translating non-English speech into English. The multilingual models handle translation, while the default turbo model is tuned for fast English transcription.

What it does

Multitasking model: multilingual speech recognition, speech translation, and language identification in one model
Six model sizes (tiny, base, small, medium, large, turbo) trading off speed, accuracy, and VRAM (~1 GB to ~10 GB)
English-only .en variants that tend to perform better for the smaller tiny and base sizes
Command-line tool that transcribes common audio formats like .flac, .mp3, and .wav
Translate non-English speech into English using the multilingual models with --task translate
Optional language hint via --language so you can target a known input language

Getting started

Install the package with pip, make sure ffmpeg is available, then run the whisper command on an audio file.

Install Whisper

Install the latest release from PyPI. The codebase targets Python 3.8-3.11 and recent PyTorch versions.

bashbash

pip install -U openai-whisper

Install ffmpeg

Whisper requires the ffmpeg command-line tool, available from most package managers.

bashbash

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on MacOS using Homebrew
brew install ffmpeg

# on Windows using Chocolatey
choco install ffmpeg

Transcribe an audio file

Run whisper on one or more audio files. This example uses the turbo model, which works well for English transcription.

bashbash

whisper audio.flac audio.mp3 audio.wav --model turbo

Translate non-English speech to English

The turbo model is not trained for translation. Use a multilingual model with --task translate and specify the input language.

bashbash

whisper japanese.wav --model medium --language Japanese --task translate

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Transcribe podcasts, interviews, or meeting recordings into text locally without a paid API
Translate non-English audio into English text using the multilingual models
Add captions or subtitles to videos by transcribing their audio tracks
Identify the spoken language of an audio clip before further processing

How Whisper compares

Whisper alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Whisper	★ 103k	General-purpose speech recognition that transcribes and translates audio in many languages
GPT-SoVITS	★ 58.9k	An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice	★ 49.5k	Microsoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS	★ 45.6k	A library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS	★ 39.5k	ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird	★ 36.9k	An open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice	★ 36.7k	OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
VoxCPM	★ 31k	An open-source text-to-speech system that generates natural multilingual speech, designs voices from text descriptions, and clones any voice from a short clip.

// Overview

// What it does

// Getting started

Install Whisper

Install ffmpeg

Transcribe an audio file

Translate non-English speech to English

// When to use it

// How Whisper compares

Overview

What it does

Getting started

When to use it

How Whisper compares