Overview
Whisper is an open-source speech recognition model from OpenAI. It is trained on a large, diverse audio dataset and handles several tasks at once: multilingual speech recognition, speech translation, and spoken language identification. Under the hood it is a Transformer sequence-to-sequence model that treats these tasks as one sequence of tokens to predict, so a single model replaces many stages of a traditional speech pipeline.
It is aimed at developers and researchers who need to turn audio into text without calling a hosted API. You install it as a Python package and run it locally, choosing from six model sizes that trade off speed against accuracy and VRAM. There are English-only variants (.en) and multilingual variants, plus a turbo model optimized for faster transcription.
In the speech and audio category, Whisper covers two common needs in one tool: transcribing speech to text in its original language, and translating non-English speech into English. The multilingual models handle translation, while the default turbo model is tuned for fast English transcription.
What it does
- Multitasking model: multilingual speech recognition, speech translation, and language identification in one model
- Six model sizes (tiny, base, small, medium, large, turbo) trading off speed, accuracy, and VRAM (~1 GB to ~10 GB)
- English-only .en variants that tend to perform better for the smaller tiny and base sizes
- Command-line tool that transcribes common audio formats like .flac, .mp3, and .wav
- Translate non-English speech into English using the multilingual models with --task translate
- Optional language hint via --language so you can target a known input language
Getting started
Install the package with pip, make sure ffmpeg is available, then run the whisper command on an audio file.
Install Whisper
Install the latest release from PyPI. The codebase targets Python 3.8-3.11 and recent PyTorch versions.
pip install -U openai-whisperInstall ffmpeg
Whisper requires the ffmpeg command-line tool, available from most package managers.
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on MacOS using Homebrew
brew install ffmpeg
# on Windows using Chocolatey
choco install ffmpegTranscribe an audio file
Run whisper on one or more audio files. This example uses the turbo model, which works well for English transcription.
whisper audio.flac audio.mp3 audio.wav --model turboTranslate non-English speech to English
The turbo model is not trained for translation. Use a multilingual model with --task translate and specify the input language.
whisper japanese.wav --model medium --language Japanese --task translateCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Transcribe podcasts, interviews, or meeting recordings into text locally without a paid API
- Translate non-English audio into English text using the multilingual models
- Add captions or subtitles to videos by transcribing their audio tracks
- Identify the spoken language of an audio clip before further processing
How Whisper compares
Whisper alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Whisper | ★ 103k | General-purpose speech recognition that transcribes and translates audio in many languages |
| GPT-SoVITS | ★ 58.9k | An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning. |
| VibeVoice | ★ 49.5k | Microsoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts. |
| Coqui TTS | ★ 45.6k | A library of text-to-speech models including the multilingual XTTS voice-cloning model. |
| ChatTTS | ★ 39.5k | ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody. |
| MockingBird | ★ 36.9k | An open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line. |
| OpenVoice | ★ 36.7k | OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation. |
| VoxCPM | ★ 31k | An open-source text-to-speech system that generates natural multilingual speech, designs voices from text descriptions, and clones any voice from a short clip. |