AI/TLDR

Whisper

General-purpose speech recognition that transcribes and translates audio in many languages

Overview

Whisper is an open-source speech recognition model from OpenAI. It is trained on a large, diverse audio dataset and handles several tasks at once: multilingual speech recognition, speech translation, and spoken language identification. Under the hood it is a Transformer sequence-to-sequence model that treats these tasks as one sequence of tokens to predict, so a single model replaces many stages of a traditional speech pipeline.

It is aimed at developers and researchers who need to turn audio into text without calling a hosted API. You install it as a Python package and run it locally, choosing from six model sizes that trade off speed against accuracy and VRAM. There are English-only variants (.en) and multilingual variants, plus a turbo model optimized for faster transcription.

In the speech and audio category, Whisper covers two common needs in one tool: transcribing speech to text in its original language, and translating non-English speech into English. The multilingual models handle translation, while the default turbo model is tuned for fast English transcription.

What it does

  • Multitasking model: multilingual speech recognition, speech translation, and language identification in one model
  • Six model sizes (tiny, base, small, medium, large, turbo) trading off speed, accuracy, and VRAM (~1 GB to ~10 GB)
  • English-only .en variants that tend to perform better for the smaller tiny and base sizes
  • Command-line tool that transcribes common audio formats like .flac, .mp3, and .wav
  • Translate non-English speech into English using the multilingual models with --task translate
  • Optional language hint via --language so you can target a known input language

Getting started

Install the package with pip, make sure ffmpeg is available, then run the whisper command on an audio file.

Install Whisper

Install the latest release from PyPI. The codebase targets Python 3.8-3.11 and recent PyTorch versions.

bashbash
pip install -U openai-whisper

Install ffmpeg

Whisper requires the ffmpeg command-line tool, available from most package managers.

bashbash
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on MacOS using Homebrew
brew install ffmpeg

# on Windows using Chocolatey
choco install ffmpeg

Transcribe an audio file

Run whisper on one or more audio files. This example uses the turbo model, which works well for English transcription.

bashbash
whisper audio.flac audio.mp3 audio.wav --model turbo

Translate non-English speech to English

The turbo model is not trained for translation. Use a multilingual model with --task translate and specify the input language.

bashbash
whisper japanese.wav --model medium --language Japanese --task translate

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Transcribe podcasts, interviews, or meeting recordings into text locally without a paid API
  • Translate non-English audio into English text using the multilingual models
  • Add captions or subtitles to videos by transcribing their audio tracks
  • Identify the spoken language of an audio clip before further processing

How Whisper compares

Whisper alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Whisper★ 103kGeneral-purpose speech recognition that transcribes and translates audio in many languages
GPT-SoVITS★ 58.9kAn open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice★ 49.5kMicrosoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS★ 45.6kA library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS★ 39.5kChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird★ 36.9kAn open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice★ 36.7kOpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
VoxCPM★ 31kAn open-source text-to-speech system that generates natural multilingual speech, designs voices from text descriptions, and clones any voice from a short clip.