AI/TLDR

NVIDIA NeMo

NVIDIA's PyTorch framework for speech recognition, text-to-speech, and speech LLMs

Overview

NVIDIA NeMo Speech is an open-source framework for building speech AI models in PyTorch. It covers automatic speech recognition (ASR), text-to-speech (TTS), and speech LLMs, and ships with pre-trained checkpoints such as the Parakeet and Canary model families that you can download from HuggingFace and run directly.

It is aimed at researchers and PyTorch developers who want to create, customize, or deploy speech models without starting from scratch. You can use the released checkpoints for inference out of the box, or fine-tune and train your own models using the framework's building blocks.

Within the speech and audio space, NeMo focuses on the model side rather than being a lightweight inference-only library. It works on top of your existing Python, PyTorch, and CUDA stack, and an NVIDIA GPU is recommended for inference and required for training.

What it does

  • Pre-trained ASR checkpoints including the Parakeet and Canary families, with multilingual recognition and translation support
  • Text-to-speech models such as MagpieTTS, with multilingual voice synthesis
  • Streaming ASR options with controllable latency for real-time transcription
  • Simple model loading from HuggingFace via from_pretrained and a one-call transcribe API
  • Works on top of your own Python, PyTorch, and CUDA versions instead of replacing them
  • Open source under Apache 2.0, built for both inference and custom training

Getting started

Install NeMo Speech, then load a pre-trained model and transcribe an audio file. A recent NVIDIA GPU with CUDA is recommended for inference.

Install from source with uv (recommended)

Clone the repo and let uv reproduce the actively tested stack from the committed lockfile. Use the CUDA extra that matches your setup.

bashbash
git clone https://github.com/NVIDIA-NeMo/NeMo.git
cd NeMo
uv sync --extra all --extra cu13     # CUDA 13.x

Or install with pip (bring your own environment)

If you already have a Python/PyTorch/CUDA stack, install NeMo over it with pip and the matching PyTorch index.

bashbash
pip install 'nemo-toolkit[asr,tts,cu13]' --extra-index-url https://download.pytorch.org/whl/cu132

Transcribe audio with a pre-trained model

Load an ASR checkpoint from HuggingFace and pass a list of audio file paths to transcribe.

pythonpython
import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained(
    model_name="nvidia/parakeet-tdt-0.6b-v3"
)

output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Transcribe audio or video into text using a pre-trained Parakeet or Canary model
  • Add real-time, low-latency streaming speech recognition to an application
  • Generate multilingual speech from text with a TTS model like MagpieTTS
  • Fine-tune or train custom ASR/TTS models on your own data

How NVIDIA NeMo compares

NVIDIA NeMo alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Whisper★ 103kOpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS★ 58.9kAn open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice★ 49.5kMicrosoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS★ 45.6kA library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS★ 39.5kChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird★ 36.9kAn open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice★ 36.7kOpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
NVIDIA NeMo★ 17.4kNVIDIA's PyTorch framework for speech recognition, text-to-speech, and speech LLMs