NVIDIA NeMo

NVIDIA's PyTorch framework for speech recognition, text-to-speech, and speech LLMs

github.com/NVIDIA-NeMo/NeMo★ 17.4k docs.nvidia.com/nemo-framework

Overview

NVIDIA NeMo Speech is an open-source framework for building speech AI models in PyTorch. It covers automatic speech recognition (ASR), text-to-speech (TTS), and speech LLMs, and ships with pre-trained checkpoints such as the Parakeet and Canary model families that you can download from HuggingFace and run directly.

It is aimed at researchers and PyTorch developers who want to create, customize, or deploy speech models without starting from scratch. You can use the released checkpoints for inference out of the box, or fine-tune and train your own models using the framework's building blocks.

Within the speech and audio space, NeMo focuses on the model side rather than being a lightweight inference-only library. It works on top of your existing Python, PyTorch, and CUDA stack, and an NVIDIA GPU is recommended for inference and required for training.

What it does

Pre-trained ASR checkpoints including the Parakeet and Canary families, with multilingual recognition and translation support
Text-to-speech models such as MagpieTTS, with multilingual voice synthesis
Streaming ASR options with controllable latency for real-time transcription
Simple model loading from HuggingFace via from_pretrained and a one-call transcribe API
Works on top of your own Python, PyTorch, and CUDA versions instead of replacing them
Open source under Apache 2.0, built for both inference and custom training

Getting started

Install NeMo Speech, then load a pre-trained model and transcribe an audio file. A recent NVIDIA GPU with CUDA is recommended for inference.

Install from source with uv (recommended)

Clone the repo and let uv reproduce the actively tested stack from the committed lockfile. Use the CUDA extra that matches your setup.

bashbash

git clone https://github.com/NVIDIA-NeMo/NeMo.git
cd NeMo
uv sync --extra all --extra cu13     # CUDA 13.x

Or install with pip (bring your own environment)

If you already have a Python/PyTorch/CUDA stack, install NeMo over it with pip and the matching PyTorch index.

bashbash

pip install 'nemo-toolkit[asr,tts,cu13]' --extra-index-url https://download.pytorch.org/whl/cu132

Transcribe audio with a pre-trained model

Load an ASR checkpoint from HuggingFace and pass a list of audio file paths to transcribe.

pythonpython

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained(
    model_name="nvidia/parakeet-tdt-0.6b-v3"
)

output = asr_model.transcribe(['2086-149220-0033.wav'])
print(output[0].text)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Transcribe audio or video into text using a pre-trained Parakeet or Canary model
Add real-time, low-latency streaming speech recognition to an application
Generate multilingual speech from text with a TTS model like MagpieTTS
Fine-tune or train custom ASR/TTS models on your own data

How NVIDIA NeMo compares

NVIDIA NeMo alongside other open-source audio, music & voice tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Whisper	★ 103k	OpenAI's speech recognition model that transcribes and translates audio across many languages.
GPT-SoVITS	★ 58.9k	An open-source WebUI that clones a voice from a short audio sample and turns text into speech, with zero-shot and few-shot fine-tuning.
VibeVoice	★ 49.5k	Microsoft's text-to-speech model for generating long, expressive multi-speaker audio like podcasts.
Coqui TTS	★ 45.6k	A library of text-to-speech models including the multilingual XTTS voice-cloning model.
ChatTTS	★ 39.5k	ChatTTS is an open-source text-to-speech model tuned for dialogue, with multi-speaker support and fine-grained control over laughter, pauses, and prosody.
MockingBird	★ 36.9k	An open-source PyTorch toolbox that clones a voice from a short sample and generates Mandarin Chinese speech, with a web app, desktop toolbox, and command line.
OpenVoice	★ 36.7k	OpenVoice clones a voice from a short reference clip and speaks in multiple languages, with control over emotion, accent, rhythm, and intonation.
NVIDIA NeMo	★ 17.4k	NVIDIA's PyTorch framework for speech recognition, text-to-speech, and speech LLMs

// Overview

// What it does

// Getting started