Overview
mistral-inference is the official, minimal codebase from Mistral AI for running their open-weight models. It covers the well-known family of releases, including Mistral 7B, the Mixtral 8x7B and 8x22B mixture-of-experts models, Codestral, Mathstral, Nemo, Mistral Large 2, and the multimodal Pixtral and Mistral Small 3.1 models.
Because it is the reference implementation, the code stays small and close to how Mistral itself runs the models. You download the model weights, then either test and chat from the command line or call the generation API directly from Python. Smaller models run on a single GPU, while the larger Mixtral models use torchrun across several GPUs.
What it does
- Official reference code maintained by Mistral AI for running their open-weight models
- Covers the full lineup: Mistral 7B, Mixtral 8x7B/8x22B, Codestral, Mathstral, Nemo, Mistral Large 2, Pixtral, and Mistral Small 3.1
- mistral-demo command to quickly check a model works in your setup
- mistral-chat command for interactive chat, with flags like --instruct, --max_tokens, and --temperature
- Python API (Transformer + generate) for instruction following and multimodal image-and-text prompts
- Multi-GPU runs for large mixture-of-experts models via torchrun --nproc-per-node
Getting started
Install the package, download a model, then run it from the CLI or Python. Note that installation needs a GPU because it relies on xformers.
Install from PyPI
Install the package with pip. A GPU is required for installation.
pip install mistral-inferenceDownload a model
Create a folder for your models, download a model archive, and extract it. Here we download the 12B Mistral Nemo instruct model.
export MISTRAL_MODEL=$HOME/mistral_models
mkdir -p $MISTRAL_MODEL
export 12B_DIR=$MISTRAL_MODEL/12B_Nemo
wget https://models.mistralcdn.com/mistral-nemo-2407/mistral-nemo-instruct-2407.tar
mkdir -p $12B_DIR
tar -xf mistral-nemo-instruct-2407.tar -C $12B_DIRTest and chat from the CLI
Run mistral-demo to confirm the model loads, then start an interactive chat with mistral-chat.
mistral-demo $12B_DIR
mistral-chat $12B_DIR --instruct --max_tokens 1024 --temperature 0.35Generate from Python
Load the tokenizer and model, build a chat request, then call generate to get a completion.
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
tokenizer = MistralTokenizer.from_file("./mistral-nemo-instruct-v0.1/tekken.json")
model = Transformer.from_folder("./mistral-nemo-instruct-v0.1")
prompt = "How expensive would it be to clean all windows in Paris? Make a reasonable guess in US Dollar."
completion_request = ChatCompletionRequest(messages=[UserMessage(content=prompt)])
tokens = tokenizer.encode_chat_completion(completion_request).tokens
out_tokens, _ = generate([tokens], model, max_tokens=1024, temperature=0.35, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])
print(result)Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Run Mistral open-weight models locally on your own GPU instead of calling a hosted API
- Use Codestral or Codestral-Mamba as a local coding assistant via mistral-chat
- Send image-and-text prompts to multimodal models like Pixtral and Mistral Small 3.1
- Serve large Mixtral 8x7B and 8x22B mixture-of-experts models across multiple GPUs with torchrun
How mistral-inference compares
mistral-inference alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Ollama | ★ 175k | A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API. |
| llama.cpp | ★ 117k | A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use. |
| GPT4All | ★ 77.4k | GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required. |
| LocalAI | ★ 47k | A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware. |
| Jan | ★ 43.1k | An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer. |
| llamafile | ★ 25k | A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS. |
| MLC LLM | ★ 22.8k | A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation. |
| mistral-inference | ★ 10.8k | Official reference code for running Mistral open-weight models |