mistral-inference

Official reference code for running Mistral open-weight models

github.com/mistralai/mistral-inference★ 10.8k mistral.ai

Overview

mistral-inference is the official, minimal codebase from Mistral AI for running their open-weight models. It covers the well-known family of releases, including Mistral 7B, the Mixtral 8x7B and 8x22B mixture-of-experts models, Codestral, Mathstral, Nemo, Mistral Large 2, and the multimodal Pixtral and Mistral Small 3.1 models.

Because it is the reference implementation, the code stays small and close to how Mistral itself runs the models. You download the model weights, then either test and chat from the command line or call the generation API directly from Python. Smaller models run on a single GPU, while the larger Mixtral models use torchrun across several GPUs.

What it does

Official reference code maintained by Mistral AI for running their open-weight models
Covers the full lineup: Mistral 7B, Mixtral 8x7B/8x22B, Codestral, Mathstral, Nemo, Mistral Large 2, Pixtral, and Mistral Small 3.1
mistral-demo command to quickly check a model works in your setup
mistral-chat command for interactive chat, with flags like --instruct, --max_tokens, and --temperature
Python API (Transformer + generate) for instruction following and multimodal image-and-text prompts
Multi-GPU runs for large mixture-of-experts models via torchrun --nproc-per-node

Getting started

Install the package, download a model, then run it from the CLI or Python. Note that installation needs a GPU because it relies on xformers.

Install from PyPI

Install the package with pip. A GPU is required for installation.

bashbash

pip install mistral-inference

Download a model

Create a folder for your models, download a model archive, and extract it. Here we download the 12B Mistral Nemo instruct model.

bashbash

export MISTRAL_MODEL=$HOME/mistral_models
mkdir -p $MISTRAL_MODEL
export 12B_DIR=$MISTRAL_MODEL/12B_Nemo
wget https://models.mistralcdn.com/mistral-nemo-2407/mistral-nemo-instruct-2407.tar
mkdir -p $12B_DIR
tar -xf mistral-nemo-instruct-2407.tar -C $12B_DIR

Test and chat from the CLI

Run mistral-demo to confirm the model loads, then start an interactive chat with mistral-chat.

bashbash

mistral-demo $12B_DIR
mistral-chat $12B_DIR --instruct --max_tokens 1024 --temperature 0.35

Generate from Python

Load the tokenizer and model, build a chat request, then call generate to get a completion.

pythonpython

from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

tokenizer = MistralTokenizer.from_file("./mistral-nemo-instruct-v0.1/tekken.json")
model = Transformer.from_folder("./mistral-nemo-instruct-v0.1")

prompt = "How expensive would it be to clean all windows in Paris? Make a reasonable guess in US Dollar."
completion_request = ChatCompletionRequest(messages=[UserMessage(content=prompt)])
tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=1024, temperature=0.35, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])
print(result)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Run Mistral open-weight models locally on your own GPU instead of calling a hosted API
Use Codestral or Codestral-Mamba as a local coding assistant via mistral-chat
Send image-and-text prompts to multimodal models like Pixtral and Mistral Small 3.1
Serve large Mixtral 8x7B and 8x22B mixture-of-experts models across multiple GPUs with torchrun

How mistral-inference compares

mistral-inference alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Ollama	★ 175k	A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp	★ 117k	A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All	★ 77.4k	GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI	★ 47k	A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan	★ 43.1k	An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile	★ 25k	A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM	★ 22.8k	A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
mistral-inference	★ 10.8k	Official reference code for running Mistral open-weight models

// Overview

// What it does

// Getting started

Install from PyPI

Download a model

Test and chat from the CLI

Generate from Python

// When to use it

// How mistral-inference compares

Overview

What it does

Getting started

When to use it

How mistral-inference compares