AI/TLDR

mistral-inference

Official reference code for running Mistral open-weight models

Overview

mistral-inference is the official, minimal codebase from Mistral AI for running their open-weight models. It covers the well-known family of releases, including Mistral 7B, the Mixtral 8x7B and 8x22B mixture-of-experts models, Codestral, Mathstral, Nemo, Mistral Large 2, and the multimodal Pixtral and Mistral Small 3.1 models.

Because it is the reference implementation, the code stays small and close to how Mistral itself runs the models. You download the model weights, then either test and chat from the command line or call the generation API directly from Python. Smaller models run on a single GPU, while the larger Mixtral models use torchrun across several GPUs.

What it does

  • Official reference code maintained by Mistral AI for running their open-weight models
  • Covers the full lineup: Mistral 7B, Mixtral 8x7B/8x22B, Codestral, Mathstral, Nemo, Mistral Large 2, Pixtral, and Mistral Small 3.1
  • mistral-demo command to quickly check a model works in your setup
  • mistral-chat command for interactive chat, with flags like --instruct, --max_tokens, and --temperature
  • Python API (Transformer + generate) for instruction following and multimodal image-and-text prompts
  • Multi-GPU runs for large mixture-of-experts models via torchrun --nproc-per-node

Getting started

Install the package, download a model, then run it from the CLI or Python. Note that installation needs a GPU because it relies on xformers.

Install from PyPI

Install the package with pip. A GPU is required for installation.

bashbash
pip install mistral-inference

Download a model

Create a folder for your models, download a model archive, and extract it. Here we download the 12B Mistral Nemo instruct model.

bashbash
export MISTRAL_MODEL=$HOME/mistral_models
mkdir -p $MISTRAL_MODEL
export 12B_DIR=$MISTRAL_MODEL/12B_Nemo
wget https://models.mistralcdn.com/mistral-nemo-2407/mistral-nemo-instruct-2407.tar
mkdir -p $12B_DIR
tar -xf mistral-nemo-instruct-2407.tar -C $12B_DIR

Test and chat from the CLI

Run mistral-demo to confirm the model loads, then start an interactive chat with mistral-chat.

bashbash
mistral-demo $12B_DIR
mistral-chat $12B_DIR --instruct --max_tokens 1024 --temperature 0.35

Generate from Python

Load the tokenizer and model, build a chat request, then call generate to get a completion.

pythonpython
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

tokenizer = MistralTokenizer.from_file("./mistral-nemo-instruct-v0.1/tekken.json")
model = Transformer.from_folder("./mistral-nemo-instruct-v0.1")

prompt = "How expensive would it be to clean all windows in Paris? Make a reasonable guess in US Dollar."
completion_request = ChatCompletionRequest(messages=[UserMessage(content=prompt)])
tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=1024, temperature=0.35, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])
print(result)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Run Mistral open-weight models locally on your own GPU instead of calling a hosted API
  • Use Codestral or Codestral-Mamba as a local coding assistant via mistral-chat
  • Send image-and-text prompts to multimodal models like Pixtral and Mistral Small 3.1
  • Serve large Mixtral 8x7B and 8x22B mixture-of-experts models across multiple GPUs with torchrun

How mistral-inference compares

mistral-inference alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Ollama★ 175kA developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp★ 117kA C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All★ 77.4kGPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI★ 47kA self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan★ 43.1kAn open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile★ 25kA Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM★ 22.8kA machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
mistral-inference★ 10.8kOfficial reference code for running Mistral open-weight models