Overview
MLC LLM is a machine-learning compiler and deployment engine for large language models. It compiles models down to native code with TVM and runs them on MLCEngine, a single inference engine that targets many platforms from one codebase.
It is aimed at developers who need to run LLMs outside a typical cloud GPU server. The same engine and compiler run on NVIDIA, AMD, Apple, and Intel GPUs, in the web browser through WebGPU and WASM, and on iOS and Android devices.
As a local runtime, it exposes an OpenAI-compatible API through a REST server, Python, and JavaScript, so you can swap it in for hosted APIs while keeping inference on the hardware you control.
What it does
- Runs on NVIDIA, AMD, Apple, and Intel GPUs via CUDA, ROCm, Metal, Vulkan, and OpenCL
- Deploys to the web browser using WebGPU and WASM, plus iOS and Android
- MLCEngine gives one unified inference engine and compiler across every target platform
- OpenAI-compatible API through a REST server, Python, and JavaScript
- Built on Apache TVM machine-learning compilation for native code generation
- Ships a CLI for chat and for launching a local OpenAI-style server
Getting started
Install the package, then either talk to a model from Python or launch a local OpenAI-compatible server. Models are pulled directly from Hugging Face with the HF:// prefix.
Install MLC LLM
Install the Python package with pip.
pip install mlc-llmChat with a model from Python
Create an MLCEngine, stream a chat completion, then terminate the engine. This downloads and runs a quantized Llama 3 8B model (about 6GB of free VRAM recommended).
from mlc_llm import MLCEngine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model)
for response in engine.chat.completions.create(
messages=[{"role": "user", "content": "What is the meaning of life?"}],
model=model,
stream=True,
):
for choice in response.choices:
print(choice.delta.content, end="", flush=True)
engine.terminate()Or run an OpenAI-compatible server
Launch a REST server at http://127.0.0.1:8000 that accepts OpenAI-style requests, so existing clients can point at it.
mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLCCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Run an LLM locally on a laptop or workstation GPU instead of a hosted API
- Ship in-browser AI features that run client-side with WebGPU and WASM
- Embed on-device language models in iOS or Android apps
- Drop in an OpenAI-compatible local endpoint for development and testing
How MLC LLM compares
MLC LLM alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Ollama | ★ 175k | A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API. |
| llama.cpp | ★ 117k | A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use. |
| GPT4All | ★ 77.4k | GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required. |
| LocalAI | ★ 47k | A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware. |
| Jan | ★ 43.1k | An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer. |
| llamafile | ★ 25k | A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS. |
| MLC LLM | ★ 22.8k | Compile and run LLMs natively on any GPU, browser, or phone |
| KTransformers | ★ 17.3k | A framework for running large Mixture-of-Experts models locally by splitting work between CPU and GPU to fit limited VRAM. |