MLC LLM

Compile and run LLMs natively on any GPU, browser, or phone

github.com/mlc-ai/mlc-llm★ 22.8k llm.mlc.ai

Overview

MLC LLM is a machine-learning compiler and deployment engine for large language models. It compiles models down to native code with TVM and runs them on MLCEngine, a single inference engine that targets many platforms from one codebase.

It is aimed at developers who need to run LLMs outside a typical cloud GPU server. The same engine and compiler run on NVIDIA, AMD, Apple, and Intel GPUs, in the web browser through WebGPU and WASM, and on iOS and Android devices.

As a local runtime, it exposes an OpenAI-compatible API through a REST server, Python, and JavaScript, so you can swap it in for hosted APIs while keeping inference on the hardware you control.

What it does

Runs on NVIDIA, AMD, Apple, and Intel GPUs via CUDA, ROCm, Metal, Vulkan, and OpenCL
Deploys to the web browser using WebGPU and WASM, plus iOS and Android
MLCEngine gives one unified inference engine and compiler across every target platform
OpenAI-compatible API through a REST server, Python, and JavaScript
Built on Apache TVM machine-learning compilation for native code generation
Ships a CLI for chat and for launching a local OpenAI-style server

Getting started

Install the package, then either talk to a model from Python or launch a local OpenAI-compatible server. Models are pulled directly from Hugging Face with the HF:// prefix.

Install MLC LLM

Install the Python package with pip.

bashbash

pip install mlc-llm

Chat with a model from Python

Create an MLCEngine, stream a chat completion, then terminate the engine. This downloads and runs a quantized Llama 3 8B model (about 6GB of free VRAM recommended).

pythonpython

from mlc_llm import MLCEngine

model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model)

for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)

engine.terminate()

Or run an OpenAI-compatible server

Launch a REST server at http://127.0.0.1:8000 that accepts OpenAI-style requests, so existing clients can point at it.

bashbash

mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Run an LLM locally on a laptop or workstation GPU instead of a hosted API
Ship in-browser AI features that run client-side with WebGPU and WASM
Embed on-device language models in iOS or Android apps
Drop in an OpenAI-compatible local endpoint for development and testing

How MLC LLM compares

MLC LLM alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Ollama	★ 175k	A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp	★ 117k	A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All	★ 77.4k	GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI	★ 47k	A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan	★ 43.1k	An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile	★ 25k	A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM	★ 22.8k	Compile and run LLMs natively on any GPU, browser, or phone
KTransformers	★ 17.3k	A framework for running large Mixture-of-Experts models locally by splitting work between CPU and GPU to fit limited VRAM.

// Overview

// What it does

// Getting started