AI/TLDR

MLC LLM

Compile and run LLMs natively on any GPU, browser, or phone

Overview

MLC LLM is a machine-learning compiler and deployment engine for large language models. It compiles models down to native code with TVM and runs them on MLCEngine, a single inference engine that targets many platforms from one codebase.

It is aimed at developers who need to run LLMs outside a typical cloud GPU server. The same engine and compiler run on NVIDIA, AMD, Apple, and Intel GPUs, in the web browser through WebGPU and WASM, and on iOS and Android devices.

As a local runtime, it exposes an OpenAI-compatible API through a REST server, Python, and JavaScript, so you can swap it in for hosted APIs while keeping inference on the hardware you control.

What it does

  • Runs on NVIDIA, AMD, Apple, and Intel GPUs via CUDA, ROCm, Metal, Vulkan, and OpenCL
  • Deploys to the web browser using WebGPU and WASM, plus iOS and Android
  • MLCEngine gives one unified inference engine and compiler across every target platform
  • OpenAI-compatible API through a REST server, Python, and JavaScript
  • Built on Apache TVM machine-learning compilation for native code generation
  • Ships a CLI for chat and for launching a local OpenAI-style server

Getting started

Install the package, then either talk to a model from Python or launch a local OpenAI-compatible server. Models are pulled directly from Hugging Face with the HF:// prefix.

Install MLC LLM

Install the Python package with pip.

bashbash
pip install mlc-llm

Chat with a model from Python

Create an MLCEngine, stream a chat completion, then terminate the engine. This downloads and runs a quantized Llama 3 8B model (about 6GB of free VRAM recommended).

pythonpython
from mlc_llm import MLCEngine

model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model)

for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)

engine.terminate()

Or run an OpenAI-compatible server

Launch a REST server at http://127.0.0.1:8000 that accepts OpenAI-style requests, so existing clients can point at it.

bashbash
mlc_llm serve HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Run an LLM locally on a laptop or workstation GPU instead of a hosted API
  • Ship in-browser AI features that run client-side with WebGPU and WASM
  • Embed on-device language models in iOS or Android apps
  • Drop in an OpenAI-compatible local endpoint for development and testing

How MLC LLM compares

MLC LLM alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Ollama★ 175kA developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp★ 117kA C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All★ 77.4kGPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI★ 47kA self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan★ 43.1kAn open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile★ 25kA Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM★ 22.8kCompile and run LLMs natively on any GPU, browser, or phone
KTransformers★ 17.3kA framework for running large Mixture-of-Experts models locally by splitting work between CPU and GPU to fit limited VRAM.