Overview
llamafile is a Mozilla project that turns an open large language model and the code to run it into a single executable file. It combines llama.cpp with Cosmopolitan Libc, so one file can run locally on most operating systems and CPU architectures without any installation step.
It is aimed at developers and end users who want to try or ship a local model without setting up a Python environment, downloading separate weights, or managing platform-specific builds. You download one file, make it executable, and run it.
As a local runtime, it fits the same niche as other on-device inference tools, but its focus is portability and zero-install distribution. The project also includes whisperfile, a single-file speech-to-text tool built on whisper.cpp and the same packaging.
What it does
- Packages a model plus its runtime into one executable that needs no installation
- Built on llama.cpp and Cosmopolitan Libc to run across most operating systems and CPU architectures
- Pre-built llamafiles are available for download, ranging from small CPU-friendly models to larger ones
- Includes whisperfile for single-file audio transcription and translation built on whisper.cpp
- Runs on CPU out of the box, with support for larger models on more capable hardware and GPUs
- Apache 2.0 licensed, with changes to llama.cpp and whisper.cpp kept under MIT for upstream compatibility
Getting started
Download a pre-built llamafile, make it executable, and run it locally. On Windows, rename the file to add a .exe extension (only executables under 4GB run on Windows).
Download an example model
Fetch a small pre-built llamafile (Qwen3.5 0.8B) so it works out of the box on most hardware.
curl -LO https://huggingface.co/mozilla-ai/llamafile_0.10/resolve/main/Qwen3.5-0.8B-Q8_0.llamafileMake it executable
On macOS, Linux, or BSD, mark the downloaded file as executable.
chmod +x Qwen3.5-0.8B-Q8_0.llamafileRun it
Start the llamafile. It launches the bundled server locally with no separate install.
./Qwen3.5-0.8B-Q8_0.llamafileCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Run a local LLM on a laptop without setting up Python, CUDA, or a package manager
- Distribute a model to non-technical users as one file they can download and run
- Test or demo open models across macOS, Linux, BSD, and Windows from the same binary
- Transcribe or translate audio offline using whisperfile, the single-file speech-to-text companion
How llamafile compares
llamafile alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Ollama | ★ 175k | A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API. |
| llama.cpp | ★ 117k | A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use. |
| GPT4All | ★ 77.4k | GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required. |
| LocalAI | ★ 47k | A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware. |
| Jan | ★ 43.1k | An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer. |
| llamafile | ★ 25k | Distribute and run a local LLM as a single executable file |
| MLC LLM | ★ 22.8k | A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation. |
| KTransformers | ★ 17.3k | A framework for running large Mixture-of-Experts models locally by splitting work between CPU and GPU to fit limited VRAM. |