llamafile

Distribute and run a local LLM as a single executable file

Overview

llamafile is a Mozilla project that turns an open large language model and the code to run it into a single executable file. It combines llama.cpp with Cosmopolitan Libc, so one file can run locally on most operating systems and CPU architectures without any installation step.

It is aimed at developers and end users who want to try or ship a local model without setting up a Python environment, downloading separate weights, or managing platform-specific builds. You download one file, make it executable, and run it.

As a local runtime, it fits the same niche as other on-device inference tools, but its focus is portability and zero-install distribution. The project also includes whisperfile, a single-file speech-to-text tool built on whisper.cpp and the same packaging.

What it does

Packages a model plus its runtime into one executable that needs no installation
Built on llama.cpp and Cosmopolitan Libc to run across most operating systems and CPU architectures
Pre-built llamafiles are available for download, ranging from small CPU-friendly models to larger ones
Includes whisperfile for single-file audio transcription and translation built on whisper.cpp
Runs on CPU out of the box, with support for larger models on more capable hardware and GPUs
Apache 2.0 licensed, with changes to llama.cpp and whisper.cpp kept under MIT for upstream compatibility

Getting started

Download a pre-built llamafile, make it executable, and run it locally. On Windows, rename the file to add a .exe extension (only executables under 4GB run on Windows).

Download an example model

Fetch a small pre-built llamafile (Qwen3.5 0.8B) so it works out of the box on most hardware.

bashbash

curl -LO https://huggingface.co/mozilla-ai/llamafile_0.10/resolve/main/Qwen3.5-0.8B-Q8_0.llamafile

Make it executable

On macOS, Linux, or BSD, mark the downloaded file as executable.

bashbash

chmod +x Qwen3.5-0.8B-Q8_0.llamafile

Run it

Start the llamafile. It launches the bundled server locally with no separate install.

bashbash

./Qwen3.5-0.8B-Q8_0.llamafile

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Run a local LLM on a laptop without setting up Python, CUDA, or a package manager
Distribute a model to non-technical users as one file they can download and run
Test or demo open models across macOS, Linux, BSD, and Windows from the same binary
Transcribe or translate audio offline using whisperfile, the single-file speech-to-text companion

How llamafile compares

llamafile alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Ollama	★ 175k	A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp	★ 117k	A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All	★ 77.4k	GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI	★ 47k	A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan	★ 43.1k	An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile	★ 25k	Distribute and run a local LLM as a single executable file
MLC LLM	★ 22.8k	A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
KTransformers	★ 17.3k	A framework for running large Mixture-of-Experts models locally by splitting work between CPU and GPU to fit limited VRAM.

// Overview

// What it does

// Getting started

Download an example model

Make it executable

Run it

// When to use it

// How llamafile compares

Overview

What it does

Getting started

When to use it

How llamafile compares