AI/TLDR

llamafile

Distribute and run a local LLM as a single executable file

Overview

llamafile is a Mozilla project that turns an open large language model and the code to run it into a single executable file. It combines llama.cpp with Cosmopolitan Libc, so one file can run locally on most operating systems and CPU architectures without any installation step.

It is aimed at developers and end users who want to try or ship a local model without setting up a Python environment, downloading separate weights, or managing platform-specific builds. You download one file, make it executable, and run it.

As a local runtime, it fits the same niche as other on-device inference tools, but its focus is portability and zero-install distribution. The project also includes whisperfile, a single-file speech-to-text tool built on whisper.cpp and the same packaging.

What it does

  • Packages a model plus its runtime into one executable that needs no installation
  • Built on llama.cpp and Cosmopolitan Libc to run across most operating systems and CPU architectures
  • Pre-built llamafiles are available for download, ranging from small CPU-friendly models to larger ones
  • Includes whisperfile for single-file audio transcription and translation built on whisper.cpp
  • Runs on CPU out of the box, with support for larger models on more capable hardware and GPUs
  • Apache 2.0 licensed, with changes to llama.cpp and whisper.cpp kept under MIT for upstream compatibility

Getting started

Download a pre-built llamafile, make it executable, and run it locally. On Windows, rename the file to add a .exe extension (only executables under 4GB run on Windows).

Download an example model

Fetch a small pre-built llamafile (Qwen3.5 0.8B) so it works out of the box on most hardware.

bashbash
curl -LO https://huggingface.co/mozilla-ai/llamafile_0.10/resolve/main/Qwen3.5-0.8B-Q8_0.llamafile

Make it executable

On macOS, Linux, or BSD, mark the downloaded file as executable.

bashbash
chmod +x Qwen3.5-0.8B-Q8_0.llamafile

Run it

Start the llamafile. It launches the bundled server locally with no separate install.

bashbash
./Qwen3.5-0.8B-Q8_0.llamafile

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Run a local LLM on a laptop without setting up Python, CUDA, or a package manager
  • Distribute a model to non-technical users as one file they can download and run
  • Test or demo open models across macOS, Linux, BSD, and Windows from the same binary
  • Transcribe or translate audio offline using whisperfile, the single-file speech-to-text companion

How llamafile compares

llamafile alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Ollama★ 175kA developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp★ 117kA C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All★ 77.4kGPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI★ 47kA self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan★ 43.1kAn open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile★ 25kDistribute and run a local LLM as a single executable file
MLC LLM★ 22.8kA machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
KTransformers★ 17.3kA framework for running large Mixture-of-Experts models locally by splitting work between CPU and GPU to fit limited VRAM.