Overview
LocalAI is an open-source inference server you run on your own machines. It exposes a drop-in OpenAI-compatible API (it also supports Anthropic and ElevenLabs API shapes), so existing clients and SDKs can talk to it by changing only the base URL. It can serve text, vision, voice, image, and video models, and it runs on CPU-only hardware as well as NVIDIA, AMD, Intel, Apple Silicon, and Vulkan GPUs.
Rather than shipping one large bundle, LocalAI keeps a small core and pulls each backend on demand. Every backend wraps a focused engine such as llama.cpp, vLLM, whisper.cpp, or stable-diffusion in its own image, so you only download what a given model needs. It detects your GPU capabilities and fetches the matching backend automatically.
It fits the local-runtime category for teams and developers who want to keep data on their own infrastructure. Built-in API key auth, per-user quotas, and role-based access make it usable beyond a single developer, and it ships features like agents with tool use, RAG, and MCP support.
What it does
- Drop-in OpenAI-compatible API (also Anthropic and ElevenLabs) across every backend
- One API for many modalities: LLMs, vision, voice, image, and video
- Composable backends (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX) pulled only when a model needs them
- Runs on CPU-only or NVIDIA, AMD, Intel, Apple Silicon, and Vulkan hardware with automatic backend detection
- Multi-user controls: API key auth, per-user quotas, and role-based access
- Load models from a built-in gallery, Hugging Face, Ollama registry, YAML config, or an OCI registry
Getting started
The quickest way to start is the official Docker image, then load a model from the gallery and chat with it from a second terminal.
Run the server (CPU only)
Start the container and expose the API on port 8080. For NVIDIA, AMD, Intel, or Vulkan GPUs, use the matching tagged image from the README instead.
docker run -ti --name local-ai -p 8080:8080 localai/localai:latestLoad a model
Pull and run a model from the gallery. You can also load from Hugging Face, the Ollama registry, a YAML config URL, or an OCI registry.
local-ai run llama-3.2-1b-instruct:q4_k_mChat with the model
From another shell, open an interactive chat session against the running server. Inside the prompt, /models lists installed models and /model <name> switches between them.
local-ai chat --model llama-3.2-1b-instruct:q4_k_mCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Give an existing OpenAI-SDK app a private, self-hosted backend by changing only the base URL
- Run LLMs, transcription, text-to-speech, and image generation behind a single local API
- Serve models on CPU-only or mixed-vendor GPU hardware where you can't rely on a cloud provider
- Stand up a shared internal inference endpoint with API keys, quotas, and role-based access for a team
How LocalAI compares
LocalAI alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Ollama | ★ 175k | A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API. |
| llama.cpp | ★ 117k | A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use. |
| GPT4All | ★ 77.4k | GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required. |
| LocalAI | ★ 47k | Self-hosted, OpenAI-compatible API for running LLMs, vision, voice, and image models locally |
| Jan | ★ 43.1k | An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer. |
| llamafile | ★ 25k | A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS. |
| MLC LLM | ★ 22.8k | A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation. |
| KTransformers | ★ 17.3k | A framework for running large Mixture-of-Experts models locally by splitting work between CPU and GPU to fit limited VRAM. |