LocalAI

Self-hosted, OpenAI-compatible API for running LLMs, vision, voice, and image models locally

github.com/mudler/LocalAI★ 47k localai.io

Overview

LocalAI is an open-source inference server you run on your own machines. It exposes a drop-in OpenAI-compatible API (it also supports Anthropic and ElevenLabs API shapes), so existing clients and SDKs can talk to it by changing only the base URL. It can serve text, vision, voice, image, and video models, and it runs on CPU-only hardware as well as NVIDIA, AMD, Intel, Apple Silicon, and Vulkan GPUs.

Rather than shipping one large bundle, LocalAI keeps a small core and pulls each backend on demand. Every backend wraps a focused engine such as llama.cpp, vLLM, whisper.cpp, or stable-diffusion in its own image, so you only download what a given model needs. It detects your GPU capabilities and fetches the matching backend automatically.

It fits the local-runtime category for teams and developers who want to keep data on their own infrastructure. Built-in API key auth, per-user quotas, and role-based access make it usable beyond a single developer, and it ships features like agents with tool use, RAG, and MCP support.

What it does

Drop-in OpenAI-compatible API (also Anthropic and ElevenLabs) across every backend
One API for many modalities: LLMs, vision, voice, image, and video
Composable backends (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX) pulled only when a model needs them
Runs on CPU-only or NVIDIA, AMD, Intel, Apple Silicon, and Vulkan hardware with automatic backend detection
Multi-user controls: API key auth, per-user quotas, and role-based access
Load models from a built-in gallery, Hugging Face, Ollama registry, YAML config, or an OCI registry

Getting started

The quickest way to start is the official Docker image, then load a model from the gallery and chat with it from a second terminal.

Run the server (CPU only)

Start the container and expose the API on port 8080. For NVIDIA, AMD, Intel, or Vulkan GPUs, use the matching tagged image from the README instead.

bashbash

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest

Load a model

Pull and run a model from the gallery. You can also load from Hugging Face, the Ollama registry, a YAML config URL, or an OCI registry.

bashbash

local-ai run llama-3.2-1b-instruct:q4_k_m

Chat with the model

From another shell, open an interactive chat session against the running server. Inside the prompt, /models lists installed models and /model <name> switches between them.

bashbash

local-ai chat --model llama-3.2-1b-instruct:q4_k_m

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Give an existing OpenAI-SDK app a private, self-hosted backend by changing only the base URL
Run LLMs, transcription, text-to-speech, and image generation behind a single local API
Serve models on CPU-only or mixed-vendor GPU hardware where you can't rely on a cloud provider
Stand up a shared internal inference endpoint with API keys, quotas, and role-based access for a team

How LocalAI compares

LocalAI alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Ollama	★ 175k	A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp	★ 117k	A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All	★ 77.4k	GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI	★ 47k	Self-hosted, OpenAI-compatible API for running LLMs, vision, voice, and image models locally
Jan	★ 43.1k	An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile	★ 25k	A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM	★ 22.8k	A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
KTransformers	★ 17.3k	A framework for running large Mixture-of-Experts models locally by splitting work between CPU and GPU to fit limited VRAM.

// Overview

// What it does

// Getting started

Run the server (CPU only)

Load a model

Chat with the model

// When to use it

// How LocalAI compares

Overview

What it does

Getting started

When to use it

How LocalAI compares