AI/TLDR

LocalAI

Self-hosted, OpenAI-compatible API for running LLMs, vision, voice, and image models locally

Overview

LocalAI is an open-source inference server you run on your own machines. It exposes a drop-in OpenAI-compatible API (it also supports Anthropic and ElevenLabs API shapes), so existing clients and SDKs can talk to it by changing only the base URL. It can serve text, vision, voice, image, and video models, and it runs on CPU-only hardware as well as NVIDIA, AMD, Intel, Apple Silicon, and Vulkan GPUs.

Rather than shipping one large bundle, LocalAI keeps a small core and pulls each backend on demand. Every backend wraps a focused engine such as llama.cpp, vLLM, whisper.cpp, or stable-diffusion in its own image, so you only download what a given model needs. It detects your GPU capabilities and fetches the matching backend automatically.

It fits the local-runtime category for teams and developers who want to keep data on their own infrastructure. Built-in API key auth, per-user quotas, and role-based access make it usable beyond a single developer, and it ships features like agents with tool use, RAG, and MCP support.

What it does

  • Drop-in OpenAI-compatible API (also Anthropic and ElevenLabs) across every backend
  • One API for many modalities: LLMs, vision, voice, image, and video
  • Composable backends (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX) pulled only when a model needs them
  • Runs on CPU-only or NVIDIA, AMD, Intel, Apple Silicon, and Vulkan hardware with automatic backend detection
  • Multi-user controls: API key auth, per-user quotas, and role-based access
  • Load models from a built-in gallery, Hugging Face, Ollama registry, YAML config, or an OCI registry

Getting started

The quickest way to start is the official Docker image, then load a model from the gallery and chat with it from a second terminal.

Run the server (CPU only)

Start the container and expose the API on port 8080. For NVIDIA, AMD, Intel, or Vulkan GPUs, use the matching tagged image from the README instead.

bashbash
docker run -ti --name local-ai -p 8080:8080 localai/localai:latest

Load a model

Pull and run a model from the gallery. You can also load from Hugging Face, the Ollama registry, a YAML config URL, or an OCI registry.

bashbash
local-ai run llama-3.2-1b-instruct:q4_k_m

Chat with the model

From another shell, open an interactive chat session against the running server. Inside the prompt, /models lists installed models and /model <name> switches between them.

bashbash
local-ai chat --model llama-3.2-1b-instruct:q4_k_m

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Give an existing OpenAI-SDK app a private, self-hosted backend by changing only the base URL
  • Run LLMs, transcription, text-to-speech, and image generation behind a single local API
  • Serve models on CPU-only or mixed-vendor GPU hardware where you can't rely on a cloud provider
  • Stand up a shared internal inference endpoint with API keys, quotas, and role-based access for a team

How LocalAI compares

LocalAI alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Ollama★ 175kA developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp★ 117kA C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All★ 77.4kGPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI★ 47kSelf-hosted, OpenAI-compatible API for running LLMs, vision, voice, and image models locally
Jan★ 43.1kAn open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile★ 25kA Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM★ 22.8kA machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
KTransformers★ 17.3kA framework for running large Mixture-of-Experts models locally by splitting work between CPU and GPU to fit limited VRAM.