OpenLLM

Run any open-source LLM as an OpenAI-compatible API with one command

Overview

OpenLLM is a command-line tool from BentoML for self-hosting large language models. It lets you run open-source models such as Llama 3.3, Qwen2.5, Gemma, Mistral, and Phi4, or your own custom models, and exposes them through OpenAI-compatible API endpoints. Because the API matches OpenAI's, you can point existing clients and frameworks at your own server with little code change.

It is aimed at developers and teams who want to keep models and data on their own infrastructure instead of calling a hosted API. You start a server with one command, and OpenLLM handles the inference backend and serving for you. It also ships a built-in chat UI and a CLI chat mode for quick testing.

Within the model-serving and deployment space, OpenLLM focuses on turning a model into a running, callable endpoint. The same workflow extends from local runs to cloud deployment with Docker, Kubernetes, and BentoCloud.

What it does

Run open-source LLMs (Llama 3.3, Qwen2.5, Gemma3, Mistral, Phi4, and more) or custom models with a single command
Exposes OpenAI-compatible API endpoints, so OpenAI clients and frameworks like LlamaIndex work against your own server
Built-in chat UI served at the /chat endpoint of the running server
Interactive CLI chat with the openllm run command
Model repository commands to list, inspect, and update available models
Path to cloud deployment with Docker, Kubernetes, and BentoCloud

Getting started

Install OpenLLM with pip, then start a model server and call it like the OpenAI API.

Install OpenLLM

Install the package with pip and run the interactive hello command to explore it.

bashbash

pip install openllm  # or pip3 install openllm
openllm hello

Start a model server

Serve a model by name and version. The server runs at http://localhost:3000 with OpenAI-compatible APIs. Gated models need a Hugging Face token set as HF_TOKEN.

bashbash

openllm serve llama3.2:1b

Call it with the OpenAI client

Point the OpenAI Python client at the local server's /v1 base URL. The API key is optional.

pythonpython

from openai import OpenAI

client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')

chat_completion = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Explain superconductors like I'm five years old"
        }
    ],
    stream=True,
)
for chunk in chat_completion:
    print(chunk.choices[0].delta.content or "", end="")

Chat from the CLI

Start an interactive chat in the terminal with a chosen model version.

bashbash

openllm run llama3:8b

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Self-host an open-source LLM on your own hardware to keep data in-house instead of using a hosted API
Swap a paid OpenAI endpoint for a local one by changing the base URL in existing OpenAI-client code
Test and compare open models quickly using the built-in chat UI or the CLI run command
Take a model from a local run to a cloud deployment with Docker, Kubernetes, or BentoCloud

How OpenLLM compares

OpenLLM alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Transformers	★ 162k	Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM	★ 83.4k	A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang	★ 29.3k	A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM	★ 13.9k	NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM	★ 12.4k	Run any open-source LLM as an OpenAI-compatible API with one command
NVIDIA Triton Inference Server	★ 10.8k	A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO	★ 10.4k	An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMCache	★ 9.4k	A KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation.

// Overview

// What it does

// Getting started

Install OpenLLM

Start a model server

Call it with the OpenAI client

Chat from the CLI

// When to use it

// How OpenLLM compares

Overview

What it does

Getting started

When to use it

How OpenLLM compares