AI/TLDR

OpenLLM

Run any open-source LLM as an OpenAI-compatible API with one command

Overview

OpenLLM is a command-line tool from BentoML for self-hosting large language models. It lets you run open-source models such as Llama 3.3, Qwen2.5, Gemma, Mistral, and Phi4, or your own custom models, and exposes them through OpenAI-compatible API endpoints. Because the API matches OpenAI's, you can point existing clients and frameworks at your own server with little code change.

It is aimed at developers and teams who want to keep models and data on their own infrastructure instead of calling a hosted API. You start a server with one command, and OpenLLM handles the inference backend and serving for you. It also ships a built-in chat UI and a CLI chat mode for quick testing.

Within the model-serving and deployment space, OpenLLM focuses on turning a model into a running, callable endpoint. The same workflow extends from local runs to cloud deployment with Docker, Kubernetes, and BentoCloud.

What it does

  • Run open-source LLMs (Llama 3.3, Qwen2.5, Gemma3, Mistral, Phi4, and more) or custom models with a single command
  • Exposes OpenAI-compatible API endpoints, so OpenAI clients and frameworks like LlamaIndex work against your own server
  • Built-in chat UI served at the /chat endpoint of the running server
  • Interactive CLI chat with the openllm run command
  • Model repository commands to list, inspect, and update available models
  • Path to cloud deployment with Docker, Kubernetes, and BentoCloud

Getting started

Install OpenLLM with pip, then start a model server and call it like the OpenAI API.

Install OpenLLM

Install the package with pip and run the interactive hello command to explore it.

bashbash
pip install openllm  # or pip3 install openllm
openllm hello

Start a model server

Serve a model by name and version. The server runs at http://localhost:3000 with OpenAI-compatible APIs. Gated models need a Hugging Face token set as HF_TOKEN.

bashbash
openllm serve llama3.2:1b

Call it with the OpenAI client

Point the OpenAI Python client at the local server's /v1 base URL. The API key is optional.

pythonpython
from openai import OpenAI

client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')

chat_completion = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Explain superconductors like I'm five years old"
        }
    ],
    stream=True,
)
for chunk in chat_completion:
    print(chunk.choices[0].delta.content or "", end="")

Chat from the CLI

Start an interactive chat in the terminal with a chosen model version.

bashbash
openllm run llama3:8b

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Self-host an open-source LLM on your own hardware to keep data in-house instead of using a hosted API
  • Swap a paid OpenAI endpoint for a local one by changing the base URL in existing OpenAI-client code
  • Test and compare open models quickly using the built-in chat UI or the CLI run command
  • Take a model from a local run to a cloud deployment with Docker, Kubernetes, or BentoCloud

How OpenLLM compares

OpenLLM alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Transformers★ 162kHugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM★ 83.4kA high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang★ 29.3kA serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM★ 13.9kNVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM★ 12.4kRun any open-source LLM as an OpenAI-compatible API with one command
NVIDIA Triton Inference Server★ 10.8kA multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO★ 10.4kAn open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMCache★ 9.4kA KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation.