RouteLLM

Route easy queries to cheap models and hard ones to strong models to cut LLM cost

github.com/lm-sys/RouteLLM★ 5k lmsys.org/blog/2024-07-01-routellm

Overview

RouteLLM is a framework from LMSYS for serving and evaluating LLM routers. It looks at each incoming query and decides whether to send it to a stronger, more expensive model or to a cheaper, weaker one. The goal is to keep response quality high while spending less on inference.

It works as a drop-in replacement for the OpenAI client, or you can launch an OpenAI-compatible server that any existing OpenAI client can talk to. Trained routers ship out of the box, and LMSYS reports they can reduce costs by up to 85% while keeping about 95% of GPT-4 performance on benchmarks like MT Bench. Model calls go through LiteLLM, so you can pair models from many open and closed providers.

It fits in the LLM gateway and routing space: instead of hard-coding one model for every request, you put RouteLLM in front of two models and let a cost threshold control the cost-quality tradeoff. It is aimed at teams running production LLM traffic who want to trim spend without rewriting their app.

What it does

Drop-in replacement for the OpenAI client, plus an OpenAI-compatible server mode that works with existing OpenAI clients.
Trained routers (such as the matrix-factorization `mf` router) included out of the box, no training required to start.
A per-request cost threshold that tunes the tradeoff between cost and response quality.
Threshold calibration tool that sets the cutoff for a target percentage of strong-model calls using Chatbot Arena data.
Wide model support through LiteLLM, including any OpenAI-compatible endpoint via the `openai/` prefix with `--base-url` and `--api-key`.
Extensible benchmarking so you can add new routers and compare them across benchmarks.

Getting started

Install RouteLLM from PyPI, then swap your OpenAI client for the RouteLLM controller and route between a strong and weak model.

Install

Install from PyPI with the serve and eval extras.

bashbash

pip install "routellm[serve,eval]"

Initialize the controller

Replace your OpenAI client with the RouteLLM controller using the `mf` router, and set a strong model and a weak model. Set the API keys for whichever providers you use.

pythonpython

import os
from routellm.controller import Controller

os.environ["OPENAI_API_KEY"] = "sk-XXXXXX"
# Replace with your model provider, we use Anyscale's Mixtral here.
os.environ["ANYSCALE_API_KEY"] = "esecret_XXXXXX"

client = Controller(
  routers=["mf"],
  strong_model="gpt-4-1106-preview",
  weak_model="anyscale/mistralai/Mixtral-8x7B-Instruct-v0.1",
)

Calibrate a cost threshold

Pick a threshold for the share of queries you want sent to the strong model. For example, calibrate for 50% GPT-4 calls using Chatbot Arena data.

bashbash

python -m routellm.calibrate_threshold --routers mf --strong-model-pct 0.5 --config config.example.yaml

Route completions

Set the `model` field to the router and threshold, then call chat completions as usual. RouteLLM routes each request between the strong and weak model.

pythonpython

response = client.chat.completions.create(
  # Use the MF router with a cost threshold of 0.11593
  model="router-mf-0.11593",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Cut inference spend on an existing OpenAI-based app by routing only the hard queries to a top-tier model while sending the rest to a cheaper one.
Stand up an OpenAI-compatible routing proxy in front of two models so multiple services can share the same cost-aware endpoint.
Tune the strong-model call rate for your own traffic by calibrating the cost threshold against real query patterns.
Evaluate and compare different routers across benchmarks before choosing one for production.

How RouteLLM compares

RouteLLM alongside other open-source gateways & routing tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
LiteLLM	★ 50.9k	A Python SDK and proxy server that gives one OpenAI-compatible API to 100+ LLM providers, with cost tracking, budgets, fallbacks, rate limiting, and an admin UI.
Apache APISIX	★ 16.8k	A cloud-native API gateway whose AI plugins add multi-provider LLM proxying, load balancing, retries and fallbacks, token-based rate limiting, and content moderation.
Portkey AI Gateway	★ 12.1k	An LLM gateway that routes calls to 100+ providers through one API and adds logging, tracing, caching, and fallbacks for production AI traffic.
Higress	★ 8.7k	An AI-native API gateway built on Istio and Envoy that proxies and governs traffic to many LLM providers, with token rate limiting, caching, and MCP server hosting.
Plano (formerly Arch Gateway)	★ 6.6k	An Envoy-based proxy and data plane for agentic apps that handles prompt routing between agents, guardrails, unified access to LLMs, and observability.
Bifrost	★ 5.9k	A high-throughput LLM gateway written in Go that gives a single OpenAI-compatible API to many providers, with failover, load balancing, semantic caching, and very low overhead at high request rates.
RouteLLM	★ 5k	Route easy queries to cheap models and hard ones to strong models to cut LLM cost
vLLM Semantic Router	★ 4.5k	An intelligent router that inspects each request and sends it to the most suitable model in a mixture-of-models setup across cloud, data center, and edge.

// Overview

// What it does

// Getting started