TensorZero

Open-source LLMOps stack: one gateway plus observability, evals, and optimization

github.com/tensorzero/tensorzero★ 11.7k tensorzero.com

Overview

TensorZero is an open-source LLMOps platform that brings several pieces of the LLM workflow together. It pairs a fast gateway that reaches every major model provider through one unified API with observability, evaluation, optimization, and built-in experimentation. You can adopt only the parts you need and add the rest over time.

The gateway is written in Rust for very low overhead, and it works with the OpenAI SDK, OpenTelemetry, and providers like Anthropic, OpenAI, Google, Mistral, and many more. It stores your inferences and feedback in your own database, so production data and human feedback can feed back into better prompts and models.

What it does

Unified gateway that calls any major LLM provider (API or self-hosted) through one OpenAI-compatible API, with sub-millisecond p99 latency overhead
High availability features built in: routing, retries, fallbacks, load balancing, granular timeouts, plus tool use, structured JSON outputs, batching, embeddings, and multimodal inputs
Observability that stores inferences and feedback in your own database, viewable in the open-source UI or programmatically, with OpenTelemetry (OTLP) and Prometheus export
Evaluation of individual inferences with heuristics or LLM judges, and end-to-end workflow evaluations, runnable from the UI or the CLI
Optimization of prompts, models, and inference strategies using production metrics and human feedback, including supervised fine-tuning and automated prompt engineering
Built-in experimentation with adaptive A/B testing so you can compare prompts and models and ship with confidence

Getting started

TensorZero ships as a single Docker container and speaks the OpenAI API, so you can point any OpenAI-compatible client at it and start routing inference through the gateway.

Deploy the gateway

Run the TensorZero Gateway, which is a single Docker container. See the deployment guide for the full Docker Compose setup; the quickstart walks through it in about five minutes.

Point your OpenAI client at the gateway

Update the base_url in your OpenAI-compatible client to the local gateway and leave the API key unused, since the gateway holds the real provider keys.

pythonpython

from openai import OpenAI

# Point the client to the TensorZero Gateway
client = OpenAI(base_url="http://localhost:3000/openai/v1", api_key="not-used")

Run inference against any provider

Pass a TensorZero model name to call any provider through the unified API. You can swap providers by changing the model string.

pythonpython

response = client.chat.completions.create(
    model="tensorzero::model_name::anthropic::claude-sonnet-4-6",
    messages=[
        {"role": "user", "content": "Share a fun fact about TensorZero."}
    ],
)

Run evaluations from the CLI

Once you have datasets and variants set up, run evaluations to compare prompts, models, and inference strategies using heuristics or LLM judges.

bashbash

docker compose run --rm evaluations \
  --evaluation-name extract_data \
  --dataset-name hard_test_cases \
  --variant-name gpt_4o \
  --concurrency 5

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Put a single, fast gateway in front of many LLM providers so apps can switch models and add retries, fallbacks, and load balancing without code changes
Capture every inference and piece of feedback in your own database to debug individual calls and watch aggregate metrics across models and prompts over time
Turn production metrics and human feedback into better prompts and fine-tuned models through a data and learning flywheel
Run evaluations and adaptive A/B tests to compare prompts, models, and inference strategies before shipping changes to production

How TensorZero compares

TensorZero alongside other open-source observability & llmops tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Langfuse	★ 29.4k	A self-hostable platform for tracing LLM and agent calls, managing prompts, and running evaluations to debug and improve AI applications.
Opik	★ 19.7k	An open-source platform from Comet for tracing, evaluating, and monitoring LLM applications, RAG systems, and agent workflows with dashboards and LLM-as-judge metrics.
TensorZero	★ 11.7k	Open-source LLMOps stack: one gateway plus observability, evals, and optimization
Evidently	★ 7.6k	A monitoring and evaluation framework for ML and LLM systems that tracks output quality, drift, and test results over time with reports and dashboards.
OpenLLMetry	★ 7.2k	An OpenTelemetry-based SDK that auto-instruments LLM providers, vector databases, and frameworks so traces flow into any existing observability backend.
Helicone	★ 5.8k	A proxy-based observability platform that logs, monitors, and evaluates LLM API calls by routing requests through its endpoint with one line of code.
AgentOps	★ 5.6k	An SDK for monitoring AI agents that tracks LLM cost, session replays, and performance across frameworks like CrewAI, LangChain, and the OpenAI Agents SDK.
Pydantic Logfire	★ 4.3k	An observability platform from the Pydantic team that records LLM calls, agent runs, and tool invocations with tokens, cost, and latency attached.

// Overview

// What it does

// Getting started