AI/TLDR

Opik

Open-source tracing, evaluation, and monitoring for LLM and agent apps

Overview

Opik is an open-source platform from Comet for building, testing, and monitoring applications built on large language models. It records the calls your app makes to an LLM, stores them as traces, and lets you inspect what went in and what came back, from early prototyping through production.

It is aimed at developers and teams working on RAG chatbots, code assistants, and agent systems who need to see why an app behaves the way it does. Alongside tracing, it offers evaluation tools, including LLM-as-a-judge metrics for things like hallucination detection, moderation, and RAG answer relevance.

As an LLM observability tool, Opik pairs a self-hostable server (run locally with Docker Compose) with client SDKs and many framework integrations, so you can collect traces, score them, and watch feedback scores, trace counts, and token usage over time in a dashboard.

What it does

  • Tracing of LLM calls, conversations, and agent activity, viewable in a UI
  • LLM-as-a-judge metrics for hallucination detection, moderation, and RAG assessment (answer relevance, context precision)
  • Datasets and experiments to automate evaluation, plus a PyTest integration for CI/CD
  • Production monitoring dashboards with online evaluation rules; built to handle high trace volumes
  • Many third-party framework integrations, including Google ADK, Autogen, and Flowise AI
  • Self-hostable server via Docker Compose, plus the Opik Agent Optimizer and Guardrails

Getting started

Install the Python SDK, optionally run the server locally, then wrap your LLM function with a decorator to start logging traces.

Install the SDK

Install the Opik client with pip (or uv).

bashbash
pip install opik

Run the server locally (optional)

Clone the repo and start the self-hosted server with the install script; the UI runs at http://localhost:5173.

bashbash
git clone https://github.com/comet-ml/opik.git
cd opik
./opik.sh

Configure

Point the SDK at your local server (or Comet cloud).

bashbash
opik configure

Trace a function

Add the @opik.track decorator to log calls as traces.

pythonpython
import opik

opik.configure(use_local=True)

@opik.track
def my_llm_function(user_question: str) -> str:
    # Your LLM code here
    return "Hello"

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Debugging a RAG chatbot by inspecting each LLM call and its retrieved context in traces
  • Scoring outputs for hallucinations or moderation issues using LLM-as-a-judge metrics
  • Adding evaluation checks to a CI/CD pipeline with the PyTest integration
  • Monitoring trace counts, token usage, and feedback scores for an agent app in production

How Opik compares

Opik alongside other open-source observability & llmops tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Langfuse★ 29.4kA self-hostable platform for tracing LLM and agent calls, managing prompts, and running evaluations to debug and improve AI applications.
Opik★ 19.7kOpen-source tracing, evaluation, and monitoring for LLM and agent apps
TensorZero★ 11.7kAn open-source LLMOps platform that puts a single gateway in front of every major LLM provider and adds observability, evaluation, optimization, and A/B testing.
Evidently★ 7.6kA monitoring and evaluation framework for ML and LLM systems that tracks output quality, drift, and test results over time with reports and dashboards.
OpenLLMetry★ 7.2kAn OpenTelemetry-based SDK that auto-instruments LLM providers, vector databases, and frameworks so traces flow into any existing observability backend.
Helicone★ 5.8kA proxy-based observability platform that logs, monitors, and evaluates LLM API calls by routing requests through its endpoint with one line of code.
AgentOps★ 5.6kAn SDK for monitoring AI agents that tracks LLM cost, session replays, and performance across frameworks like CrewAI, LangChain, and the OpenAI Agents SDK.
Pydantic Logfire★ 4.3kAn observability platform from the Pydantic team that records LLM calls, agent runs, and tool invocations with tokens, cost, and latency attached.