Overview
Opik is an open-source platform from Comet for building, testing, and monitoring applications built on large language models. It records the calls your app makes to an LLM, stores them as traces, and lets you inspect what went in and what came back, from early prototyping through production.
It is aimed at developers and teams working on RAG chatbots, code assistants, and agent systems who need to see why an app behaves the way it does. Alongside tracing, it offers evaluation tools, including LLM-as-a-judge metrics for things like hallucination detection, moderation, and RAG answer relevance.
As an LLM observability tool, Opik pairs a self-hostable server (run locally with Docker Compose) with client SDKs and many framework integrations, so you can collect traces, score them, and watch feedback scores, trace counts, and token usage over time in a dashboard.
What it does
- Tracing of LLM calls, conversations, and agent activity, viewable in a UI
- LLM-as-a-judge metrics for hallucination detection, moderation, and RAG assessment (answer relevance, context precision)
- Datasets and experiments to automate evaluation, plus a PyTest integration for CI/CD
- Production monitoring dashboards with online evaluation rules; built to handle high trace volumes
- Many third-party framework integrations, including Google ADK, Autogen, and Flowise AI
- Self-hostable server via Docker Compose, plus the Opik Agent Optimizer and Guardrails
Getting started
Install the Python SDK, optionally run the server locally, then wrap your LLM function with a decorator to start logging traces.
Install the SDK
Install the Opik client with pip (or uv).
pip install opikRun the server locally (optional)
Clone the repo and start the self-hosted server with the install script; the UI runs at http://localhost:5173.
git clone https://github.com/comet-ml/opik.git
cd opik
./opik.shConfigure
Point the SDK at your local server (or Comet cloud).
opik configureTrace a function
Add the @opik.track decorator to log calls as traces.
import opik
opik.configure(use_local=True)
@opik.track
def my_llm_function(user_question: str) -> str:
# Your LLM code here
return "Hello"Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Debugging a RAG chatbot by inspecting each LLM call and its retrieved context in traces
- Scoring outputs for hallucinations or moderation issues using LLM-as-a-judge metrics
- Adding evaluation checks to a CI/CD pipeline with the PyTest integration
- Monitoring trace counts, token usage, and feedback scores for an agent app in production
How Opik compares
Opik alongside other open-source observability & llmops tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Langfuse | ★ 29.4k | A self-hostable platform for tracing LLM and agent calls, managing prompts, and running evaluations to debug and improve AI applications. |
| Opik | ★ 19.7k | Open-source tracing, evaluation, and monitoring for LLM and agent apps |
| TensorZero | ★ 11.7k | An open-source LLMOps platform that puts a single gateway in front of every major LLM provider and adds observability, evaluation, optimization, and A/B testing. |
| Evidently | ★ 7.6k | A monitoring and evaluation framework for ML and LLM systems that tracks output quality, drift, and test results over time with reports and dashboards. |
| OpenLLMetry | ★ 7.2k | An OpenTelemetry-based SDK that auto-instruments LLM providers, vector databases, and frameworks so traces flow into any existing observability backend. |
| Helicone | ★ 5.8k | A proxy-based observability platform that logs, monitors, and evaluates LLM API calls by routing requests through its endpoint with one line of code. |
| AgentOps | ★ 5.6k | An SDK for monitoring AI agents that tracks LLM cost, session replays, and performance across frameworks like CrewAI, LangChain, and the OpenAI Agents SDK. |
| Pydantic Logfire | ★ 4.3k | An observability platform from the Pydantic team that records LLM calls, agent runs, and tool invocations with tokens, cost, and latency attached. |