Overview
Supervision is a Python library from Roboflow that gives you reusable building blocks for computer vision tasks. It handles the work that surrounds a model: drawing detections on frames, loading and converting datasets, counting objects in zones, and more, so you can focus on your application instead of plumbing.
It is designed to be model agnostic. You plug in any classification, detection, or segmentation model, including Ultralytics, Transformers, MMDetection, RF-DETR, and Roboflow Inference, and convert their output into a common `sv.Detections` format that the rest of the toolkit understands.
As a computer vision tool, it sits between your model and your output. It is a fit for developers building real-time video pipelines, dataset workflows, or annotated visualizations who want consistent, customizable utilities rather than rewriting the same glue code for each model.
What it does
- Model-agnostic `sv.Detections` format with connectors for Ultralytics, Transformers, MMDetection, RF-DETR, and Roboflow Inference
- Customizable annotators (such as `BoxAnnotator`) for composing detection and segmentation visualizations
- Dataset utilities to load, split, merge, save, and convert between COCO, YOLO, and Pascal VOC formats
- On-demand image loading when iterating over a `DetectionDataset`
- Real-time video helpers including zone counting and stream processing for tasks like dwell-time analysis
Getting started
Install the package into a Python 3.9 or newer environment, then plug in a model and start annotating detections.
Install Supervision
Install the core package with pip. Requires Python 3.9 or newer.
pip install supervisionRun a model and inspect detections
Supervision is model agnostic. Some integrations like rfdetr return sv.Detections directly. Install the optional dependencies for this example first.
pip install pillow rfdetrGet detections from an image
Load an image, run the model, and you get a Detections object you can measure and process.
import supervision as sv
from PIL import Image
from rfdetr import RFDETRSmall
image = Image.open(...)
model = RFDETRSmall()
detections = model.predict(image, threshold=0.5)
len(detections)
# 5Annotate the frame
Use an annotator to draw the detections onto a copy of the image.
import cv2
import supervision as sv
image = cv2.imread(...)
detections = sv.Detections(...)
box_annotator = sv.BoxAnnotator()
annotated_frame = box_annotator.annotate(scene=image.copy(), detections=detections)Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Draw boxes, masks, and labels on images or video frames from any detection model with a consistent API
- Convert and merge object-detection datasets between COCO, YOLO, and Pascal VOC formats
- Build real-time video analytics like zone counting and dwell-time analysis on a live stream
- Standardize output from different models (Ultralytics, Transformers, RF-DETR, Inference) into one Detections format
How Supervision compares
Supervision alongside other open-source vision & understanding tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| PaddleOCR | ★ 83.1k | A toolkit for detecting and recognizing text in images across many languages, plus document parsing. |
| Ultralytics YOLO | ★ 58.6k | A framework for training and running YOLO models for real-time object detection, segmentation, and tracking. |
| Supervision | ★ 44.7k | A model-agnostic Python toolkit for processing and visualizing computer vision detections |
| MMDetection | ★ 32.8k | An OpenMMLab toolbox with many object detection and instance segmentation algorithms for research and production. |
| Segment Anything 2 (SAM 2) | ★ 19.4k | Meta's model for segmenting and tracking any object across images and video frames from clicks or boxes. |
| Grounded-SAM | ★ 17.6k | A pipeline that combines Grounding DINO and Segment Anything to detect and segment objects from text prompts. |
| DINOv3 | ★ 10.7k | Meta's self-supervised vision backbone that produces general-purpose image features for many downstream tasks. |
| Segment Anything 3 (SAM 3) | ★ 10.6k | Meta's segmentation model that detects, segments, and tracks objects in images and video from text or visual prompts. |