Supervision

A model-agnostic Python toolkit for processing and visualizing computer vision detections

github.com/roboflow/supervision★ 44.7k supervision.roboflow.com

Overview

Supervision is a Python library from Roboflow that gives you reusable building blocks for computer vision tasks. It handles the work that surrounds a model: drawing detections on frames, loading and converting datasets, counting objects in zones, and more, so you can focus on your application instead of plumbing.

It is designed to be model agnostic. You plug in any classification, detection, or segmentation model, including Ultralytics, Transformers, MMDetection, RF-DETR, and Roboflow Inference, and convert their output into a common `sv.Detections` format that the rest of the toolkit understands.

As a computer vision tool, it sits between your model and your output. It is a fit for developers building real-time video pipelines, dataset workflows, or annotated visualizations who want consistent, customizable utilities rather than rewriting the same glue code for each model.

What it does

Model-agnostic `sv.Detections` format with connectors for Ultralytics, Transformers, MMDetection, RF-DETR, and Roboflow Inference
Customizable annotators (such as `BoxAnnotator`) for composing detection and segmentation visualizations
Dataset utilities to load, split, merge, save, and convert between COCO, YOLO, and Pascal VOC formats
On-demand image loading when iterating over a `DetectionDataset`
Real-time video helpers including zone counting and stream processing for tasks like dwell-time analysis

Getting started

Install the package into a Python 3.9 or newer environment, then plug in a model and start annotating detections.

Install Supervision

Install the core package with pip. Requires Python 3.9 or newer.

bashbash

pip install supervision

Run a model and inspect detections

Supervision is model agnostic. Some integrations like rfdetr return sv.Detections directly. Install the optional dependencies for this example first.

bashbash

pip install pillow rfdetr

Get detections from an image

Load an image, run the model, and you get a Detections object you can measure and process.

pythonpython

import supervision as sv
from PIL import Image
from rfdetr import RFDETRSmall

image = Image.open(...)
model = RFDETRSmall()
detections = model.predict(image, threshold=0.5)

len(detections)
# 5

Annotate the frame

Use an annotator to draw the detections onto a copy of the image.

pythonpython

import cv2
import supervision as sv

image = cv2.imread(...)
detections = sv.Detections(...)

box_annotator = sv.BoxAnnotator()
annotated_frame = box_annotator.annotate(scene=image.copy(), detections=detections)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Draw boxes, masks, and labels on images or video frames from any detection model with a consistent API
Convert and merge object-detection datasets between COCO, YOLO, and Pascal VOC formats
Build real-time video analytics like zone counting and dwell-time analysis on a live stream
Standardize output from different models (Ultralytics, Transformers, RF-DETR, Inference) into one Detections format

How Supervision compares

Supervision alongside other open-source vision & understanding tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
PaddleOCR	★ 83.1k	A toolkit for detecting and recognizing text in images across many languages, plus document parsing.
Ultralytics YOLO	★ 58.6k	A framework for training and running YOLO models for real-time object detection, segmentation, and tracking.
Supervision	★ 44.7k	A model-agnostic Python toolkit for processing and visualizing computer vision detections
MMDetection	★ 32.8k	An OpenMMLab toolbox with many object detection and instance segmentation algorithms for research and production.
Segment Anything 2 (SAM 2)	★ 19.4k	Meta's model for segmenting and tracking any object across images and video frames from clicks or boxes.
Grounded-SAM	★ 17.6k	A pipeline that combines Grounding DINO and Segment Anything to detect and segment objects from text prompts.
DINOv3	★ 10.7k	Meta's self-supervised vision backbone that produces general-purpose image features for many downstream tasks.
Segment Anything 3 (SAM 3)	★ 10.6k	Meta's segmentation model that detects, segments, and tracks objects in images and video from text or visual prompts.

// Overview

// What it does

// Getting started