AI/TLDR

Segment Anything 2 (SAM 2)

Prompt, segment, and track any object across images and video frames

Overview

SAM 2 (Segment Anything Model 2) is a model from Meta's FAIR team for promptable visual segmentation. You give it a prompt — a click, a set of points, or a box — and it returns a mask for the object you pointed at. It extends the original Segment Anything model from single images to video by treating an image as a one-frame video.

The main step beyond SAM is video tracking. SAM 2 uses a transformer with streaming memory, so once you mark an object in one frame it can follow that object across the rest of the clip. The 2.1 checkpoints come in four sizes (tiny, small, base-plus, large) so you can trade accuracy for speed.

It sits in the computer vision space and is meant for developers and researchers who need object masks rather than just bounding boxes. Typical users are people building annotation tools, video editing features, or datasets, and it ships with both an image predictor and a video predictor you call from Python.

What it does

  • Promptable segmentation from clicks, point sets, or boxes on static images
  • Object tracking across video frames using a streaming memory architecture
  • Multi-object tracking with per-object inference, including adding new objects after tracking has started
  • Four SAM 2.1 checkpoint sizes (tiny, small, base-plus, large) to balance speed and accuracy
  • torch.compile support for the full model on video for faster VOS inference (vos_optimized=True)
  • Separate SAM2ImagePredictor and video predictor APIs, plus released training and fine-tuning code

Getting started

SAM 2 needs Python 3.10+, PyTorch 2.5.1+, and torchvision 0.20.1+, and is best run on a GPU machine. Clone the repo, install it, download a checkpoint, then call the predictor.

Install from source

Clone the repository and install it in editable mode. On Windows, the README recommends using WSL with Ubuntu.

bashbash
git clone https://github.com/facebookresearch/sam2.git && cd sam2

pip install -e .

Download a checkpoint

Fetch the SAM 2.1 model checkpoints with the provided script (or download a single .pt file individually).

bashbash
cd checkpoints && \
./download_ckpts.sh && \
cd ..

Segment an image

Load the image predictor, set your image, then pass your prompts to predict() to get masks.

pythonpython
import torch
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor

checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    predictor.set_image(<your_image>)
    masks, _, _ = predictor.predict(<input_prompts>)

Track objects in a video

Use the video predictor: initialize state, add prompts on a frame, then propagate the masks across the whole video.

pythonpython
import torch
from sam2.build_sam import build_sam2_video_predictor

checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = build_sam2_video_predictor(model_cfg, checkpoint)

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    state = predictor.init_state(<your_video>)

    # add new prompts and instantly get the output on the same frame
    frame_idx, object_ids, masks = predictor.add_new_points_or_box(state, <your_prompts>)

    # propagate the prompts to get masklets throughout the video
    for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
        ...

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Speed up dataset annotation by letting annotators click an object and get a clean mask instead of drawing by hand
  • Track a selected object across the frames of a video for editing, effects, or analysis
  • Build interactive segmentation into a tool where a user clicks or drags a box to select things
  • Fine-tune the model on a custom domain using the released training code

How Segment Anything 2 (SAM 2) compares

Segment Anything 2 (SAM 2) alongside other open-source vision & understanding tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
PaddleOCR★ 83.1kA toolkit for detecting and recognizing text in images across many languages, plus document parsing.
Ultralytics YOLO★ 58.6kA framework for training and running YOLO models for real-time object detection, segmentation, and tracking.
Supervision★ 44.7kA Python toolkit for processing, annotating, and visualizing detections and segmentations from many vision models.
MMDetection★ 32.8kAn OpenMMLab toolbox with many object detection and instance segmentation algorithms for research and production.
Segment Anything 2 (SAM 2)★ 19.4kPrompt, segment, and track any object across images and video frames
Grounded-SAM★ 17.6kA pipeline that combines Grounding DINO and Segment Anything to detect and segment objects from text prompts.
DINOv3★ 10.7kMeta's self-supervised vision backbone that produces general-purpose image features for many downstream tasks.
Segment Anything 3 (SAM 3)★ 10.6kMeta's segmentation model that detects, segments, and tracks objects in images and video from text or visual prompts.