Overview
DINOv3 is a family of vision foundation models from Meta AI Research (FAIR). It is trained with self-supervised learning, so it learns from large amounts of unlabeled images instead of hand-labeled datasets. The result is a backbone that turns any image into general-purpose feature vectors you can reuse across many downstream tasks.
It is meant for computer vision developers and researchers who need strong image representations without training a model from scratch. The released backbones range from a 21M-parameter ViT-S/16 up to a 6.7B-parameter ViT-7B/16, plus ConvNeXt variants, so you can trade off accuracy against compute. The same frozen features can drive image classification, semantic segmentation, monocular depth estimation, retrieval, and dense matching.
As a multimodal/computer-vision tool, DINOv3 sits at the feature-extraction layer of a vision pipeline. You load a pretrained backbone, run images through it to get patch-level or global features, and attach a lightweight head (often a linear probe) for your specific task. The models are also available through the Hugging Face Hub, Transformers, and timm.
What it does
- Self-supervised training on the LVD-1689M web image dataset, so no labels are needed to learn the features
- High-quality dense, patch-level features usable for segmentation, depth, and dense correspondence, not just whole-image classification
- Multiple backbone sizes: ViT-S/S+/B/L/H+ distilled models, a 6.7B-parameter ViT-7B/16, and ConvNeXt variants
- Loadable directly through torch.hub.load() from a local clone, pointing to downloaded weight URLs or files
- Also supported by Hugging Face Transformers (>=4.56.0) and PyTorch Image Models / timm (>=1.0.20)
- Reference code and configs for linear segmentation (ADE20K) and linear depth estimation (NYUv2-Depth)
Getting started
Set up the environment, request the pretrained weights from Meta, then load a backbone and extract features from an image. Note that the reference repo targets a Linux environment with PyTorch >= 2.7.1.
Create the environment
Clone the repo and create the provided conda/micromamba environment, which installs PyTorch and the other dependencies.
micromamba env create -f conda.yaml
micromamba activate dinov3Get the model weights
Request access on the DINOv3 downloads page. Once accepted, you receive URLs for the backbone and adapter weights, which you pass to torch.hub.load() as a URL or local file. Use wget rather than a browser to download.
Load a backbone and extract features
Load a backbone from your local clone, then run a normalized image through it to get features.
import torch
REPO_DIR = "<PATH/TO/DINOV3/REPO>"
# Load DINOv3 ViT backbone
dinov3_vitl16 = torch.hub.load(
REPO_DIR,
'dinov3_vitl16',
source='local',
weights="<CHECKPOINT/URL/OR/PATH>",
)
# Extract features from an image
from torchvision.transforms import v2
from PIL import Image
img = Image.open("image.jpg")
transform = v2.Compose([
v2.ToImage(),
v2.Resize((256, 256), antialias=True),
v2.ToDtype(torch.float32, scale=True),
v2.Normalize(mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)),
])
with torch.inference_mode():
features = dinov3_vitl16(transform(img)[None])Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Extract reusable image embeddings as input to a custom classifier or downstream model without training a vision backbone yourself
- Build semantic segmentation or monocular depth estimation by adding a linear probe on top of frozen DINOv3 features
- Power image retrieval and similarity search using the global feature vectors
- Use dense patch features for correspondence and matching, including specialized domains like satellite or microscopy imagery via the metadata-guided recipes
How DINOv3 compares
DINOv3 alongside other open-source vision & understanding tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| PaddleOCR | ★ 83.1k | A toolkit for detecting and recognizing text in images across many languages, plus document parsing. |
| Ultralytics YOLO | ★ 58.6k | A framework for training and running YOLO models for real-time object detection, segmentation, and tracking. |
| Supervision | ★ 44.7k | A Python toolkit for processing, annotating, and visualizing detections and segmentations from many vision models. |
| MMDetection | ★ 32.8k | An OpenMMLab toolbox with many object detection and instance segmentation algorithms for research and production. |
| Segment Anything 2 (SAM 2) | ★ 19.4k | Meta's model for segmenting and tracking any object across images and video frames from clicks or boxes. |
| Grounded-SAM | ★ 17.6k | A pipeline that combines Grounding DINO and Segment Anything to detect and segment objects from text prompts. |
| DINOv3 | ★ 10.7k | Self-supervised vision backbones that produce general-purpose image features without fine-tuning |
| Segment Anything 3 (SAM 3) | ★ 10.6k | Meta's segmentation model that detects, segments, and tracks objects in images and video from text or visual prompts. |