OpenVINO

Optimize and deploy deep learning models across Intel CPU, GPU, and NPU

github.com/openvinotoolkit/openvino★ 10.4k docs.openvino.ai

Overview

OpenVINO is an open-source toolkit from Intel for optimizing and deploying deep learning models. You bring a trained model, convert it into the OpenVINO format, and run inference faster on the hardware you already have.

It works with models trained in popular frameworks such as PyTorch, TensorFlow, ONNX, Keras, PaddlePaddle, and JAX/Flax, and you can pull models straight from the Hugging Face Hub using Optimum Intel.

The same model can run on Intel CPUs (x86 and ARM), integrated and discrete Intel GPUs, and Intel NPU accelerators, so you can move from edge devices to the cloud without rewriting your pipeline.

What it does

Inference optimization for computer vision, speech recognition, generative AI, and natural language tasks
Flexible model support from PyTorch, TensorFlow, ONNX, Keras, PaddlePaddle, and JAX/Flax
Runs inference on Intel CPU (x86, ARM), integrated and discrete GPU, and NPU accelerators
APIs in C++, Python, C, and Node.js, plus a GenAI API for optimized model pipelines
Direct integration with Hugging Face models through Optimum Intel and Torch.compile
A broad ecosystem including NNCF for model compression and the OpenVINO Model Server for serving

Getting started

OpenVINO installs with a single pip command. After install, you load a trained model, convert it to the OpenVINO format, compile it for a device, and run inference.

Install OpenVINO

Install the Python package with pip. This is the quickest way to get started.

shsh

pip install -U openvino

Convert and run a PyTorch model

Load a PyTorch model, convert it to an OpenVINO model, compile it for the CPU, and run inference on it.

pythonpython

import openvino as ov
import torch
import torchvision

# load PyTorch model into memory
model = torch.hub.load("pytorch/vision", "shufflenet_v2_x1_0", weights="DEFAULT")

# convert the model into OpenVINO model
example = torch.randn(1, 3, 224, 224)
ov_model = ov.convert_model(model, example_input=(example,))

# compile the model for CPU device
core = ov.Core()
compiled_model = core.compile_model(ov_model, 'CPU')

# infer the model on random data
output = compiled_model({0: example.numpy()})

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Speeding up inference for computer vision, speech, and language models on Intel hardware
Deploying the same model across CPU, GPU, and NPU from edge devices to the cloud
Running LLMs and generative AI pipelines locally using the OpenVINO GenAI API
Optimizing and serving Hugging Face models through Optimum Intel without the original framework

How OpenVINO compares

OpenVINO alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Transformers	★ 162k	Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM	★ 83.4k	A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang	★ 29.3k	A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM	★ 13.9k	NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM	★ 12.4k	A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server	★ 10.8k	A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO	★ 10.4k	Optimize and deploy deep learning models across Intel CPU, GPU, and NPU
LMCache	★ 9.4k	A KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation.

// Overview

// What it does

// Getting started