KServe

Deploy and autoscale AI model inference on Kubernetes across many frameworks

github.com/kserve/kserve★ 5.6k kserve.github.io/website

Overview

KServe is an open-source platform for running AI model inference on Kubernetes. It defines a custom resource called InferenceService that wraps a model and handles deployment, networking, and autoscaling, so you describe what you want in YAML instead of wiring up servers by hand. It is a Cloud Native Computing Foundation (CNCF) incubating project.

It covers two kinds of workloads in one place. For generative AI it serves large language models through optimized backends like vLLM and exposes an OpenAI-compatible API. For predictive AI it serves models from TensorFlow, PyTorch, scikit-learn, XGBoost, ONNX, and others, with routing between a predictor, transformer, and explainer.

It fits the high-throughput serving category for teams that already run Kubernetes and want a standard way to ship many models. Request-based autoscaling and scale-to-zero help keep GPU and other costs down when traffic is low.

What it does

Serves LLMs through optimized backends such as vLLM and llm-d, with an OpenAI-compatible inference protocol
Native Hugging Face model support, with model caching and KV cache offloading to CPU or disk for longer sequences
Multi-framework predictive serving: TensorFlow, PyTorch, scikit-learn, XGBoost, ONNX, and more
Request-based autoscaling with scale-to-zero to reduce cost on idle GPUs and other resources
Canary rollouts, inference pipelines, and ensembles via InferenceGraph, with routing across predictor, transformer, and explainer
Built-in model explainability plus payload logging, outlier, adversarial, and drift detection

Getting started

KServe runs on a Kubernetes cluster (version 1.32 or higher). The quick install script sets up CRDs, controllers, and serving runtimes; you then apply an InferenceService to deploy a model.

Install KServe (standard mode)

Run the quick install script against your cluster. This is the lightweight install; use the Knative mode script instead if you need scale-to-zero and canary deployments.

bashbash

curl -sL "https://github.com/kserve/kserve/releases/download/v0.18.0/kserve-standard-mode-full-install-with-manifests.sh" | bash

Deploy a model with an InferenceService

Apply a YAML manifest describing your model. This example serves a Hugging Face LLM from a Qwen model on a GPU.

yamlyaml

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "qwen-llm"
  namespace: kserve-test
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=qwen
      storageUri: "hf://Qwen/Qwen2.5-0.5B-Instruct"
      resources:
        limits:
          cpu: "2"
          memory: 6Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 4Gi
          nvidia.com/gpu: "1"

Query the endpoint

Once the service is ready, resolve its hostname and send a request to the OpenAI-compatible chat completions endpoint.

bashbash

SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen-llm -n kserve-test -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions" -d @./chat-input.json

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Serving open-weight LLMs (such as Hugging Face models) behind an OpenAI-compatible API on your own Kubernetes cluster
Deploying many predictive ML models from mixed frameworks (PyTorch, scikit-learn, XGBoost, ONNX) under one consistent interface
Cutting GPU cost for spiky or low-traffic models with request-based autoscaling and scale-to-zero
Rolling out new model versions safely using canary traffic splitting, and building multi-step inference pipelines with InferenceGraph

How KServe compares

KServe alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Transformers	★ 162k	Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM	★ 83.4k	A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang	★ 29.3k	A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM	★ 13.9k	NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM	★ 12.4k	A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server	★ 10.8k	A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO	★ 10.4k	An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
KServe	★ 5.6k	Deploy and autoscale AI model inference on Kubernetes across many frameworks

// Overview

// What it does

// Getting started