AI/TLDR

KServe

Deploy and autoscale AI model inference on Kubernetes across many frameworks

Overview

KServe is an open-source platform for running AI model inference on Kubernetes. It defines a custom resource called InferenceService that wraps a model and handles deployment, networking, and autoscaling, so you describe what you want in YAML instead of wiring up servers by hand. It is a Cloud Native Computing Foundation (CNCF) incubating project.

It covers two kinds of workloads in one place. For generative AI it serves large language models through optimized backends like vLLM and exposes an OpenAI-compatible API. For predictive AI it serves models from TensorFlow, PyTorch, scikit-learn, XGBoost, ONNX, and others, with routing between a predictor, transformer, and explainer.

It fits the high-throughput serving category for teams that already run Kubernetes and want a standard way to ship many models. Request-based autoscaling and scale-to-zero help keep GPU and other costs down when traffic is low.

What it does

  • Serves LLMs through optimized backends such as vLLM and llm-d, with an OpenAI-compatible inference protocol
  • Native Hugging Face model support, with model caching and KV cache offloading to CPU or disk for longer sequences
  • Multi-framework predictive serving: TensorFlow, PyTorch, scikit-learn, XGBoost, ONNX, and more
  • Request-based autoscaling with scale-to-zero to reduce cost on idle GPUs and other resources
  • Canary rollouts, inference pipelines, and ensembles via InferenceGraph, with routing across predictor, transformer, and explainer
  • Built-in model explainability plus payload logging, outlier, adversarial, and drift detection

Getting started

KServe runs on a Kubernetes cluster (version 1.32 or higher). The quick install script sets up CRDs, controllers, and serving runtimes; you then apply an InferenceService to deploy a model.

Install KServe (standard mode)

Run the quick install script against your cluster. This is the lightweight install; use the Knative mode script instead if you need scale-to-zero and canary deployments.

bashbash
curl -sL "https://github.com/kserve/kserve/releases/download/v0.18.0/kserve-standard-mode-full-install-with-manifests.sh" | bash

Deploy a model with an InferenceService

Apply a YAML manifest describing your model. This example serves a Hugging Face LLM from a Qwen model on a GPU.

yamlyaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "qwen-llm"
  namespace: kserve-test
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=qwen
      storageUri: "hf://Qwen/Qwen2.5-0.5B-Instruct"
      resources:
        limits:
          cpu: "2"
          memory: 6Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 4Gi
          nvidia.com/gpu: "1"

Query the endpoint

Once the service is ready, resolve its hostname and send a request to the OpenAI-compatible chat completions endpoint.

bashbash
SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen-llm -n kserve-test -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions" -d @./chat-input.json

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Serving open-weight LLMs (such as Hugging Face models) behind an OpenAI-compatible API on your own Kubernetes cluster
  • Deploying many predictive ML models from mixed frameworks (PyTorch, scikit-learn, XGBoost, ONNX) under one consistent interface
  • Cutting GPU cost for spiky or low-traffic models with request-based autoscaling and scale-to-zero
  • Rolling out new model versions safely using canary traffic splitting, and building multi-step inference pipelines with InferenceGraph

How KServe compares

KServe alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Transformers★ 162kHugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM★ 83.4kA high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang★ 29.3kA serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM★ 13.9kNVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM★ 12.4kA tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server★ 10.8kA multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO★ 10.4kAn open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
KServe★ 5.6kDeploy and autoscale AI model inference on Kubernetes across many frameworks