KTransformers

Run large Mixture-of-Experts models locally by splitting work across CPU and GPU

github.com/kvcache-ai/ktransformers★ 17.3k kvcache-ai.github.io/ktransformers

Overview

KTransformers is a framework for efficient inference and fine-tuning of large language models using CPU-GPU heterogeneous computing. Instead of requiring all model weights to sit in GPU memory, it splits the work so that some experts and layers run on the CPU while the GPU handles the rest. This lets you run very large Mixture-of-Experts (MoE) models on machines with limited VRAM.

It is aimed at developers and researchers who want to run or fine-tune large MoE models, such as DeepSeek-V3/R1 and Kimi-K2, on a single workstation or server rather than a large GPU cluster. The project exposes two main capabilities from its kt-kernel source tree: high-performance inference and supervised fine-tuning (SFT) through an integration with LLaMA-Factory.

As a local runtime, it focuses on the CPU side of inference: AMX and AVX-optimized kernels, NUMA-aware memory management, and INT4/INT8 quantization. It also provides a Python API so it can plug into serving frameworks like SGLang.

What it does

CPU-GPU heterogeneous inference that places hot experts on the GPU and cold experts on the CPU to fit large MoE models in limited VRAM
AMX/AVX acceleration: Intel AMX and AVX512/AVX2 optimized kernels for INT4/INT8 quantized inference
MoE optimization with NUMA-aware memory management for multi-socket CPU systems
Quantization support: CPU-side INT4/INT8 weights and GPU-side GPTQ
Clean Python API for integration with SGLang and other serving frameworks
Supervised fine-tuning (SFT) of ultra-large MoE models via LLaMA-Factory integration

Getting started

The inference capability lives in the kt-kernel directory of the repository, which you build and install with pip.

Get the repository

Clone the KTransformers repository so you have the kt-kernel source tree locally.

bashbash

git clone https://github.com/kvcache-ai/ktransformers.git

Build and install kt-kernel

Enter the kt-kernel directory and install it with pip, which compiles the CPU-optimized kernels.

bashbash

cd kt-kernel
pip install .

Pick your model and backend

Follow the kt-kernel documentation and the per-model tutorials (for example DeepSeek-R1/V3 or Kimi-K2) to choose quantized weights and a CPU backend (AMX or AVX2) that matches your hardware.

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Running large MoE models like DeepSeek-V3/R1 or Kimi-K2 on a workstation with limited GPU VRAM
Serving large models in production by integrating kt-kernel with SGLang
Splitting experts across CPU and GPU (hot experts on GPU, cold experts on CPU) to balance memory and speed
Fine-tuning ultra-large MoE models on local hardware via the LLaMA-Factory integration

How KTransformers compares

KTransformers alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Ollama	★ 175k	A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp	★ 117k	A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All	★ 77.4k	GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI	★ 47k	A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan	★ 43.1k	An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile	★ 25k	A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM	★ 22.8k	A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
KTransformers	★ 17.3k	Run large Mixture-of-Experts models locally by splitting work across CPU and GPU

// Overview

// What it does

// Getting started