Overview
KTransformers is a framework for efficient inference and fine-tuning of large language models using CPU-GPU heterogeneous computing. Instead of requiring all model weights to sit in GPU memory, it splits the work so that some experts and layers run on the CPU while the GPU handles the rest. This lets you run very large Mixture-of-Experts (MoE) models on machines with limited VRAM.
It is aimed at developers and researchers who want to run or fine-tune large MoE models, such as DeepSeek-V3/R1 and Kimi-K2, on a single workstation or server rather than a large GPU cluster. The project exposes two main capabilities from its kt-kernel source tree: high-performance inference and supervised fine-tuning (SFT) through an integration with LLaMA-Factory.
As a local runtime, it focuses on the CPU side of inference: AMX and AVX-optimized kernels, NUMA-aware memory management, and INT4/INT8 quantization. It also provides a Python API so it can plug into serving frameworks like SGLang.
What it does
- CPU-GPU heterogeneous inference that places hot experts on the GPU and cold experts on the CPU to fit large MoE models in limited VRAM
- AMX/AVX acceleration: Intel AMX and AVX512/AVX2 optimized kernels for INT4/INT8 quantized inference
- MoE optimization with NUMA-aware memory management for multi-socket CPU systems
- Quantization support: CPU-side INT4/INT8 weights and GPU-side GPTQ
- Clean Python API for integration with SGLang and other serving frameworks
- Supervised fine-tuning (SFT) of ultra-large MoE models via LLaMA-Factory integration
Getting started
The inference capability lives in the kt-kernel directory of the repository, which you build and install with pip.
Get the repository
Clone the KTransformers repository so you have the kt-kernel source tree locally.
git clone https://github.com/kvcache-ai/ktransformers.gitBuild and install kt-kernel
Enter the kt-kernel directory and install it with pip, which compiles the CPU-optimized kernels.
cd kt-kernel
pip install .Pick your model and backend
Follow the kt-kernel documentation and the per-model tutorials (for example DeepSeek-R1/V3 or Kimi-K2) to choose quantized weights and a CPU backend (AMX or AVX2) that matches your hardware.
Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Running large MoE models like DeepSeek-V3/R1 or Kimi-K2 on a workstation with limited GPU VRAM
- Serving large models in production by integrating kt-kernel with SGLang
- Splitting experts across CPU and GPU (hot experts on GPU, cold experts on CPU) to balance memory and speed
- Fine-tuning ultra-large MoE models on local hardware via the LLaMA-Factory integration
How KTransformers compares
KTransformers alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Ollama | ★ 175k | A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API. |
| llama.cpp | ★ 117k | A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use. |
| GPT4All | ★ 77.4k | GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required. |
| LocalAI | ★ 47k | A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware. |
| Jan | ★ 43.1k | An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer. |
| llamafile | ★ 25k | A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS. |
| MLC LLM | ★ 22.8k | A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation. |
| KTransformers | ★ 17.3k | Run large Mixture-of-Experts models locally by splitting work across CPU and GPU |