ggml

Low-level C tensor library for machine learning on local hardware

Overview

ggml is a tensor library for machine learning written in C. It provides the low-level building blocks (tensors, operations, automatic differentiation, and optimizers) needed to run and train models without pulling in third-party dependencies. It is the engine behind well-known local projects such as llama.cpp and whisper.cpp.

It targets developers who want to run models on their own machines across a range of platforms. Integer quantization and broad hardware support let you fit larger models into limited memory, while the design avoids memory allocations during runtime to keep inference predictable.

Within the local-runtimes category, ggml sits at the bottom layer. Rather than a ready-made server, it is the foundation that higher-level runtimes build on, so you reach for it when you need direct control over tensor computation or want to add model support yourself.

What it does

Low-level, cross-platform implementation in plain C
Integer quantization support to shrink model memory use
Broad hardware support across CPUs and GPU backends
Automatic differentiation with ADAM and L-BFGS optimizers
No third-party dependencies
Zero memory allocations during runtime

Getting started

Clone the repository, set up the Python tooling, then build the example programs with CMake.

Clone and set up dependencies

Clone the repo and install the Python requirements inside a virtual environment.

bashbash

git clone https://github.com/ggml-org/ggml
cd ggml

# install python dependencies in a virtual environment
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Build the examples

Configure and compile the bundled example programs in release mode.

bashbash

mkdir build && cd build
cmake ..
cmake --build . --config Release -j 8

Run GPT-2 inference

Download the GPT-2 small 117M model and run a prompt through the backend example.

bashbash

# run the GPT-2 small 117M model
../examples/gpt-2/download-ggml-model.sh 117M
./bin/gpt-2-backend -m models/gpt-2-117M/ggml-model.bin -p "This is an example"

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Running language or speech models locally on CPU or GPU without cloud services
Building a custom runtime or adding support for a new model architecture
Shrinking models with integer quantization to fit limited memory
Experimenting with on-device training using the built-in autodiff and optimizers

How ggml compares

ggml alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Ollama	★ 175k	A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp	★ 117k	A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All	★ 77.4k	GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI	★ 47k	A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan	★ 43.1k	An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile	★ 25k	A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM	★ 22.8k	A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
ggml	★ 14.8k	Low-level C tensor library for machine learning on local hardware

// Overview

// What it does

// Getting started

Clone and set up dependencies

Build the examples

Run GPT-2 inference

// When to use it

// How ggml compares

Overview

What it does

Getting started

When to use it

How ggml compares