Overview
ggml is a tensor library for machine learning written in C. It provides the low-level building blocks (tensors, operations, automatic differentiation, and optimizers) needed to run and train models without pulling in third-party dependencies. It is the engine behind well-known local projects such as llama.cpp and whisper.cpp.
It targets developers who want to run models on their own machines across a range of platforms. Integer quantization and broad hardware support let you fit larger models into limited memory, while the design avoids memory allocations during runtime to keep inference predictable.
Within the local-runtimes category, ggml sits at the bottom layer. Rather than a ready-made server, it is the foundation that higher-level runtimes build on, so you reach for it when you need direct control over tensor computation or want to add model support yourself.
What it does
- Low-level, cross-platform implementation in plain C
- Integer quantization support to shrink model memory use
- Broad hardware support across CPUs and GPU backends
- Automatic differentiation with ADAM and L-BFGS optimizers
- No third-party dependencies
- Zero memory allocations during runtime
Getting started
Clone the repository, set up the Python tooling, then build the example programs with CMake.
Clone and set up dependencies
Clone the repo and install the Python requirements inside a virtual environment.
git clone https://github.com/ggml-org/ggml
cd ggml
# install python dependencies in a virtual environment
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtBuild the examples
Configure and compile the bundled example programs in release mode.
mkdir build && cd build
cmake ..
cmake --build . --config Release -j 8Run GPT-2 inference
Download the GPT-2 small 117M model and run a prompt through the backend example.
# run the GPT-2 small 117M model
../examples/gpt-2/download-ggml-model.sh 117M
./bin/gpt-2-backend -m models/gpt-2-117M/ggml-model.bin -p "This is an example"Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Running language or speech models locally on CPU or GPU without cloud services
- Building a custom runtime or adding support for a new model architecture
- Shrinking models with integer quantization to fit limited memory
- Experimenting with on-device training using the built-in autodiff and optimizers
How ggml compares
ggml alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Ollama | ★ 175k | A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API. |
| llama.cpp | ★ 117k | A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use. |
| GPT4All | ★ 77.4k | GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required. |
| LocalAI | ★ 47k | A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware. |
| Jan | ★ 43.1k | An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer. |
| llamafile | ★ 25k | A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS. |
| MLC LLM | ★ 22.8k | A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation. |
| ggml | ★ 14.8k | Low-level C tensor library for machine learning on local hardware |