Antirez · 2026-05-06 · major
ds4 — Antirez Ships C/Metal Inference Engine for DeepSeek V4 Flash on Apple Silicon
Salvatore Sanfilippo (Redis creator) released a single-purpose Metal graph executor for DeepSeek V4 Flash, with disk-persisted KV cache, OpenAI/Anthropic-compatible APIs, and 1M-token context on a Mac Studio.
Antirez's first AI repo: a DeepSeek V4 Flash-only inference engine in C, built for Apple Silicon with a compressed disk-backed KV cache.
Key specs
| License | MIT |
|---|---|
| Context window | 1M tokens |
| GitHub stars | 397 |
| Min ram for2bit | 128GB |
| Min ram for4bit | 256GB |
What is it?
ds4 is a from-scratch inference engine written in C that runs only DeepSeek V4 Flash on Apple Metal GPUs. It is intentionally narrow — not a generic GGUF runner — so it can ship a Metal graph tuned to that one model. It exposes both a CLI and a server with OpenAI- and Anthropic-compatible HTTP APIs.
How does it work?
The author wrote a Metal graph executor that targets DeepSeek V4 Flash's hybrid Compressed-Sparse / Heavily-Compressed attention layout, then layered a compressed KV cache that spills to disk so 1M-token sessions fit in unified memory. The server serializes Metal calls under concurrent HTTP requests and persists the cache between turns. Tool/function calling is wired up to drive Claude Code, Pi, and OpenCode locally.
Why does it matter?
It pushes the new DeepSeek V4 Flash from a HuggingFace upload into something a single Mac Studio can host as a coding-agent backend — no cloud, no GPU rental. Coming from the creator of Redis, the project has immediate credibility, and the Anthropic/OpenAI API parity means existing agent harnesses point at it with one config line.
Who is it for?
Mac developers running coding agents locally; people experimenting with long-context inference on Apple Silicon.
Try it
git clone https://github.com/antirez/ds4 && make