Antirez · 2026-05-06 · major

ds4 — Antirez Ships C/Metal Inference Engine for DeepSeek V4 Flash on Apple Silicon

Salvatore Sanfilippo (Redis creator) released a single-purpose Metal graph executor for DeepSeek V4 Flash, with disk-persisted KV cache, OpenAI/Anthropic-compatible APIs, and 1M-token context on a Mac Studio.

GitHub social preview of antirez/ds4 — DeepSeek 4 Flash local inference engine for Metal

Antirez's first AI repo: a DeepSeek V4 Flash-only inference engine in C, built for Apple Silicon with a compressed disk-backed KV cache.

Key specs

License	MIT
Context window	1M tokens
GitHub stars	397
Min ram for2bit	128GB
Min ram for4bit	256GB

What is it?

ds4 is a from-scratch inference engine written in C that runs only DeepSeek V4 Flash on Apple Metal GPUs. It is intentionally narrow — not a generic GGUF runner — so it can ship a Metal graph tuned to that one model. It exposes both a CLI and a server with OpenAI- and Anthropic-compatible HTTP APIs.

How does it work?

The author wrote a Metal graph executor that targets DeepSeek V4 Flash's hybrid Compressed-Sparse / Heavily-Compressed attention layout, then layered a compressed KV cache that spills to disk so 1M-token sessions fit in unified memory. The server serializes Metal calls under concurrent HTTP requests and persists the cache between turns. Tool/function calling is wired up to drive Claude Code, Pi, and OpenCode locally.

Why does it matter?

It pushes the new DeepSeek V4 Flash from a HuggingFace upload into something a single Mac Studio can host as a coding-agent backend — no cloud, no GPU rental. Coming from the creator of Redis, the project has immediate credibility, and the Anthropic/OpenAI API parity means existing agent harnesses point at it with one config line.

Who is it for?

Mac developers running coding agents locally; people experimenting with long-context inference on Apple Silicon.

Try it

git clone https://github.com/antirez/ds4 && make