Luce-Org · 2026-04-20 · notable
LuceBox Hub — Hand-Tuned LLM Inference Reaching 207 tok/s on an RTX 3090
Open-source LLM inference toolkit that hand-tunes CUDA kernels for one GPU at a time. DFlash speculative decoding hits 207 tok/s on Qwen3.5-27B on an RTX 3090 — 3.43× faster than autoregressive, 2.8× faster than SGLang AWQ.
LuceBox rewrites LLM inference from scratch for one GPU at a time, achieving 207 tok/s on Qwen3.5-27B on a consumer RTX 3090.
Key specs
| Throughput (qwen3.5 27 b, rtx 3090) | 207 tok/s |
|---|---|
| Vs. autoregressive | 3.43× |
| Vs. sglang awq | 2.8× |
| Megakernel efficiency (qwen3.5 0.8 b) | 1.87 tok/J |
What is it?
LuceBox Hub is an open-source inference optimization project that hand-tunes CUDA kernels specifically for individual GPU architectures rather than writing generic cross-hardware code. The current target is the NVIDIA RTX 3090. It has two main components: DFlash (speculative decoding port for Qwen3.5-27B) and Megakernel (a fused forward-pass implementation for Qwen3.5-0.8B).
How does it work?
DFlash implements speculative decoding — a technique where a smaller draft model proposes tokens that a larger verifier model checks in parallel, allowing multiple tokens to be confirmed per forward pass. On the RTX 3090, this yields 207 tok/s on Qwen3.5-27B, which is 3.43× faster than standard autoregressive inference and 2.8× faster than SGLang AWQ. The Megakernel fuses the entire forward pass into a single kernel for the 0.8B model, reaching 1.87 tok/J — reportedly matching Apple Silicon efficiency at twice the throughput.
Why does it matter?
Most inference frameworks (vLLM, SGLang, llama.cpp) target a broad set of hardware. LuceBox trades portability for raw per-chip performance. For developers running large models locally on a single RTX 3090, 207 tok/s means fast enough for interactive use with a 27B model — a meaningful threshold. 131 HN points within hours of posting reflects genuine community interest from the self-hosting crowd.
Who is it for?
Self-hosters running 27B+ models on consumer NVIDIA GPUs.
Try it
git clone https://github.com/Luce-Org/lucebox-hub