AI/TLDR

Luce-Org · 2026-04-20 · notable

LuceBox Hub — Hand-Tuned LLM Inference Reaching 207 tok/s on an RTX 3090

Open-source LLM inference toolkit that hand-tunes CUDA kernels for one GPU at a time. DFlash speculative decoding hits 207 tok/s on Qwen3.5-27B on an RTX 3090 — 3.43× faster than autoregressive, 2.8× faster than SGLang AWQ.

Luce-Org/lucebox-hub GitHub repository — hand-tuned LLM inference for RTX 3090

LuceBox rewrites LLM inference from scratch for one GPU at a time, achieving 207 tok/s on Qwen3.5-27B on a consumer RTX 3090.

Key specs

Throughput (qwen3.5 27 b, rtx 3090)207 tok/s
Vs. autoregressive3.43×
Vs. sglang awq2.8×
Megakernel efficiency (qwen3.5 0.8 b)1.87 tok/J

What is it?

LuceBox Hub is an open-source inference optimization project that hand-tunes CUDA kernels specifically for individual GPU architectures rather than writing generic cross-hardware code. The current target is the NVIDIA RTX 3090. It has two main components: DFlash (speculative decoding port for Qwen3.5-27B) and Megakernel (a fused forward-pass implementation for Qwen3.5-0.8B).

How does it work?

DFlash implements speculative decoding — a technique where a smaller draft model proposes tokens that a larger verifier model checks in parallel, allowing multiple tokens to be confirmed per forward pass. On the RTX 3090, this yields 207 tok/s on Qwen3.5-27B, which is 3.43× faster than standard autoregressive inference and 2.8× faster than SGLang AWQ. The Megakernel fuses the entire forward pass into a single kernel for the 0.8B model, reaching 1.87 tok/J — reportedly matching Apple Silicon efficiency at twice the throughput.

Why does it matter?

Most inference frameworks (vLLM, SGLang, llama.cpp) target a broad set of hardware. LuceBox trades portability for raw per-chip performance. For developers running large models locally on a single RTX 3090, 207 tok/s means fast enough for interactive use with a 27B model — a meaningful threshold. 131 HN points within hours of posting reflects genuine community interest from the self-hosting crowd.

Who is it for?

Self-hosters running 27B+ models on consumer NVIDIA GPUs.

Try it

git clone https://github.com/Luce-Org/lucebox-hub

Sources · 2 outlets

Tags

  • inference
  • optimization
  • speculative-decoding
  • cuda
  • rtx-3090
  • qwen
  • self-hosting
  • local-llm
  • kernel-optimization

← All releases · Learn AI