AI/TLDR

SGLang Project · 2026-04-09 · major

SGLang v0.5.10 — Native Apple Silicon Backend, Elastic MoE Fault Tolerance, sglang-kernel 0.4.1

Adds native MLX backend for Apple Silicon, Elastic EP partial-failure tolerance for DeepSeek MoE, HiSparse long-context attention, Transformers v5 upgrade, and a renamed sglang-kernel package.

SGLang GitHub repository

SGLang adds Apple Silicon inference, resilient MoE failover, and a 1000× RDMA reduction for large-scale clusters.

Key specs

GitHub stars27,141

What is it?

SGLang is a high-performance serving framework for large language models and multimodal models with 27k GitHub stars. Version 0.5.10 shipped April 6, 2026 with several cross-platform and reliability improvements.

How does it work?

The native MLX backend enables SGLang to run inference directly on Apple Silicon Macs without CUDA. Elastic EP integrates NIXL-based partial failure tolerance for DeepSeek MoE deployments so one failed node does not kill the cluster. GPU staging buffers gather scattered head slices into contiguous memory before RDMA transfer, reducing RDMA request count by approximately 1000× on GQA models. HiSparse sparse attention reduces compute for long-context inference through sparsity-aware attention.

Why does it matter?

Making SGLang work natively on Apple Silicon opens local inference for Mac-based developers. The Elastic EP and GPU staging buffer features directly address pain points when running trillion-parameter MoE models like DeepSeek V4 at scale.

Who is it for?

Production ML infrastructure teams serving large MoE models and developers wanting local SGLang inference on Apple hardware.

Try it

pip install sglang==0.5.10

Sources · 2 outlets

Tags

  • inference
  • llm-serving
  • apple-silicon
  • moe
  • open-source

← All releases · Learn AI