AI/TLDR

Nexa SDK

Run LLMs and multimodal models on-device across CPU, GPU, and NPU

Overview

Nexa SDK is an on-device inference framework for running language and multimodal AI models locally. It targets CPU, GPU, and Qualcomm NPU hardware and runs across Android, Windows, and Linux devices. Models load from GGUF and the project's own NEXA format.

It is aimed at developers who want to run models on the device itself rather than calling a hosted API—useful when you care about latency, offline operation, or keeping data on the machine. You can drive it from a command-line tool, a Python package, or an Android SDK.

As a local runtime, it sits alongside other on-device options but adds NPU support and a model range that goes past chat: it also covers ASR, OCR, reranking, object detection, image generation, and embeddings.

What it does

  • Runs models on CPU, GPU, and Qualcomm NPU across Android, Windows, and Linux
  • Three ways to integrate: a CLI, a Python SDK (pip install nexaai), and an Android SDK
  • Supports many model types: LLM, multimodal, ASR, OCR, rerank, object detection, image generation, and embeddings
  • Loads models in GGUF and the project's NEXA format
  • Streaming text generation via generate_stream in Python and Flow-based streaming on Android
  • Multimodal chat from the CLI, including dragging images directly into the prompt

Getting started

You can start from the Python SDK for a quick hello-world, or use the CLI to chat with a model in one command.

Install the Python SDK

Install the nexaai package from PyPI.

bashbash
pip install nexaai

Run a model in Python

Load a GGUF model, build a chat prompt, and stream tokens as they generate.

pythonpython
from nexaai import LLM, GenerationConfig, ModelConfig, LlmChatMessage

llm = LLM.from_(model="NexaAI/Qwen3-0.6B-GGUF", config=ModelConfig())

conversation = [
    LlmChatMessage(role="user", content="Hello, tell me a joke")
]
prompt = llm.apply_chat_template(conversation)
for token in llm.generate_stream(prompt, GenerationConfig(max_tokens=100)):
    print(token, end="", flush=True)

Or chat from the CLI

After installing the CLI for your platform, run a model directly. NPU models on Windows arm64 require setting the NEXA_TOKEN environment variable first (see the README).

bashbash
# Chat with Qwen3
nexa infer ggml-org/Qwen3-1.7B-GGUF

# Multimodal: drag images into the CLI
nexa infer NexaAI/Qwen3-VL-4B-Instruct-GGUF

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Add an offline chat or assistant feature to a desktop or mobile app without a hosted API
  • Run on-device multimodal tasks such as OCR, ASR, or vision-language chat on laptops or phones
  • Take advantage of a Qualcomm Snapdragon NPU for low-power local inference on Windows arm64 or Android
  • Build embeddings, reranking, or object detection into a local pipeline that keeps data on the device

How Nexa SDK compares

Nexa SDK alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Ollama★ 175kA developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp★ 117kA C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All★ 77.4kGPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI★ 47kA self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan★ 43.1kAn open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile★ 25kA Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM★ 22.8kA machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
Nexa SDK★ 8.1kRun LLMs and multimodal models on-device across CPU, GPU, and NPU