Nexa SDK

Run LLMs and multimodal models on-device across CPU, GPU, and NPU

github.com/qualcomm/nexa-sdk★ 8.1k nexa.ai

Overview

Nexa SDK is an on-device inference framework for running language and multimodal AI models locally. It targets CPU, GPU, and Qualcomm NPU hardware and runs across Android, Windows, and Linux devices. Models load from GGUF and the project's own NEXA format.

It is aimed at developers who want to run models on the device itself rather than calling a hosted API—useful when you care about latency, offline operation, or keeping data on the machine. You can drive it from a command-line tool, a Python package, or an Android SDK.

As a local runtime, it sits alongside other on-device options but adds NPU support and a model range that goes past chat: it also covers ASR, OCR, reranking, object detection, image generation, and embeddings.

What it does

Runs models on CPU, GPU, and Qualcomm NPU across Android, Windows, and Linux
Three ways to integrate: a CLI, a Python SDK (pip install nexaai), and an Android SDK
Supports many model types: LLM, multimodal, ASR, OCR, rerank, object detection, image generation, and embeddings
Loads models in GGUF and the project's NEXA format
Streaming text generation via generate_stream in Python and Flow-based streaming on Android
Multimodal chat from the CLI, including dragging images directly into the prompt

Getting started

You can start from the Python SDK for a quick hello-world, or use the CLI to chat with a model in one command.

Install the Python SDK

Install the nexaai package from PyPI.

bashbash

pip install nexaai

Run a model in Python

Load a GGUF model, build a chat prompt, and stream tokens as they generate.

pythonpython

from nexaai import LLM, GenerationConfig, ModelConfig, LlmChatMessage

llm = LLM.from_(model="NexaAI/Qwen3-0.6B-GGUF", config=ModelConfig())

conversation = [
    LlmChatMessage(role="user", content="Hello, tell me a joke")
]
prompt = llm.apply_chat_template(conversation)
for token in llm.generate_stream(prompt, GenerationConfig(max_tokens=100)):
    print(token, end="", flush=True)

Or chat from the CLI

After installing the CLI for your platform, run a model directly. NPU models on Windows arm64 require setting the NEXA_TOKEN environment variable first (see the README).

bashbash

# Chat with Qwen3
nexa infer ggml-org/Qwen3-1.7B-GGUF

# Multimodal: drag images into the CLI
nexa infer NexaAI/Qwen3-VL-4B-Instruct-GGUF

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Add an offline chat or assistant feature to a desktop or mobile app without a hosted API
Run on-device multimodal tasks such as OCR, ASR, or vision-language chat on laptops or phones
Take advantage of a Qualcomm Snapdragon NPU for low-power local inference on Windows arm64 or Android
Build embeddings, reranking, or object detection into a local pipeline that keeps data on the device

How Nexa SDK compares

Nexa SDK alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Ollama	★ 175k	A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp	★ 117k	A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All	★ 77.4k	GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI	★ 47k	A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan	★ 43.1k	An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile	★ 25k	A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM	★ 22.8k	A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
Nexa SDK	★ 8.1k	Run LLMs and multimodal models on-device across CPU, GPU, and NPU

// Overview

// What it does

// Getting started