Baidu · 2026-06-22 · major
Baidu Unlimited-OCR — 3B vision model parses long documents in one pass
Baidu's Unlimited-OCR is a 3B vision-language model that introduces Reference Sliding Window Attention to keep a constant KV cache, letting one forward pass transcribe dozens of document pages within a 32K context. Code and weights ship under MIT.

Baidu's open 3B OCR model swaps standard attention for R-SWA so it can transcribe dozens of pages without the usual KV-cache blowup.
Key specs
| Parameters | 3B |
|---|---|
| Context window | 32K tokens |
Quick facts
| Maker | Baidu |
|---|---|
| License | MIT |
| Parameters | 3B |
| Max context | 32,768 tokens |
| Innovation | Reference Sliding Window Attention (R-SWA) |
| Availability | Hugging Face, GitHub, ModelScope |
| Inference | Hugging Face Transformers, SGLang, OpenAI-compatible API |
What is it?
Unlimited-OCR is a 3B open-weight vision-language model from Baidu, released under MIT with code and weights on Hugging Face. The model parses single images, multi-page documents, and PDFs as one job and supports outputs up to 32,768 tokens.
How does it work?
Reference Sliding Window Attention replaces the decoder's standard attention so the KV cache stays a constant size as the output grows. Two inference modes ship out of the box — a 'gundam' setting that crops dense images and a 'base' setting tuned for multi-page documents — both driven from Hugging Face Transformers or SGLang.
Why does it matter?
Most OCR stacks force callers to slice a document into pages, run inference per page, then stitch the results back together. Unlimited-OCR turns long-document transcription into a single forward pass, which simplifies pipelines for archives, contracts, invoices, and any workload where chunking loses cross-page context.
Who is it for?
OCR engineers, document-AI teams, researchers building long-output decoders
Frequently asked questions
- How is Baidu's Unlimited-OCR different from other OCR models?
- Unlimited-OCR replaces the decoder's standard attention with Reference Sliding Window Attention, which keeps the KV cache constant as the output grows. That lets the 3B vision-language model transcribe dozens of document pages in one forward pass, where typical OCR models stall on KV memory after a few pages.
- Is Baidu Unlimited-OCR open source?
- Yes. Baidu released Unlimited-OCR under the MIT license, with code on GitHub at baidu/Unlimited-OCR and weights mirrored to Hugging Face and ModelScope. The repo ships inference paths for Hugging Face Transformers, SGLang, and an OpenAI-compatible API, plus a custom no-repeat n-gram sampler.
- Can Unlimited-OCR's attention trick help non-OCR tasks?
- The Unlimited-OCR paper argues Reference Sliding Window Attention generalizes to other long-output decoders such as automatic speech recognition and machine translation, anywhere a growing KV cache caps how much a model can emit before memory runs out. The current 3B release is OCR-tuned, but the operator is task-agnostic.
- What does 'one-shot long-horizon parsing' actually mean here?
- Unlimited-OCR processes a single image, a multi-page document, or a PDF in one forward pass instead of chunking it into per-page calls. Two modes are exposed: a 'gundam' setting with configurable crop windows for dense single images, and a 'base' setting tuned for multi-page documents up to the 32,768-token output budget.
Try it
pip install transformers && git clone https://github.com/baidu/Unlimited-OCR