How is Baidu's Unlimited-OCR different from other OCR models?

Unlimited-OCR replaces the decoder's standard attention with Reference Sliding Window Attention, which keeps the KV cache constant as the output grows. That lets the 3B vision-language model transcribe dozens of document pages in one forward pass, where typical OCR models stall on KV memory after a few pages.

Is Baidu Unlimited-OCR open source?

Yes. Baidu released Unlimited-OCR under the MIT license, with code on GitHub at baidu/Unlimited-OCR and weights mirrored to Hugging Face and ModelScope. The repo ships inference paths for Hugging Face Transformers, SGLang, and an OpenAI-compatible API, plus a custom no-repeat n-gram sampler.

Can Unlimited-OCR's attention trick help non-OCR tasks?

The Unlimited-OCR paper argues Reference Sliding Window Attention generalizes to other long-output decoders such as automatic speech recognition and machine translation, anywhere a growing KV cache caps how much a model can emit before memory runs out. The current 3B release is OCR-tuned, but the operator is task-agnostic.

What does 'one-shot long-horizon parsing' actually mean here?

Unlimited-OCR processes a single image, a multi-page document, or a PDF in one forward pass instead of chunking it into per-page calls. Two modes are exposed: a 'gundam' setting with configurable crop windows for dense single images, and a 'base' setting tuned for multi-page documents up to the 32,768-token output budget.

Baidu · 2026-06-22 · major

Baidu Unlimited-OCR — 3B vision model parses long documents in one pass

Baidu's Unlimited-OCR is a 3B vision-language model that introduces Reference Sliding Window Attention to keep a constant KV cache, letting one forward pass transcribe dozens of document pages within a 32K context. Code and weights ship under MIT.

Hugging Face model card hero for Baidu Unlimited-OCR

Baidu's open 3B OCR model swaps standard attention for R-SWA so it can transcribe dozens of pages without the usual KV-cache blowup.

Key specs

Parameters	3B
Context window	32K tokens

Quick facts

Maker	Baidu
License	MIT
Parameters	3B
Max context	32,768 tokens
Innovation	Reference Sliding Window Attention (R-SWA)
Availability	Hugging Face, GitHub, ModelScope
Inference	Hugging Face Transformers, SGLang, OpenAI-compatible API

What is it?

Unlimited-OCR is a 3B open-weight vision-language model from Baidu, released under MIT with code and weights on Hugging Face. The model parses single images, multi-page documents, and PDFs as one job and supports outputs up to 32,768 tokens.

How does it work?

Reference Sliding Window Attention replaces the decoder's standard attention so the KV cache stays a constant size as the output grows. Two inference modes ship out of the box — a 'gundam' setting that crops dense images and a 'base' setting tuned for multi-page documents — both driven from Hugging Face Transformers or SGLang.

Why does it matter?

Most OCR stacks force callers to slice a document into pages, run inference per page, then stitch the results back together. Unlimited-OCR turns long-document transcription into a single forward pass, which simplifies pipelines for archives, contracts, invoices, and any workload where chunking loses cross-page context.

Who is it for?

OCR engineers, document-AI teams, researchers building long-output decoders

Frequently asked questions

How is Baidu's Unlimited-OCR different from other OCR models?: Unlimited-OCR replaces the decoder's standard attention with Reference Sliding Window Attention, which keeps the KV cache constant as the output grows. That lets the 3B vision-language model transcribe dozens of document pages in one forward pass, where typical OCR models stall on KV memory after a few pages.
Is Baidu Unlimited-OCR open source?: Yes. Baidu released Unlimited-OCR under the MIT license, with code on GitHub at baidu/Unlimited-OCR and weights mirrored to Hugging Face and ModelScope. The repo ships inference paths for Hugging Face Transformers, SGLang, and an OpenAI-compatible API, plus a custom no-repeat n-gram sampler.
Can Unlimited-OCR's attention trick help non-OCR tasks?: The Unlimited-OCR paper argues Reference Sliding Window Attention generalizes to other long-output decoders such as automatic speech recognition and machine translation, anywhere a growing KV cache caps how much a model can emit before memory runs out. The current 3B release is OCR-tuned, but the operator is task-agnostic.
What does 'one-shot long-horizon parsing' actually mean here?: Unlimited-OCR processes a single image, a multi-page document, or a PDF in one forward pass instead of chunking it into per-page calls. Two modes are exposed: a 'gundam' setting with configurable crop windows for dense single images, and a 'base' setting tuned for multi-page documents up to the 32,768-token output budget.

Try it

pip install transformers && git clone https://github.com/baidu/Unlimited-OCR