UCLA NLP · 2026-04-09 · notable
OpenVLThinkerV2 — generalist multimodal reasoning with Gaussian GRPO
Open-source vision-language reasoning model trained with Gaussian GRPO (G²RPO), a non-linear RL objective that replaces standard linear scaling with distributional matching — reported to outperform comparable open and proprietary VLMs across 18 benchmarks.
A generalist multimodal reasoning model trained with a non-linear GRPO variant — an open-source baseline that beats GPT-4o on MMMU and MathVista.
What is it?
OpenVLThinkerV2 is UCLA NLP's vision-language reasoning model released with training code, evaluation code, and a full paper. The headline contribution is G²RPO (Gaussian GRPO), a new reinforcement learning objective designed to balance perception and reasoning across very different visual tasks.
How does it work?
Standard GRPO scales advantages linearly, which over-weights outlier rewards on tasks with noisy visual signal. G²RPO replaces that with a Gaussian distributional-matching objective — the policy update is shaped by how far a trajectory lies in the reward distribution, not by its raw magnitude. The authors apply it to multi-domain visual tasks spanning document understanding, math, and spatial reasoning, and report gains across 18 benchmarks including 71.6% on MMMU and 79.5% on MathVista.
Why does it matter?
Most open multimodal reasoning models either specialize in one axis (math, charts, documents) or regress on others when trained jointly. OpenVLThinkerV2 is a reproducible recipe for a single model that holds up across domains, and the G²RPO objective is a drop-in replacement for vanilla GRPO in any VLM RL pipeline.
Who is it for?
Multimodal RL researchers and anyone fine-tuning open VLMs.
Try it
git clone https://github.com/uclanlp/openvlthinkerKey numbers
- Benchmarks: 18
- MMMU: 71.6%
- MathVista: 79.5%
- arXiv: 2604.08539
- License: Apache 2.0