UCLA NLP · 2026-04-09 · notable

OpenVLThinkerV2 — generalist multimodal reasoning with Gaussian GRPO

Open-source vision-language reasoning model trained with Gaussian GRPO (G²RPO), a non-linear RL objective that replaces standard linear scaling with distributional matching — reported to outperform comparable open and proprietary VLMs across 18 benchmarks.

OpenVLThinker GitHub repository social card

A generalist multimodal reasoning model trained with a non-linear GRPO variant — an open-source baseline that beats GPT-4o on MMMU and MathVista.

What is it?

OpenVLThinkerV2 is UCLA NLP's vision-language reasoning model released with training code, evaluation code, and a full paper. The headline contribution is G²RPO (Gaussian GRPO), a new reinforcement learning objective designed to balance perception and reasoning across very different visual tasks.

How does it work?

Standard GRPO scales advantages linearly, which over-weights outlier rewards on tasks with noisy visual signal. G²RPO replaces that with a Gaussian distributional-matching objective — the policy update is shaped by how far a trajectory lies in the reward distribution, not by its raw magnitude. The authors apply it to multi-domain visual tasks spanning document understanding, math, and spatial reasoning, and report gains across 18 benchmarks including 71.6% on MMMU and 79.5% on MathVista.

Why does it matter?

Most open multimodal reasoning models either specialize in one axis (math, charts, documents) or regress on others when trained jointly. OpenVLThinkerV2 is a reproducible recipe for a single model that holds up across domains, and the G²RPO objective is a drop-in replacement for vanilla GRPO in any VLM RL pipeline.

Who is it for?

Multimodal RL researchers and anyone fine-tuning open VLMs.

Try it

git clone https://github.com/uclanlp/openvlthinker

Key numbers

Benchmarks: 18
MMMU: 71.6%
MathVista: 79.5%
arXiv: 2604.08539
License: Apache 2.0