What Is Chatbot Arena (LMSYS)?

Understand how Chatbot Arena turns millions of human votes into a live AI leaderboard — and why that is both more realistic and more gameable than a fixed test.

BEGINNER10 MIN READUPDATED 2026-06-12

In plain English

Chatbot Arena is a website where you type a question, get two anonymous AI responses side by side, and vote for the one you prefer. Nobody tells you which model wrote which reply. Do that millions of times across hundreds of thousands of real users, collect all the votes, and you can rank AI models by which ones people actually like — no fixed answer key required.

Think of it like an endless blind taste test for AI, run in public. Two models get the same drink recipe (your prompt), you taste both without seeing the label, and you pick the better one. The rankings that come out of millions of these blind picks are called the Arena leaderboard.

The project was built in 2023 by LMSYS Org, a research group at UC Berkeley. It grew so fast that by 2025 it spun out as an independent company, raising $100 million. In January 2026 it rebranded from LMArena to simply Arena (at arena.ai), but most people still call it "Chatbot Arena" or "the Arena." By 2025 it had collected over 6 million votes, making it the most widely cited human-preference leaderboard in AI.

Why it matters

Static benchmarks measure how well a model scores on someone else's exam. That is useful, but it has two big blind spots: the exam questions are fixed (so labs can optimize for them), and the "correct" answer is determined by the benchmark writers — not by whether a real user found the response helpful.

Arena sidesteps both problems. Because prompts come from real users and change every day, there is no fixed test set to memorize or "teach to." And because a real human picks the winner of each matchup, the scores reflect actual human preference rather than an academic answer key.

What it is actually useful for

Picking a general-purpose model — the leaderboard is the fastest sanity-check for "which frontier model is best right now for everyday tasks."
Catching the vibes benchmarks miss — a model can score top on MMLU but feel robotic. Arena captures fluency, helpfulness, and tone that multiple-choice tests ignore.
Comparing models without a budget for evals — running your own evals takes time. Arena gives you a quick, human-grounded baseline.
Tracking progress over time — because the Arena runs continuously, a new model's score reflects how it compares to every other model currently live, not a snapshot from launch day.

How it works

The mechanics are simple on the surface and statistically careful underneath. Here is the full loop:

// One Arena battle, start to finish

User enters a promptany topic, any languageTwo models respondidentities hidden (Model A / B)User picks a winneror declares a tieIdentities revealedthen ratings updateScores recalculatedBradley-Terry over all votes

Blind, pairwise voting

The key design choice is anonymity before voting. You see "Model A" and "Model B" — no brand names, no version numbers. Only after you submit your vote does the interface reveal which models you just compared. This prevents brand loyalty from contaminating the result: you can't unconsciously favor the model you already pay for if you don't know which one it is.

You can vote "A is better," "B is better," "Tie (both good)," or "Tie (both bad)." Ties are real data — they count toward each model's statistics. You can also continue the conversation for multiple turns before voting, which is why Arena captures multi-turn quality that single-answer benchmarks miss.

From votes to Elo scores: the Bradley-Terry model

Raw win/loss counts are not enough to rank models fairly. A model that only ever faces weak opponents can rack up wins without being great. The Arena uses the Bradley-Terry (BT) model — a statistical method originally developed in 1952 for sports rankings — to convert millions of pairwise votes into a single score per model.

The intuition: if Model X wins 64% of its head-to-head votes against Model Y, the BT model says X is roughly 100 "Elo points" better than Y. A model with an Arena score of 1300 versus one with 1200 is expected to win about 64% of their direct matchups. The scores are not percentages or absolute quality measures — they are relative ratings that only make sense when compared to each other.

The Arena originally used the classic Elo system (the same algorithm that ranks chess players), then switched to BT because BT handles hundreds of simultaneously competing models more robustly. BT computes the maximum-likelihood estimate of each model's underlying win-rate across all its matchups at once, rather than updating scores one battle at a time. The practical difference: BT ratings are more stable and come with explicit statistical confidence intervals — so you can see whether a model's lead over another is statistically meaningful or just noise.

// Arena (human-preference) vs static benchmarks

Chatbot Arena

Real user prompts — changes daily
Blind: no brand bias before voting
Captures preference, tone, helpfulness
Multi-turn conversations counted
No fixed answer key to memorize
Hard to game prompt distribution

Static benchmarks

Fixed question set — public forever
Lab sets prompting style
Captures factual accuracy / reasoning
Usually single-turn
Answer key exists → contamination risk
Can be "taught to" with targeted fine-tuning

Limitations and gaming concerns

The Arena is popular precisely because it is hard to game in the same ways as static benchmarks. But it has its own weaknesses, and several of them became public knowledge in 2025.

Who is actually voting?

Arena voters are self-selected: people who found the site, chose to participate, and typed whatever prompt they felt like. That population skews English-speaking, technically literate, and curious about AI. A model that is excellent at English creative writing but weak in Japanese will score higher than its true multilingual quality justifies. The prompts people choose to submit are also not representative of professional use cases — low-frequency but high-stakes tasks (legal drafting, medical explanation, code debugging at scale) get far fewer votes than general chat.

Vote rigging

A 2025 academic paper showed that without strict safeguards, it is possible to de-anonymize model outputs with over 95% accuracy — meaning an attacker can identify which response came from which model before voting. Injecting biased votes for a target model at scale can produce multi-rank gains on the leaderboard. Omnipresent rigging (manipulating votes across all battles) is especially effective, requiring only hundreds of injected votes to produce a visible ranking change.

Arena responded by adding CAPTCHA, login requirements, bot detection, and anomaly monitoring. But the vulnerability illustrates a structural tension: the more prominent the leaderboard becomes, the stronger the incentive to game it.

Provider asymmetries

Large AI companies can test model versions on Arena privately before making them public, selectively withdraw poorly-performing versions, and run many more test configurations than a small lab can afford. This creates an asymmetry: well-resourced providers can effectively cherry-pick which version of a model gets public exposure on the leaderboard. A 2025 analysis called this the $100M bias problem, noting that Arena's major funders are the same companies whose models compete on the leaderboard.

Preference is not correctness

The most fundamental limitation: Arena measures what users prefer, which is not the same as what is accurate or safe. Longer, more confident-sounding answers often win votes even when they are less accurate. A model trained to produce fluent, agreeable text can outscore a more careful model that hedges appropriately. This is why Arena ranks should never be used alone to evaluate a model for high-stakes domains — you still need factual accuracy checks and red-teaming.

Going deeper

Once you understand how the basic Arena loop works, there are a few harder questions worth sitting with.

How many votes does a stable score actually require?

The Arena publishes confidence intervals alongside each model's score. A model with 50 battles has a very wide interval — its apparent rank could jump by a dozen places with a few dozen more votes. A model with 10,000 battles has a tight interval — the rank is stable. When you see a brand-new model near the top of the leaderboard, always check the confidence interval before concluding it is better than an older, more battle-tested model. The number of votes is printed in the leaderboard table.

Specialized arenas

The main leaderboard covers general conversation, but Arena also runs domain-specific leaderboards: Coding Arena (compare models on code generation), Vision Arena (compare on image understanding), and text-to-video leaderboards. Each uses the same blind pairwise mechanism but routes prompts to domain-relevant models. A model can rank very differently across these sub-arenas, which is the right data to look at if you have a specific use case.

How Arena and static benchmarks complement each other

The mature way to evaluate a model is to use both. Static benchmarks like GPQA or SWE-bench tell you how the model performs on carefully constructed test cases with ground-truth answers — useful for measuring factual accuracy and reasoning. Arena tells you how the model's responses feel to real users asking real questions. A model that ranks highly on both is a stronger choice than one that dominates only one dimension.

Arena as a model of human feedback at scale

Beyond ranking, the data Arena collects — millions of labeled preference pairs — is the same kind of signal used in RLHF (Reinforcement Learning from Human Feedback), the technique used to align LLMs to human preferences during training. Some researchers argue that the Arena's public preference data is as valuable as a research artifact as it is as a leaderboard, because it captures genuine human judgments at a scale that no internal lab can easily replicate.

FAQ

What is Chatbot Arena and how does it work?

Chatbot Arena is a website where you submit a prompt and see two anonymous AI model responses side by side. You vote for the better one without knowing which model wrote it. After millions of such votes, a statistical model (Bradley-Terry) converts the win/loss data into a ranked leaderboard. It was built by LMSYS at UC Berkeley in 2023 and rebranded to Arena (arena.ai) in 2026.

What does an Arena Elo score actually mean?

Arena scores are relative ratings — they only mean something compared to other models on the same leaderboard. A model with a score of 1300 vs one at 1200 is expected to win about 64% of head-to-head votes. The scores are not percentages of correct answers; they are derived from the Bradley-Terry statistical model applied to millions of pairwise human votes.

Why is Chatbot Arena better than regular benchmarks?

Regular benchmarks use a fixed question set with a known answer key, which means labs can optimize models for those specific questions and the questions can leak into training data. Arena uses real user prompts that change every day and measures human preference rather than matching a fixed answer key, making it harder to game in the same ways. However, it measures preference, not factual correctness, which static benchmarks can capture.

Can Chatbot Arena be gamed or manipulated?

Yes. A 2025 paper showed that model outputs can be de-anonymized with over 95% accuracy, making targeted vote injection possible. Injecting hundreds of biased votes can shift a model's rank measurably. Arena added CAPTCHA, login requirements, and bot detection in response. There are also structural concerns: large AI companies can test model versions privately before release and withdraw poorly-performing variants, giving them an advantage smaller labs lack.

What are the main limitations of the Chatbot Arena leaderboard?

Four main limitations: (1) Voter bias — the user base skews English-speaking and tech-savvy, so multilingual and domain-specific models are underrated. (2) Preference vs. correctness — fluent, confident answers win votes even when less accurate. (3) Provider asymmetry — well-funded labs can run private tests and selectively expose their best versions. (4) Sampling bias — the prompts people submit are not representative of professional or high-stakes tasks.

Is LMArena the same as Chatbot Arena?

Yes. The platform launched as Chatbot Arena by LMSYS Org at UC Berkeley in May 2023. In September 2024 it moved to the domain lmarena.ai and adopted the LMArena name. In April 2025 it incorporated as Arena Intelligence Inc. and raised $100 million. In January 2026 it rebranded again to simply Arena and moved to arena.ai. Most people still refer to it as Chatbot Arena.

// In plain English

// Why it matters

What it is actually useful for

// How it works

Blind, pairwise voting

From votes to Elo scores: the Bradley-Terry model

// Limitations and gaming concerns

Who is actually voting?

Vote rigging

Provider asymmetries

Preference is not correctness

// Going deeper

How many votes does a stable score actually require?

Specialized arenas

How Arena and static benchmarks complement each other

Arena as a model of human feedback at scale

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

Limitations and gaming concerns

Going deeper

FAQ

Further reading

Related