In plain English
LMArena (now rebranded to simply Arena at arena.ai) is the internet's biggest live tournament for AI chatbots. You visit the site, type a question, and two anonymous AI models both answer it. You pick the better reply without knowing whose it was. Repeat that millions of times across hundreds of thousands of real users, and the win/loss results can be converted into a ranked leaderboard.
The rankings use a scoring method borrowed from competitive chess: the Elo rating system. In chess, every player starts with a number (say 1500). Beat a higher-rated opponent and your number goes up — beat a lower-rated one and it barely changes. The same logic applies here: an AI model earns more rating points by beating a strong competitor than by beating a weak one.
Think of it as a never-ending round-robin chess tournament, except the competitors are language models, the judges are ordinary internet users, and the game is "which reply do you prefer?" A rating of 1300 doesn't mean anything on its own, but a 100-point gap means the higher-rated model wins roughly 64 % of head-to-head votes. A 200-point gap puts that win rate at about 76 %.
Why it matters
Most AI benchmarks work like school exams: someone writes a question, records the correct answer, feeds it to the model, and grades the output automatically. That is cheap and reproducible, but it has a huge flaw — labs can (and do) tune models specifically to score well on known tests. The benchmark stops measuring general quality and starts measuring benchmark-fitness.
Arena sidesteps this by using real user prompts on live models, with no fixed answer key. Voters judge by personal preference, so there is nothing to memorize. This makes gaming harder (though not impossible, as you will see later). The result is a leaderboard that many practitioners treat as the closest proxy to "which model do users actually prefer in the wild."
For an AI engineer, the Arena number tells you something useful: if two models are within 10–15 points, treat them as roughly equivalent for general use. If one model is 150+ points ahead, there is a meaningful, user-noticeable quality gap. This lets you make product decisions — model selection, cost/quality trade-offs — with a data point that is harder to inflate than a lab's own benchmark.
- Compare models across labs on real conversational tasks without relying on self-reported scores.
- Track progress over time — the leaderboard history shows how rapidly frontier models have improved since 2023.
- Sanity-check fine-tunes by entering your own model as a private submission and seeing how it compares.
- Interpret model cards — a model with a top-5 Arena score but mediocre benchmark numbers often means the benchmark has saturated.
How the rating system works
When you vote on Arena, that single preference is turned into a signal: Model A beat Model B (or tied). Aggregate enough of those signals and you can fit a statistical model that assigns each chatbot a single number capturing its overall strength. Here is how that pipeline runs.
Classic Elo: the original formula
The Elo system calculates an expected win probability before each battle, then updates ratings based on the surprise. If Model A (rating 1400) faces Model B (rating 1200), A was the heavy favourite. A win earns A only a few points; a loss costs it a lot. The formula is:
Expected score: E_A = 1 / (1 + 10^((R_B - R_A) / 400))
Rating update: R_A_new = R_A + K * (actual - E_A)
actual = 1 (win), 0.5 (tie), 0 (loss)
K = 4 (Arena's setting; chess typically uses 32)The small K-factor of 4 makes Arena ratings stable. A single upset changes scores by at most 4 points, not 32. This matters when you have millions of battles — you want the aggregate signal to dominate, not recent noise.
Bradley-Terry: the upgrade
Arena later replaced the simple online Elo update with the Bradley-Terry (BT) model. Instead of updating ratings one battle at a time, BT looks at all battles simultaneously and finds the rating vector that best explains the observed win rates — a maximum-likelihood fit. This has two advantages:
- No recency bias. Online Elo gives slightly more weight to the most recent battle; BT treats every vote equally regardless of when it happened.
- Better calibration. BT directly optimises the log-likelihood of all outcomes, giving a tighter fit to the data and more accurate win-probability estimates between any pair of models.
Bootstrap confidence intervals
A single rating number hides how certain we are. Arena reports a 95 % confidence interval computed by bootstrap resampling: take all the recorded battle pairs, shuffle and sample them with replacement 1 000 times, refit the BT model each time, and read the 2.5 %–97.5 % range of the resulting scores. Models with many battles get tight CIs; newer models or niche-category specialists show wide bands. When two models' confidence intervals overlap, the ranking between them is statistically ambiguous — they are effectively tied.
| Elo gap | Approximate win rate | Practical meaning |
|---|---|---|
| 0–15 | ~50 % | Statistical tie — prefer on price or latency |
| ~100 | ~64 % | Noticeable quality edge for most tasks |
| ~200 | ~76 % | Strong preference; clearly the better model |
| 300+ | ~85 %+ | Very large gap — rare between frontier models |
Leaderboard categories and what each measures
Arena does not publish a single number. As of 2026 it runs several parallel leaderboards filtered by topic, each drawing votes only from battles in that domain. Knowing which category to consult is as important as reading the number.
| Category | What it measures | Best for |
|---|---|---|
| Overall | General conversation across all prompt types | General-purpose assistant selection |
| Coding | Code generation, debugging, and explanation | Engineering tool decisions |
| Math | Reasoning through quantitative problems | STEM and data tasks |
| Creative Writing | Narrative quality, style, originality | Content and copywriting tools |
| Instruction Following | Obeying format and constraint instructions | Pipelines with strict output schemas |
| Hard Prompts | Votes from complex, multi-step questions only | Power-user and enterprise tasks |
| Expert (Arena Expert) | Top 5.5 % of prompts by reasoning depth | Research and specialist deployments |
Arena Expert, launched in November 2025, is the most selective: only the highest-complexity prompts are included. Models that rank highly in Expert but poorly on Overall tend to be strong at deep reasoning but less polished on everyday chat — useful signal if your use-case is technical.
How the leaderboard gets gamed — and its real limits
Arena is harder to game than a fixed benchmark, but it is not immune. A 2025 paper titled "The Leaderboard Illusion" (researchers from Cohere Labs, AI2, Princeton, Stanford, and others) analysed 2 million battles across 243 models and documented several systematic advantages enjoyed by large providers.
Best-of-N private testing
Arena allows selected providers to submit model variants privately, accumulate battles, and then choose which variant goes public (withdrawing the others). The paper found that testing just 20 variants can inflate the best-observed score by roughly 50 points compared to a single honest submission. Meta reportedly tested as many as 27 private variants before picking one to publish. This systematically advantages large labs over smaller ones.
Style bias
Human voters are not grading accuracy — they are expressing preference. Responses with bullet-point formatting, a specific length (neither too short nor too long), and a confident tone tend to win more votes regardless of correctness. A model that answers wrong but presents the answer beautifully can outrank one that answers right but plainly.
Training on Arena data
Arena's prompt log is partially public. The paper found that fine-tuning a model on Arena-style prompts can produce relative performance gains of up to 112 % on the Arena distribution — without making the model better at anything else. And about 7.3 % of prompts from December 2024 appeared again verbatim in January 2025, meaning the "live" distribution has meaningful duplication.
Vote rigging
An ICML 2025 paper showed that coordinated vote-rigging campaigns can shift rankings even with a modest number of fake battles, especially using an "omnipresent rigging" strategy that exploits the BT model's global coupling — rigging a battle between two unrelated models still nudges the target model's score upward.
Going deeper
If you want to go beyond reading the leaderboard and actually understand or reproduce the math, here is where to dig.
Arena-Rank: the open-source methodology
Arena open-sourced its full ranking stack as the arena-rank Python package. It uses JAX as the computational backend — just-in-time compilation cuts what was previously a 19-minute recomputation to under 10 seconds. The package separates data preprocessing from model fitting, so you can swap ranking algorithms (Elo, BT, TrueSkill) against the same cleaned dataset. This is the code that runs the live leaderboard.
Interpreting confidence intervals in practice
A model that jumps from rank 8 to rank 4 between two leaderboard snapshots has probably just accumulated enough battles to tighten its CI, not necessarily gotten better. Before concluding that a new model is "beating" an older one, check whether (a) the CIs no longer overlap, and (b) the vote count is above roughly 1 000 battles — below that, the interval is wide enough that almost any ordering is plausible.
Silent deprecations and BT model assumptions
The BT model assumes that every model's strength is constant over time and that battles are drawn from the same distribution. Neither is strictly true: models get updated silently, and the user population changes. The Leaderboard Illusion paper found that 64 % of silently deprecated models were open-weight or fully open-source, which biases the surviving model pool toward closed-API providers. If a model you care about has been removed, its former rating is no longer part of the BT optimisation and its apparent effect on remaining models' ratings drifts.
When to trust Arena scores — and when not to
Arena scores are most trustworthy when: the model has 5 000+ battles, you are querying a domain-specific leaderboard (Coding, Math) not just Overall, and the model was not submitted privately with many retracted variants. They are least trustworthy for: comparing models within a 20-point band, judging very recently-added models with few battles, and domains where human aesthetic preference diverges from downstream task accuracy (e.g., formal reasoning, factual recall).
FAQ
What does an LMArena Elo score actually mean?
It is a relative strength estimate derived from millions of pairwise human votes. The absolute number has no fixed meaning; what matters are the gaps. A 100-point gap means the higher-rated model wins roughly 64 % of blind head-to-head comparisons; a 200-point gap puts that at about 76 %. Models within ~15 points of each other are statistically tied.
What is the difference between Elo and Bradley-Terry in the context of Arena?
Classic Elo updates ratings one battle at a time using a fixed K-factor, which gives slightly more weight to recent battles. Bradley-Terry fits a maximum-likelihood model over all battles simultaneously, treating every vote equally. Arena switched to BT because it is more statistically principled and avoids recency bias, though both methods produce scores on the same Elo-scaled axis.
How many votes does a model need before its Arena score is reliable?
As a rough rule, expect the 95 % confidence interval to narrow to a useful width once a model has around 1 000 battles. Below that, the interval is so wide that a model's true rank could vary by 5–10 positions. The Arena UI shows the confidence band directly — check it before drawing conclusions from a new model's debut score.
Can AI companies game the LMArena leaderboard?
Yes, in several ways. The most documented is best-of-N private testing: labs can submit multiple model variants privately, observe which scores best, and publish only that one. Research found this can inflate scores by ~50 points. Style optimisation (training for bullet-point-heavy responses) and, theoretically, vote-rigging campaigns are also possible. Arena has published mitigations and open-sourced its ranking code, but the structural advantage for large providers with many private test slots remains.
Is Arena the same as LMSYS?
LMSYS Org (a UC Berkeley research group) created Chatbot Arena in 2023. The platform rebranded to LMArena in September 2024, then incorporated as Arena Intelligence Inc. in April 2025, and rebranded again to simply Arena at arena.ai in January 2026. The methodology and accumulated vote data have been continuous throughout. Most practitioners still call it "Chatbot Arena" or "the Arena."
Why do Arena rankings differ between Overall and Coding/Math categories?
Each category leaderboard is fitted independently using only the battles from prompts tagged to that domain. A model's overall vote share includes casual chitchat and creative tasks, which can dilute (or boost) its score relative to its specialist performance. Always consult the domain-specific leaderboard that matches your use-case rather than the Overall ranking.