AI benchmarking platform is helping top companies rig their model performances, study claims

LMArena, a popular benchmark for large language models, has been accused of giving preferential treatment to AIs made by big tech firms, potentially enabling them to game their results.

A robot caught underneath a spotlight.
The study claims raise concerns about how AI models can be tested in a fair and consistent manner.
(Image credit: Getty Images)

The go-to benchmark for artificial intelligence (AI) chatbots is facing scrutiny from researchers who claim that its tests favor proprietary AI models from big tech companies.

LM Arena effectively places two unidentified large language models (LLMs) in a battle to see which can best tackle a prompt, with users of the benchmark voting for the output they like most. The results are then fed into a leaderboard that tracks which models perform the best and how they have improved.

Roland Moore-Colyer

Roland Moore-Colyer is a freelance writer for Live Science and managing editor at consumer tech publication TechRadar, running the Mobile Computing vertical. At TechRadar, one of the U.K. and U.S.’ largest consumer technology websites, he focuses on smartphones and tablets. But beyond that, he taps into more than a decade of writing experience to bring people stories that cover electric vehicles (EVs), the evolution and practical use of artificial intelligence (AI), mixed reality products and use cases, and the evolution of computing both on a macro level and from a consumer angle.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.