Millions of people use artificial intelligence (AI) tools like ChatGPT daily to do everything from generating code to drawing images to creating business ideas.
Those AI tools appear to be getting better. Back in November 2022 when it was launched, ChatGPT was powered by GPT-3.5, at the time the most powerful model offered by OpenAI. Yet GPT-3.5 was quickly eclipsed by GPT-4 just a few months later. GPT-4 crushed GPT-3.5 on a range of benchmarks, including its performance on the bar exam (GPT-4 scored in the 90th percentile; GPT-3.5 in the 10th). In short order, GPT-4 itself was also overtaken by GPT-4o, now OpenAI’s most powerful model by a long shot.
At the same time, companies like Google, Anthropic, and Microsoft also have developed increasingly capable AI models that seem to blow past the top scores of previous models on a range of tests and evaluations. That makes the trajectory of AI improvement quite clear. Models are getting dramatically better as scaling laws mean their intelligence grows significantly as they’re given more data and compute time.
Today, the top models all appear to be comparably powerful, and leaps and bounds better than their predecessors of just a couple years ago. However, directly comparing the relative strengths of each model is still difficult. Is GPT-4o or Claude 3.5 Sonnet better at coding? Is Gemini 1.5 Pro or Mistral Large the better choice for document analysis? Which model is more efficient or effective for a specific task or use case?
Those are not easy questions to answer because we lack effective standardized evaluation methods to tell us how good a particular AI model is at a particular thing. There are plenty of tests and benchmarks to measure different outcomes, but they’re still inadequate if you want to definitively understand which model to use for a particular task.
It’s like knowing you’re looking at a group of Olympic gold medalists, but don’t know what sport they play.
A feature or a bug?
To start, the problem lies in the nature of today’s AI systems. The difficulty of measuring them is in many ways a feature, not a bug.
Generative AI, the type of AI that powers OpenAI’s ChatGPT, Anthropic’s Claude, Google’s Gemini, and others, is a general-purpose technology that is good at a lot of different things, not narrowly great and purpose-built for one specific thing. This inherently complicates attempts to measure it consistently.
“Generative AI is designed for versatility, developed through large-scale pre-training to support open-ended applications,” said Matthias Bethge, an AI researcher at Germany’s University of Tübingen. “Traditional evaluation methods, which rely on static tests for predefined abilities, fall short in capturing this versatility.”
As an example, you can use the exact same model to do two wildly different things like, say, generate art in the style of the Dutch Masters and code in Python. You know the model you use to do this is generally state-of-the-art, but you don’t always know if the model is state-of-the-art at the specific task you want to do, chosen from a menu of thousands of tasks the model is capable of doing.
In other words, this incredible AI model may suck at art or code, or both, and be world-class at something else. This presents obvious problems when trying to pick which tools to use to get stuff done, but this obvious problem does not have an obvious solution right now. The best that AI companies can do right now is measure their technology using a range of different methods, said Kathy McKeown, an expert in natural language processing at Columbia University.
“Companies measure the quality of generative AI systems for the most part by evaluation on a suite of task-specific benchmarks,” she said.
Common benchmarks used to measure models include testing them on things like General-Purpose Question Answering (GPQA), performance across a wide range of subject matter expertise (Massive Multitask Language Understanding, or MMLU), and how well a model deals with multimodal inputs (MultiModal Understanding, or MMU). These benchmarks are commonly cited in technical reports by AI companies when they release a new model. Companies also often cite how well models perform on standardized tests designed to evaluate human competency.
In the GPT-4 technical report, for instance, benchmarks such as question answering, story cloze (asking the model to come up with the end of a story), natural language inference, and summarization were used to evaluate how well the model does certain tasks, said McKeown. But we can only guess at the full range of benchmarks companies might be using, as they don’t reveal everything, she cautioned.
While the benchmarks are helpful, they also have drawbacks.
MMLU, for instance, one of the more common benchmarks, consists of tens of thousands of multiple-choice academic questions across a range of subjects. The idea is that a chatbot able to answer many of these questions correctly is smarter than one which answers fewer correctly.
However, it’s just not that simple.
If an AI model’s training data contains questions and answers from the MMLU test, then it can cheat on the test, since it already is more likely to know the answer than a model not trained on this information. And, spoiler, there’s no outside teacher or organization grading AI’s work. Companies are grading and reporting their own performance.
Also, the dataset itself can contain problems in other ways, said McKeown.
For example, some benchmarks use a common metric called ROUGE to evaluate how well a model does on summarization tasks using a common dataset called XSum.
“But we know there are a lot of problems with the dataset,” said McKeown, because it has reference summaries that are unfaithful to the input. “And a metric like ROUGE will reward a model, then, for generating unfaithful summaries for XSum.”
There are also plenty of languages, she noted, that are not well-represented in training data, further reducing the effectiveness of certain benchmarks, depending on what you’re trying to evaluate.
In other words, even the quantitative measures of AI are only as good as the data underpinning them.
The challenge of evaluating narrow tasks.
Make no mistake, broad benchmarks can be useful. They will start to tell you, with some reliability, how good a new AI model is at a broad category of tasks. Yet they don’t really tell you if a particular model will be best to use for a tangible task you’re trying to do in your business or life.
“What we do much less of is evaluate whether AI helps people complete real-world tasks,” said Lydia Chilton, a professor of computer science at Columbia University. “It can write a cold email that looks and sounds like a good cold email should, but did that cold email get you the response you wanted?”
To do that today, we have to rely on far less scientific methods of measurement. And, right now, a lot of that measurement is qualitative. You try out a particular model for a particular use case and see if it produces an output that is superior to that of another model you want to use. In fact, this is part of how one of the top measurement sites evaluates AI.
The Large Model Systems Organization (LMSYS) is an open research organization that, among other projects, maintains the Chatbot Arena Leaderboard, an industry-standard scoreboard that rates how effective current AI models are. The leaderboard uses a combination of human feedback (users rate the outputs of two different models) and Elo ratings, a method used in the chess world to rank the relative skill level of players in zero-sum games. At any given time, you can see which models reign supreme on the leaderboard—a fleeting honor that AI companies love to brag about.
This gets closer to a more granularly useful measurement system. For the top models, enough users across enough use cases have rated the model highly. That helps you narrow down which models to be using across use cases. But it still doesn’t tell you exactly which model is good at which task. And the efficacy of doing that varies widely across tasks.
One area that does appear to be doing well, said Chilton, is evaluating AI for code generation.
“A nice property of the code synthesis problem is that the outputs are relatively easy to evaluate: you can run the code and test if it did the right thing,” she said. This makes it much easier to determine if one model is better at coding than another.
“For this problem, you can more easily close the feedback loop to test whether the output solved the problem, and that’s why code from generative AI is typically quite useful.”
The same is true of relatively simple questions and answers, she says. For a single question with relatively little context, it’s fairly straightforward to evaluate whether or not the answer to the question is correct and does not contain hallucinations (AI’s tendency to confidently make stuff up). It also is relatively easy to determine if you like the output’s length, tone, and clarity.
But for everything else in between?
“Current methods for evaluating generative AI require a significant upgrade to encompass the full range of capabilities these models offer,” said Bethge. His research group is attempting to pioneer one such method, which they call “infinity benchmarks.”
“[This is] a more dynamic approach that leverages a constantly evolving, open pool of data drawn from both existing and emerging test sets,” Bethge said. “This enables more flexible and comprehensive assessments.”
By reusing and aggregating diverse data samples, the method can evaluate models under different testing conditions and dynamically update prior evaluations, he said. That gets around the problem of static benchmarks, essentially continuously tracking and measuring model performance against ever-expanding tests and requirements.
Further Reading
- Achiam, J., et al.
GPT-4 Technical Report, arXiv, Mar. 13, 2023, https://arxiv.org/abs/2303.08774 - LMSYS Chatbot Arena Leaderboard, Large Model Systems Organization, https://lmarena.ai
- McKeown, K., et al.
Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers, arXiv, Mar. 2, 2024, https://arxiv.org/abs/2403.01061
Join the Discussion (0)
Become a Member or Sign In to Post a Comment