Stop Judging AI Using Human Exams

About 25 years ago, I sat next to a housepainter from Seattle on a coast-to-coast flight. He was chatty, especially about physics—he was an amateur scientist, and was proud to share all he had discovered in encyclopedias. At the time, I was studying how people learn physics. I really enjoyed the high altitude conversation because the housepainter was passionate about physics. And his knowledge was oddly about a mile wide but only an inch deep. From reading and cross-referencing encyclopedias, he had figured out a lot, but he still had a lot of questions and gaps in knowledge—understandable, given superficial yet highly cross-indexed learning material. While he was passionate, smart and engaging, no one would mistake the housepainter from Seattle for a university physicist.

These days I notice a frequent type of headline:

ChatGPT can pass every high school test.
ChatGPT bests university exams.
ChatGPT beats students on law school tests.

I believe people are excited by these headlines because they believe passing the test shows that the AI is smart. Or perhaps that AI is as competent as a high school student, university student, or law student. But that’s faulty reasoning. To explain why, I offer a lightning tour of a pair of concepts important to human exams, discrimination and validity.

Human tests are designed using psychometrics, most often Item Response Theory (IRT). When using IRT, test makers build large banks of exam items; then the exam builders ask a sample of human students to try candidate test items. Based on the empirical results of trying items with pools of humans, IRT empirically measures how well each item discriminates among people who show lower or higher ability in the subject of the exam. Items that do not discriminate are tossed out; items that work well are retained. Thus the validity of the exam as a measure of human ability is empirically calibrated.

Here’s the important point: IRT yields no guarantee that this validity is true for non-human test takers, such as test takers that are AI algorithms or aliens from another planet. Because AI models answer human test items in different ways than human test takers do, we cannot assume a high test score means a smarter AI model. The IRT model was never given the data that would be necessary to make a reliable discrimination among smart or shallow AI models.

Second, all tests seek to use a very limited type and number of items to make an inference from specific to general. The inference is from the specific items on the test to the human’s strengths on a wider variety of naturalistic tasks in a general domain of knowledge, skills, and abilities. To back up these inferences for humans, psychometricians apply a bag of techniques that establish validity. These, too, are empirical calibrated among people, for example, by comparing performance on a new test to other measures of performance. If the new test and other measures correlate, the inference from specific to general is more valid. Again, there is no guarantee that for non-human test takers, the inference from a specific set of items (a law test) to a domain of knowledge, skills and abilities (a successful law student) is warranted.

Now, let’s think about what this looks like for AI algorithms. First, it seems pretty obvious that taking these tests would do little to discriminate among worse and better algorithms. Tests use a pretty narrow set of task formats, and it is easy to over-optimize to those formats. I’m frankly more impressed by ChatGPT’s performance on ill-structured chatbot interactions with humans than I am with its ability to nail a standardized test. They are called standardized tests for a reason; they are very well-defined and predictable in structure and content. Why would anyone believe that an AI algorithm or model that performs well on standardized tests is a more powerful or better model than one that can handle a wide variety of challenging yet non-standardized tasks?

And as for the specific-test-to-general-knowledge inference, what I termed the validity problem, it seems that AI is more like the housepainter from Seattle than we care to admit. He was a great conversant on physics, yet no one would mistake him for a university physicist. My mentor in graduate school, Andy DiSessa, described expert physics knowledge as having a very precise structure; true experts can traverse the connections from more superficial answers back to foundational principles. They have a very well-organized topology of a very large number of pieces of knowledge and there principles that organize the nature and types of connections within their knowledge base mirror the epistemology of physics. They allow the expert to confirm that a specific application of physics is well-grounded in foundational principles, and for the expert to provide deeper and deeper explanations. This is where the housepainter from Seattle quickly came up short.

Today’s large language models have a different epistemology, one based on the sequential connectivity of word frequencies in sentences. I have no idea how we go from sequential word frequencies (a horizontal topology) to the applied-to-axiomatic structure that characterizes expertise (a vertical topology). But I am sure we are not there yet. And thus passing a human test is not a valid indication that the AI model or algorithm is developing knowledge like that of a human expert. Today’s generative AI is more like the housepainter from Seattle than we care to admit, and the news stories about AI models passing tests do a disservice by trivializing the nature of expertise in a domain.

My recommendations:

Computer scientists should take the lead in informing people that human exams are not good ways to measure the strengths of AI algorithms or models.
Computer scientists must also take the lead in informing people that the comparison of AI to people by way of score rankings on human exams is a faulty comparison.
Finally, learning scientists like me should take the lead in helping people think about how to improve human assessment now that we have more powerful machines. Old tests are not sacred cows; they were simply the best available proxies we had for a person’s general knowledge, skills, and abilities in a domain. Although educators use long-standing formats for exams, the science of building assessments of human skill has advanced massively in the past decades, and we’re well poised to help educators to create new exams to measure the abilities of humans as they work with powerful tools to solve challenging problems.

Jeremy Roschelle is Executive Director of Learning Sciences Research at Digital Promise and a Fellow of the International Society of the Learning Sciences.