News
Artificial Intelligence and Machine Learning

Beyond Turing: Testing LLMs for Intelligence

Modern large language models appear to be able to pass some public Turing tests. How do we measure if they are as intelligent as people?

Posted
swirl of colored bio-tinged strands, illustration

In the nearly two years since its release, ChatGPT has shown some remarkably human-like behavior, from trying to seduce a journalist to acing the bar exam. That has left some people wondering whether computers are approaching human levels of intelligence. Most computer scientists do not think machines are the intellectual equals of people yet, but they have not developed a consensus on how to measure intelligence, or what exactly to measure.

The canonical experiment to check for machine intelligence is the Turing test, proposed by Alan Turing in his 1950 paper “Computing Machinery and Intelligence.” Turing argues that if a computer could convince a person having a typed conversation with it that it was human, that might be a sign of intelligence. Large language models (LLMs) such as GPT excel at speaking like people, but have not yet convincingly passed the Turing test.

In 2023, researchers at the University of California San Diego (UCSD) conducted a public Turing test designed to compare the performance of the most recent LLMs with that of Eliza, a chatbot developed in the 1960s. GPT-4, the version that scored high on the bar exam, did pretty well, passing as human in the judges’ estimation in 41% of the games it played. Its predecessor, GPT-3.5, only passed 14% of games, short of Eliza’s 27%. Humans passed for human 63% of the time.

That the human score is so low is not surprising, says Cameron Jones, a Ph.D. student in cognitive science at UCSD, who ran the experiment. That is because players expect the models to do well, so they are more likely to think a human is just a human-sounding model. Jones says it is unclear what score a chatbot would have to achieve to win the game.

The Turing test could be useful for checking whether a customer service chatbot, for instance, is interacting with people in a way that those people are comfortable with, demonstrating what Jones calls a flexible social intelligence. Whether it can identify more general intelligence, however, is difficult to say. “We have a really poor understanding of what intelligence is in humans,” Jones said. “And I’d be surprised if we’re quicker at resolving that question in the case of models.”

“I think the whole idea of the Turing test has been taken too literally,” said Melanie Mitchell, a professor of complexity at the Santa Fe Institute, who argues that Turing proposed his Imitation Game as a way to think about what machine intelligence might look like, and not as a well-defined test. “People use the term glibly to say, ‘large language models have passed the Turing test,’ which in fact they haven’t.”

New exams

If the Turing test does not reliably assess machine intelligence, though, that raises the question of what might. In a November 2023 paper in the journal Intelligent Computing, Philip Johnson-Laird, a psychologist at Princeton University, and Marco Ragni, a professor of predictive analytics at Germany’s Chemnitz University of Technology, proposed a different exam: they suggested treating a model as if it were a participant in a psychological experiment, and seeing whether it can understand its own reasoning process.

For instance, they would ask a model a question such as, “If Ann is intelligent, does it follow that she is intelligent or she is rich, or both?” While it is valid under the rules of logic to infer that Ann is intelligent, rich, or both, most humans would reject the inference, because nothing in the set-up suggests she might be rich. If the model also rejects it, it is behaving like a human, and the researchers move on to the next step, asking the machine to explain its reasoning. If it gives a reason similar to one a person would give, the third step is to examine the source code for components that simulate human performance. Those might include a system for making rapid inferences, another for more deliberative reasoning, and a system for changing the interpretation of words like “or” based on their context. If it passes all these tests, the researchers argue, the model can be thought of as emulating human intelligence.

Huma Shah, a computing professor at the U.K.’s Coventry University who studies machine intelligence and has run Turing tests, says Johnson-Laird and Ragni’s approach might provide some interesting insights, but points out that questioning a model about its reasoning is not new. “The Turing test allows for that kind of logical questioning,” she said.

The trouble with trying to test for intelligence is that it depends on how one defines intelligence in the first place, Shah said. Is it pattern recognition, the ability to innovate, the capacity to produce something creative like music or comedy? “So the ‘I’ in ‘AI,’” she said. “If we don’t have an agreement on ‘I,’ then how can we be building artificial general intelligence (AGI)?”

For his part, Francois Chollet, a software engineer and AI specialist at Google, does not find the Turing test particularly useful. A good test should have an exact, formalized goal, he said, and should measure how close a system is to that goal. “That’s not really what the Turing test does,” he pointed out. “It’s about deceiving people into believing they’re talking to a human.”

The performance of LLMs on Turing tests demonstrates only that they are good at using language, a skill that comes entirely from memorizing large amounts of data, Chollet said. Real intelligence is not about mastering an individual skill, he argued, but about taking what has been learned and applying it to a new, different situation. “LLMs are 100% about memorization. They have no intelligence. They have no ability to adapt,” Chollet said.

In his view, intelligence is the ability to efficiently acquire new skills that training did not prepare for, with the goal of accomplishing tasks that are sufficiently different from those a system has seen before. Humans spend their lives interacting with the world, essentially running experiments that allow them to build a model of how the world works, so that when they come up against new situations, they can learn to handle them. The wider the scope of the new skills, the closer the computer comes to achieving artificial general intelligence.

“If you can make the learning process as information-efficient as a human mind, then you’ve got AGI,” Chollet said. So far, machines lag far behind, approximately 10,000 times less efficient than human brains. For instance, it took millions of images to teach computers to recognize pictures of cats, whereas humans learn to identify them based on only one or two examples.

To test for intelligence under his definition, Chollet developed the Abstract and Reasoning Corpus (ARC). The ARC challenge is created with elementary building blocks, simple concepts like shape or size. Those building blocks are used to create tasks, such as sorting objects by size or completing a symmetrical pattern. The test subject is shown three examples and should be able to identify the goal and reproduce the task. The best AIs achieve the goal roughly 30% of the time, Chollet said, whereas people were right 80% of the time.

Each task is unlike any the subject has seen before, so memorization does not help. “It’s a game that you cannot practice for,” he said. When GPT-4 passed the bar exam, that was likely because it had seen enough examples that looked like the exam that it could produce reasonable answers, without any intrinsic understanding of the law.

“It’s not a perfect test. It has many limitations, many flaws,” Chollet said. For instance, there is enough redundancy in the tasks that after enough examples, the test subject may be able to make an educated guess at an answer. The basic idea, though, is solid, he said.

Mitchell agreed human-like general intelligence requires the ability to accomplish tasks far outside the training data. She and her group came up with a revised version of ARC that organized the tasks around basic concepts, such as one thing being on top of another, or one being inside something else. The idea of ConceptARC was to test how robust the computer’s solutions were by having it derive a rule for a concept, and then apply that concept to new tasks. For instance, she might show the AI a grid in which a yellow square was on top of a blue square, but then the blue square was on top of the yellow one. That might be followed by a red circle on top of a green circle that gets replaced by a green circle on top of a red one. The concept, which humans should pick up easily, is that the colors are swapping vertical positions. The computer would then have to apply that rule to a new pair of shapes. The tasks are easy for people, but still seem to be very challenging for machines, Mitchell said.

There may be instances, such as trying to make a discovery in vast amounts of data, where a computer having its own way of making abstractions would be desirable, Mitchell said. For cases where they are interacting with people, though, such as driving a car, understanding the world in a human way is important.

“I don’t think intelligence is all or nothing. It’s a spectrum and there’s certain things that computers do that are intelligent,” Mitchell said. “If you want to talk about full, human-level intelligence, I would say we are somewhat far away, because there’s many aspects of human-level intelligence that are invisible to us.”

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More