Opinion
Artificial Intelligence and Machine Learning

‘Not on the Best Path’

Gary Marcus discusses AI's technical problems, and why he thinks Large Language Models have reached a “point of diminishing returns.”

Posted
Gary Marcus

In an age of breathless predictions and sky-high valuations, cognitive scientist Gary Marcus has emerged as one of the best-known skeptics of generative artificial intelligence (AI). In fact, he recently wrote a book about his concerns, Taming Silicon Valley, in which he made the case that “we are not on the best path right now, either technically or morally.” Marcus—who has spent his career examining both natural and artificial intelligence—explained his reasoning in a recent conversation with Leah Hoffmann.

You’ve written about neural networks in everything from your 1992 monograph on language acquisition to, most recently, your book Taming Silicon Valley. Your thoughts about how AI companies and policies fall short have been well covered in your U.S. Senate testimony and other outlets (including your own Substack). Let’s talk here about your technical criticisms.

Technically speaking, neural networks, as they are usually used, are function approximators, and Large Language Models (LLMs) are basically approximating the function of how humans use language. And they’re extremely good at that. But approximating a function is not the same thing as learning a function.

In 1998, I pointed out several examples of what people now call the problem of distribution shift. For instance, I trained the one-hidden-layer neural networks that were popular at the time the identity function, f(x)=X, on even numbers represented as binary digits, and I showed that these systems could generalize to some new even numbers. But if I tested them on odd numbers, they would systematically fail. So I made, roughly, a distinction between interpolation and extrapolation, and I concluded that these tools are good at interpolating functions, but they’re not very good at extrapolating functions.

And in your view, the multilayer neural networks we have now still do not address that issue.

In fact, there was a paper published in October by six Apple researchers basically showing the same thing. If something is in the training set or close to something in the training set, these systems work pretty well. But if it’s far enough away from the training set, they break down.

In philosophy, they make a distinction between intention and extension. The intention of something is basically the abstract meaning, like “even number.” The extension is a list of all the even numbers. And neural networks basically work at the extensional level, but they don’t work at the intentional level. They are not getting the abstract meaning of anything.

You’ve called attention to one way this distinction manifests in river-crossing problems, where generative AI systems propose solutions that resemble the right answer, but with absurdly illogical twists or random elements that were not present in the original question.

These models don’t really have a representation of what a man is, what a woman is, or what a boat is; as a result, they often make really boneheaded mistakes. And there are other consequences, like the fact that you can’t give them an instruction and expect them to reliably follow it. You can’t say, “Don’t lie,” or “don’t hallucinate,” or “don’t use copyrighted materials.” These systems are trained on copyrighted materials—they won’t be able to judge. You can’t do basic fact-checking. You also can’t follow principles like, “Don’t discriminate on the basis of race or age or sex,” because if LLMs are trained on real-world data, they tend to perpetuate past stereotypes rather than following abstract principles.

So you wind up with all of these technical problems, many of which spill over into the moral and ethical domain.

You’ve argued that to fix the moral and technical problems with AI, we need a new approach, not just more training data.

Generative AI only works for certain things. It works for pattern recognition, but it doesn’t work for the type of formal reasoning you need in chess. It doesn’t work for everyday formal reasoning about the world, and it doesn’t even reliably generate accurate summaries.

If you think about it abstractly, there’s a huge number of possible AI models, and we’re stuck in one corner. So one of my proposals is that we should consider integrating neural networks with classical AI. I make an analogy in my book to Daniel Kahneman’s System One and System Two. System One is fast, reflexive, and automatic—kind of like LLMs—while System Two is more deliberative reasoning, like classical AI. Our human mind combines both and gets results that are not perfect, but that are much better, in many dimensions, than current AI, so I think exploring that would be really a good idea. It won’t be sufficient for developing systems that can observe something and build a structured set of representations about how that thing works, but it might get us part of the way there.

At the time of this interview, several people in the field seem to agree that we’re hitting a period of diminishing returns with respect to LLMs.

That is a phrase that I coined in a 2022 essay called “Deep Learning is Hitting a Wall,” which was about why scaling wouldn’t get us to AGI (Artificial General Intelligence). And when I coined it, everybody dismissed me and said, “No, we’re not reaching diminishing returns. We have these scaling laws. We’ll just get more data.” But what people have observed in the last couple of months is that adding more data does not actually solve the core underlying problems on the technical side. The big companies that are doing big training runs are not getting the results they expected.

Do you think that will be enough to change the atmosphere and shift the industry’s focus?

I hope that the atmosphere will change. In fact, I know it will change, I just don’t know when. A lot of this is crowd psychology. DeepMind does hybrid AI. AlphaFold is a neurosymbolic system, and it just won the Nobel Prize. So there are some efforts, but for the time being, venture capitalists only want to invest in LLMs. There’s no oxygen left for anything else.

That said, different things could happen, maybe even by the time we go to print. The market might crash. If you can’t eliminate hallucinations, it limits your commercial potential. I think people are starting to see that, and if enough of them do, then it’s just a psychology thing. Maybe someone will come up with a new and better idea. At some point, they will. It could come tomorrow or it might take a decade or more.

People have proposed a number of different benchmarks for evaluating progress in AI. What do you make of them?

Here’s a benchmark I proposed in 2014 that I think is still beyond current AI. I call it the comprehension challenge. The idea is that an AI system should be able to watch a movie, build a cognitive model of what is going on, and answer questions. Why did the characters do this? Why is that line funny? What’s the irony in this scene?

Right now, LLMs might get it sort of right some of the time, but nowhere near as reliably as the average person. If a character says at the end of the movie, “I see dead people,” everybody in the cinema has this “Oh, my god” moment. Everybody in the cinema has followed the world of the movie and suddenly realized that a principle they thought was true does not apply. When we have AI that can do that with new movies that are not in the training data, I’ll be genuinely impressed.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More