What to Expect When You’re Expecting … GPT-4

As 2022 came to a close, OpenAI released an automatic writing system called ChatGPT that rapidly became an Internet sensation; less than two weeks after its release, more than a million people had signed up to try it online. As every reader surely knows by now, you type in text, and immediately get back paragraphs and paragraphs of uncannily human-like writing, stories, poems and more. Some of what it writes is so good that some people are using it to pick up dates on Tinder (“Do you mind if I take a seat? Because watching you do those hip thrusts is making my legs feel a little weak.”) Other, to the considerable consternation of educators everywhere, are using it write term papers. Still others are using it to try to reinvent search engines . I have never seen anything like this much buzz.

Still, we should not be entirely impressed.

Although ChatGPT can write about anything, it is also easily confused. As I told NYT columnist Farhad Manjoo, ChatGPT, like earlier, related systems is “still not reliable, still doesn’t understand the physical world, still doesn’t understand the psychological world and still hallucinates.”

Which is to say it makes stuff up, regularly; a fair bit of what it is says simply isn’t true.

ChatGPT claims, for example, that churros were good for surgery, because “their compact size allows for greater precision and control during surgery, reducing the risk of complications and improving the overall outcome of the procedure”; the hallucination problem, whereby the machine makes up fluent nonsense, is so dire, that at least one prominent website, Stack Overflow, has put a temporary ban on computer-generated submissions.

And the blunders are endless. The system is frequently retuned based on feedback, but weeks later I still keep getting sent tweets like these:

And sure, there are guardrails in place now, but sometimes those guardrails just suck. Here’s one of my own first experiments with ChatGPT:

Fails like these are so easy to find that a Sam Altman, CEO of OpenAI, and former lampooner of me, was eventually forced to concede to reality:

In short, ChatGPT might sound like the Star Trek computer, but for now, you still can’t trust it.

But all that is so 2022. What about 2023?

What Silicon Valley, and indeed the world, is waiting for, is GPT-4.

I guarantee that minds will be blown. I know several people who have actually tried GPT-4, and all were impressed. It truly is coming soon (Spring of 2023, according to some rumors). When it comes out, it will totally eclipse ChatGPT; it’s safe bet that even more people will be talking about it..

In many quarters, expectations are really, really high:

In technical terms, GPT-4 will have more parameters inside of it, requiring more processors and memory to be tied together, and be trained on more data. GPT-1 was trained on 4.6 gigabytes of data, GPT-2 was trained on 46 gigabytes, GPT-3 was trained on 750. GPT-4 will be trained on considerably more, a significant fraction of the internet as a whole. As OpenAI has learned, bigger in many ways means better, with outputs more and more humanlike with each iteration. GPT-4 is going to be a monster.

But will it solve the problems we have seen before? I am not so sure.

Although GPT-4 will definitely seem smarter than its predecessors, its internal architecture remains problematic. I suspect that what we will see is a familiar pattern: immense initial buzz, followed by a more careful scientific inspection, followed by a recognition that many problems remain.

As far as I can tell from rumors, GPT-4 is architecturally essentially the same as GPT-3. If so, we can expect that approach to still be marred by something fundamental: an inability to construct internal models of how the world works, and in consequence we should anticipate an inability to understand things at an abstract level. GPT-4 may be better at faking term papers, but if it follows the same playbook as it predecessors, it still won’t really understand the world, and seams will eventually show.

And so, against the tremendous optimism for GPT-4 that I have heard from much the AI community, here are seven dark predictions:

GPT-4 will still, like its predecessors, be a bull in a china shop, reckless and hard to control. It will still make a significant number of shake-your-head stupid errors, in ways that are hard to fully predict. It will often do what you want, sometimes not — and it will remain difficult to anticipate which in advance..
Reasoning about physical, psychological and mathematical world will still be unreliable, GPT-3 was challenged in theory of mind, medical and physical reasoning. GPT-4 will solve many of the individual specific items used in prior benchmarks, but still get tripped up, particularly in longer and more complex scenarios. When queried on medicine it will either resist answering (if there are aggressive guardrails) or occasionally spout plausible-sounding but dangerous nonsense. It will not be trustworthy and complete enough to give reliable medical advice, despite devouring a large fraction of the Internet.
Fluent hallucinations will still be common, and easily induced, continuing — and in in fact escalating — the risk of large language models being used as a tool for creating plausible-sounding yet false misinformation. Guardrails (a la ChatGPT) may be in place, but the guardrails will teeter between being too weak (beaten by “jailbreaks”) and too strong (rejecting some perfectly reasonable requests). Bad actors will in any case eventually be able to replicate much of GPT-4, dispensing with whatever guardrails are in place, and using knock-off systems to create whatever narratives they wish.
Its natural language output still won’t be something that one can reliably hook up to downstream programs; it won’t be something, for example, that you can simply and directly hook up to a database or virtual assistant, with predictable results. GPT-4 will not have reliable models of the things that it talks about that are accessible to external programmers in a way that reliably feeds downstream processes. People building things like virtual assistants will find that they cannot reliably enough map user language onto user intentions.
GPT-4 by itself won’t be a general purpose artificial general intelligence capable of taking on arbitrary tasks. Without external aids it won’t be able beat Meta’s Cicero in Diplomacy; it won’t be able to drive a car reliably; it won’t be able to reliably guide a robot like Optimus to be anything like as versatile as Rosie the Robot. It will remain turbocharged pastiche generator, and a fine tool for brainstorming, and for first drafts, but not trustworthy general intelligence.
“Alignment” between what humans want and what machines do will continue to be a critical, unsolved problem. The system will still not be able to restrict its output to reliably following a shared set of human values around helpfulness, harmlessness, and truthfulness. Examples of concealed bias will be discovered within days or months. Some of its advice will be head-scratchingly bad.
When AGI (artificial intelligence) comes, large language models like GPT-4 may be seen in hindsight as part of the eventual solution, but only as part of the solution. “Scaling” alone—building bigger and models until they absorb the entire internet — will prove useful, but only to a point. Trustworthy, general artificial intelligence, aligned with human values, will come, when it does, from systems that are more structured, with more built-in knowledge, and will incorporate at least some degree of explicit tools for reasoning and planning, as well as explicit it knowledge, that are lacking in systems like GPT. Within a decade, maybe much less, the focus of AI will move from a pure focus on scaling large language models to a focus on integrating them with a wide range of other techniques. In retrospectives written in 2043, intellectual historians will conclude that there was an initial overemphasis on large language models, and a gradual but critical shift of the pendulum back to more structured systems with deeper comprehension.

If all seven predictions prove correct, I hope that the field will finally realize that it is time to move on.

Shiny things are always fun to play with, and I fully expect GPT-4 to be the shiniest so far, but that doesn’t mean that it is a critical step on the optimal path to AI that we can trust. For that, we will, I predict, need genuinely new architectures that incorporate explicit knowledge and world models at their very core.

Gary Marcus (@garymarcus) is a scientist, best-selling author, and entrepreneur. His most recent book, co-authored with Ernest Davis, Rebooting AI, is one of Forbes’s 7 Must Read Books in AI.