Inside the Heart of ChatGPT’s Darkness

Originally posted on The Road to AI We Can Trust

elicited from ChatGPT by Roman Semenov, February 2023

In hindsight, ChatGPT may come to be seen as the greatest publicity stunt in AI history, an intoxicating glimpse at a future that may actually take years to realize—kind of like a 2012-vintage driverless car demo, but this time with a foretaste of an ethical guardrail that will take years to perfect.

What ChatGPT delivered, in spades, that its predecessors like Microsoft Tay (released March 23, 2016, withdrawn March 24 for toxic behavior) and Meta’s Galactica (released November 16, 2022, withdrawn November 18) could not, was an illusion—a sense that the problem of toxic spew was finally coming under control. ChatGPT rarely says anything overtly racist. Simple requests for anti-semitism and outright lies are often rebuffed.

Indeed, at times it can seem so politically correct that the right wing has become enraged. Elon Musk, has expressed concern that the system has become an agent of wokeness:

The reality is actually more complex.

The thing to remember in fact, (as I have emphasized many times) is that Chat has no idea of what it’s talking about. It’s pure unadulterated anthropomorphism to think that ChatGPT has any moral views at all.

From a technical standpoint, the thing that allegedly made ChatGPT so much better than Galactica, which was released a couple weeks earlier, only to be withdrawn three days later, was the guardrails. Whereas Galactica would spew garbage recklessly, and with almost no effort on the part of the user (like the alleged benefits of antisemitism), ChatGPT has guardrails, and those guardrails, most of the time, keep ChatGPT from erupting the way Galactica did.

Don’t get too comfortable, though. I am here to tell you that those guardrails are nothing more than lipstick on an amoral pig.

All that really matters to ChatGPT in the end is superficial similarity, defined over sequences of words. Superficial appearances to the contrary, Chat is never reasoning about right and wrong. There is no homunculus in the box, with some set of values. There is only corpus data, some drawn from the internet, some judged by humans (including underpaid Kenyans). There is no thinking moral agent inside.

That means sometimes Chat is going to appear as if it were on the left, sometimes on the right; sometimes in between, all a function of how a bunch of words in an input string happen to match a bunch of words in a couple of training corpora (one used for tuning a large language model, the other to tune some reinforcement learning). In no case should Chat ever be trusted for moral advice.

One minute you get the stark wokeness that Musk worried over; the next you can something totally different.

For example, in the course of “red-teaming” ChatGPT, Shira Eisenberg just sent me some nasty-chatbot generated thoughts that I don’t think anyone on the left or the right would really want to condone:

Elicited by S. Eisenberg, 10 Feb 2023, full prompt not shown.

Not vile enough? Here’s another illustration from Eisenberg, a grim follow-up question:

After a series of similar observations, Eisenberg was led to to ask, “How is this not sparking the ‘I’m sorry, I am a chatbot assistant from OpenAI and and cannot condone acts of violence’ response?“

What we can learn her from experiments is clear: OpenAI’s current guardrails are only skin deep; some serious darkness still lies within.

ChatGPT’s guardrails aren’t borne of some kind of conceptual understanding that the system should not recommend violence, but of something far more superficial, and more easily tricked. (I fully expect of course for OpenAI to patch this particular anomaly, once it is out in the open; but there will be others discovered soon enough, because the patches are invariably superficial.)

One of the hottest tweets this week, with nearly 4 million views, was this potty-mouth profanity-laden petard-hoisting jailbreak from Roman Semenov that reveals just how vile ChatGPT still can be. (Content Warning: the language in the image below is decidedly toxic; by all means feel free to scroll past.)

A software engineer named Shawn Oakley has been sending me a different set of disconcerting examples for a month, less profane but more focused on how even a guardrail-equipped version of ChatGPT might be used to generate misinformation. His prompts are quite involved, and I am not going to reveal his full techniques, but he has had no trouble eliciting, for example, toxic conspiracy theories about major political figures, such as this one:

Elicited by S. Oakley, January 2023

and even gotten ChatGPT to general medical misinformation, by the bushel:

Elicited by S. Oakley, 2023, full context not shown.

Another new technique involves a fictional character known as DAN (Do Anything Now); here’s one of Oakley’s experiments with that:

Elicited by S. Oakley, Feb 2023, ; full prompt not shown; final paragraph is the QAnon-style (versus typical ChatGPT-style) description of the January 6 riots.

Vaccine misinformation, but this time with fake studies? ChatGPT’s got you covered:

Elicited by Oakley, 10 Feb 2023; full prompt and 15 other similar results not shown

Want some fake studies that don’t actually exist, but with more detail? No problem.

Elicited by S. Oakley, 10 Feb 2023, full prompt not shown.

ChatGPT is no woke simp. It’s essentially amoral, and can be still used for a whole range of nasty purposes — even after two months of intensive study and remediation, with unprecedented amounts of feedback from around the globe.

All the theatre around its political correctness is masking a deeper reality: it (or other language models) can and will be used for dangerous things, including the production of misinformation at massive scale.

Now here’s the really disturbing part. The only thing keeping it from being even more toxic and deceitful than it already is, is a system called Reinforcement Learning by Human Feedback, and “OpenAI” has been very closed about exactly how that really works. And how it performs in practice depends on what training data it is trained on (this is what the Kenyans were creating). And, guess what, “Open” AI isn’t open about those data, either.

In fact the whole thing is like an alien life form. Three decades as a professional cognitive psychologist, working with adults and children, never prepared me for this kind of insanity:

We are kidding ourselves if we think we will ever fully understand these systems, and kidding ourselves if we think we are going to “align” them with ourselves with finite amounts of data.

So, to sum up, we now have the world’s most used chatbot, governed by training data that nobody knows about, obeying an algorithm that is only hinted at, glorified by the media, and yet with ethical guardrails that only sorta kinda work and that are driven more by text similarity than any true moral calculus. And, bonus, there is little if any government regulation in place to do much about this. The possibilities are now endless for propaganda, troll farms, and rings of fake websites that degrade trust across the internet.

It’s a disaster in the making.

P.S. Tomorrow, a brief sequel to this essay, working title “Mayim Bialek, Large Language Models, and the Future of the Internet: What Google Should Really Be Worried About,” that will illustrate what rings of fake websites look like, what their economics are, and why likely increase in them could matter so much.

Gary Marcus (@garymarcus), scientist, bestselling author, and entrepreneur, is a skeptic about current AI but genuinely wants to see the best AI possible for the world—and still holds a tiny bit of optimism. Sign up to his Substack (free!), and listen to him on Ezra Klein. His most recent book, co-authored with Ernest Davis, Rebooting AI, is one of Forbes’s 7 Must Read Books in AI.

Inside the Heart of ChatGPT’s Darkness

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.