BLOG@CACM
Artificial Intelligence and Machine Learning

AI Does Not Help Programmers

Posted
Bertrand Meyer of Constructor Institute and Eiffel Software

Everyone is blown away by the new AI-based assistants. (Myself included: see an earlier article on this blog which, by the way, I would write differently today.) They pass bar exams and write songs. They also produce programs. Starting with Matt Welsh’s article in Communications of the ACM, many people now pronounce programming dead, most recently The New York Times.

I have tried to understand how I could use ChatGPT for programming and, unlike Welsh, found almost nothing. If the idea is to write some sort of program from scratch, well, then yes. I am willing to believe the experiment reported on Twitter of how a beginner using Copilot to beat hands-down a professional programmer for a from-scratch development of a Minimum Viable Product program, from “Figma screens and a set of specs.” I have also seen people who know next to nothing about programming get a useful program prototype by just typing in a general specification. I am talking about something else, the kind of use that Welsh touts: a professional programmer using an AI assistant to do a better job. It doesn’t work.

Precautionary observations:

  • Caveat 1: We are in the early days of the technology and it is easy to mistake teething problems for fundamental limitations. (PC Magazine‘s initial review of the iPhone: “it’s just a plain lousy phone, and although it makes some exciting advances in handheld Web browsing it is not the Internet in your pocket.“) Still, we have to assess what we have, not what we could get.
  • Caveat 2: I am using ChatGPT (version 4). Other tools may perform better.
  • Caveat 3: It has become fair game to try out ChatGPT or Bard, etc., into giving wrong answers. We all have great fun when they tell us that Famous Computer Scientist X has received the Turing Award and next (equally wrongly) that X is dead. Such exercises have their use, but here I am doing something different: not trying to trick an AI assistant by pushing it to the limits of its knowledge, but genuinely trying to get help from it for my key purpose, programming. I would love to get correct answers and, when I started, thought I would. What I found through honest, open-minded enquiry is at complete odds with the hype.
  • Caveat 4: The title of this article is rather assertive. Take it as a proposition to be debated (“This house believes that…”). I would be interested to be proven wrong. The main immediate goal is not to edict an inflexible opinion (there is enough of that on social networks), but to spur a fruitful discussion to advance our understanding beyond the “Wow!” effect.

Here is my experience so far. As a programmer, I know where to go to solve a problem. But I am fallible; I would love to have an assistant who keeps me in check, alerting me to pitfalls and correcting me when I err. A effective pair-programmer. But that is not what I get. Instead, I have the equivalent of a cocky graduate student, smart and widely read, also polite and quick to apologize, but thoroughly, invariably, sloppy and unreliable. I have little use for such  supposed help.

It is easy to see how generative AI tools can peform an excellent job and outperform people in many areas: where we need a result that comes very quickly, is convincing, resembles what a top expert would produce, and is almost right on substance. Marketing brochures. Translations of Web sites. Actually, translations in general (I would not encourage anyone to embrace a career as interpreter right now). Medical image analysis. There are undoubtedly many more. But programming has a distinctive requirement: programs must be right. We tolerate bugs, but the core functionality must be correct. If the customer’s order is to buy 100 shares of Microsoft and sell 50 of Amazon, the program should not do the reverse because an object was shared rather than replicated. That is the kind of serious error professional programmers make and for which they need help.

AI in its modern form, however, does not generate correct programs: it generates programs inferred from many earlier programs it has seen. These programs look correct but have no guarantee of correctness. (I am talking about  “modern” AI to distinguish it from the earlier kind—largely considered to have failed—which tried to reproduce human logical thinking, for example through expert systems. Today’s AI works by statistical inference.)

Fascinating as they are, AI assistants are not works of logic; they are works of words. Large language models: smooth talkers (like the ones who got all the dates in high school). They have become incredibly good at producing text that looks right. For many applications that is enough. Not for programming.

Some time ago, I published on this blog a sequence of articles that tackled the (supposedly) elementary problem of binary search, each looking good and each proposing a version which, up to the last installments, was wrong. (The first article is here; it links to its successor, as all items in the series do. There is also a version on my personal blog as a single article, which may be more convenient to read.)

I submitted the initial version to ChatGPT. (The interaction took place late May; I have not run it again since.)

The answer begins with a useful description of the problem:

Good analysis; similar in fact to the debunking of the first version in my own follow-up. The problem can actually arise with any number of elements, not just two, but to prove a program incorrect it suffices to exhibit a single counterexample. (To prove it correct, you have to show that it works for all examples.) But here is what ChatGPT comes up with next, even though all I had actually asked was whether the program was correct, not how to fix it:

 

 (Please examine this code now!) It includes helpful comments:

 

All this is very good, but if you did examine the proposed replacement code, you may have found something fishy, as I did.

I report it:

Indeed, in trying to fix my bug, ChatGPT produced another buggy version, although the bug is a new one. There is an eerie similarity with my own original sequence of binary search posts, where each attempt introduced a version that seemed to correct the mistake in the preceding one —only to reveal another problem.

The difference, of course, is that my articles were pedagogical, instead of asserting with undaunted assurance that the latest version is the correct fix!

One thing ChatGPT is very good at is apologizing:

Well, for my part, when looking for an assistant I am all for him/her/it to be polite and to apologize, but what I really want is that the assistant be right. Am I asking too much? ChatGPT volunteers, as usual, the corrected version that I had not even (or not yet) requested:

(Do you also find that the tool doth apologize too much? I know I am being unfair, but I cannot help think of the French phrase trop poli pour être honnête, too polite to be honest.)

At this point, I did not even try to determine whether that newest version is correct; any competent programmer knows that spotting cases that do not work and adding a specific fix for each is not the best path to a correct program.

I, too, remain (fairly) polite:

Now I am in for a good case of touché: ChatGPT is about to lecture me on the concept of loop invariant!