Self-Correction in Large Language Models

The latest generation of chatbots includes valuable tools to help with a variety of tasks, from providing information about a topic to generating computer code. Yet despite their impressive abilities, anyone who uses them regularly will have noticed that they make various types of mistakes. A recent round-up of the failures of OpenAI’s ChatGPT groups its mistakes into 10 different categories, some examples demonstrating difficulty in reasoning, such as working out the sequence of events in a simple story, multiplying large numbers, or solving riddles.

Self-correction is one approach that could improve the responses generated by the large language models (LLMs) that power chatbots. Although it is largely of interest to fix reasoning errors so far, self-correction also is being investigated for other tasks. It takes inspiration from the way humans correct themselves.

“When we solve a problem, we might first get an initial solution, and then we try to get some feedback either by self-reflection or from some other sources and then revise our initial answer iteratively until we get the right solution,” said Liangming Pan, an assistant professor at the University of Arizona whose research focuses on how to build LLMs that are logical, truthful, and safe. “People are interested in whether large language models have a similar ability.”

Different self-correction approaches already are being used in LLMs. The first step is typically to get feedback. While models are sometimes harnessed to evaluate the accuracy of their own output, external tools such as using a search engine to check facts are utilized in other approaches. Responses can be improved at different stages of the process: during training by using specialized fine-tuning methods, while an answer is being generated, or afterwards.

Many papers cite specific cases in which an LLM has been able to improve itself, but it is not clear how effective self-correction is overall. Some researchers are now taking a closer look at the big picture. “The common belief was that large language models can correct their own mistakes by themselves,” says Pan.

Ryo Kamoi, a Ph.D. student at Penn State University, and his colleagues were keen to survey existing research on self-correction, since they had come across contradictory findings. Some recent papers suggest it is difficult for LLMs to detect their own mistakes, while others claim they are good at correcting themselves and can do so without using external tools.

After taking a closer look, the team found a simple or suboptimal prompt was used in many papers where LLMs appeared to be able to fix their mistakes. This resulted in mediocre results that were easy to improve upon. In other cases, favorable results were not generalizable, due to the nature of the task.

“[LLMs] could only do self-correction on very specific tasks,” said Rui Zhang, an assistant professor at Penn State University and one of Kamoi’s co-authors.

Kamoi added that for an LLM to be able to correct its mistakes, it must first be able to detect an error, which is the bottleneck in the process. The tasks that were described as success stories in research papers often had an obvious answer, which made it easy to gauge if it was wrong.

The team also found that using external tools typically helped LLMs fix their mistakes. They also performed better when they were fine-tuned on datasets specifically designed for self-correction during training.

Kamoi said that using human-annotated data as feedback has also been proven to help with self-correction. However, it often is not a viable approach since it is time-consuming and costly. A popular alternative is to use reinforcement learning (RL), a paradigm centered on improving from self-generated feedback using trial and error. For example, researchers from Google Deepmind recently developed a high-performing, two-stage approach called SCoRE that uses RL and a reward method to guide a model to self-correct effectively.

In follow-up work, Kamoi and his colleagues are exploring a novel avenue for LLM self-correction that involves supervised fine-tuning using synthetic datasets.

“The idea is that if we target specific tasks where we can detect mistakes in large language models relatively easily, we can automatically create datasets with error annotations on the responses of large language models,” said Kamoi. “We are just starting a project to explore whether we can improve the error detection performance of large language models on general tasks by training large language models on [these datasets].”

Pan and his colleagues also examined studies related to LLM self-correction in a survey paper published earlier this year focusing on recent approaches that use automated feedback. Similar to Kamoi’s team’s findings, they found there were more examples of LLMs not being able to self-correct when relying on their own feedback, compared to success stories.

“The hardest part is how to get very accurate feedback, especially with reasoning tasks when you want intermediate feedback (before the entire response is generated),” said Pan.

Although some studies gave examples of specific tasks where LLMs were able to improve their performance from self-generated feedback, there were also many cases where their responses got worse. Pan and his colleagues suspect models exhibit something akin to narcissism where they favor what they generate, called self-bias.

The team followed up with another study to investigate the suspected self-bias. They analyzed six different LLMs, testing how they behaved in four different languages while performing three tasks: machine translation, text generation, and mathematical reasoning. While using a correction method called self-refinement, where the quality of a response is assessed through self-feedback and improved during a number of pre-defined iterations, they attempted to quantify self-bias in each model.

They found that all models exhibited self-bias, regardless of the task and language, which affected their ability to optimize their responses. Although the output of a model often improved in terms of its wording, which became easier to understand, often the quality of the answer itself wasn’t any better. In some cases, generated text was rated more favorably if it mirrored the LLM’s style.

Furthermore, an LLM’s partiality was exacerbated during the optimization process. “[Self-bias] becomes stronger and stronger as you do more rounds of self-correction,” said Pan.

Pan and his colleagues have proposed ways of mitigating self-bias and, in turn, self-correction. In some of their experiments, they showed that larger models were less partial to their own output and were better at fixing their own mistakes, which suggests increasing the size of an LLM can be a potential solution.

Pan thinks a better theoretical understanding of self-correction is needed to develop more effective approaches; for example, probing what is happening to different parameters in an LLM while it is trying to self-correct could reveal new details about the process.

More in-depth knowledge should help uncover the limits of self-correction and whether it is impossible in certain scenarios. In addition, it could allow self-correction to be used in tasks such as generating open-ended dialogue, where it is hard to define what is a mistake and provide objective feedback. Until now, LLM self-correction has focused on tasks with well-defined answers, such as reasoning problems. “Application-wise, there is a lot of future work we can do,” said Pan.

Sandrine Ceurstemont is a freelance science writer based in London, U.K.