Generative artificial intelligence (GAI) started making waves a few years ago with the release of systems such as ChatGPT and DALL-E. They are able to produce sophisticated and human-like text, code, or images after the models powering them are trained on large quantities of data. However, it soon became apparent that the specific phrasing of a question or statement input by a user, known as a prompt, had an impact on the quality of the resulting output.
“It’s a way of unlocking different capabilities from these models,” says Andrei Muresanu, an AI researcher at Vector Institute in Toronto, Canada. “If you tell ChatGPT to pretend that it’s a professor of mathematics, it will do better on math questions than if you just say, ‘answer this question’ or ‘pretend you’re a student’.”
Coming up with prompts that steer a model towards a desired output has emerged as a relatively new profession, called prompt engineering, to help achieve more relevant and accurate results. This is becoming increasingly important as more companies use generative AI systems, such as chatbots that can help with customer service or tools that can detect fraud. A pioneering technique is chain-of-thought (CoT) prompting, which involves telling a model to break down a complex task, such as a math problem, into intermediate steps as a way of generating a more-accurate result.
However, instead of tasking humans with prompt engineering, the idea soon came about to automate the process by harnessing large language models (LLMs), like those that power ChatGPT, to craft the written input themselves.
“It’s very time-consuming (for humans) to think of prompts and then test them all out,” said Muresanu. He added that prompt engineering is also often unintuitive to people: two queries may seem to have similar wording yet produce results that vary significantly in accuracy.
In the last few years, automated prompting techniques and tools have evolved and are now starting to be more widely used. They can generate, select, or optimize prompts to improve the performance of models. In 2023, researchers from Stanford University developed a new algorithmic framework to optimize LLM prompts, called Declarative Self-improving Python (DSPy), that was seen as a breakthrough; it can figure out if a new prompt is better than an initial prompt. The team then followed up with another system called TextGrad that uses backpropagation and text-based feedback. It evaluates LLM output to hone in on its weaknesses so it can improve the prompt.
Many automated prompt engineering tools are now available for public use; for example AI company OpenAI just released an AI prompt generator for ChatGPT, while Anthropic made a similar tool available earlier this year that can formulate prompts based on best practice for its family of LLMs.
Researchers and developers are now taking a closer look at some of these approaches and evaluating their effectiveness.
Some research has shown that LLMs can come up with prompts that are superior to those created by humans. Recent work by researchers at Google DeepMind, for example, showed that LLMS are better at optimizing their own prompts, outperforming human-generated input by 8% on one benchmark test and by up to 50% on another.
One popular tool developed by the research laboratory, called Optimization by Prompting (OPRO), uses an innovative approach described as ‘meta-prompting’. An LLM optimizes a prompt for a task by first considering previous prompts alongside their training accuracy. It is also presented with a description of the task incorporating illustrative examples from the training set, allowing it to progressively craft new prompts that increase the accuracy of its output.
In light of the DeepMind findings, Rick Battle, a machine learning researcher and engineer at VMWare, a cloud computing company in California, and his team wanted to further compare automated and human-crafted prompts. In work published last year, they first evaluated the performance of human-derived input that includes encouraging messages such as ‘This will be fun’, a technique called positive thinking that can often improve a model’s performance. Sixty different prompt variations, some incorporating CoT reasoning, were generated for up to 100 different questions from a publicly available dataset of grade school math problems. They were tested on three open-source language models.
The same language models, which are of different sizes, then were utilized to come up with prompts for the same tasks. Previous research had mostly focused on using massive, commercial LLMs such as ChatGPT, so Battle and his colleagues wanted to see how smaller models, such as those often developed for use by a single company, compared.
Similar to previous work, Battle and his team also found that using an automated approach was the best way to enhance a model’s results, even when smaller models were used. It resulted in higher-performing prompts compared to the most effective ones generated by humans using the positive thinking technique. Battle, however, was surprised at how absurd some of the best model-generated prompts were. One of them, which resulted in their paper going viral, was a Star-Trek-influenced command.
“I would never have written something like that as a person, but a model trying to optimize the prompt tried it and it worked the best,” said Battle.
He thinks language models have an edge because they consider a wider range of possibilities than humans would. People should leave prompt engineering to automated systems and focus instead on developing high-quality test examples for evaluating the success of different inputs, he adds.
“The take-home (message) is to go back to traditional machine learning approaches,” said Battle. “You need to have broad test sets that cover the entirety of your use case.”
Another team also found that LLM-generated prompts can rival those written by humans. Muresanu and his colleagues developed an automated prompting tool called Automatic Prompt Engineer (APE) in work published as a conference paper at the International Conference on Learning Representations (ICLR) 2023. Using a few pairs of correct inputs and outputs, APE uses an LLM to try to produce an optimal prompt. It first comes up with a set of candidate prompts, then tests them before selecting the best one.
In one experiment, the researchers focused on CoT optimization to see if their tool could generate a better-worded prompt than humans. Using an LLM called InstructGPT to solve a set of arithmetic questions step by step, the best-performing human-devised prompt included the well-known phrase ‘Let’s think step by step’ in the input, which has been shown to dramatically improve a model’s reasoning abilities. Muresanu and his colleagues then used APE with the same dataset of math questions to see if it could come up with an even better CoT message to append to prompts.
The tool came up with the prompt ‘Let’s think step by step to be sure we have the right answer’, which was able to improve on the best human-generated CoT prompt by 3%. The team has now made APE freely available online, so anyone can use it to produce a prompt that is likely to boost results on a task.
“Given just a small set of input-output pairs, you can generate a prompt which is as good as, or better than, teams of humans who are actively trying to find the best prompt,” said Muresanu. “(Using our tool) is a very cheap and easy way to improve the performance of a model.”
Muresanu thinks the wording of the prompt is interesting because it shows that a model may introduce some errors if it is not explicitly told to generate a correct answer. He suspects it is because models are trained on data produced by humans, which inadvertently contains some mistakes. If they are trying to replicate what they have learned, it would mean occasionally including errors.
Automated prompt engineering may not always be more effective than human-crafted efforts, though. Ilia Shumailov, a junior research fellow at the University of Oxford in the U.K., and his colleagues set out to examine whether AI-based methods could consistently outperform manual prompting in a paper published in ArXiv last year. They compared how an LLM called RoBERTa performed when using two different automated prompting tools and a human-engineered method on six different tasks. They also used a varying number of examples in each case, between 8 and 1,000.
Shumailov and his colleagues were surprised to find that the prompts written by humans actually produced better results on most tasks compared to those generated by the automated tools they tested. “It suggests that our models were performing rather well even for less-optimized prompts,” said Shumailov.
However, although these results provide insights into how other LLMs might perform with the different prompting methods tested, they may not always be generalizable. If a new model is used, Shumailov and his colleagues recommend a complete reassessment.
“It is important to redesign the prompt by rerunning the automatic prompting method,” said Shumailov. “(This) inherently comes with an increased cost, especially given the rate at which new models come out.”
The team also found that manual prompt engineering had a more consistent effect on a model’s results regardless of the amount of test data available. The effect of AI-generated instructions on the accuracy of results was more variable; they tended to result in an improvement when a lot of examples were used, but a model would also occasionally fail catastrophically on some tasks. The researchers think human-designed prompts are more robust since they are based on common knowledge from life experience, whereas automated prompts come about more randomly.
Shumailov and his colleagues suggest using LLM-based methods to optimize human-composed prompts. Shumailov thinks automated tools that use a CoT style, for example by decomposing a person’s query into multiple prompts, are promising and will begin to replace the formulation of prompts by humans.
“This approach would necessitate less human oversight and enhance the general usability of LLMs,” he said. “Nonetheless, the evaluation of these techniques must be meticulously crafted.”
However, whether prompt engineering tools will be used in the long run is up for debate. Muresanu said that at the moment, some type of prompt engineering is usually beneficial only because current generative AI systems are imperfect. Eventually, that may not be the case.
“A perfect AI should be able to follow (an instruction) no matter how it is phrased, the same way you or I can,” he said. “So once that happens, prompt engineering shouldn’t even be required at all.”
Further Reading
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, arXiv, 2023. https://arxiv.org/abs/2201.11903 - Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q.V., Zhou, D., and Chen, X.
Large Language Models as Optimizers, arXiv, 2024. https://arxiv.org/abs/2309.03409 - Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T., Moazam, H., Miller, H., Zaharia, M., Potts, C.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, arXiv, 2023. https://arxiv.org/abs/2310.03714 - Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Huang,., Guestrin, Carlos., Zou, J.
TextGrad: Automatic ‘Differentiation’ via Text, arXiv, 2024. https://arxiv.org/abs/2406.07496 - Battle, R. and Gollapudi, T.
The Unreasonable Effectiveness of Eccentric Automatic Prompts, arXiv, 2024. https://arxiv.org/abs/2402.10949 - Zhou, Y., Muresanu, A.I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J.
Large Language Models are Human-Level Prompt Engineers, arXiv, 2023. https://arxiv.org/abs/2211.01910 - Zhou, Y., Zhao, Y., Shumailov, I., Mullins, R., and Gal, Y.
Revisiting Automated Prompting: Are We Actually Doing Better?, arXiv, 2023. https://arxiv.org/abs/2304.0360
Join the Discussion (0)
Become a Member or Sign In to Post a Comment