As a computer scientist with one foot in artificial intelligence (AI) research and the other in human-computer interaction (HCI) research, I have become increasingly concerned that promptinga has transitioned from what was essentially a test and debugging interface for machine-learning (ML) engineers into the de facto interaction paradigm for end users of large language models (LLMs) and their multimodal generative AI counterparts. It is my professional opinion that prompting is a poor user interface for generative AI systems, which should be phased out as quickly as possible.
Prompting is a poor user interface for generative AI systems, which should be phased out as quickly as possible.
My concerns about prompting are twofold. First, prompt-based interfaces are confusing and non-optimal for end users (and ought not to be conflated with true natural-language interactions). Second, prompt-based interfaces are also risky for AI experts—we risk building a body of apps and research atop a shaky foundation of prompt engineering. I will discuss each of these issues in turn, below.
Limitations of Prompting as an End-User Interface
Prompting is not the same as natural language. When people converse with each other, they work together to communicate, forming mental models of a conversation partner’s communicative intent based not only on words but also on paralinguistic and other contextual cues, theory-of-mind abilities, and by requesting clarification as needed.4 By contrast, while some prompts resemble natural language, many of the most “successful” prompts do not—for instance, image generation is a domain where arcane prompts tend to produce better results than those in plain language.1 Further, prompts are surprisingly sensitive to variations in wording, spelling, and punctuation in ways that lead to substantial changes in model outputs, whereas these same permutations would be unlikely to impact human interpretation of intent—for example, jailbreak prompts using suffix attacks10 or word-repetition commands.7
While some prompts resemble natural language, many of the most “successful” prompts do not.
Indeed, the subtle differences between prompting and true natural-language interaction leads to confusion for typical end users of AI systems9 and results in the need for specially trained “prompt engineers” as well as prompt marketplaces, such as PromptBase, where customers can pay money to copy prompts that purport to achieve a given result (I say “purport to achieve a given result” because the stochastic nature of generative AI models means the same input may not reliably yield the same output, an issue further exacerbated by frequent updates to underlying models). As further evidence of the challenges many end users face in crafting prompts, systems such as Dall-E 3 and Gemini sometimes rewrite users’ submitted prompts—that is, performing behind-the-scenes AI-assisted prompt engineering that may or may not be transparent to or controllable by the end user.
A few years from now, I expect we will look back on prompt-based interfaces to generative AI models as a fad of the early 2020s—a flash in the pan on the evolution toward more natural interactions with increasingly powerful AI systems. Indeed, true natural-language interfaces may be one of the desirable ways to interact with such systems, since they require no learning curve and are extremely expressive. Other high-bandwidth “natural” interfaces to AI systems might include gesture interfaces, affective interfaces (that is, mediated by emotional states), direct-manipulation interfaces (that is, directly manipulating content on a screen, in mixed reality, or in the physical world), non-invasive brain-computer interfaces (that is, thought-based interactions), or multimodal combinations of all of these.
A few years from now, I expect we will look back on prompt-based interfaces to generative AI models as a fad of the early 2020s.
Alternatively, there may be situations where free-form “natural” interactions are non-optimal. For example, the limitless input combinations from true natural language or other similarly expressive interactions may in practice create barriers to interaction by their open-ended nature; such paradigms do not help novice end-users understand the affordances of a system. Constraint-based graphical user interfaces (for example, menus, templates) might be more suitable for some user groups or application scenarios by revealing affordances, scaffolding user knowledge of available interactions, and supporting recognition over recall. It is also worth considering interaction designs that shift more of the burden for interaction onto the system rather than the end user, such as implicit interactions that infer the user’s intent from contextual clues and mixed-initiative systems that are more proactive than today’s chatbots about eliciting users’ preferences.4 Modality influences “naturalness” as well; for instance, sketching or other direct-manipulation interactions2 might be faster and more intuitive for generative image creation and editing than text-based prompting.
Ultimately, the goal of any AI interface is to allow the user to express their intent and to know the system understood their meaning and will carry out their intent in a safe and correct fashion. Careful consideration of appropriate human-AI interaction paradigms is an important component of a multifaceted approach to AI safety, particularly as models progress in capability.6
There is an urgent need for the fields of AI and HCI to combine their skills not only in developing improved interfaces for status quo systems, but in developing strategic programs of research on user experience for frontier models. It is also vital to innovate in educational programs that will train a new generation of computing professionals fluent in the methods and values of both fields. We should not be complacent and assume that artificial general intelligence (AGI) will obviate the need to consider user interfaces (since such hypothetical, powerful systems would by definition understand all inputs perfectly). Progress toward AGI is a journey, not a single endpoint6—investment in user experience along the path to AGI will improve the utility of status quo and near-term systems while also improving alignment for more speculative future models.
Limitations of Prompting as an Expert Interface for ML Researchers and Engineers
I fear that we are in the midst of a “replication crisis” in AI research. Psychology and related social sciences have been experiencing a crisis in which a substantial number of published results do not replicate, often due to p-hacking to obtain statistically significant findings.8 I am increasingly concerned that a non-negligible portion of recent AI research findings may not stand the test of time due to a different phenomenon, which I call prompt-hacking.
Much like the p-hacking crisis in the social sciences, prompt-hacking does not imply nefarious intent or active wrongdoing on the part of a researcher. Indeed, researchers may be entirely unaware they are engaging in this behavior. Prompt-hacking might include any of the following research practices:
Carefully crafting dozens or even hundreds of prompts (manually, programmatically, or via generative AI tools) to obtain a desired result but not reporting in a paper the number of prompts tried that failed to produce desired results, and whether the prompt(s) that did produce desired results had any properties that systematically differentiated them from those that failed.
Not checking whether slight variations in a successful prompt alter the research results.
Not checking whether a prompt is robust across multiple models, multiple generations of the same model, or even the same model when repeated several times.
How can we prevent prompt-hacking and an associated AI research replication crisis? In addition to investing in developing the next generation of interfaces for generative and general AI systems, conference and journal committees could set clear standards for reporting exact prompts used, the method by which they were generated, and any prompts that were tried and discarded (and an explanation of why) as part of the methods sections of research papers. We can also seek to replicate key results to understand whether prompt-hacking is prevalent in the AI research community, and to what extent. Perhaps we even need a system for “prompt pre-registration,” analogous to the pre-registration of hypotheses that is the standard for quality social science research.
The fact that variations in prompting that would be irrelevant to a human interlocutor (for example, swapping synonyms, minor re-phrasings, changes in spacing, punctuation, or spelling) result in major changes in model behavior should give us all pause,3 and serve as a further reminder that prompts are still quite far from being a natural-language interface. Even research that does not engage in “prompt hacking” is still dependent on the shaky foundations of the sensitivity of models to prompts.
In addition to a replication crisis, another risk of current prompting approaches is in our methods for evaluating models. A critique of status quo evaluation of frontier models is that, while models are ostensibly testing on the same set of benchmarks, in practice these metrics may not be comparable due to variations in how each organization operationalizes the benchmarking—that is, the format of prompts used to present the tests to the model.5 This is cause for concern: Accurate measurement is key to responsibly and safely monitoring our progress toward advanced AI capabilities.6
In sum, we are at a critical juncture in AI research and development. However, our acceptance of prompting as a “good enough” simulacrum of a natural interface is hindering progress. Moving beyond prompting is vital for successful end-user adoption of AI. As systems graduate from labs to the open world, improvements in human-AI interaction paradigms are central to ensuring that AI is useful, usable, and safe. Further, moving beyond prompting (or at least openly acknowledging and compensating for its shortfalls) is vital even for experts such as AI developers and researchers, to ensure that we can trust the results of our research and evaluations such that future systems are built upon a sturdy foundation of trustworthy knowledge.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment