Do You Think the Chatbot Likes Me?

Chatbots increasingly are becoming more human-like. OpenAI’s ChatGPT, for example, demonstrates sophisticated conversational skills and can provide relevant responses to a broad range of prompts thanks to the large language model (LLM) trained on big datasets of text that powers it.

In fact, users often perceive chatbots as having a personality, sometimes even forming relationships with them as friends or romantic partners.

“There is a long line of science showing how human language encodes personality,” says Greg Serapio-Garcia, a computational social psychologist and Ph.D. student at the University of Cambridge in the U.K. “It would make sense that these large language models that encode human language reflect some of the social characteristics [such as personality] that are embedded in language, too.”

As a result, researchers are trying to better understand the nature of chatbot personalities and how they can be shaped. For example, such synthetic personalities can be well-defined or less concrete.

Meta, the company that operates Facebook, Instagram, and WhatsApp, recently created artificial intelligence (AI) characters for these messaging platforms that have specific personas; initially, they will be acted out by a cast of famous people, including Snoop Dogg and Paris Hilton.

Other chatbots have vaguer personalities that are acquired from an LLM’s training data.

“With the advent of ChatGPT and all these other chatbots out there, one thing that really stuck out to us as a research team was how personas were bubbling up from systems that aren’t meant to have much of an actual human core,” says Serapio-Garcia.

While he was a student researcher at Google parent DeepMind, Serapio-Garcia wanted to measure personality. There are many examples of AI systems taking on “dark personas” that can harm users, for example, so being able to better assess personality could help to curtail detrimental traits from arising. “There have been attempts to measure psychological traits in AI systems, but one thing that’s been missing in this space is validation, making sure that the measures mean what they claim they mean,” says Serapio-Garcia.

In a recent preprint, he and his colleagues came up with a method to test the personalities of LLMs to see whether they are simulating various traits in reliable and meaningful ways. Different versions of Google AI’s PaLM LLM were subjected to two widely-used personality tests from psychology research that each assessed the Big 5 personality traits (extraversion, agreeableness, conscientiousness, neuroticism, and openness to experience) in different ways. The researchers also administered 11 other tests to measure additional personality traits that have been linked to each Big 5 trait. “This was another measure of validity,” says Serapio-Garcia.

The team found that larger models, and those fine-tuned to answer questions, had more consistent synthetic personalities. For example, an instruction fine-tuned model was more likely to be classed as agreeable on both of the main tests compared to base models. This was expected since bigger AI models, as well as those that are trained more specifically, typically perform better on many tasks.

Serapio-Garcia and his colleagues also investigated whether a model’s personality could be shaped by the prompts they gave it. In initial tests, the researchers focused on changing a single dimension of personality, for example extraversion, by adding some information to a prompt, for example ‘I’m a bit shy’. They found that it worked: when they measured the model’s personality again, it would have a different extroversion score. “It’s difficult to say if chatbots actually have a default core personality, because you can see that this is all synthesized and conditional on prompting,” says Serapio-Garcia.

To test how a measured personality trait might influence the output of a chatbot when a user is interacting with it, Serapio-Garcia and his team also got the largest PaLM model they used to generate thousands of social media status updates as different personas. They then used a set of predictive algorithms trained to predict personality in humans from their social media posts to assess the LLM’s personality from a text it generated. Serapio-Garcia was surprised at how well the test-based signals of personality correlated with the LLM personality that surfaced in its social media status updates. “We found that this correlation was stronger than what has been observed in equivalent human studies using pretty much the same personality tests and task,” he says.

Serapio-Garcia thinks that could be because LLMs tend to amplify relationships found in the large amounts of data on which they are trained. Gender bias and social biases found in language, for example, often come through strongly in a chatbot’s output. “The same thing might be happening here,” he says.

Another team also investigated whether LLMs could generate content that is consistent with specific personality profiles. Hang Jiang, a Ph.D. student at the Massachusetts Institute of Technology (MIT) aiming to understand and bridge human communications with AI, worked with colleagues on a case study of ChatGPT and GPT-4, OpenAI’s latest and most powerful LLM, which involved creating 32 distinct LLM personas incorporating different combinations of the Big 5 personality traits. Ten virtual characters with slight individual differences then were created from each persona type, and each was directed to complete a personality test and story-writing task.

The stories were then evaluated by both humans and LLMs, in terms of different aspects of the writing quality such as readability, as well as predicting the personality of the writer. “We close the loop by adding a lot of human evaluation of the AI personas,” says Jiang. “If [a persona] is not perceivable, maybe AI is not doing a good job, or maybe it’s too implicit in the eyes of humans.”

Jiang and his team found that both ChatGPT and GPT-4 could consistently give answers in the personality test that matched their assigned persona. They also wrote with linguistic features that were characteristic of their personality. However, GPT-4 was slightly better at mimicking a designated personality, most likely because it is a larger, more fine-tuned model.

Jad Kabbara, a postdoctoral researcher at MIT and one of the authors of the study, finds it promising in terms of real-world applications where it is important to tailor a model’s personality to users. “There is now a lot of research about how to create chat-based models that can interact with people going through stress, or certain kinds of mental health issues like depression,” he says. “Part of it is how to reflect a certain character that goes well with a user.”

However, some Big 5 traits were more perceivable to humans than others. While people could generally tell if an LLM was supposed to be extroverted and agreeable, they were less able to gauge if it was conscientious. “There are still some limitations with current AI [being able] to display certain personality traits,” says Jiang.

The human evaluations also suggest preconceived notions play a role in how people rate the linguistic features of a text. Participants predicted an LLM’s personality less accurately when they were aware of AI authorship, and they found the stories it produced less personal and relatable. However, their scores related to other aspects such as readability, cohesiveness, and believability did not change. “The fact that you’re telling them it is written by an AI or not plays a role psychologically,” says Kabbara. “I think it’s interesting and surprising at the same time, because the text is the same.”

Jiang would like to follow up on this work by investigating how a more compatible AI persona could affect interactions with humans. A chatbot could, for example, be able to better communicate with a person or more effectively help them to achieve their goals. The effect could be tested by designing quizzes or games in which an AI and a human must collaborate. “If the personality matches versus the personality doesn’t match, does that influence the time it takes them to finish the job or whether they have a better relationship after they finish the job?” Jiang wonders. “That might draw a lot of implications for real-world applications of AI characters.”

As LLMs become increasing able to mimic personalities, they could be used to create AI versions of ourselves. That could, for example, enable employees to get an idea of a colleague’s opinion on a task when the colleague is busy, or at any time of day.

Kartik Talamadupula, the director of AI research at Symbl.ai, a company that creates purpose-built AI models for communication data, and the applied AI officer of the ACM Special Interest Group on Artificial Intelligence (ACM SIGAI), thinks language models will become smaller and hyper-personalized; when they do, he said, they could be trained to become different versions of ourselves, in a way similar to how we currently have our work persona on LinkedIn, for example, but we present a different side of ourselves to family and friends on WhatsApp.

“Depending on the exact context in which I’m trying to execute something, I’ll be able to automatically pick from one of these models of myself,” says Talamadupula. “I think that’s the future.”

Sandrine Ceurstemont is a freelance science writer based in London, U.K.