Architecture and Hardware

Teaching Robots Manners

Roboticists have been attempting to program "expressive behaviors" into social robots for decades.

Credit: Shutterstock humanoid robot wearing a tuxedo in a spacious room extends a welcoming hand, illustration

Robots may be getting ever more capable in terms of their dexterity and mobility—as witnessed by the humanoids that Figure AI, Tesla, Agility Robotics, and others are developing. Yet the way robots deal with people as they interact and communicate with them in shared spaces and cooperative tasks tends to be characterized by brusque, offhand behaviors, which leave a lot to be desired. That matters when human buy-in is needed to make a robot’s task run smoothly.

At issue here is the fact that it does not take much for a robot to offend people. Simply by failing to hold a door open for someone, not acknowledging somebody’s presence by making eye contact, or not saying “excuse me” as the robot squeezes past somebody in a corridor, can leave people feeling deflated at a robot’s dead-eyed indifference. 

At the joint ACM/IEEE Human-Robot Interaction (HRI ’24) conference in Boulder, CO, in March, a surprising answer to this behavioral problem was revealed: large language models (LLMs), researchers at Google DeepMind in Silicon Valley told delegates, should be able to provide the “rich social context” that allows robots to express themselves with appropriate human-friendly behaviors. As a corollary of that, they said, people will be more accepting of the robot, making its mission all the more effective.

These “expressive behaviors,” as DeepMind called them, are something that roboticists have actually been attempting to program into social robots for decades. Those attempts have largely failed, either because they tried to use exhaustive rules-based methods to predict and code up templates for every type of human-robot interaction, or because they tried to implement application-specific datasets for every kind of social situation the robot could be expected to encounter.

Neither approach has worked well, said Fei Xia, senior research scientist at DeepMind’s Mountain View, CA, lab. Previously, he said, “Robot behaviors weren’t exactly trained; rather, they were determined by predefined rules from professional animators or specialized datasets.  These methods were extremely limiting, as the rules and data used to program robot behaviors couldn’t be effectively transferred across environments and therefore required significant manual effort for each new environment.”

So if a robot entered a new type of social situation, such as a noisy, crowded room where verbal communication did not work, the code could not generalize and “scale” to the new situation, requiring new code to be written. However, LLMs, trained on vast amounts of human knowledge on the Internet, hold the potential to work out a way around that social problem. DeepMind’s idea is to “leverage” the social context available from large language models, and not use it only to generate appropriately expressive robot behaviors, but also to make it adaptive to new conditions.

Said Xia, “Our approach, called Generative Expressive Motion, or GenEM, uses language models to translate high-level human instructions, such as ‘nod your head to acknowledge a person,’ into the corresponding robot actions. To do this, we provide the LLM with a small number of representative examples of instructions, social reasonings, and the corresponding robot code snippets.

“From these training inputs, the LLM can then generalize across new instructions to produce robot control codes guided by how a human would act on that instruction in our training set. So, with LLMs, we’re able to actually train robots so they can adapt to new contexts and environments.”

Xia added that the approach can adapt to new types of robots, too.

However, as LLMs are well known for ‘hallucinating’ and providing error-ridden output, could LLM-trained robots undertake unsafe or even dangerous moves? DeepMind said they could not, because the LLM’s instructions cannot make a robot do anything that its underlying operating system already prevents. “The generated robot code leverages predefined primitive skills, such as moving the head, so the robot can only perform safe behaviors,” said Xia. “The key aspect is that the social reasoning needed to process instructions is offloaded to the high-capability language model, while the robot code is constrained to safe predefined skills.”  

To test whether LLMs could indeed generate expressive behaviors that people both understand and appreciate, the DeepMind researchers exposed paid volunteers to a series of experimental behaviors running on a tall, highly mobile, single-armed, wheeled robot. Capable of speech and other audible utterances, plus the ability to pan and tilt its head, the behavior-rich robot also had a multicolored LED light ring around its face.

In their HRI ’24 paper, DeepMind reported that the LLM used to acquire the social context for their actions—OpenAI’s GPT-4—did indeed let them “quickly produce expressive behaviors,” which “reduces the need for curated datasets to generate specific robot behaviors or carefully crafted rules as in prior work.”

That could deem it a qualified success. However, just because it cannot produce unsafe actions does not mean the robot’s behaviors did not become annoying at times, said Leila Takayama, a human-robot interaction specialist on the DeepMind team. “There have been a couple instances of funky behaviors; for example, this robot likes to show off spinning colors on its light ring, which we jokingly called the ‘party on its face.’ It’s delightful when you see it once, but it gets annoying when it does that too often.

“Because that was a basic behavior that it had in its skill base, it would often revert to the ‘party on its face’ behavior when we didn’t give it persistent feedback on taming its lights,” she said.

The amounts to a promising start, said Kartik Talamadupula, director of AI research at, and Applied AI Officer at the ACM Special Interest Group on AI (SIGAI). “This is a promising area of research that could significantly address the bottleneck of knowledge acquisition for deployed robots in the real world, but it will need a lot more work and evaluation before it is ready to be deployed in real applications.”

For its part, DeepMind is in no hurry to deploy. Explained DeepMind research scientist Dorsa Sadigh, who is also an assistant professor of computer science at Stanford University, “For this research, we exclusively leveraged robots developed in house. We don’t have any immediate plans to integrate our work into commercially available robots.

“Right now, this is research that shows how we can leverage knowledge from foundation models and apply it towards robotic settings, enabling more natural ways to interact with robots and making them more helpful in human-centered environments. It’s too early to say how this kind of technology might be used in products and there’s an array of deep technical challenges that must be overcome before robots can safely, autonomously, and proficiently perform complex tasks in everyday environments,” she said.

Beyond expressive behaviors, DeepMind engineering director Carolina Parada gave a much broader talk on the potentially profound impact of LLM foundation models in robotics at HRI ’24, prompting this post on X (formerly Twitter) by social robotics specialist Tony Belpaeme of the University of Ghent, in Belgium:

“Fantastic talk by @carolina_parada from Google Deepmind on using LLMs to control and teach robots. LLMs seem to be the hammer we’ve been looking for in personal robotics.”

Asked what he meant by “the hammer,” Belpaeme said, “I was very impressed with Google Deepmind’s work. They’ve been using LLMs for half a dozen different things, all tasks that up till now were beyond the grasp of robots. It rests on two properties of LLMs: first, that LLMs encode ‘common sense’, something we have been trying to build into AI for many decades but which we have largely failed to achieve because there just too many properties and interactions that need encoding.

“And second, LLMs speak many languages, not only natural languages but also programming languages. Just think how powerful that is: you give spoken instructions to the robot, and it translates those into computer code.”

What particularly impressed Belpaeme, also a visiting professor at the University of Plymouth in the U.K., was the way DeepMind demonstrated how it had used LLMs to iteratively teach robots new skills (not just the aforementioned expressive behaviors) using plain English prompts.

“For example, a quadruped robot had to learn things such as ‘sit’, ‘fetch’, and ‘high five.’ DeepMind describes ‘sit’ as meaning ‘that you bend your hind legs and stretch your front legs,’ and the robot then translates that into a piece of code called a ‘reward function,’ which gives it a reward the closer it gets to this pose, and then the robot learns to achieve this. But you can also correct the robot as it goes along, saying: ‘No, bend your legs a little more.’ It’s very exciting.”

Paul Marks is a technology journalist, writer, and editor based in London, U.K.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More