Making Conversation a Robot’s Command

By combining large language models (LLMs) with vision-language models (VLMs, which label images), it is possible for autonomous robots to achieve their tasks by conversing with humans in natural language. This feat makes “conversation the command,” instead of requiring operators to issue robots explicit commands in their own structured language, according to research by University of Leoben, Austria (Montanuniversität Leoben) professor Elmar Rueckert, and chair of cyber-physical systems Linus Nwankwo.

Said Kartik Talamadupula, director of AI Research at AI solution provider Symbl.ai (which was not involved in this research), “Recent advances in large language models and vision language models can enable new progress on one of the fundamental problems of robotics: resolving the gap between high-level specifications of behavior and goals, and low-level implementations of robotic platforms and systems. Used with appropriate guardrails and sufficient context, this can significantly scale up the real world applicability of robotic applications.”

Many command sets for autonomous robots have been created over the years, but most are limited to structured commands and responses in specific work environments. The advantage of large language models and vision-language models is that they train on a vast database of possible language statements (even slang, in the case of some LLMs) and visual scenes (in the case of VLMs), presenting the possibility of widening the breadth of “commands” to natural language “conversations” between a human and an LLM embodied in a robot.

Nwanko said LLMs “have the capability to interpret natural language commands issued by humans and convert them into structured commands that robots can understand and execute directly (using the direct-text/speech-to-action method). However, left to themselves, LLMs can sometimes ‘hallucinate’—that is, generate inconsistent data. If such happens, the LLM could introduce stochastic behavior or randomness into the robot’s actions.”

Rather than relying completely on the LLMs to plan and execute commands directly on the robot, thus risking random behaviors, Nwankwo said “we programmed a robot execution mechanism [REM] that employs a robot operating system [ROS] navigation planner to translate the intents extracted from the LLMs into actionable tasks for physical execution by the robot. Thus, instead of the human learning a structured command set for a robot trained for a specific visual environment, using LLMs integrated into VLMs presents the possibility of carrying on conversations with robots—which, in turn, carry out the human’s commands without the need for the human to learn the idiosyncrasies of a robot’s structured command set.”

To demonstrate this possibility, the University of Leoben researchers recently presented a “conversation is the command” demonstration of the ideas expressed in a paper at the ACM/IEEE International Conference on Human-Robot Interaction (HRI’24) in March. The researchers have made their code and resources publicly available at GitHub, so interested parties may reproduce their results.

The entire software system programmed by the researchers around the pre-trained LLM and VLM was called TCC-IRoNL (The Conversation is the Command: Interacting with Real-world autonomous Robots through Natural Language). TCC-IRoNL accomplished its goal by first integrating various readily available pre-trained LLMs (the demonstration used OpenAI’s second-generation Generative Pre-trained Transformer GPT in English). The researchers also used OpenAI’s VLM, named Contrastive Language-Image Pretraining (CLIP), which includes over 400 million labeled images.

In more detail, the researchers programmed the various subsystems of TCC-IRoNL, such as a ChatGUI (graphic user interface) that accepted typed conversations, then submitted them to the LLM, which executed the structured commands implicit in the user’s “conversation” (including “conversational responses” from the robot), resulting in a natural language dialog between man and machine—“conversation as the command.”

The software robot execution mechanism (REM) used the pre-trained LLM and VLM to convert natural language into “conversations” with open-source robot ROMR (Robot-operating system-based Open-source Mobile Robot). To prevent the inevitable errors present in every LLM’s output stream from leading to robotic command errors, TCC-IRoNL used the LLM solely as a linguistic decoder for its REM, which used traditional navigation planning to issue ROS commands to the robot. (The REM also enabled nonsense commands from the LLM to be ignored, instead of passing them to the ROS navigation planner).

Explained Nwanko, “The ROS navigation planner—which included mapping, path planning, and obstacle avoidance algorithms—takes the output commands from the LLM and turns them into custom movement or navigational commands to this specific robot. For example, if the human instruction is a description of a goal or destination, such as ‘navigate to the secretary’s office,’ our framework translates this into precise goal coordinates within the robot’s operational environment via a mapping process that correlates the VLM’s descriptive label of the destination with its corresponding spatial coordinates.”

The REM unit was considered a success, since the robot was judged to carry out the intent of its natural language conversation with human operators with over 99% command recognition accuracy, according to the researchers.

Nwanko continued, “Our approach used the LLMs to interpret the natural language commands—goal destination commands, queries, and custom commands such as move right, move forward, turn in a circular pattern, etc.—and at the same time, used the ROS navigation planner to execute the robot’s physical actions accordingly. Our 99.13% command recognition accuracy measured the robot’s execution mechanism’s ability to translate a high level of understanding from the LLMs to the actual robot’s detailed navigation actions.”

In summary, the command recognition accuracy measurement assessed how accurately the complete TCC-IRoNL software system executed natural language instructions from the user. In testing, the researchers computed the accuracy from the logged interaction data of 21 test-users as the percentage proportion of the LLM predicted labels to the true commands required to accomplish the task—“as reflected in how effectively the robot execution mechanism successfully planned and executed the robot’s many individual tasks such as navigating the robot to the user’s instructed location,” Nwanko said.

In terms of response time, Nwanko said, “Our framework took an average latency of 0.45 seconds from receiving the user’s chat command to initiating the robot’s movement. This indicates a relatively quick response time for our framework compared to the traditional neural network reinforcement and ‘imitation learning’ frameworks of the past, which are also more computationally expensive in terms of the high costs associated with reward specification, task-specific training, and fine-tuning.”

Using a video stream from an Intel RealSense imager, including both depth and inertial measurement units, allowed the robot to converse with the human about its environment. The overlaid text identified objects in the scene, from the VLM database including millions image-text pairs, enabled human/robot conversations about the robot’s environment. A human operator thus could direct the robot to navigate to identified objects, and conversationally list the items within its view, as well as providing information about objects in view, including their precise locations. Nevertheless, the robot’s object identification accuracy was judged to be just 55.2%, showing “room for improvement,” according to Rueckert.

“We achieved relatively low performance in terms of objects’ identification and localization within the robot’s task environment. We attribute this performance to the current implementation of the CLIP vision-language model,” said Nwanko, which is why he said the researchers “plan to further refine the VLM implementation by optimizing the alignment between its textual descriptions and the visual data.”

The researchers also plan to explore other available VLMs, first with the integration of Microsoft’s Grounded Language-Image Pre-training (GLIP) model as a baseline of comparison with OpenAI’s CLIP performance, to decide which model, or combination of models, best enhances object identification accuracy and localization capabilities.

R. Colin Johnson is a Kyoto Prize Fellow who has worked as a technology journalist for two decades.