Images Give Robots a Sharper Focus

Teaching robots to navigate the world the way humans do is a formidable challenge. Over time, it has become clear that Human-in-the-loop (HITL) and large language model (LLM) training methods require enormous time and resources but leave significant performance gaps.

It is nearly impossible to prepare a robot for every situation using HITL or an LLM. For example, “A large language model is useful for directing a robot to pursue a high-level activity like cooking food or folding laundry, but it isn’t particularly good with fine grain motor and sensory actions you need to complete the task,” said Mohit Shridhar, a robotic research scientist with Google DeepMind.

For this reason, Shridhar and other researchers are taking an entirely different tack: they’re turning to image data to train robots. The idea is to imbue robots with a more complete understanding of how humans approach various tasks—whether it’s preparing scrambled eggs or folding a stack of laundry.

This research could fundamentally change the way mechanical arms, humanoid robots, and other devices motor through activities. “Conventional data required to train robots is limited and expensive—while visual images are plentiful and highly effective. It’s a way to scale up learning,” said Ruoshi Liu, a fourth-year Ph.D. computer science student who has explored the topic at Columbia University.

Added Anirudha Majumdar, an associate professor in the Mechanical & Aerospace Engineering Department at Princeton University, “Image generation and video models represent a huge opportunity for robotics.”

A Robotic Vision

Unlike fields like computer vision and natural language processing, robotics faces a data scarcity problem. Oftentimes, researchers rely on limited and carefully curated datasets. Much of the learning takes place in a lab, where a robot is connected to a human who performs a task. Software captures the motion, which is incorporated into the robot’s programming.

More recently, researchers have also turned to LLMs. However, by training a robot with videos or simulations, researchers can generate synthetic training data that mimics real-world actions and behaviors—without the overhead of HITL. Researchers typically fine-tune a pre-trained model like Stable Diffusion using methods similar to LLM training. Eventually, “The robot learns how to follow the trajectory of a task. It can understand how an arm, elbow, wrist, hand, and fingers work together,” Shridhar said.

In practical terms, a robot can receive a command—”cook scrambled eggs,” for instance—and proceed through all the steps required to prepare them. This includes retrieving eggs from the refrigerator, removing them from the carton, cracking them into a bowl, beating them, pouring them into a pan, cooking them, and then plating them.

For a human, the task doesn’t require much thinking. However, for a robot that lacks a complete understanding of the task, there are plenty of things that can go wrong. “It’s very difficult to explain to another person how to crack an egg,” Shridhar said. “So, while an LLM is good at directing the robot to do something, it falls short when you try to use it to handle the nitty-gritty details.”

Diffusion methods are appealing because they enable robots to gain generalized knowledge, making them adaptable across diverse environments and situations. For example, in manufacturing, the approach could help train a system to handle a complex assembly process. In the home, visual data could help a robot learn how to clean, organize spaces, and prepare meals.

Motion Matters

The same diffusion models that have revolutionized image editing are coming to robotics. Shridhar and a group of researchers from the former Stephen James Robot Learning Lab at the U.K.’s Imperial College London have developed a framework called Genima. Essentially, the system serves as a behavior-cloning agent by mapping sequences of movements from images and turning them into visuomotor controls.

The team turned to AI image generation platform Stable Diffusion—which broadly understands how objects appear in the real world—to “draw” actions for robots. They fed the model a series of images and introduced color spheres to indicate where specific joints of the robot should be, one second into the future. An ACT-based controller mapped the spheres and translated the motions into real-world robotic movements.

Researchers tested visual data for nine tasks spread across 25 simulations. The system achieved a success rate as high as 79.3% for a specific task, though the average rate hovered between 50% and 64%. The takeaway, Shridhar said, is that the approach allows robots to adapt to new objects by exploiting the prior knowledge of the Internet pre-trained model. It also emphasizes the potential of this method to create interactive agents that can take actions in the physical world.

Meanwhile, researchers from Columbia University and several other groups have devised a system called Dreamitate. “There are numerous videos of humans folding clothes and doing other chores,” Liu said. “It’s possible to use these videos to train a robot how to imitate humans.” Added Junbang Liang, a second-year master’s student at Columbia who participated in the research: “High-resolution video creates a shortcut for training robots.”

The researchers fed 300 video clips depicting specific tasks into a pre-trained diffusion model. The resulting motion data—collected from diverse environments—allowed an AI model to use an initial image to predict actions three seconds into the future—even in unfamiliar settings. Using object tracking software, the robot could then execute the assigned task.

The visuomotor policy learning framework achieved improvements that ranged from about 55% to 640% over more conventional training methods. “The robot’s ability to generalize activity boosts its performance,” Liu said. “What’s more, the research showed that the method could work with a real robot in the real world.”

Framing Progress

For now, the idea of using visual data to train robots remains in its infancy—and the field is currently limited by tools, GPU constraints and other issues. One significant challenge, Majumdar said, is that images aren’t always reliable. “Video and image models are trained on internet data, which often looks very different from visual observations that the robot collects using its sensors,” he said.

Noisy data is also a problem. This includes irrelevant objects and backgrounds that can make it difficult for the AI system to focus on the crucial visual data and extract the essential information. As a result, Majumdar has developed a video editing software tool called Bring your Own VLA, which removes irrelevant visual regions from the robot’s observation. Other research groups are also exploring ways to improve training methods, including TinyVLA, ViLa and ReVLA.

Nevertheless, it’s clear that video diffusion methods represent a promising avenue for developing smarter and more versatile robots. Combined with LLMs and HITL training, it’s possible to achieve the best of all worlds: simpler interactions with humans along with more human-like performance.

Concluded Liu: “The goal is to build a future prediction model that gives the robot enough information that it can handle tasks, even in unfamiliar situations and circumstances. That’s a key to building a robot that will function well in the real world.”

Samuel Greengard is an author and journalist based in West Linn, OR, USA.

Images Give Robots a Sharper Focus

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Images Give Robots a Sharper Focus

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.