Artificial Intelligence and Machine Learning Last Byte

How Many Ways Can You Teach a Robot?

ACM Prize recipient Pieter Abbeel is working to help robots learn, and learn to improve.

UC Berkeley Professor Pieter Abbeel sitting at a laptop computer

The human brain is wired to be able to learn new things—and in all kinds of different ways, from imitating others to watching online explainer videos. What if robots could do the same thing? It is a question that ACM Prize recipient Pieter Abbeel, professor at the University of California, Berkeley and director of the Berkeley Robot Learning Lab, has spent his career researching. Here, we speak with Abbeel about his work and about the techniques he has developed to make it easier to teach robots.

Let’s start with deep reinforcement learning and the method you developed called Trust Region Policy Optimization. How does that method work, and how did you come to develop it?

Let’s zoom out for a moment.

In the past, to put a robot somewhere—let’s say a car or electronics factory—you set up the environment around the robot such that everything repeats, over and over, in exactly the same way. Then, you programmed the robot with some kind of fixed-motion sequence, and that was enough to get things done. It works great for structured environments, but it doesn’t enable us to do anything in environments where things are a little less predictable.

My thinking has always been that the big change is going to happen when robots can adapt to their situations. And to do that, they’re going to have to be able to learn.

So how do we make robots learn?

That is what I’ve been working on ever since my Ph.D. Fundamentally, there are two main approaches, and they complement each other. One is called imitation or apprenticeship learning, and the other is reinforcement learning.

In imitation learning, you show the robot what to do, and from your examples, the robot learns to do it. That’s great, because when you want a robot to do something, you often know exactly what you want it to do. But the challenge is that you need to give the robot a wide range of examples so it can generalize when faced with a new scenario and still complete the task. That can become time-consuming, and once the environment changes, there’s almost always something you didn’t cover well enough in your demonstrations.

What about reinforcement learning?

Reinforcement learning is about trial and error. Here, the robot is not shown what to do; the robot just tries and tries, and there’s a system that tells it whether or not it succeeded. So, in principle, you would want first to show the robot what to do through imitation learning, then have the robot keep learning from their own trial and error.

In 2012, ACM A.M. Turing Award recipient Geoff Hinton showed that, with enough visual data, deep neural networks can be trained to perform unprecedented levels of pattern recognition. I gather that inspired you to try to make the pattern recognition in your reinforcement learning framework more powerful.

In reinforcement learning, the robot does something on its own, but it still needs to recognize how the patterns in good runs are different from the patterns in bad runs. My student John Schulman and I began experimenting with deep neural nets to see if we could improve the pattern recognizer inside our reinforcement learning algorithms. But it turns out that reinforcement learning algorithms are much more brittle than standard supervised learning. In supervised learning, there’s an input, and the output is a label, and you just need to recognize the pattern. In reinforcement learning, the robot needs to learn to run when it’s never run before. There’s not a lot of signal, and there’s an awful lot of noise.

So you tried both to improve pattern recognition and to make the algorithm more stable.

We need to be able to guarantee that the robot is improving. If it looks at a recent experience, it will update the pattern recognizer, which is the neural network policy that takes current sensor input and generates motor commands. We knew that if we could come up with a way to enable the robot to make consistent improvement at every step, then we’d have a real foundation to do reinforcement learning with these massive neural nets.

And that is where Trust Region Policy Optimization comes in.

In traditional reinforcement learning, we take a bunch of trials and compute a gradient to find the direction of most improvement. Trust Region Policy Optimization defines a trust region: a region in which we know we can trust that gradient. A gradient is a first-order, linear approximation of the landscape. We know the landscape’s not linear, but locally it can be approximated that way. So, my student John Schulman and I came up with a way to quantify the region in which you can trust this linear approximation. Afterwards, we just have to take a step that stays within that region, and it’s a guaranteed improvement.

“Our goal is to come up with a methodology that enables robots to be very general in what they can learn, and also in how they learn.”

Repeat that, and you have a trustworthy foundation for reinforcement training.

Right! We have a video on YouTube (https://bit.ly/3nZrQhs) where you can see this whole process in action. The robot is just trying, trying, trying, and falling over, But over time, it’s actually starting to run. And part of the beauty of learning is that once you have a learning algorithm, you don’t need to reprogram it to work with a four-legged robot—you just run the learning algorithm again, and now it’ll learn what’s needed in the new situation.

The company you founded, Covariant, is trying to commercialize this idea by building a general robot brain.

Our goal is to come up with a methodology that enables robots to be very general in what they can learn, and also in how they can learn. Of course, you can’t learn to fly just from having learned to manipulate Lego blocks (https://bit.ly/3bXcPty)—it’s not the right data set. But the code can be the same. Humans, too, use the same principles to learn how to ride a bike or drive a car. The algorithm in our brains isn’t changing.

Covariant is also building robots for commercial applications, notably pick-and-place warehousing.

We can give robots new skills that go well beyond pre-programmed fixed-motion sequences, even if they aren’t fully general. We knew that was possible from our academic research, and we started Covariant with it in mind. Part of our thinking was that the robots should be useful right now. Also, our product development is data-driven, and if we want to collect a lot of data, we need to build robots that people actually want to buy.

When you founded Covariant, in 2017, self-driving cars were getting tons of funding. What drew you to pick-and-place?

We wanted to find a domain where, in the rare times when we might still need human backup, it didn’t have to be a real-time intervention. Real-time human intervention is costly, and it takes away most of the value of having a robot do things. With robotic manipulation, you still need to have very high accuracy, but the one time the robot doesn’t perform as expected, someone can step in later with a quick fix.

We looked at many different companies and industries and applications, but we converged on warehousing because it seemed like a natural starting point, for two reasons. One, because pick-and-place, if you think about it, is the general foundation of almost anything else a manipulation robot might do. Two, it’s a fast-growing industry with a real need for automation to support all our online deliveries. There’s no automation in pick-and-place, and it’s a task that causes humans a lot of injuries.

You co-teach a class about the business of AI. What have you learned from teaching AI to non-specialists?

One of the reasons I decided to teach the course is because I think a very basic understanding of AI is important for making business decisions. Many companies will use AI in some way or other, whether they develop it internally or buy some sort of service. Business students must be able to understand what’s possible today, what might become possible in the near future, and how to evaluate different systems.

It’s also a lot of fun, because for people who have never really studied AI, it’s kind of like explaining a magic trick. At its core, AI is very explainable. You’ll need a lot of training if you want to push it to the next technological frontier, but understanding basic concepts isn’t something that requires years and years of studying.


Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More