People have been interested in building “robots” in the form of humans for thousands of years. There were baked clay figures of humans in both Europe and China 7,000 years ago. At the height of Egyptian civilization 3,000 years ago, articulated statues could be controlled by hidden operators. At Thebes, the new king was chosen by an articulated statue of Ammon, one of the chief Egyptian gods (depicted as a human with a ram’s head). Priests secretly controlled it as male members of the royal family paraded before it.
Leonardo da Vinci, the leading student of human anatomy in his time, designed a mechanical equivalent of a human—a humanoid robot—early in the 16th century; unfortunately, the design has still not been constructed.
Frenchman Jacques de Vaucanson early in the 18th century built three clockwork humanoids. One was a mandolin player that sang and tapped its foot as it played. Another was a piano player that simulated breathing and moved its head. A third was a flute player. All were reported to be very lifelike, though none could sense the environment; all were simple playback mechanisms.
Similar humanoids soon followed. In the 18th century Pierre Jacquet-Droz, a Swiss watchmaker, and his son Henri-Louis built a number of humanoids, including a female organ player that simulated breathing and gaze direction, looking at the audience, her hands, and the music. Henri Maillardet, also a Swiss watchmaker, built a boy robot in 1815 that could write script in both French and English and draw a variety of landscapes.
Modern Humanoids
The modern era of humanoid robots was ushered in during the early 1970s by Hirokazu Kato, a professor at Waseda University in Tokyo; he oversaw the building of Wabot-1, a robot that could walk a few steps on two legs, grasp simple objects with its two hands, and carry out some primitive speech interaction with people. But, as with the early humanoids, Wabot-1 was still essentially a playback mechanism.
Kato’s next robot, Wabot-2, built in 1984, was much more than a playback mechanism. Like Wabot-1 it had two legs and two arms. Unlike Wabot-1, it could not stand but rather sat on a piano bench. Its feet were used to press the pedals of an organ, and its arms and hands were restricted to playing the organ’s keyboard. It had five fingers on each hand and could move its arms from side to side when playing the keys. Its head was a large TV camera; when sheet music was placed on the music stand above the keyboard, it would read the music and play the piece. In some sense it, too, was a playback mechanism, but it played back standard musical notation, perceiving such notation through its vision system and responding appropriately.
By the mid-1990s many humanoid robot projects were under way, most notably in Japan, Germany, and the U.S. Today, more than 100 researchers work in humanoid robotics at Waseda University alone and a similar number at Honda Corp. just outside Tokyo. There are also large humanoid projects at Tokyo University, the Electro-Technical Laboratory (ETL) in Tsukuba, Advanced Telecommunications Research (ATR) in Kyoto, and at other Japanese locations. Germany’s Bundeswehr University of Munich and the Technical University of Munich have hosted humanoid robot projects. The major projects in the U.S. have been at the University of Utah, Vanderbilt University, NASA-Houston, and MIT.
There have been many different motivations for building humanoid robots. Some formally announced ones include: investigating bipedal locomotion; building teleoperated robots to directly take the place of people (such as in spacewalks outside the International Space Station); building robots to maneuver in houses built to be convenient for people; investigating hand-eye coordination for tasks usually done by people; entertaining people; and functioning as a tool to study how people do what they do in the world.
MIT Humanoids
The humanoid robotics group at MIT (one of two groups in the Artificial Intelligence Laboratory working on humanoid robotics, the other concentrating on bipedal locomotion) started out developing humanoid robots as a tool for understanding humans’ use of representations of the world around them [6]. Early plans were based on the work of the philosophers George Lakoff and Mark Johnson (best summarized in [8]) who posited that all of our understanding of the world builds upon the embodied experiences we have when we are young. For instance, they argued that the concept of affection uses warmth as a metaphor because children are exposed to the warmth of their parents’ bodies when shown affection. Thus we might say, “They greeted me warmly.” Likewise, we tend to use bigness as a metaphor for importance, as in “tomorrow is a big day,” because parents are important, big, and indeed dominate our visual experience when we are young. Higher-level concepts are built as metaphors less direct than these primary ones but nevertheless rely on our bodily experience in the world. For instance, for time, we use the metaphor of moving forward, walking or running in a straight line. Thus the future is ahead of us, the present is where we are, and the past is behind us.
As the first humanoid robot, called Cog, was being developed in the mid-1990s, many aspects of perception and motor control had yet to be solved [5] (see Figure 1). Its developers realized there were important precursors to explicit representations of metaphors, as had been argued in earlier work on situated and embodied robots [4]. In the case of robots with humanoid form, intended to act in the world as people do, these precursors are social interactions [2], which are themselves based on emotional systems [7], facial expressions, and eye movements. The eye movements are driven by perceptual demands imposed by the underlying architecture of the eye [10]; in turn, they have been hijacked by evolution as significant components of human social interactions.
This realization prompted development of the robot Kismet in the late 1990s to study how social cues can be elicited from people by robots (see Figure 2). Today, both robots are used for researching aspects of social interaction.
Active Vision
Vision systems with steerable cameras that move in purposeful ways as part of the perception process are called active vision systems [1]. A humanoid vision system with the same basic mechanical structure as humans and other mammals and that follows the same motion primitives used by humans appears to be animate and lifelike.
The robot’s coherence of behavior is not determined by some internal locking mechanism but by its direction of gaze out into the world.
The human eye has a central fovea spanning about 5 degrees vertically and horizontally of the full 160 degrees the eye can see. The brightness and color receptors are much more densely packed in this area; more than half of the region of the brain that first processes signals from the eye is dedicated to the central 2% of the field of view. Humans move their eyes around rapidly, up to four times per second, to aim this high-resolution part of their eyes at whatever it is they are interested in. These rapid motions are called saccades and occur ballistically without feedback about their accuracy during their motion. They are under voluntary control, in that a person can consciously choose to saccade to a particular location, though most saccades are made completely involuntarily by some sort of attention mechanism. Something interesting is often in the low-resolution periphery of human perception, and the eye saccades to that target to see it with higher resolution.
Humans can also scan their eyes to follow something moving in their field of view. Called smooth pursuit, such scanning cannot be done voluntarily. People cannot scan their eyes smoothly from, say, left to right, unless there is a moving object they can lock onto and follow. Lastly, humans use their inner ears to detect head motion, feeding the signal forward to compensate with eye motion much more quickly than the vision system could track how the world appears to be slipping and compensate. This is known as the vestibular-ocular reflex.
These three capabilities—saccades, smooth pursuit, and the vestibular-ocular reflex—have been implemented repeatedly in both Cog and Kismet [5], operating with performance comparable to that of humans, though their cameras have much lower resolution overall than the human eye.
Humans also verge their eyes toward a target and estimate the gross depth by how far off parallel their eyes have to move to see the same point in space. Comparisons are then made between the images in the eyes to get a local relative depth map—the process of stereo vision. Cog and Kismet also have these capabilities and so are able to perceive 3D aspects of the world.
Cog and Kismet are able to detect human faces through a variety of methods [9] and estimate the gaze direction of a person by determining the direction their eyes are pointing. The robots are not able to do as good a job as the human visual system, however, but estimates with 3 to 5 degree accuracy are useful for social interactions.
Cog and Kismet each have their perception and control systems running on more than a dozen computers. There is no central executive and indeed no central locus of control for the robots. Nevertheless, they appear to be operating in a coherent manner. The low-level trick that allows this coherence to happen is the visual attention mechanism (see Figure 3), which determines where the robot is looking; where it is looking determines what all the low-level perceptual processes will see. That in turn determines which of the robot’s behaviors are active. The robot’s coherence of behavior is not determined by some internal locking mechanism but by its direction of gaze out into the world.
Social Interaction
The Cog and Kismet visual systems are the bases for their social interactions. Even a naive human observer can understand what the robots are paying attention to by the direction of their gaze. Likewise, the robots can understand what a person is paying attention to by the direction of the person’s gaze [9].
The visual attention system makes it completely intuitive for naive users to direct the robot’s visual attention system to some particular objects. Cynthia Breazeal, now at the MIT Media Laboratory, describes a series of experiments in which subjects were asked to get the robot to pay attention to different objects [2]. Typically, they would bring the object into the field of view of the robot, then shake it and move it to the desired position, with the robot now smoothly pursuing it, paying attention to what the human subject wanted. The experimental subjects had no knowledge of how the robot’s visual system operated but were able to use the same strategies they would use with a child, and they worked.
By manipulating the weighting the visual system applies to different visual cues, Kismet’s high-level behaviors, such as dialogue turn-taking, can make it make or break eye contact, so indirectly, these high-level behaviors regulate social interaction. Moreover, Kismet expresses its internal emotional state through facial expressions and prosody in its voice. So, for instance, when someone comes very close to Kismet or waves something very quickly near its face, Kismet becomes more fearful. That emotional state is reflected in its posture; it draws back. This reaction triggers a complementary reaction in naive human subjects who also tend to draw back. Thus, Kismet, indirectly through its emotional system and its expression in the world, is able to manipulate people in social settings, just as humans unconsciously manipulate each other.
Kismet is also able to detect basic prosody in the voices of people and classify their speech as “praising,” “prohibiting,” “bidding for attention,” or “soothing,” four basic prosodic signals used in almost all human cultures by mothers with their babies. Kismet’s detection of these cues changes its internal emotional state in appropriate ways; its outward demeanor changes, coupling in people who then intuitively react in appropriate ways.
Breazeal has reported on a number of experiments with human subjects [2]. Naive subjects in one set of experiments sat in front of the robot and were instructed to “talk to the robot.” The robot could understand only their prosody and not the actual words they said. The robot generated speech with prosody, though it was always random strings of English phonemes with no intrinsic meaning.
Most subjects were able to determine when it was their turn to speak, but some did not know what to say. Others engaged in long conversations with the robot—even though there was no conventional linguistic transfer. The more basic social interactions often masked the lack of actual language. For instance, in one session a human subject said, “I want you to take a look at my watch,” and Kismet looked right at the person’s watch. The person had drawn up his left wrist to be in Kismet’s field of view, then tapped his right index finger to the face of the watch. That was a sufficient cue to attract Kismet’s attention system, and Kismet saccaded to the watch. Just as in human-to-human communication, layers of social interaction smoothed the process.
Because Kismet’s processing system made it a little slower at turn-taking than a human, careful examination of the video record showed frequent turn-taking errors (where the robot or the person interrupted the other) at the start of each session, but also that people soon adapted (the robot did not adapt), and that after a few minutes the errors were significantly less frequent. Video clips of many of these experiments are available at www.ai.mit.edu/projects/humanoid-robotics-group.
Humanoids Everywhere?
The first few domestic robots are already on the market, including lawnmowing robots and home floor-cleaning robots. All are easy to use, which will be very important as the functionality of domestic robots is developed further. We can compare robot ease of use with computer ease of use. There are two sorts of computers in people’s homes: One is embedded processors in television sets, coffee machines, and practically any tool or appliance powered by electricity; they are trivial to interact with and induce almost no cognitive load on the human user. The other is home PCs with thousands of options that can be quite difficult to understand; they produce high cognitive loads on users. It would be desirable for robots to follow the path of embedded processors, rather then PCs, and produce little cognitive load. However, unlike today’s embedded processors, robots will be highly visible because they will move around in home environments. Therefore, it will be desirable for them to understand human social conventions, so they can be unobtrusive; meanwhile, humans should be able to interact with them in the same kind of noncognitive ways they interact with other humans. For instance, it will be useful for a large mobile appliance and a person to be able to negotiate who goes first in a tight corridor with the same natural head, eye, and hand gestures all people understand already.
It would be desirable for robots to follow the path of embedded processors, rather then PCs, and produce little cognitive load on users.
Should we expect these sociable robots to have humanoid form and be as commonplace in our lives as a number of Hollywood fantasies have portrayed? It is difficult to know today, but there are two compelling, and competing, arguments on opposite sides of this question:
The current infatuation with humanoid robots is a necessary but passing phase. It allows researchers to get at the essence of human-robot interactions, but the lessons learned will ultimately be applicable to robots with much more functional forms. For instance, we can expect driverless trucks in our residential neighborhoods. When human drivers stop at an intersection as other vehicles pull up on the cross street, they often engage in informal social interactions through eye contact, head nodding, and finger motions—social interactions ignored in the formal driving rules but that form a negotiation as to which driver should proceed first.
When another vehicle is a driverless truck, similar sorts of social negotiations should be possible to lubricate the safe flow of traffic. However, current experiences with humanoid sociable robots may well lead to development of social signals for the robot truck requiring no human form but rather signals that can tap into the same subconscious cues used by humans and by humans and humanoid robots.
It may be that the large number of humanoid robot projects under way today, especially in Japan, may produce enough successful prototype robots that people will find them naturally acceptable and expect them to have human form. It has become well understood over the past 20 years that the technologically superior solution may not be the one that wins out in the marketplace (in the same way the VHS video format won out over the Beta format). Rather, it depends on early market share and the little-understood dynamics of adoption. For this reason, humanoid robots might become common by accident. Or it may turn out there will be a discovery (not yet made) that they have some significant advantage over all other forms, and they will be common precisely because they are technologically superior.
The weight of progress in so many forms of robots for unstructured environments leads to the conclusion that robots will be common in people’s lives by the middle of the century if not significantly earlier. Whether significant numbers of them will have human form is an open question.
Figures
Figure. The many moods of Kismet: anger, surprise, happy, asleep.
Figure 1. Cog is an upper-torso robot with two force-controlled arms, simple hands, and an active-vision head. It has undergone many revisions since 1993; different versions have appeared in the literature with different heads, arms, and hands.
Figure 2. Kismet is an active-vision head with a neck and facial features. It has four cameras (two in the steerable eyes and two wide-angle ones embedded in its face) and active eyebrows, ears, lips, and a jaw. Altogether, it includes 17 motors. A new-generation Kismet is under construction.
Figure 3. The visual attention system used in Kismet and Cog. The robots pay attention to objects with skin color, bright colors, or fast motion. Higher-level behaviors determine how these factors are weighted together; the eye motor system then saccades toward the most interesting part of the image. A habituation signal makes any interesting feature eventually appear less interesting, allowing the robot to pay attention to something new.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment