Decoding the Language of Human Movement

Image captured from video of a weightlifter in action.

Researchers are combining techniques from different branches of artificial intelligence and statistical processing to give computers the ability to comprehend the language of human movement. Using these techniques, computer scientists hope to give automated systems the ability to understand what is happening in the sequences of images captured by the video cameras around us.

Jesus del Rincón, researcher at the Institute of Electronics, Communications and Information Technology (ECIT) based at the Queen’s University of Belfast, U.K., points to defending against crime and terrorism as prime motivations. “If someone is on the public transport network and doing something strange, we want to use reasoning to model the intention of the attacker and work out what is going on.”

Healthcare provides another driver for making computers understand human movement. “There are a number of quality of life issues that could be improved. Take the example of a stroke survivor; much of their rehabilitation is done today in the hospital building, but with cameras at home you could change how we do rehabilitation,” says Deva Ramanan, a researcher at the University of California, Irvine (UC Irvine).

A computer can spend much more time with a patient than a specialist and in a more comfortable, familiar environment. Potentially, the applications could move from the healthcare environment into daily life as gestural interfaces become more sophisticated.

“If we get this right, it could change the way we interface with everything,” Ramanan adds.

Michael Ryoo, a researcher in the computer vision group at NASA’s Jet Propulsion Laboratory, sees the analysis of strokes performed by tennis players carried out 20 years ago by Junji Yamato, now executive manager of NTT’s Innovative Communications Laboratory, and colleagues as laying the foundations for providing computers with the means to analyze human movement captured by video. The work pioneered an approach that carries through to today’s work, splitting the analysis into two phases.

The first is to extract what researchers call low-level features of interest from each video frame. These features may be limbs and other parts of the body, or may work at a scale that lets them categorize the pose or posture of the body as a whole using statistical techniques.

A popular statistical approach in research today is the histogram of oriented gradients (HOG), in which the edges of a body are matched to a template that contains a matrix of short line segments of differing intensities. The match is best where the edges extracted from the image align closely to the parts of the template with the greatest intensity.

The next step is to feed the poses extracted from each video frame to an algorithm that can make sense of them. Initially, researchers used the hidden Markov model (HMM), another statistical technique, this time borrowed from the world of speech processing, to take a series of poses and generate a classification for them.

Take the video of a weightlifter, for example. Their lift can be subdivided into three larger-scale actions: a yank from the ground, followed by a pause and then the press action that lifts the bar above the head. Frames within each segment will match different motion templates. HMM uses probabilities to determine which group of templates best fits an action, and produces that as its output.

Ryoo says the original HMM-based paradigm suffered from limitations: “It used per-frame human features, which were not that reliable.” Plus, the HMMs could only deal with strictly sequential information. An alternative technique that researchers have combined with HMMs is to treat the x, y, and time dimensions as a 3D space and apply templates to shapes in those volumes, as these can be more robust than frame-based techniques, although results are more abstract. In either case, there was no notion of hierarchy in the results. Someone looking for videos of weightlifting would expect them to be classified by that overarching activity, rather than being forced to search for combinations of yanks, pauses, and presses. One way to represent that hierarchy is to use grammar rules that tell the computer how to group actions into activities.

One example is work presented by Ramanan and Hamed Pirsiavash, now based at the Masschusetts Institute of Technology (MIT), at the CVPR 2014 computer-vision conference in June. In that work, they combined the ability to scan across frames from a sequential video with a hierarchical approach based on a similar type of grammar to that used to define a number of programming languages. The grammar they selected could not just group simple actions into composite activities such as weightlifting, but works with variations in technique, so it could handle situations in which the athlete does not pause between the yank and press, for example.

An ongoing issue is how to train systems to handle many different kinds of movement. Del Rincón says, “The main problem is that you can use machine learning to recognize actions for each particular action, but you are limited by the data you have available for training.”

Much of the current work in movement research focuses on working with a standard set of video clips that can form a corpus useful for training other systems. The work by Pirsiavash and Ramanan, for example, provided together with one approach to deciphering the movements of sportspeople a new batch of videos on which other researchers could work. Yet such collections are limited in their ability to represent the conditions systems will face when deployed. “It’s not clear how representative they are of real-life situations. You take them, use them, and apply the same techniques to the video from a surveillance camera, and they don’t work,” says del Rincón.

“What we are trying to do is identify actions that were performed for a reason.”

Del Rincón worked with Maria Santofimia, a researcher from the University of Castilla-La Mancha, Ciudad Real, Spain, and Jean-Christophe Nebel of Kingston University, Surrey, U.K., on a method that could make the results more robust under real-world conditions by providing a degree of contextualization for the actions seen in a video. For example, faced with a video of someone picking up a suitcase in the street, the system can more or less rule out the possibility of it being part of a weightlifting activity. By applying rules from real-life behavior, the system should be able to make more intelligent decisions about what it sees, and trigger alarms if the activities are seen as unusual.

Santofimia says, “What we are trying to do is identify actions that were performed for a reason.”

One option was to use artificial intelligence techniques based on ontologies or expert-system databases that capture the circumstances in which people undertake different activities, such as weightlifting within a gym. “With ontologies, case-based reasoning, or expert systems, the main disadvantage that they have is that they can only focus on the situations they have been taught,” says Santofimia. The researchers opted for an alternative known as commonsense reasoning, which contains a database of much more generalized rules.

In much the same way the context in which a sentence is used in languages with informal grammars may be used to disambiguate between possible meanings, commonsense reasoning provides an additional tool for recognition when combined with the context-free grammars used in many of today’s experiments. “In commonsense, we describe how the world works and we ask the system to reason about the general case,” Santofimia says.

A further advantage of using reasoning is that it can be tuned to cope with situations that are difficult to train for statistically. For example, systems to monitor the elderly will need to watch for them falling over. Templates built from videos of actors who are trained to fall may not capture the movements of accidental falls, and so may fail to trigger.

“When people fall in real life, they are not acting,” says del Rincón. “Commonsense reasoning could figure out that something has gone wrong, without having to learn that precise action.”

There is a further advantage of using more generalized commonsense reasoning, Santofimia claims: “We also deal with different possible situations in parallel for situations where we are not sure if the video processing was giving us the right actions. We keep stories alive until we can prove which one is the most likely. We can’t do that with ontology- or case-based systems.”

Much of the current research remains focused on single-person activities. As the area matures, attention will shift to group or crowd behavior, which will likely see more complex grammars be applied to represent movement.

“One important limitation of context-free grammars is that they enforce a sequential structure. In order to capture the richer semantics of high-level activities with concurrent sub-events, such as ‘a thief stealing an object while the other thieves are distracting the owners,’ more expressive power will be required,” says Ryoo, adding there remains plenty to do to provide computers with the ability to recognize and react to human activities.

“In commonsense, we describe how the world works and we ask the system to reason about the general case.”

Group-level understanding is likely to be needed for more advanced applications such as robotic control, where so-called ‘cobots’ work in the same space as humans and so need to comprehend movements to avoid injuring them or getting in the way. Much remains to be done both in low-level feature extraction and higher-level, grammar-oriented processing. The pedestrian-detection systems required for automated driving have simpler needs, but Ryoo points out, “Even pedestrian detection is far from being perfect when faced with noise, camera motion, and changes in viewpoint.”

Figures

Figure. Images captured from video of a weightlifter in action, segmented into larger-scale actions; a yank from the ground (left), followed by a pause (center), and then the press action that lifts the bar above his head (right).