Research and Advances
Architecture and Hardware Multimodal interfaces that flex, adapt, and persist

Multimodal Processing By Finding Common Cause

Commonalities help answer many context-aware questions that arise in human-computer interaction.
  1. Introduction
  2. Multimodal Events with a Common Cause
  3. Common Cause Finding for Extracting Context Descriptors
  4. Conclusion
  5. References
  6. Authors
  7. Footnotes
  8. Figures

Context plays an important role in human-human communication, as anyone who’s shared jokes with a friend will attest. In contrast, current human-machine interaction makes limited use of this important source of information. The goal of our research is to exploit multiple and varied information sources to obtain context descriptors for use in context-aware systems and applications, in order to move from current styles of rigid and context-free human-machine interaction toward more natural and intuitive interaction.

This article discusses techniques that automatically determine whether events in multiple, multimodal input streams share a common cause. One example of audio and video events sharing a common cause is a talking face on screen with the corresponding speech audible in the soundtrack; we revisit this example throughout this article. Techniques for testing whether events in multimodal input streams have a common cause are of interest to us because they are well-suited for answering a variety of context-awareness questions that arise in human-computer interaction. For instance, consider the teleconferencing-related problem of steering a microphone or camera toward the currently active speaker. After clarifying our notions of context-aware systems and discussing tests for common cause, we show that common cause techniques can solve this and other context-awareness problems.

The goal of our research is to exploit multiple and varied information sources to obtain context descriptors for use in context-aware systems and applications, in order to move from current styles of rigid and context-free human-machine interaction toward more natural and intuitive interaction.

Following [1], we define context, context-awareness, and context-aware computing as:

Context: Any information characterizing the situation of an entity, where an entity can be a person, place, physical, or computational object.

Context-awareness (or context-aware computing): The use of context to provide task-relevant information and/or services to a user, wherever they may be.

Context-aware system behaviors: These include presentation of information and services to a user; automatic execution of a service; and tagging context to information for later retrieval.

Six questions summarize the key aspects of the context of each entity of interest: who? where? when? what? why? and how? We may seek answers in the local system environment, the world stage (for example, via broadcast news), or both.

Here, we focus on answering context questions relating to human activity, since systems aware of human activity and able to interpret a user’s intentions can provide context-aware and personalized support to aid task completion. For example, smart (sensor-equipped) meeting room systems M4 Multimodal Meeting Manager;, Microsoft Distributed Meetings; ~rcutler/DM/dm.htm) strive to monitor participants’ actions and interactions and to infer individual or group goals in order to provide support of one of two types: timely presentation of information and services, such as retrieving relevant information at particular points in the meeting; and tagging recorded information streams with context to enable automatic minute-taking or later browsing or querying.

Context-aware systems for financial researchers and marketers provide personalized information push or automated event response by monitoring context associated with specific entities of interest (the first two behaviors noted earlier). Sources include radio, TV, newspapers, Web casts, chat rooms, Google Zeitgeist, and the IBM WebFountain, which identifies “patterns, trends and relationships in unstructured text data stores such as news feeds … the Worldwide Web, industry-specific data sources and company documents.” Modeling user context may also help ease human interaction with autonomous systems [6]: since user attention spans are limited, context will assist in selecting the relevant subset of the actions of the “ghosts in the machine” to report in different situations.

Our research starts from the premise that key context questions can be more reliably answered through solutions that extract and process multimodal input streams. The benefits of exploiting the complementary information in multimodal inputs are widely reported and the following two examples are typical. Solutions combining carefully placed multiple distributed microphone pairs with calibrated static and pan-tilt-zoom cameras have improved our ability to identify the current speaker (who) and their location (where), for example, [7]. Solutions for speech recognition that use an audio microphone plus a camera to record mouth movements (audiovisual rather than audio-only speech recognition) have improved transcription accuracy, particularly in noisy environments: On a digit recognition task, adding visual information gives an effective SNR gain of 10dB at 10dB babble noise [5]. Improved transcript accuracy translates (indirectly) into improved ability to understand what is happening in the environment.

Back to Top

Multimodal Events with a Common Cause

We term events in multiple (for example, multimodal) input streams related to a single underlying event as common cause events or as events having a common cause. Examples include spoken countdowns (speech event), explosions (non-speech audio event), and rising rocket visual (visual event) are realizations of a rocket launch (common cause). Similarly, an on-screen talking face with corresponding speech in the soundtrack or an on-screen drummer with the corresponding drum beats in the soundtrack represent multimodal events with a common cause.

Our definitions must be more precise in practice. We require tests that decide whether multimodal input events share a common cause. A variety of both knowledge-free and knowledge-based tests are possible and when illustrating these, we will use the working example of testing whether audible speech corresponds to a talking face on-screen.

For a specific type of common cause, knowledge-based tests can often be defined using a detailed model of the relationships between the signatures in different information streams. For our working example, we can decide whether test speech and face movements share a common cause based on the goodness of their relationship as assessed by a model relating facial movements and speech sounds [4].

When data is not available for constructing detailed models of signature relationships, an alternative is to use general, knowledge-free techniques for evaluating relationships across multimodal information streams, for example, cross-correlation or mutual information1 [2, 4]. We illustrate this approach by describing a mutual-information-based test for our working example. Figure 1 shows a Mutual Information Image reflecting mutual information values between per-pixel intensity changes and the speech cepstrum over a short time window. This image is calculated under the assumption of Gaussian distributions for speech (cepstral coefficients), per-pixel intensity, and joint speech-and-pixel-intensity. The image is interpreted as follows: The brightness of each pixel qualitatively reflects mutual information between that pixel and the speech audio over the time window, with whiter pixel values having higher mutual information (that is, being more closely related) to the speech. As hoped when the left speaker is active, highest per-pixel mutual information occurs around the left speaker’s mouth and jaw. The test for whether a particular face shares a common cause with the audible speech then considers the sum of per-pixel mutual information values around the hypothesized speaker’s mouth [4].

Back to Top

Common Cause Finding for Extracting Context Descriptors

We have used common cause finding to answer questions who and where and to indirectly improve answers to what by improving speech transcription performance. Three examples related to speech and facial movements include:

Smart meeting room. Consider a smart meeting room in which one camera per person records frontal facial images and a single microphone records all participants’ speech. We wish to locate the active speaker (who, where) in order to, for example, redirect a steerable microphone. To accomplish this, we first compute a mutual information image for each input camera stream with the speech signal over a short time window (as in Figure 1). We then search each mutual information image for the compact mouth-sized region having the highest sum of per-pixel mutual information values. Finally, we steer the microphone to the speaker for which this quantity is highest. (In visually noisy environments, one might first isolate the speaker mouth regions and then compare the mutual information sums in the mouth regions across speakers.) We used an artificial test set to show how this algorithm outperforms a knowledge-based scheme. The test set comprises 1,016 sets of test cases; each test case comprises four talking faces paired with the same speech soundtrack, of which only one face shares a common cause with the speech. For the task of finding the “true” face-speech pairing in each of the 1,016 test cases, the Gaussian-based mutual information approach gives 82% accuracy versus 45% for a knowledge-based approach modeling the relationship between facial movements and speech [4]. This is partly due to the simplicity of the knowledge-free scheme; unlike the knowledge-based case, no separate training data is required.

Teleconferencing. Figure 2 (top) comes from video that might arise in a single-camera, single-microphone teleconferencing scenario. The right-hand person (with a white square over the mouth) is talking. The bottom of the figure shows the corresponding mutual information image between per-pixel intensities and speech cepstrum. We again locate the active speaker (who, where) by seeking the rectangular region (or mouth region) with highest mutual information sum. In a speaker localization task for which accuracy is defined as a point-mouth estimate falling within 100×100 pixel square centered on the active speaker’s mouth (see white box in Figure 2), this approach gives 65% accuracy versus 50% for a video-only baseline technique [4]. This demonstrates the benefits of working with several multimodal rather than one unimodal input stream.

Automatic annotation of monologues in broadcast video. A NIST-sponsored Video TREC 2002 benchmark task required video segments (“shots”) to be ranked according to whether they contained a monologue, defined as a shot containing a talking face for which the corresponding speech is heard. To solve this problem (who, what), we first use a face detector to identify candidate faces in shots (if any), then process the shot audio to identify speech (if any); finally, for shots with speech and face, we rank shots based on plausibility of shared common cause by using mutual information sums. This algorithm improves average precision by over 50% relative to simply seeking shots with at least one face and speech [3] and also benchmarked best among 18 monologue detectors at Video TREC 2002. We now use this technique for identifying speaker-turn points between off-screen interviewers and on-screen interviewees in a large archive (120+ terabytes) of interviews. Reliable turn-pointing is a key step toward analyzing what happens in each interview: the previous lack of reliable turn-point identification was known to negatively impact automatic speech transcription accuracy on this corpus.

These three examples successfully use common cause tests for answering context-awareness questions. Note now these problems all implicitly involve a ranking of different pairs of speech-facial movement input streams according to how plausibly they are generated by a common cause. The first two algorithms rank among the speaker faces in the environment; the monologue labeling algorithm ranks shots according to how plausibly each contains a talking face. In all cases, we implicitly assume there is a common cause explanation for one of the audiovisual examples examined: our solution is to find the example having a common cause. We have also attempted to use common cause tests in situations in which a common cause may or may not exist, such as distinguishing voice-over video segments (where on-screen talking faces are not heard in the soundtrack) from speaker-on-screen video segments (for example, a monologue or a dialogue) as illustrated by the extracted frame in Figure 3. In this case, we do not know if the test audio-video pair has a common cause: we must make an absolute decision as to whether this is the case. Our existing techniques have proved inadequate for discriminating common cause events from non-common cause events [4].

We should note the techniques discussed earlier seek common causes within short time intervals. Common cause ideas also apply across longer time intervals and across space. For example, a system receiving multiple wearable video and audio inputs from person A and person B might seek common cause events across the A and B inputs; this provides information beyond that obtainable from the individuals’ GPS and time-stamping information. Successful identification of cross-person common cause events could help us to infer information about people’s interactions, a useful step toward understanding how or why.

Back to Top


Techniques for finding common cause have proven a useful addition to our arsenal for extracting context descriptors for use in context-aware applications. However, there remain situations for which we have not yet identified useful common cause tests: specifically, cases that involve an absolute decision about whether a set of events shares a common cause, rather than cases that require us to find which of two or more sets of events shares a common cause.

In conclusion, we note common cause finding has been particularly useful for answering questions regarding who, where, when, and (indirectly) what. More robust solutions to these questions continue to be of interest. In the longer term, though, additional techniques for extracting context information such as why and how things happen must also be developed. This will require monitoring the affect, actions, and interactions of participants in the environment plus cognitive social modeling followed by automatic inferences. These issues are only just beginning to be addressed, through programs such as the European Union Information Society Technologies Programme.

Back to Top

Back to Top

Back to Top

Back to Top


F1 Figure 1. Mutual information image.

F2 Figure 2. Single camera and microphone teleconferencing image (CUAVE corpus).

F3 Figure 3. Voice-over or on-screen speakers?

Back to top

    1. Dey, A.K., and Abowd, G.D. Towards a better understanding of context and context-awareness. GVU Technical Report GIT-GVU-99-22. Georgia Institute of Technology, Atlanta.

    2. Fisher III, J.W., and Darrell, T. Informative subspaces for audio-visual processing: High-level function from low-level fusion. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (2002).

    3. Iyengar, G., Nock, H.J., and Neti, C. Audio-visual synchrony for detection of monologues in video archives. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (2003).

    4. Nock, H.J., Iyengar, G., and Neti, C. Speaker localisation using audio-visual synchrony: An empirical study. In Proceedings of the International Conference on Image and Video Retrieval (2003).

    5. Potamianos, G., Neti, C., Gravier, G., Garg, A., and Senior, A.W. Recent advances in the automatic recognition of audio-visual speech. In Proceedings of the IEEE 91, 9 (2003). IEEE Press, NY.

    6. Russell, D.M., Maglio, P.P., Dordick, R., and Neti, C. Dealing with ghosts: Managing the user experience of autonomic computing. IBM Systems Journal 42, 1 (2003).

    7. Yoshimi, B.H., and Pingali, G.S. A multimodal speaker detection and tracking system for teleconferencing. In Proceedings of the ACM Conference on Multimedia (2002). ACM Press, NY.

    1Mutual information measures the amount of information one random variable (for example, a random variable related to speech) tells us about another (for example, a random variable related to video of a talking face).

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More