Visualizing Sound

recovering sound from video, illustration — When sound hits the empty bag of chips it causes extremely small surface vibrations that are extracted from high speed video to reconstruct the sound that produced them.

In the movies, people often discover their room is bugged when they find a tiny microphone attached to a light fixture or the underside of a table. Depending on the plot, they can feed their eavesdroppers false information, or smash the listening device and speak freely. Soon, however, such tricks may not suffice, thanks to efforts to recover speech by processing other types of information.

Researchers in the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology (MIT), for instance, reported at last year’s SIGGRAPH meeting on a method to extract sound from video images. Among other tricks, they were able to turn miniscule motions in the leaves of a potted plant into the notes of “Mary Had a Little Lamb,” and to hear a man talking based on the tiny flutterings of a potato chip bag.

The idea is fairly straightforward. Sound waves are just variations in air pressure at certain frequencies, which cause physical movements in our ears that our brains turn into information. The same sound waves can also cause tiny vibrations in objects they encounter. The MIT team merely used high-speed video to detect those motions, which were often too small for the human eye to notice, and then applied an algorithm to translate the vibrations back into sound.

The work grew out of a project in MIT computer scientist William Freeman’s lab that was designed not for eavesdropping, but simply to amplify motion in video. Freeman’s hope was to develop a way to remotely monitor infants in intensive care units by watching their breathing or their pulse. That project looked for subtle changes in the phase of light from pixels in a video image, then enhanced those changes to show motion that might be otherwise unnoticeable to the naked eye.

“The focus was on amplifying and visualizing these tiny motions in video,” says Abe Davis, a Ph.D. student in computer graphics, computational photography, and computer vision at MIT, and lead author of the sound recovery paper. “It turns out in a lot of cases it’s enough information to infer what sound was causing it.”

The algorithm, he says, is relatively simple, but the so-called visual microphone can take a lot of processing power, simply because of the amount of data involved. To capture the frequencies of human speech, the team used a high-speed camera that takes images at thousands of frames per second (fps). In one test, for instance, they filmed a bag of chips at 20,000 fps. The difficulty with such high frame rates, aside from the sheer number of images the computer has to process, is that they lead to very short exposure times, which means there must be a bright light source. At those rates, the images contain a lot of noise, making it more difficult to extract a signal.

The team got a better signal-to-noise ratio when they filmed the bag at 2,200 fps, and they improved it further with processing to remove noise. Yet even with a standard-speed camera, operating at only 60 fps—well below the 85-255 Hz frequencies typical of human speech—they were able to recover intelligible sounds. They did this by taking advantage of the way many consumer video cameras operate, with a so-called rolling shutter that records the image row by row across the camera’s sensor, so that the top part of the frame is exposed before the bottom part. “You have information from many different times, instead of just from the time the frame starts,” explains Neal Wadwha, a Ph.D. student who works with Davis. The rolling shutter, he says, effectively increases the frame rate by eight or nine times.

Speech recovered using the rolling shutter is fairly garbled, Wadwha says, but further processing with existing techniques to remove noise and enhance speech might improve it. The method was good enough, however, to capture “Mary Had a Little Lamb” again. “You can recover almost all of that because all the frequencies of that song are under 500 Hz,” he says.

Wadwha also has managed to reduce the processing time for this work, which used to take tens of minutes. Initially, researchers looked at motions at different scales and in different orientations. By picking just one view, however, they eliminated about three-quarters of the data while getting almost as good a result, Wadwha says. He is now able to process 15 seconds of video in about 10 minutes, and he hopes to reduce that further.

To their surprise, the researchers found that objects like a wine glass, which ring when struck, are not the best to sources to focus on. “We had this loose notion that things that make good sounds could make good visual microphones, and that’s not necessarily the case,” Davis says. Solid, ringing objects tend to produce a narrow range of frequencies, so they provide less information. Instead, light, thin objects that respond strongly to the motions of air—the potato bag, for instance, or even a piece of popcorn—are much more useful. “If you tap an object like that, you don’t hear a very musical note, because the response is very broad-spectrum.”

Sensing Smartphones

This imaging work is not the only way to derive sound information from vibrations. Researchers in the Applied Crypto Group at Stanford University have written software, called Gyrophone, that turns movements in a smartphone’s gyroscope into speech. The gyroscopes are sensitive enough to pick up minute vibrations from the air or from a surface on which a handset is resting. The devices operate at 200Hz, within the frequency range of the human voice, although in any sort of signal processing, distortions creep in at frequencies above half the sampling rate, so only sounds up to 100Hz are distinguishable.

“We had this loose notion that things that make good sounds could make good visual microphones, and that’s not necessarily the case.”

The reconstructed sound is not good enough to follow an entire conversation, says Yan Michalevsky, a Ph.D. student in the Stanford group, but there is still plenty of information to be gleaned. “You can still recognize information such as certain words and the gender of the speaker, or the identity in a group of known speakers,” he says. That could be useful if, say, an intelligence agency had a speech sample from a potential terrorist it wanted to keep tabs on, or certain phrases for which it wanted to listen.

Researchers used standard machine learning techniques to train the computer to identify specific speakers in a group of known individuals, as well as to distinguish male from female speakers. They also trained it with a dictionary of 11 words—the numbers zero through 10, plus “oh.” That could be useful to someone trying to steal PINs and credit card numbers. “Knowing even a couple of digits from out of this longer number would help you to guess,” Michalevsky says. “The main thing here is extracting some sensitive information.”

He said it would be fairly easy to place a spy program on someone’s phone, disguised as a more innocent app. Most phones do not require the user to give permission to access the gyroscope or the accelerometer. On the other hand, simply changing the permissions requested could defend against the attack. Additionally, many programs that require the gyroscope would work fine with much lower sampling rates, rates that would be useless for an eavesdropper.

Sparkling Conversation

Another technique for spying on conversations, the laser microphone, has been around for some time—the CIA reportedly used one to identify the voice of Osama bin Laden. The device fires a laser beam through a window and bounces it off either an object in the room or the window itself. An interferometer picks up vibration-induced distortions in the reflected beam and translates those into speech. Unfortunately, the setup is complicated: equipment has to be arranged so the reflected beam would return directly to the interferometer, it is difficult to separate speech from other sounds, and it only works with a rigid surface such as a window.

Zeev Zalevsky, director of the Nano Photonics Center at Bar-Ilan University, in Ramat-Gan, Israel, also uses a laser to detect sound, but he relies on a different signal: the pattern of random interference produced when laser light scatters off the rough surface of an object, known as speckle. It does not matter what the object is—it could be somebody’s wool coat, or even his face. No interferometer is required. The technique uses an ordinary high-speed camera.

“The speckle pattern is a random pattern we cannot control, but we don’t care,” Zalevsky says. All he needs to measure is how the intensity of the pattern changes over time in response to the vibrations caused by sound. Because he relies on a small laser spot, he can focus his attention directly on a speaker and ignore nearby noise sources. The laser lets him listen from distances of a few hundred meters. The technique even works if the light has to pass through a semi-transparent object, such as clouded glass used in bathroom windows. It can use infrared lasers, which produce invisible beams that will not hurt anyone’s eyes. Best of all, Zalevsky says, “The complexity of the processing is very low.”

He is less interested in the spy movie aspect of the technology than in biomedical applications. It can, for instance, detect a heartbeat, and might be included in a bracelet that would measure heart rate, respiration, and blood oxygen levels. He’s working with a company to commercialize just such an application.

Davis, too, sees other uses for his video technique. It might provide a way to probe the characteristics of a material without having to touch it, for instance. Or it might be useful in video editing, if an editor needs to synchronize an audio track with the picture. It might even be interesting, Davis says, to use the technique on films where there is no audio, to try and recover sounds from the silence.

Zalevsky’s technique relies on the pattern of random interference produced when laser light scatters off the rough surface of an object.

What it will not do, he says, is replace microphones, because the existing technology is so good. However, his visual microphone can fill in the gaps when an audio microphone is not available. “It’s not the cheapest, fastest, or most convenient way to record sound,” Davis says of his technique. “It’s just there are certain situations where it might be the only way to record sound.”

Figures

Figure. In (a), a video camera aimed at a chip bag from behind soundproof glass captures the vibrations of a spoken phrase (a single frame from the resulting 4kHz video is shown in the inset). Image (b) shows a spectrogram of the source sound recorded by a standard microphone next to the chip bag, while (c) shows the spectrogram of the recovered sound, which was noisy but understandable.