Brain Science Helps Computers Separate Speakers in a Crowded Room

The mechanics of the human brain and hearing system are providing inspiration for the development of algorithms that could lead to better hearing aids, with researchers combining neural networks and techniques that mirror biological behavior. Yet researchers dealing with the computer science are cautious about pushing the emulation of biology too far.

For decades, researchers in both neuroscience and artificial intelligence have been fascinated by the so-called ‘cocktail party problem’. Cognitive scientist Colin Cherry coined the term in 1953 during a project for the U.S. Office of Naval Research while working at the Massachusetts Institute of Technology (MIT) to find out how it is possible for humans to “recognize what one person is saying when others are speaking at the same time.”

Humans can separate voices that lie in the same frequency range, and may even pull out sentences from background conversation and other noises that are louder. University of Surrey research fellow Andrew Simpson says trying to find out how to do this artificially is enticing: “The cocktail party paradigm is compelling because we are trying to equal the brain on its own terms.”

Such research may lead to better hearing aids that can suppress noise and background chatter while boosting and even cleaning up the speech of the people talking to them. Researchers see other applications in human-machine interaction, particularly for robots working closely with people so they can distinguish background chatter from commands meant for them.

Research from biology is helping to shape the strategy for a number of projects that aim to solve the cocktail party problem. Biologists have found the brain learns to separate out individual voices using a number of cues, including tell-tale harmonic frequencies in an individual voice, as well as the understanding of how sounds change over time because of the shaping of words and consonants.

In 2012, Nai Ding and Jonathan Simon of the University of Maryland published research that claimed ‘phase locking’ is an important method humans use to home in on one speaker; the brain detects the rhythms of that speaker’s speech and filters out others. Repeating sounds also play a role, according to Josh McDermott of MIT’s Department of Brain and Cognitive Sciences, following research that found the brain stores information on sonic textures, using them to pick out similar sounds from the many others the brain may hear at the same time. Similar techniques can be used by artificial systems, researchers say.

Simpson says, “Mimicking nature gives us a sensible architecture for our neural networks and a cost-effective architecture for our sensors.”

DeLiang Wang of Ohio State University’s Center for Cognitive and Brain Studies adds, “The knowledge of auditory physiology and psychology has helped propel the field, but its role shouldn’t be overstated, and auditory physiology serves more as inspiration than detailed instruction for algorithm design. Domain knowledge does not have to come from auditory physiology or psychophysics. Much can be gained by analyzing the problem itself, as David Marr advocated in his influential study in computational vision decades ago.”

Wang points to the use of the mel scale of pitches in this kind of research to convert raw audio into a time-frequency representation that can be used as the input to other algorithms. The mel scale mirrors the nonlinear way in which the ear interprets pitch. Wang’s team performs a further conversion, called cepstral analysis, which shows harmonic relationships between frequencies more clearly.

“The mel scale is inspired by psychoacoustics, although more from behavioral studies than physiological studies,” says Wang, cautioning, “but literally following auditory neuroscience would not lead one to cepstral analysis or learning-based methods.”

The biggest breakthrough in recent years, Wang notes, is the use of systems that can be trained on large quantities of data. Simpson, Wang, and others have applied deep-learning neural networks (DNNs) to the problem. These systems are themselves inspired by neurons in the brain, although they are extremely simplified compared to their biological equivalents.

Simpson’s work includes an algorithm called Deep Transform that, as well as separating out the audio associated with a single speaker, uses re-synthesis to recreate speech fragments that are heavily obscured by noise in the source audio.

“The use of DNNs has elevated performance to a level that can rival human performance in limited situations,” Wang claims.

The DNNs used in this line of research are based on techniques developed by Geoffrey Hinton and Ruslan Salakhutdinov at the University of Toronto for training large neural networks. The digital neurons process data, such as speech samples or the strength of a frequency at a point in time, by multiplying them with trained weights and feeding them to neurons in subsequent layers that filter the data and recognize structure in the input.

“Mimicking nature gives us a sensible architecture for our neural networks and a cost-effective architecture for our sensors.”

The networks are trained by iteratively adjusting the weights that each neuron applies to its input data to try to minimize the error between the output of the entire network and the desired result. The features that the neural networks learn depend strongly on the way the training data is applied.

Deep learning could be locking into the same low-level speech features as those used by the brain, but without being explicitly programmed to do so. Simpson says of the type of harmonic analysis performed by the brain: “There is no reason to presume that present deep learning approaches do not inherently learn such filters; they probably do. In a recent paper, I demonstrated that DNN learning is mirrored in the crest factor of the filters learned; filters get sharper as the network learns.”

Other features such as tracking speech rhythms may call for bigger, more processing-intensive DNNs. Simpson adds: “If a deep neural network is exposed during training to useful information at this level of abstraction, and features sufficient layers to demodulate the envelopes, then it will learn filters to exploit it. These filters would be broadly equivalent to the ‘modulation filters’ of the auditory system. However, in practice, this means very long time windows.”

Being able to deal with those longer time windows may call for the use of more complex neural-network structures. However, Simpson says of these types of neural networks that their “abstract learning seems poorly understood at present.”

DNNs can be trained on audio data, but researchers tend to find better results if the inputs receive some preprocessing. Simpson says, “I’ve implemented versions of Deep Transform both in the time domain and using time-frequency representations. Results are better using the time-frequency representations. The same lesson comes from biology.”

Adds Wang, “A successful learning algorithm needs both good features and a good learning machine. DNN plays the latter role. Preprocessing, or feature extraction, is as important. Typical feature extraction involves time-frequency analysis and subsequent filtering, to extract amplitude-modulation patterns, for example.”

“A successful learning algorithm needs both good features and a good learning machine. DNN (deep-learning neural network) plays the latter role.”

Although DNNs offer a good technique for handling monaural audio, some researchers, particularly those looking at robotic applications, have focused on the spatial cues made possible by the use of two or more microphones, moving further away from what biological systems do. Wang says although the human brain could use the difference in sound reaching the left and right ears to handle the cocktail party problem, these cues are used mainly just to identify where a speaker is, and they are easily confused by echoes from walls. However, fed with clear binaural data, Simpson’s Deep Transform seemed to train itself primarily on spatialization cues.

“The deep learning approach is very general and will learn whatever abstract feature space it is able to exploit to minimize the cost function during training. The binaural gradient provides a very convenient separation plane and I would interpret the resulting binaural Deep Transform as having learned ‘spatial filters’.”

Using more traditional signal processing approaches rather than deep learning, Tobias May and colleagues from the University of Oldenburg and Philips Research used a combination of voice-frequency detection and binaural localization to filter out individual speakers within a room of many, among other noise sources and echoes that can make localization difficult.

Other approaches have even made use of confusing echoes. Martin Vetterli and colleagues at the Swiss Federal Institute of Technology in Lausanne (EPFL) borrowed a concept from cellular communications. The 3G cellular protocols introduced the concept of the rake-receiver algorithm, which is highly effective at separating the main signals from echoes caused by radio waves bouncing off the sides of buildings and other surfaces. The ability to detect the echoes turned out to be useful in speech processing; even if a loud interferer stands in the way of the person to which the system is tuned, the algorithm can analyze the echoes to successfully separate and amplify their speech, according to the researchers.

Other teams have used additional microphones to try to provide algorithms with more spatial clues, although Simpson suggests combining the input from cameras with the audio data so robots can link lip movements with sounds might provide greater gains.

Further advances can come from building larger deep-learning systems, Simpson says. “We will see upscaling, from neural networks of a billion parameters, to massive parallel GPU farms running hundreds of thousands of deep nets with cumulative parameters reaching to the range of trillions. Source separation is entering the arms race in a big way.”

Figures

Figure. Flowchart of the simulation setup used for numerical experiments on an acoustic rake receiver to validate theoretical results of its efficiency in suppressing an interferer.