Thinking Deeply to Make Better Speech

A humanoid robot, named Aiko Chihira by its creators at Toshiba and Osaka University, at a 2015 trial in Tokyo's Mitsukoshi department store. Toshiba says it will incorporate speech recognition and synthesis into the robot by 2020.

Machines that speak are nothing new. Siri has been answering questions from iPhone users since 2011, and text-to-voice programs have been around even longer. People with speaking disabilities—most famously, Stephen Hawking—have used computers to generate speech for decades. Yet synthesizing speech that sounds as natural as if spoken by a human is still an elusive goal, although one that appears to be getting closer to reality.

If you listen to the latest version of Apple’s Siri, “it sounds pretty amazing,” says Simon King, a professor of speech processing and director of the Centre for Speech Technology Research at the University of Edinburgh. Apple, Google, and Microsoft all have commercial speech applications that read text in a neutral but reasonable-sounding tone. Words are pronounced correctly, for the most part, and generally flow from one to the next in perfectly acceptable sentences. “We’re quite good at that and the speech is very intelligible,” King says.

Researchers in speech synthesis, however, would like to move beyond merely “intelligible” to speech that sounds more natural. Their work could make synthesized speech easier to understand and more pleasant to hear. It could also allow them to synthesize better voices for people unable to speak for themselves, and create text-to-speech systems for less-common languages.

“Practically all the systems work well at the sentence level,” says Alex Acero, senior director of Siri at Apple. Ask a machine to read you a newspaper article or an email message from your mother, however, and the result will be flat. “Yes, you can understand it if you pay attention, but it’s still not the same as having someone read it to you,” he says. Computerized speech cannot handle prosody—the rhythm and intonation of speech that conveys meaning and adds emotional context. “That is incredibly important for humans,” says Acero. “That’s why when you send text messages, you add emojis.”

There are two basic approaches to creating speech. The older one is parametric speech synthesis, in which a computer generates sounds from the elements of text. Over the years, that has evolved into statistical parametric speech synthesis, which uses a statistical model to create the proper waveform for each sound. For a long time the statistical model used was a hidden Markov model, which calculates the future state of a system based on its current state. In the past couple of years, however, hidden Markov models been replaced with deep neural networks, which compute the interaction between different factors in successive layers. That switch, King says, has led to an improvement in the accuracy of the parametric approach.

The technique that has mainly been used over the last couple of decades is concatenative speech synthesis, in which a human speaker records many hours of speech, which is then diced into individual units of sound called phonemes and then spliced back together to create new phrases that the original speaker never uttered. Apple, for instance, splits the phonemes, represented as waveforms, in half. That provides more choices for finding different phonemes that fit together smoothly, Acero explains.

The latest iteration of Siri combines parametric and concatenative speech synthesis. It relies on a statistical model called a mixed density network—a type of neural network—to learn the parameters of the phonemes it is looking for, examining hundreds of features such as whether a sound is stressed or not, or which phonemes usually proceed or follow others. Once it knows what the waveforms of the speech are supposed to look like, it searches for appropriate ones in the recorded speech and fits them together. The system does not necessarily create every phrase from scratch; groups of words and sometimes even whole sentences can be taken directly from the recording. “It is more automated and it’s more accurate because it’s more data-driven,” says Acero.

As good as the results are, however, the speech still lacks prosody, because the machine does not really understand what it is saying. That lack may explain one problem with synthesized speech, King believes; while it may be completely intelligible to someone in a quiet room who is paying attention, if the listener is in a noisy environment, or trying to multitask, or has hearing loss or dyslexia, the intelligibility drops off much more rapidly than it does with natural speech.

King hypothesizes the drop-off occurs because natural speech contains a lot of redundancies, cues that aid in understanding what is being said. There may be, for instance, changes in intonation or stress or pitch when one word leads into another in natural speech. Such acoustic cues are not there in synthesized speech, and in concatenative speech words plucked from different sentences may even contain the wrong cues.

As good as Siri is, its speech lacks prosody—the rhythm and intonation of speech that conveys meaning and adds emotional context.

It may also be that having to process such inconsistencies makes the listener’s brain work harder, which may increase the chances of missing something. “You couldn’t say your synthetic speech is truly natural until it’s as good as natural speech for everybody in every environment,” King says.

“In order to say something in the most natural way, you pretty much need to understand what it means,” King says. Though speech recognition is good enough for Siri and similar systems to respond to questions and commands, their level of understanding is still fairly shallow, he says. They can recognize individual words, identify nouns and verbs, notice local sentence structure, even distinguish a question from a statement. Researchers working on natural language understanding are using approaches such as vector spaces, which focus on statistics such as how frequently words appear, but so far machines are not able to understand speech—especially in large chunks such as paragraphs or entire passages—on a deep-enough level to be able to read them the way a human would.

A New Wave

Last September, Google announced it had made great strides with a technique called WaveNet. Developed by DeepMind, a London-based company that Google bought in 2014, WaveNet uses statistical parametric synthesis relying on deep neural networks to produce speech in both English and Mandarin that listeners rated as superior to the best existing systems (there is no objective measurement of speech quality, so it is always assessed by human listeners). The system also automatically generated piano music. Google published its results in a blog post and in a paper on ArXiv, but declined to make the researchers available for press interviews.

Google’s approach was inspired by a model it had published earlier in the year that used a neural network to generate natural-looking images one pixel at a time. The researchers trained the system by feeding it waveforms recorded from human speakers. Such raw audio can contain 16,000 samples per second, so it is computationally expensive. Once trained, they fed the system text they had broken down into a sequence of linguistic and phonetic features, giving the computer such information as what word, syllable, and phoneme it was seeing. They were able to train it on different speakers so it could speak in different voices, and provided it with different accents and emotions.

Acero calls WaveNet a very interesting approach, which somewhere down the road might replace concatenative synthesis. At the moment, though, it takes several hours of computing to produce one second of speech, so it is not immediately practical.

A Physical Model

Oriol Guasch, a physicist and mathematician at Ramon Llull University in Barcelona, Spain, is also taking a computationally intensive approach to speech synthesis. He is working on mathematically modeling the entire human vocal tract. “We’d like to simulate the whole physical process, which will, in the end, generate the final sound,” he says.

To do that, he takes an MRI image of a person’s vocal tract as he is pronouncing, say, the vowel “E.” He then represents that geometry of the vocal folds, soft palate, lips, nose, and other parts with differential equations. Using that, he generates a computational mesh, a many-sided grid that approximates the geometry. The process is not easy; a desktop computer can generate a mesh with three to four million elements in about three or four hours to represent the short “A” sound, he says. A sibilant “S,” though, requires a computer with 1,000 processors to run for a week to generate 45 million elements. The added complexity of that sound arises from the air flowing between the teeth and creating turbulent eddies swirling in complex patterns. Imagine, then, the time required to produce a whole word, let alone a sentence.

Guasch sees his approach more as an interesting computing challenge than a practical attempt to create speech. “The final goal is not just synthesizing speech, it’s about reproducing the way the human body behaves,” he says. “I believe when you have a computational problem, it’s good to face it from many different angles.”

The University of Edinburgh’s King, on the other hand, is working toward practical applications. He recently received funding for a three-year project, in conjunction with the BBC World Service, to create text-to-speech systems for languages that do not have enough speakers to make developing a system a financially attractive process for companies. It should be possible to use machine learning on data such as radio broadcasts and newspapers to build a credible system, King says, without the expense of hiring linguistic experts and professional voice artists. He has already built a Swahili prototype, which he says works pretty well.

King also has developed a system can take a small number of recordings of a particular individual’s speech and apply them to a model already trained with a much larger dataset, and use that to generate new speech that sounds like that individual. The system is undergoing clinical trials in a U.K. hospital to see if it can be a practical way of helping people with amyotrophic lateral sclerosis, who are expected to lose their ability to speak as their disease progresses. “This is not going to help them live any longer, but for the time they do live it could help make their quality of life better,” he says.

Further Reading

Van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., and Kalchbrenner, N.
WaveNet: A Generative Model for Raw Audio, ArXiv, Cornell University Library, 2016 http://arxiv.org/pdf/1609.03499

King, S., and Karaiskos, V.
The Blizzard Challenge 2016, Blizzard Challenge Workshop, Sept. 2016, Cupertino, CA http://www.festvox.org/blizzard/bc2016/blizzard2016_overview_paper.pdf

Arnela, M., Dabbaghchian, S., Blandin, R., Guasch, O., Engwall, O., Van Hirtum, A., and Pelorson, X.
Influence of vocal tract geometry simplifications on the numerical simulation of vowel sounds, Journal of the Acoustical Society of America, 140, 2016 http://dx.doi.org/10.1121/L4962488

Deng, L., Li, J., Huang, J-T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., Gong, Y, and Acero, A.
Recent advances in deep learning for speech research at Microsoft, IEEE International Conference on Acoustics, Speech and Signal Processing, 2013 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6639345

Simon King – Using Speech Synthesis to Give Everyone Their Own Voice https://www.youtube.com/watch?v=xzL-pxcpo-E

Figures

Figure. A humanoid robot, named Aiko Chihira by its creators at Toshiba and Osaka University, at a 2015 trial in Tokyo’s Mitsukoshi department store. Toshiba says it will incorporate speech recognition and synthesis into the robot by 2020.