In an interview, Baidu engineer Awni Hannun discusses a new model for handling Mandarin voice queries that tests found is accurate 94 percent of the time.
He says the model employs Deep Speech, a deep-learning system that differs from other deep learning-based systems such as Microsoft's Skype Translate. Hannun says in the latter case there are usually three modules in the pipeline--a speech-transcription module, a machine-translation module, and a speech-synthesis module.
"Our system is different than that system in that it's more what we call end-to-end," he notes. "Rather than having a lot of human-engineered components that have been developed over decades of speech research--by looking at the system and saying what features are important or which phonemes the model should predict--we just have some input data, which is an audio .WAV file on which we do very little pre-processing. And then we have a big, deep neural network that outputs directly to characters."
Hannun says the network is fed enough data so it can learn what is relevant from the input to correctly transcribe the output, with a minimum of human intervention. He says Baidu plans to build a speech system that can interface with any smart device, and compressing existing models may be of help in this regard.
View Full Article
Abstracts Copyright © 2015 Information Inc., Bethesda, Maryland, USA
No entries found