Artificial Intelligence and Machine Learning News

Talking to Machines

Voice recognition programs like Siri are now capable of understanding spoken commands, recognizing a conversation's context, and answering questions in a personable manner.
  1. Introduction
  2. Executing Commands
  3. Learning and Organizing
  4. Measures of Success
  5. Further Reading
  6. Author
  7. Figures
Siri assistant on iPhone 4S
Siri can answer a wide variety of spoken questions in a conversational manner even in difficult conditions but, like its human inventors, it has yet to solve to the P versus NP problem.

When Apple integrated Siri into the iOS operating system last October, it spurred iPhone owners to start talking to their phones as well as through them. The program, which converts spoken commands such as “Schedule dinner with Lisa at 6 tonight” into calendar appointments, Web searches, and the like, is the most widely distributed example of a cognitive assistant to date. More than four million iPhone 4S’s featuring Siri were sold during its first weekend. Although users might see it as simple speech recognition, its abilities go far beyond simple transcription.

Siri represents an important moment when voice recognition, information management, artificial intelligence, task fulfillment, and user interface marry in a way the general public finds usable and productive. As Wolfram Research executive director Luc Barthelet says, “The news about Siri is that it works. People have tried to get computers to answer questions conversationally for at least 15 years, but only now has the technology reached a threshold where people overall like it.” The iPhone’s popularity also gives intelligent software assistants wider exposure than they would get otherwise. Roger K. Moore, editor in chief of the journal Computer Speech and Language, points out that “the field of research hasn’t changed dramatically. What’s new is that Siri’s brought several complementary technologies together. Our business has been going for many years. Only now, with Siri, everybody knows about it.”

Back to Top

Executing Commands

There is a long road between the spoken command and its fulfillment, though. The first step in the process is to convert the audio of speech into meaning. The two main applications of speech recognition—dictation and command recognition—have forced researchers to pursue parallel methods that balance vocabulary, accent, and context needs.

Grammar-based voice recognition is optimized for situations where the program has a very good idea of what the speaker will say. Its most common application is in Interactive Voice Response (IVR) systems, such as those that some airlines use to interpret spoken reservations and requests for information. These are often conversational. A recorded voice asks the speaker a question, then listens for the response. As a result, the system needs to understand only a limited vocabulary. But according to Dan Faulkner, vice president of product and strategy for the Enterprise Business Unit at Nuance, responses can vary widely no matter how restricted the domain is. “When a phone prompt system tells you to ‘please say yes or no,’ we might be pretty confident the speaker will say ‘yes’ or ‘no.’ But ‘yes’ could be ‘yeah,’ ‘that’s correct,’ or ‘yup,’ and in the Southern states some people will say ‘yes, ma’am’ and ‘no, ma’am.’ So even for something as simple as ‘yes’ or ‘no,’ you need quite an extensive list of phrases.”

By contrast, a language-based voice recognition system is optimized for dictation. It makes few assumptions about context, attempting to recognize and transcribe every word it hears. In gaining a wide vocabulary, it gives up the ability to understand a variety of accents. According to Peter Mahoney, general manager of the division of Nuance that produces its Dragon Naturally Speaking products, modern dictation programs no longer need the 30–60 minute initial training period that earlier versions did. “People now get well above 90% accuracy the first time they use [our dictation programs]; that figure used to be only 75% to 80%.”

Although initial training is not as necessary as before, dictation programs still train themselves as you use them. “The program looks at three things to get to know its user,” says Mahoney. “First is acoustics. How do you actually say words? That’s what the initial training used to focus on. Second is the kind of words you use—your spoken writing style. Third is a variety of user preferences, such as how you say numbers and names, and how you like them to be formatted. So if you were to say, ‘I gave you two dollars and forty-two cents,’ the program knows to transcribe that as ‘I gave you $2.42.'”

These two methods of understanding speech are starting to converge, however. Dictation programs adapt somewhat to the tasks they are performing, for example favoring the phrase “dot com” when the cursor is in an email program’s “To” field. And Siri switches from command to dictation mode when appropriate.

Back to Top

Learning and Organizing

But voice recognition is only a small part of the puzzle. Before a cognitive assistant can schedule that dinner with Lisa, it needs to understand that dinner is an event of limited duration, that Lisa is a person whose contact information is found in the device’s address book, and so forth.

Some of this understanding came from a Defense Advanced Research Projects Agency (DARPA)-funded project, Cognitive Assistant that Learns and Organizes (CALO), which was part of DARPA’s PAL (Personalized Assistant that Learns) program. CALO’s focus was not on voice recognition per se, or natural-language understanding, or on human-computer interaction in general. Rather, it was about making computer systems learn in everyday settings, such as learning to recognize concepts. It then had to relate them through an underlying ontology, and trigger desktop applications and Web services.

The two main applications of speech recognition—dictation and command recognition—have forced researchers to pursue parallel methods that balance vocabulary, accent, and context needs.

SRI International principal investigator C. Raymond Perrault describes the challenge of solving for ambiguities in natural language. “Movie titles are typically just phrases in the language,” Perrault points out, “so you could say ‘Get me two tickets to The Artist,’ and the system would recognize that phrase as a movie title. On CALO we tried to solve for such ambiguities robustly, and would make it easy to build systems that do as well.” Humans resolve these ambiguities using context, while teaching a computer to learn them requires building very long lists of such things as addresses, people, movies, and organizations, along with a solid categorization system to manage them. Ideas from CALO suggested a new approach to the development of a spoken interface to a set of Web services, eventually realized in Siri, which was built by a company that spun out of SRI International.

Neither a robust ontology nor high-speed voice recognition are possible without substantial data as input. Here, the Web’s social nature—and some carefully worded clauses in end-user license agreements—allow software assistants to collect acoustic, syntactic, and factual information. “It wouldn’t be possible to do something like Siri 10 to 15 years ago because you couldn’t get enough data to train the system,” says Alan W. Black, associate professor at the Language Technologies Institute of Carnegie Mellon University. “Google started collecting data years ago through its free 411-GOOG informational service. Notably, they advertised on billboards rather than online. What they were actually doing was finding out how ordinary people asked questions.” By contrast, Nuance’s Faulkner recalls how the company trained its dictation products to understand noisy phone transmissions in past years. “We’d pay people to come into the office and give them scripts. We’d give them a mobile phone, put them in a cab, tell them to call a number, then record their speech.”

The resulting collection of data is much too big to fit on today’s portable devices, so command agents rely on two other recent developments: ubiquitous high-speed bandwidth and cloud storage. For speed, recognition actually takes place on the server rather than in the device itself. But because the voice channel has a relatively low 8kHz resolution, resulting in noise and distortion that could affect recognizability, the sound is transmitted via the data channel. “We’ve now reached a point that we put a fairly fast stream of language over an iPhone at 16kHz,” says Faulkner. “Being able to capture twice as much data makes a big difference.”

On the back end, such programs rely on third-party services. Siri counts among its providers Yelp, OpenTable, StubHub, Rotten Tomatoes, and The New York Times. One of its most unusual providers is Wolfram|Alpha, which catalogs general concepts and facts rather than the raw Web site data stored by such search engines as Google. As Wolfram’s Barthelet describes, “The Web pushes the world back on you. You’re asking, ‘What does the world know about this subject?’ But often, you just want to know the answer.” As with conversational voice recognition systems, Wolfram|Alpha attempts to deliver that answer by first limiting the question’s domain based on the questioner’s location, previous questions, and other factors.

Back to Top

Measures of Success

Ultimately, these steps serve the single goal of delivering a relevant, true, and useful response with acceptable speed. But as Carnegie Mellon’s Black points out, a software agent’s job is not done if it only delivers facts. “One standard measure of spoken dialogue systems is task completion,” he notes. “Did the user successfully get the weather? But it’s clear that that’s not the only goal. You can have an interaction that’s successful and takes little time, but is unpleasant. So satisfaction is another goal.”

Black believes Siri delivers that satisfaction partly through its helpful-yet-sassy tone. “It doesn’t just answer questions,” he says. “It has a character. It wants to name you, to know who you are. You can tell it to call you “Master” or “Darth Vader” or whatever, but it wants to call you that. It makes things a little more personal, and that’s important.” Faulkner also points to Siri’s many handcrafted, hidden Easter eggs. For instance, if you tell Siri “I’m drunk,” it offers to call you a cab.

More importantly, today’s software agents have taken context to a level never before attempted, striving to know more about you than you know about yourself—your situation, tastes, and patterns—before running off to find exactly what it believes you want. They paradoxically expand your power by limiting its domain, collapsing infinite possibilities into a single action.

“It wouldn’t be possible to do something like Siri 10 to 15 years ago because you couldn’t get enough data to train the system,” says Alan W. Black.

Still, Moore believes the game is far from over. “The history of this field has always been one of waves of success, followed by going into the doldrums. The joke about our success is, ‘Just keep showing the same graph of escalating future returns, but don’t put any dates on it.’ ” He has tested that theory empirically by surveying his colleagues every six years, asking them when voice recognition will hit certain milestones. But every time he resurveys them, “all the dates have moved out another six years! So the future isn’t getting any closer.”

“Something like Siri appears and people think, ‘We’ve solved it!’ But you can’t use it in a pub or a train station,” says Moore. “Then when you point out all the realities of the fantastic abilities that human beings have at holding conversations in difficult circumstances, you realize we still need to solve artificial intelligence, language, neuro-computing, and so on before we have a truly autonomous agent.”

Back to Top

Further Reading

Baker, J., et al.
Research developments and directions in speech recognition and understanding, part 1, IEEE Signal Processing Magazine 26, 3, May 2009.

Baker, J., et al.
Updated MINDS report on speech recognition and understanding, part 2, IEEE Signal Processing Magazine 26, 4, July 2009.

Lecouteuxa, B., Linarèsb, G., and Ogerb, S.
Integrating imperfect transcripts into speech recognition systems for building high-quality corpora, Computer Speech & Language 26, 2, April 2012.

Moore, R.K.
Progress and prospects for speech technology: Results from three sexennial surveys, INTERSPEECH 2011, Florence, Italy, August 27–31, 2011.

Prasad, R., et al.
BBN TransTalk: Robust multilingual two-way speech-to-speech translation for mobile platforms, Computer Speech and Language, Nov. 15, 2011.

Yorke-Smith, S. and Myers, M.
Like an intuitive and courteous butler: A proactive personal agent for task management, Proceedings of 8th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2009, Budapest, Hungry, May 10-15, 2009.

Back to Top

Back to Top


UF1 Figure. Siri can answer a wide variety of spoken questions in a conversational manner even in difficult conditions but, like its human inventors, it has yet to solve to the P versus NP problem.

Back to top

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More