Imagine using your cell phone to navigate through a network of hyperlinked speech applications as rich and diverse as the visual applications on the conventional Web. On holiday at the beach, you could use it to navigate through a vendor's site for information on the new camera you're using to take snapshots of your friends. In the evening, by the fire, you could verbally browse a library of electronic stories, picking the one most appropriate for the moment. Imagine, too, being able to build speech applications and add them to a public-domain SpeechWeb as easily as you build and add pages to the conventional visual Web. You could have a speech-accessible resumé. Information (such as directions on how to get to your home) could be made available for your guests to access through hands-free speech in their vehicles. Companies (with the expertise) could create natural-language speech interfaces to their products and services, making them available to anyone, anywhere using a handheld SpeechWeb browser. Moreover, a public-domain SpeechWeb would allow blind users to participate more fully in our knowledge-based society.
A Kelsey Group study estimated that speech-enabled services could generate more than $4.6 billion in revenue for North American wireless carriers and $25 billion worldwide by 2006 . However, says Dale Hartzell of Sandcherry, Inc., a software company specializing in speech products, this world of speech-enabled services will be a reality only if speech solutions are able to deliver flexibility, scalability, and economy, and speech applications can be created and deployed as easily as text- and picture-based Web services are today . This has not been possible due to the fact that existing architectures for distributing speech applications are not conducive to public involvement in the construction of SpeechWebs.
Here, I present a new architecturecalled Local Recognition Remote Processing (LRRP) I've developed at the University of Windsorfor distributed speech processing, showing how it can be implemented through existing software and Web protocols and that it is conducive to the development of a public-domain SpeechWeb (see the figure). My goal is to encourage participation in the construction of a public-domain SpeechWeb, showing the ease speech browsers and applications can be built using existing software and the ease hyperlinked speech applications can be built on conventional Web servers.
Speech applications can be distributed via three main architectures: speech interfaces to HTML pages; networks of hyperlinked VXML pages; and call centers containing speech applications. Early versions of the first (in the early 1990s) allowed users to scan HTML pages using a standard set of voice commands. More recent versions translate Web pages into speech dialogues. Recognition grammars, or sets of rules defining user input languages, are derived from the Web pages and used to direct speech-recognition software, thereby improving speech-recognition accuracy. The Speech-Aware Multimedia (SAM) system, introduced in 1995 by Texas Instruments,  is an example that also allowed grammars (defined by application developers) to be associated with so-called "smart pages." When an end user downloads a smart page, the associated grammar tailors the speech interface to recognize utterances that are interpreted locally by the speech browser.
The use of speech interfaces to HTML pages goes some way toward creating a public-domain SpeechWeb. They are easy to install by nonexperts and provide speech access to knowledge already available on the Web; local speech recognition allows users to train the interfaces for their voices, further improving recognition accuracy. However, this architecture is somewhat limited for three main reasons:
More information on tools that enable speech access to Web pages is in the Eurescom P923-PF project's documentation (www.eurescom.de/).
In the second architecture, VXML, which is like HTML, is used to create hyperlinked speech applications . VXML pages, which are executed on VXML browsers, include commands for prompting user speech input, invoking recognition grammars, outputting synthesized voice, iterating through blocks of code, calling local Java scripts, and hyperlinking to other remote VXML pages downloaded in a manner that's like the linking of HTML pages in the conventional Web. Though this architecture is likely to be a major aspect of the development of a public-domain SpeechWeb, it is also limited by two factors:
In the third architecture, which involves telephone access to remote speech applications, call centers allow users to phone in and speak to their applications. Speech recognition and application processing are carried out at the call center. A fairly recent innovation is the integration of VXML pages stored on call-center Web servers with speech dialogues accessed through remote client telephones. (For example, the city of Montréal implemented this approach in 2003 to accommodate the hundreds of thousands of calls it receives annually.) The resulting architecture, which can be viewed as a variant of the second architecture, supports sophisticated applications requiring powerful processing. It also allows applications to reside on conventional Web servers, reducing administrative overhead. However, using the call-center architecture in a public-domain SpeechWeb also involves three main limitations:
The protocol called Speech Application Language Tags (SALT) developed by Microsoft, Intel, and other sources  includes tags (fewer than VXML) that are intended to be added to HTML pages so they can be browsed through speech. An advantage of this approach is that existing Web browsers are more easily extended to accommodate SALT tags; SALT relies on the execution model of HTML. On the other hand, VXML is an independent language that can be used to create and hyperlink speech applications. VXML and SALT might converge over the next five years into a single voice-application markup language.
The Microsoft Speech Server (MSS), which supports SALT, is intended to simplify the integration of an enterprise's Web services, speech processing, and telephony. MSS can help implement all three of the architectures described earlier for distributed speech applications. However, the role MSS might play in the creation of a public-domain SpeechWeb is less clear, owing to its high cost and complexity for nonexpert SpeechWeb-content developers.
Statistical language models, according to , were popular in 1995, but by 2001 grammar-based models had claimed the preeminent position in commercial speech-recognition products. My proposed LRRP architecture takes advantage of this fact, making use of the grammar-based speech-recognition software readily available today. In the LRRP architecture, speech applications and their associated recognition grammars are stored on regular Web servers, and LRRP speech browsers reside on end-user devices. When users access remote applications, grammars are downloaded to their browsers where they are used to tailor the speech recognizer for the application. When a user utterance is subsequently recognized, it is sent to the remote application for processing. If the utterance is a request for information, the result is returned in the form of synthesized voice. Speech recognition is local, and application processing is remote. Alternatively, users might ask to be connected to other speech applications by uttering a speech hyperlink (such as "Can I talk to the chemistry expert?"). The Web address of the new application is then returned to the browser, which downloads a new recognition grammar from that address and subsequently sends all recognized utterances to the new application.
The LRRP architecture has something in common with each of the three architectures discussed earlier. Recognition is local, as it is with speech interfaces to HTML pages. The downloading of recognition grammars from remote applications is similar to SAM's smart pages. Like VXML, LRRP is intended to support access to applications designed for speech (rather than multimodal access to HTML pages). And finally, it involves remote execution of applications, like the call-center architecture. In terms of support for a public-domain SpeechWeb, LRRP reflects a combination of the following advantages:
My goal is to demonstrate the ease speech browsers and applications can be built using existing software and the ease with which hyperlinked speech applications can be built and deployed on conventional Web servers.
Aiming to demonstrate the viability of the LRRP architecture, my research group and I have experimented with a number of technologies for building browsers and deploying speech applications. We built early versions of the browser (1998) in Java using IBM's proprietary speech APIs. I demonstrated them at the 1999 Conference of the Pacific Association of Computational Linguistics  and at the 2002 Conference of the American Association for Artificial Intelligence .
Last spring, one of my students and I rebuilt the browser as a single VXML page, integrating it with two simple Java objects . The single VXML page remains on the local device and accesses remote hyperlinked non-VXML speech applications. In order to run this upgraded browser, users install the Java runtime environment and a VXML interpreter, both readily available as free downloads on the Web. In order to add a speech application to the SpeechWeb, developers need access to a Web server and software to create their text-in/text-out applications. We tested the browser on PCs and laptops using both wired and wireless connections to local and wide-area networks, then demonstrated it in 2004 at the second annual conference on Communications Networks and Services Research in Fredericton, N.B. .
My somewhat unorthodox use of VXML (to implement the LRRP browser) provides several notable advantages: the browser can be distributed freely; VXML interpreters are available for a range of devices, including desktop PCs, laptops, and pocket PCs; and the browser inherits improvements in speech-recognition technology as they're integrated into new versions of VXML interpreters. Finally, this method of constructing the browser facilitates integration of the LRRP architecture with the conventional "networks of VXML pages" architecture; the result is a richer environment for SpeechWeb application development. I also expect that the browser can be implemented as a single HTML page containing appropriate SALT tags, with comparable advantages.
The functionality of a public-domain SpeechWeb will be limited by three main issues:
Addressing the first involves development of special tools and techniques. Tools are needed to facilitate construction of natural-language applications by nonexperts. Microsoft, IBM, and other sources have developed software for constructing natural-language interfaces to proprietary database systems. In addition, advanced developers can construct natural-language processors using Yet Another Compiler Compiler, or YACC, as well as the Cup parser generator for Java and the parser combinators in logic and functional programming. Compositional semantic theories (such as those devised by Richard Montague at the University of California, Los Angeles ) can help develop the semantics for the subset of natural language that are appropriate for a given application. Even with these tools and theories, development of natural-language interfaces remains a difficult technical task for all but the experts.
Guidelines are needed to help developers create recognition grammars that provide the proper mix of accuracy and robustness for a given application. Although a number of tutorial-style documents on recognition-grammar design are available on the Web, they must be augmented with more comprehensive material based on empirical results obtained through grammar-based recognition engines in a range of applications.
Design rules are needed to construct sets of hyperlinked speech objects and decompose large speech applications into sets of hyperlinked units, balancing recognition accuracy, robustness, and ease of navigation; these tasks are similar to dialogue design .
The second limitationlack of support for free-flowing speech in current technologywas discussed in , which identified the steps researchers must take to improve accuracy and robustness, thereby facilitating the mainstream adoption of speech technology. The recommendations of these researchers, also discussed in , apply to the development and acceptance of a public-domain SpeechWeb.
Fortunately, the third limitationthe need for speech recognizers on lightweight end-user deviceshas been overcome to some extent by the recent availability (June 2005) of freely downloadable browsers supporting the multimodal markup language XHTML+Voice, also known as X+V. This language was developed by Access Systems, IBM, Motorola, and Opera Software and has been submitted to the World-Wide Web Consortium for approval. Trial versions of X+V multimodal browsers for PCs, pocket PCs, and cell phones are freely downloadable from www-306.ibm.com/software/pervasive/multimodal/. They can be used to execute a single X+V page constituting a SpeechWeb interface and provide local state-of-the-art speech-recognition and text-to-speech synthesis.
The conventional Web has grown dramatically in terms of numbers of users and Web sites, e-commerce activity, collaborative applications, information accessibility, and social networking over the past decade. The availability of free browsers and the ease of developing HTML pages represent important factors. My aim here is to motivate development of a public-domain SpeechWeb with which speech browsers and applications can be built and deployed using readily available software and conventional Web protocols.
3. Frost, R., Abdullah, N., Bhatia, K., Chitte, S., Hanna, F., Roy, M., Shi, Y., and Su, L. LRRP SpeechWebs. In Proceedings of the Second Annual Conference on Communication Networks and Services Research (Fredericton, N.B., Canada, May 1921). IEEE Computer Society, 2004, 9198.
4. Frost, R. SpeechWeb: A Web of natural language speech applications. In Proceedings of AAAI Intelligent Systems Demonstrations (University of Alberta, Edmonton, July 28Aug. 1). AAAI Press/MIT Press, 2002, 998999.
5. Frost, R. and Chitte, S. A new approach for providing natural-language speech access to large knowledge bases. In Proceedings of the Conference of the Pacific Association for Computational Linguistics (University of Waterloo, 1999), 8290.
6. Hartzell, D. Simplifying Speech-Enabled Solutions or Deploying Speech-Enabled Services Should Be as Easy as Deploying Web Services. Invited talk at the Conference on Voice Enabled Services, London, Jan. 2003.
8. Knight, S., Gorrell, G., Rayner, M., Milward, D., Koeling, R., and Lewin, I. Comparing grammar-based and robust approaches to speech understanding: A case study. In Proceedings of Eurospeech 2001, the Seventh European Conference of Speech Communication and Technology (Aalborg Denmark, Sept. 37, 2001), 17791782.
12. Su, L. and Frost, R. A novel use of VXML to construct a speech browser for a public-domain SpeechWeb. In Proceedings of the 18th Conference of the Canadian Society for Computational Studies of Intelligence (Victoria, B.C., 2005), 401405.
This research is supported by the Natural Science and Engineering Council of Canada. Sanjay Chitte, Kunal Bhatia, and (Jack) Li Su (all graduates of the University of Windsor) helped implement versions of the SpeechWeb browser.
©2005 ACM 0001-0782/05/1100 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2005 ACM, Inc.
No entries found