Apple pioneered the voice revolution in 2011 with the introduction of Siri in its iPhone 4s. Today, you tell your iPhone 11, "Hey Siri, Play Bruce Springsteen by Spotify," and it responds, "I can't talk to Spotify, but you can use Apple music instead," politely displaying options on the screena as shown in the figure here. Or, you tell one of your five Amazon Echo devices at home, "Alexa, add pumpkin pie to my Target shopping list,"b then "order AA Duracell batteries," and it adds pumpkin pie and Amazon Basics batteries to your Amazon shopping cart, ignoring your request to shop at Target and be loyal to Duracell. You are the consumer, but your choices have been ignored.
Or, consider you are a brand manager. You want to customize the voice of Echo to match your brand persona, but it is not an option offered. Amazon only lets users change the default female voice to "male" and to a few other lonely options. Instead you decide to create a personalized assistant. However, unlike the leading technology companies (the so-called FAANGs—Facebook, Apple, Amazon, Netflix, and Google), your company does not have the thousands of engineers and vast advertising budgets needed to develop it.
These examples suggest artificial intelligence (AI) environments are evolving toward more limited choice. However, we believe an open and standardized approach would be beneficial to most, including the FAANGs themselves. In fact, this approach may be urgent, given the rapid growth of adoption and increasing level of complexity for users and skill suppliers.
Large companies are rushing to the rapidly growing voice opportunity, introducing incoherent smart speakers spanning many markets and languages. Each has its own "wake construct," which we define as the utterance that activates a skill. Typically, it consists of a "Wake Word" followed by some skill name and associated parameters. It may also include input from face recognition or other sensors, such as what happens when you activate Siri on your Apple watch by waving your wrist. Incompatibility among devices from different suppliers is pervasive and may come from any of the wake construct's components.
For example, if you say "Alexa, Bedtime" to your Echo, it starts a skill operated by Johnson and Johnson, while if you tell Google Home, "OK, Google, Bedtime," the device initiates a different routine that is owned by Google itself. Sometimes the behavior is similar in both devices but the service operator is different—for example, saying the words "sleep sounds" to your Echo or Google Home will initiate two different skills related to sleep and relaxation sounds. There is no central and standard repository of words to avoid this sort of inconsistency, such as one where a shop owner could reserve an action name across devices and languages.
For skill developers, devices are also incompatibile and may have language-specific skill programming options. This means that porting P&G's Tide Alexa Skill to all combinations of devices and languages would require maintaining hundreds and eventually thousands of different versions. Even then, the front-facing experience will not be consistent across devices and languages, which could confuse customers. Creating a somewhat consistent user experience across this myriad of options would require the sort of budget that is unjustifiable for small businesses. Even P&G's Tide skill, introduced several years ago, is still available only in English.
Amazon and Apple's business aggressiveness—their choice to rout product requests to their services, while developing incompatible closed-garden solutions—makes perfect sense for large companies trying to establish dominance in this competitive space. However, any advantages from this aggressive incompatible offerings may be offset by raised entry costs for skill suppliers such as P&G, antitrust violations affecting retailers large and small,5 or unattended consumer preferences.c
Incompatibility in the voice space contrasts with the situation in Wake Neutrality markets. In these markets, such as the Web, the phone network, barcodes, or even WiFi networking, there are standard ways of activating services.9 This gives consumers a consistent experience across more choices and at lower entry costs. On the Web, one can type "www.kohls.com," and trust reaching the same retailer, no matter the browser type, WiFi network, or device OS. Toscanini's Ice Cream phone number is the same, no matter the phone maker, calling app, or network provider. The Internet has the Domain Name System (DNS),d which ensures a unique name in the DNS, and phone networks have the North American Numbering Plan (NANP), which ensures a unique number in NANP.
Similarly, we believe the voice space, and AI in general, would greatly benefit from a standard "Voice Name System" (VNS) enabling unique skill names across devices. We suggest the VNS incorporates three architectural components: "Common Wake Constructs" (CWC), "Secure Voice Channels" (SVC), and "Conversational Privacy Firewalls" (CPF); each are now reviewed in turn.
Our first architectural suggestion is to implement Common Wake Constructs to standardize voice request routing. For example, the analogue of typing "http://www.lidl.com/shopping-list" could be to say "OK Google, Lidl Open Shopping list."e The DNS may be a starting point, so one cannot reserve a word on the VNS without having the corresponding DNS word, but it needs one extra step because voice is ambiguous. The identical pronunciation of store brands "Coles" and "Kohls" requires disambiguation. This could be done geographically, by adding a prompt, such as "OK, Coles" versus "Coles", or by a wrist movement like in the Siri example here. Deciding whether two names are similar may be tricky requiring arbitrage. Domain name registrars have performed such arbitrage functions within the DNS. For example, they prevent C0LES.org—with a zero "0" instead of the letter "O"— from being registered to prevent phishing.
Disambiguating voice is more complex than disambiguating text and it may lead to errors because computer speech recognition is ambiguous and not 100% accurate. The VNS could address error correction by using other sensors, requiring a user to spell out a word, or verification on a different device. To decide whether two speech utterances resemble each other too closely, phoneme matches could determine whether the probability of a phishing attack is above a threshold, before a voice domain is granted. To automatically establish this probability using deep learning algorithms one can use one of the public speech sample repositories using an open source speech-to-text solution such as Baidu's, which calculates probabilities in its last layer.
A CWC as proposed here is feasible using today's technology. So is a standard that also includes as CWC basic command phrases such as "<raise volume>" or "<play> Bruce Springsteen <in> Spotify." Together with one of my students we have created Huey, a CWC programing language based on human natural language.11,13
Devices currently avoid security issues by only allowing simple use cases, such as ordering a cab or pizza, in private environments like the car, home or office, because voice can be very insecure. In fact, researchers have shown existing wake algorithms can be fooled by sound sequences inaudible to the human ear, even to the point of forcing transcription to potentially any desired phrase.2
To expand security options, we propose designing Secure Voice Channels by adding a security layer to smart speakers, analogous to "secure http" (https), so that selected CWC require a more secure process to be activated. In public spaces, it could be based on responding to a "trick question" or in pressing a specific "secure" button associated to the device, much as we do today when using a "car key." The VNS could also blacklist spaces where harmful voice intrusions are known to occur.
A Burger King TV ad woke nearby Google devices when it stated, "OK, Google, What's a Whopper?". The Cannes Lions International Festival of Creativity singled out this Burger King ad as the most intrusive advertisement ever because it triggered a follow-on ad skill in each home and it informed Google which homes were watching. Hackers changed the skill to be harmful, and Google had to pull it out immediately.f The potential for privacy violations is unprecedented because short audio segments recorded by smart speakers can have Private Identifiable Information (PII) including genre, race, mood, alcohol intake, and even personality disorders.6
Automated analysis of speech has been shown to detect onset of psychosis, in young adults, even before human experts. In my lab, we recently used AI to diagnose Alzheimer's with only 20 seconds of speech and identified longitudinal biomarkers to track disease progression, achieving a spontaneous-speech detection rate of 93.8%, the highest reported so far.7 Similarly, we used forced cough recordings to identify COVID-19 subjects12 and demonstrated that a single cough can reveal cultural and biological information.4
Disambiguating voice is more complex than disambiguating text and it may lead to errors because computer speech recognition is ambiguous and not 100% accurate.
Our third suggestion, Conversational Privacy Firewalls (CPF), is an architectural block that filters input-output voice signals limiting the amount of PII available to intervening players. Depending on what type of PII one wants to protect, a different CPF filter mode may be appropriate. Here are some examples:
Two suggestions for short-term actions to begin establishing the VNS standard include:g
Eventually, more difficult choices will have to be made, such as determining how to manage the capturing, storing and sharing of sensor samples to improve AI device communication abilities using legal programming.8 For instance, when should devices be allowed to listen and talk "intelligently" to each other? When should we selectively process video footage from the home and from public spaces? Can the intelligence gathered then be used to customize AI personalities that induce you to consume more? If what we say stored at scale is gold, then, who owns our voice samples? And what about safety? Should AI agents, for example, be allowed to prevent you from driving if they hear you sound intoxicated, and should they warn you if they detect an increased risk of depression? May devices disclose your whereabouts if there is an active search warrant for you? An open discussion of these questions could enlarge the VNS standardization effort for the benefit of all.
12. Subirana, B. et al. Hi Sigma, do I have the Coronavirus?: Call for a New Artificial Intelligence Approach to Support Health Care Professionals Dealing With The COVID-19 Pandemic. arXiv preprint arXiv:2004.06510 (2020).
b. We developed a skill in our lab and in July 2017 Alexa's parsing surprisingly changed. The phrase "Alexa, Target shopping list add soap" went from adding it to the skill's list to directly adding it into Amazon's shopping list.
c. For example, some consumers reject the courtesy behavior exhibited by dominant devices: https://bit.ly/36yYEEi
e. "OK Google" would be like the first part of the url, that is, http://, "Lidl" like "www.lidl.com" and "Open Shopping list" like that of "shoppinglist." Just like there is "http", "ftp", one could imagine different voice services. These could include a "local skill mode" and a "single-shot" transfer mode that sends a command, but does not transfer subsequent speech commands to "Lidl." It could also include a "multiple-shot" transfer mode that keeps you in the skill for multiple interactions.
g. There are a number of innitiatives already under way but none yet with the scope we suggest. Amazon, for example, enabled Echo to wake Microsoft Cortana (The Washington Post, August 16th, 2018) and later announced a voice effort with a few more partners (The Verge, September 16th, 2019). A few retailers are behind the Open Voice Network (www.openvoicenetwork.org) a Linux initiative to standardize voice based on the MIT research of my lab, which has long-term objectives similar to the ones here suggested. W3C has a few groups interested on the topic too. Apple introduced last September "voice over", a way for users to control via speech any application in IOS screens. Inspired in the MIT Center for Brain Man and Machine's four module model of the human brain, MIT is introducing reference architectures for the VNS through the MIT Auto-ID Lab Open Voice initiative.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2020 ACM, Inc.
No entries found