Apple pioneered the voice revolution in 2011 with the introduction of Siri in its iPhone 4s. Today, you tell your iPhone 11, “Hey Siri, Play Bruce Springsteen by Spotify,” and it responds, “I can’t talk to Spotify, but you can use Apple music instead,” politely displaying options on the screena as shown in the figure here. Or, you tell one of your five Amazon Echo devices at home, “Alexa, add pumpkin pie to my Target shopping list,”b then “order AA Duracell batteries,” and it adds pumpkin pie and Amazon Basics batteries to your Amazon shopping cart, ignoring your request to shop at Target and be loyal to Duracell. You are the consumer, but your choices have been ignored.
Or, consider you are a brand manager. You want to customize the voice of Echo to match your brand persona, but it is not an option offered. Amazon only lets users change the default female voice to “male” and to a few other lonely options. Instead you decide to create a personalized assistant. However, unlike the leading technology companies (the so-called FAANGs—Facebook, Apple, Amazon, Netflix, and Google), your company does not have the thousands of engineers and vast advertising budgets needed to develop it.
These examples suggest artificial intelligence (AI) environments are evolving toward more limited choice. However, we believe an open and standardized approach would be beneficial to most, including the FAANGs themselves. In fact, this approach may be urgent, given the rapid growth of adoption and increasing level of complexity for users and skill suppliers.
Large companies are rushing to the rapidly growing voice opportunity, introducing incoherent smart speakers spanning many markets and languages. Each has its own “wake construct,” which we define as the utterance that activates a skill. Typically, it consists of a “Wake Word” followed by some skill name and associated parameters. It may also include input from face recognition or other sensors, such as what happens when you activate Siri on your Apple watch by waving your wrist. Incompatibility among devices from different suppliers is pervasive and may come from any of the wake construct’s components.
For example, if you say “Alexa, Bedtime” to your Echo, it starts a skill operated by Johnson and Johnson, while if you tell Google Home, “OK, Google, Bedtime,” the device initiates a different routine that is owned by Google itself. Sometimes the behavior is similar in both devices but the service operator is different—for example, saying the words “sleep sounds” to your Echo or Google Home will initiate two different skills related to sleep and relaxation sounds. There is no central and standard repository of words to avoid this sort of inconsistency, such as one where a shop owner could reserve an action name across devices and languages.
For skill developers, devices are also incompatibile and may have language-specific skill programming options. This means that porting P&G’s Tide Alexa Skill to all combinations of devices and languages would require maintaining hundreds and eventually thousands of different versions. Even then, the front-facing experience will not be consistent across devices and languages, which could confuse customers. Creating a somewhat consistent user experience across this myriad of options would require the sort of budget that is unjustifiable for small businesses. Even P&G’s Tide skill, introduced several years ago, is still available only in English.
Wake Neutrality
Amazon and Apple’s business aggressiveness—their choice to rout product requests to their services, while developing incompatible closed-garden solutions—makes perfect sense for large companies trying to establish dominance in this competitive space. However, any advantages from this aggressive incompatible offerings may be offset by raised entry costs for skill suppliers such as P&G, antitrust violations affecting retailers large and small,5 or unattended consumer preferences.c
Incompatibility in the voice space contrasts with the situation in Wake Neutrality markets. In these markets, such as the Web, the phone network, barcodes, or even WiFi networking, there are standard ways of activating services.9 This gives consumers a consistent experience across more choices and at lower entry costs. On the Web, one can type “www.kohls.com,” and trust reaching the same retailer, no matter the browser type, WiFi network, or device OS. Toscanini’s Ice Cream phone number is the same, no matter the phone maker, calling app, or network provider. The Internet has the Domain Name System (DNS),d which ensures a unique name in the DNS, and phone networks have the North American Numbering Plan (NANP), which ensures a unique number in NANP.
Similarly, we believe the voice space, and AI in general, would greatly benefit from a standard “Voice Name System” (VNS) enabling unique skill names across devices. We suggest the VNS incorporates three architectural components: “Common Wake Constructs” (CWC), “Secure Voice Channels” (SVC), and “Conversational Privacy Firewalls” (CPF); each are now reviewed in turn.
Common Wake Constructs (CWC)
Our first architectural suggestion is to implement Common Wake Constructs to standardize voice request routing. For example, the analogue of typing “http://www.lidl.com/shopping-list” could be to say “OK Google, Lidl Open Shopping list.”e The DNS may be a starting point, so one cannot reserve a word on the VNS without having the corresponding DNS word, but it needs one extra step because voice is ambiguous. The identical pronunciation of store brands “Coles” and “Kohls” requires disambiguation. This could be done geographically, by adding a prompt, such as “OK, Coles” versus “Coles”, or by a wrist movement like in the Siri example here. Deciding whether two names are similar may be tricky requiring arbitrage. Domain name registrars have performed such arbitrage functions within the DNS. For example, they prevent C0LES.org—with a zero “0” instead of the letter “O”— from being registered to prevent phishing.
Figure. The iPhone indicated various options to listen to Bruce Springsteen, but not the one requested.
Disambiguating voice is more complex than disambiguating text and it may lead to errors because computer speech recognition is ambiguous and not 100% accurate. The VNS could address error correction by using other sensors, requiring a user to spell out a word, or verification on a different device. To decide whether two speech utterances resemble each other too closely, phoneme matches could determine whether the probability of a phishing attack is above a threshold, before a voice domain is granted. To automatically establish this probability using deep learning algorithms one can use one of the public speech sample repositories using an open source speech-to-text solution such as Baidu’s, which calculates probabilities in its last layer.
A CWC as proposed here is feasible using today’s technology. So is a standard that also includes as CWC basic command phrases such as “<raise volume>” or “<play> Bruce Springsteen <in> Spotify.” Together with one of my students we have created Huey, a CWC programing language based on human natural language.11,13
Secure Voice Channels (SVC)
Devices currently avoid security issues by only allowing simple use cases, such as ordering a cab or pizza, in private environments like the car, home or office, because voice can be very insecure. In fact, researchers have shown existing wake algorithms can be fooled by sound sequences inaudible to the human ear, even to the point of forcing transcription to potentially any desired phrase.2
To expand security options, we propose designing Secure Voice Channels by adding a security layer to smart speakers, analogous to “secure http” (https), so that selected CWC require a more secure process to be activated. In public spaces, it could be based on responding to a “trick question” or in pressing a specific “secure” button associated to the device, much as we do today when using a “car key.” The VNS could also blacklist spaces where harmful voice intrusions are known to occur.
Conversational Privacy Firewalls (CPF)
A Burger King TV ad woke nearby Google devices when it stated, “OK, Google, What’s a Whopper?”. The Cannes Lions International Festival of Creativity singled out this Burger King ad as the most intrusive advertisement ever because it triggered a follow-on ad skill in each home and it informed Google which homes were watching. Hackers changed the skill to be harmful, and Google had to pull it out immediately.f The potential for privacy violations is unprecedented because short audio segments recorded by smart speakers can have Private Identifiable Information (PII) including genre, race, mood, alcohol intake, and even personality disorders.6
Automated analysis of speech has been shown to detect onset of psychosis, in young adults, even before human experts. In my lab, we recently used AI to diagnose Alzheimer’s with only 20 seconds of speech and identified longitudinal biomarkers to track disease progression, achieving a spontaneous-speech detection rate of 93.8%, the highest reported so far.7 Similarly, we used forced cough recordings to identify COVID-19 subjects12 and demonstrated that a single cough can reveal cultural and biological information.4
Disambiguating voice is more complex than disambiguating text and it may lead to errors because computer speech recognition is ambiguous and not 100% accurate.
Our third suggestion, Conversational Privacy Firewalls (CPF), is an architectural block that filters input-output voice signals limiting the amount of PII available to intervening players. Depending on what type of PII one wants to protect, a different CPF filter mode may be appropriate. Here are some examples:
- “Speech Incognito Mode”: This converts speech to text, so that skill suppliers don’t receive any information contained in the soundwave other than text.
- “Vision Incognito Mode”: For devices such as the Echo Look, the filter could transform images to prevent proper face recognition, while blocking PII about gender or race.
- “Alexander Mode”: This mode takes speech and converts it into commands using a synthesized robot voice, spacing requests when possible. This ensures that neither the voice, the location, nor the sequence of commands is shared.
- “Strong Incognito Mode”: Evidence shows our choice of words conveys a lot of PII.3 This mode would convert “I desperately need two tickets for Sunday’s Baseball Match and would pay any amount”, to neutral text, ensuring all a service provider receives is a request in “neutral English” with limited location and user sentiment information, such as, “Are there any tickets available for Sunday’s match?” This type of incognito mode could result eventually in a new form of “Esperanto” for AI devices.
Call to Action Toward A Voice Name System (VNS)
Two suggestions for short-term actions to begin establishing the VNS standard include:g
- Defining Roles: Implement the first version of the VNS, which would include CWC, SVC, and CPF, by using existing open source software and hosting it with an existing non-profit organization. Non-open source development choices are also possible and may co-exist. For example, large players could continue to grow closed-garden solutions, while sharing key ingredients of CWC, SVC, and CPF. The competitive landscape, including the role of each constituent, must be clarified, and existing standard bodies, such as W3C, IEEE or GS1, may need to get involved.
- Setting Community Objectives: For a first version of the VNS to be used widely, agreement is needed on which application area is first. This will help set technical choices such as file formats, routing protocols and command syntax. We suggest working toward a first version of the VNS that allows you to talk to any Web page, phone app, or smart speaker. Subsequent work would pursue a new version that allows you to converse with any object in the world so that you can ask the tomato sauce you are holding in your hand “Am I allergic to you?”. This would require combining Natural Language Processing with other AI modalities such as high-level computer vision, gaze detection, SSVEP brain sensing or gesture recognition, and could imply interfacing with the Internet of Things (IoT).6 In addition to the VNS, we may need an Artificial Intelligence Name System (AINS).
Eventually, more difficult choices will have to be made, such as determining how to manage the capturing, storing and sharing of sensor samples to improve AI device communication abilities using legal programming.8 For instance, when should devices be allowed to listen and talk “intelligently” to each other? When should we selectively process video footage from the home and from public spaces? Can the intelligence gathered then be used to customize AI personalities that induce you to consume more? If what we say stored at scale is gold, then, who owns our voice samples? And what about safety? Should AI agents, for example, be allowed to prevent you from driving if they hear you sound intoxicated, and should they warn you if they detect an increased risk of depression? May devices disclose your whereabouts if there is an active search warrant for you? An open discussion of these questions could enlarge the VNS standardization effort for the benefit of all.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment