Screen Recognition Makes Apps More Accessible

Using Apple's VoiceOver algorithm. — Missing semantic metadata is generated at runtime by Screen Recognition and spoken to visually challenged users through a built-in smartphone text-to-voice algorithm called VoiceOver: a single-tap makes the metadata accessible (the smartphone speaks “Brow

Sighted people can recognize the different elements of an app or Web page just by looking at them, which the blind and visually impaired obviously cannot. For years, the best the technology industry has been able to offer users with visual impairments were specialized assistive technologies that rendered text and image content as speech or Braille.

At ACM's online virtual Computer/Human Interface (CHI 2021) conference in May, however, Apple described its Screen Recognition algorithm, that identifies the necessary metadata to communicate the functions of any app to the visually impaired.

Screen Recognition is a run-time algorithm developed by Apple engineers that automatically recognizes the visual elements of apps, vocalizes their functions, locates them on a touch screen, and tells the blind how to use them, as well as how to navigate among them.

"The Screen Recognition approach can also be applied to other mobile platforms, including Android, as well as beyond mobile context, such as for desktop computers and Web user interfaces, which are similarly inaccessible to the blind," said Apple engineer Xiaoyi Zhang.

Screen Recognition works by combining machine learning with knowledge-based heuristics into a real-time model that allows it to not only recognize each screen element type, but also to group them together by function (such as side-by-side rows of social-media sharing icons), according to Zhang. Another important heuristic makes explicit the navigational flow of an app (such as naming a recipient before sending a message).

The dataset used to create the Screen Recognition model contained 72,635 screens for training, and 5,002 screens for testing. Both blind and sighted accessibility engineers participated in coding the machine learning algorithm and in creating the heuristics that classified each element by type (button, icon, picture, text box, check box, menu, link, and so forth) and group (navigation, data input, sharing, and so forth).

"We need to get the word out that these accessibility capabilities are now available," said Chieko Asakawa, an IBM Fellow at the company's T. J. Watson Research Center, and IBM Distinguished Service Professor at Carnegie Mellon University. "For three decades, I have been doing accessibility research and development on prototypes, but including accessibility capabilities in existing popular smartphones will reach many more people. And since users can control the functions themselves, Screen Recognition could enable the visually impaired community to participate in the world to a much greater degree than they may have thought was possible."

As keynote speaker at CHI 2021, Asakawa described the history of accessibility and how Apple's Screen Recognition improvements exemplify to both users and developers that apps previously considered hopelessly unusable by the visually impaired have been made accessible, at least in iOS. She said developers can implement similar capabilities on other platforms, including smartphones using other operating systems, desktop computers, and Web sites.

Asakawa, who lost her sight in her teens, built prototype software to make Internet home pages accessible to the visually impaired. She described at CHI 2021 accessibility prototypes she has invented, including NavCog, a real-world navigation app for smartphones that enables the blind to explore the world by themselves. She also demonstrated a prototype she calls the AI Suitcase, a robotic rolling suitcase whose extended handle literally leads the blind step by step, avoiding obstacles by using Lidar to navigate indoor terrain as difficult as stairs, elevators, and escalators, and even standing on lines. Her work in this area earned her and co-author Takashi Itoh the ACM Special Interest Group on Accessible Computing (SIGACCESS ASSETS) 2013 Impact Award.

According to Asakawa, Apple's solution for the visually impaired is unique in that it built upon a preexisting VoiceOver accessibility algorithm already built into every iPhone. Largely ignored by sighted users, VoiceOver adds an initial semantic step to using iOS apps by providing speakable demonstrative explanations for each element on an iOS screen. These semantic explanations are accessed by tapping onto a screen; each time a tap lands on a button, icon, picture, text box, check box, menu, or link, VoiceOver speaks a descriptive name for the element and describes how to use it. For buttons, for instance, VoiceOver will give a name to the button that identifies its function, then says how to activate that function. By systematically tapping across and down a screen, blind users can build a topological map of the screen in their minds that they can then use to access the app in much the same way as sighted users.

Unfortunately, VoiceOver only works on iOS apps that strictly adhere to all the guidelines in Apple's User Interface Kit (UIKit). Many app developers do not add the necessary metadata for the visually impaired user options in the UIKit; when those apps are accessed, VoiceOver will merely say "button" for a button, without describing its function or use. Legacy apps written before UIKit's semantic capabilities also do not supply the necessary metadata describing screen elements for VoiceOver.

Prior screen-recognition efforts aimed at merely improving specific apps without giving the user control over fine-tuning accessibility capabilities, making them useful only for a subset of the visually impaired community, according to accessibility expert Jennifer Mankoff, Richard E. Ladner Endowed Professor and Associate Director for Diversity and Inclusion in the Paul G. Allen School of Computer Science & Engineering at the University of Washington, in Seattle.

"Solutions to app inaccessibility can be problematic," said Mankoff, adding that Apple's Screen Recognition app "shows that they can hold promise."

VoiceOver and Screen Recognition together supply over 200 user-selectable fine-tuning accessibility options. Zhang claims Apple aims to improve accessibility further by automatically adding the necessary semantic metadata required for use by the blind to many more of the millions of apps available under iOS, iPadOS, and MacOS. That goal has not yet been achieved, but Zhang says he and his Apple engineering colleagues are committed to pursuing it. Their guiding principle for accessibility algorithms is that they work at run-time, rather than require prior human tweaking app by app, allowing the visually impaired users to fine-tune how any app they select works best for them.

Screen Recognition does not make every app usable by the blind yet, Asakawa acknowledges—most games being a notable exception—but by putting the user in charge of fine-tuning its capabilities, it can make many apps more usable to the visually impaired today. More importantly, Asakawa hopes the usability of apps previously thought impossible to be made accessible will inspire a landslide of accessibility improvements by other smartphone, desktop computer, and website developers.

R. Colin Johnson is a Kyoto Prize Fellow who has worked as a technology journalist for two decades.