Artificial Intelligence and Machine Learning Review articles

(Computer) Vision Without Sight

Computer vision holds the key for the blind or visually impaired to explore the visual world.

Posted Jan 1 2012

Introduction
Key Insights
The VI Population
Application Areas
Interfaces
Usability
Conclusion
References
Authors
Footnotes
Figures

More than 20 million people in the U.S. live with visual impairments ranging from difficulty seeing, even with eyeglasses, to complete blindness. Vision loss affects almost every activity of daily living. Walking, driving, reading and recognizing objects, places, and people become difficult or impossible without vision. Thus, technology that can assist visually impaired (VI) people in at least some of these tasks may have a very relevant social impact.

Key Insights

Computer vision and mobile computing are powerful tools with great potential to enable a range of assistive for the growing population of blind and visually impaired people.
The actual needs of this population must drive the development of this technology in order for it to be truly useful and likely be widely adopted. To this end, blind and visually impaired users must be involved at all stages of design, research, and development.
Particular attention must paid to the development of appropriate user interfaces for this technology that respect the needs of blind and visually impaired users.

Research in assistive technology for VI people has resulted in some very useful hardware and software tools in widespread use. The most successful products to date include text magnifiers and screen readers, Braille note takers, and document scanners with optical character recognition (OCR). This article focuses specifically on the use of computer vision systems and algorithms to support VI people in their daily tasks. Computer vision seems like a natural choice for these applications—in a sense, replacing the lost sense of sight with an “artificial eye.” Yet, in spite of the success of computer vision technology in several other fields (such as robot navigation, surveillance, and user interface), very few computer vision systems and algorithms are currently employed to aid VI people.

In this article, we review current research work in this field, analyze the causes of past failed experiences, and propose promising research directions marrying computer vision and assistive technology for the VI population. Our considerations stem in large part from our own direct experience developing technology for VI people, and from conducting the only specific workshop on Computer Vision Applications for the Visually Impaired, which was held in 2005 (San Diego), 2008 (Marseille), and 2010 (San Francisco).

The VI Population

The VI community is very diverse in terms of degree of vision loss, age, and abilities. It is important to understand the various characteristics of this population if one is to design technology that is well fit to its potential “customers.” Here is some statistical data made available by the American Foundation for the Blind: Of the 25 or more million Americans experiencing significant vision loss, about 1.3 million are legally blind (meaning their visual field in their best eye is 20 degrees or less or their acuity is less than 20/200), and only about 290,000 are totally blind (with at most some light perception). Since the needs of a low-vision person and of a blind person can be very different, it is important not to overgeneralize the nature of visual impairment. Another important factor to be considered is the age of a VI person. Vision impairment is often due to conditions such as diabetic retinopathy, macular degeneration, and glaucoma that are prevalent at later age. Indeed, about one-fourth of those reporting significant vision loss are 65 years of age and older. It is important to note that multiple disabilities in addition to vision loss are also common at later age (such as hearing impairment due to presbycusis or mobility impairment due to arthritis). In regard to the younger U.S. population, about 60,000 individuals 21 years old or younger are legally blind. Of these, fewer than 10% use Braille as their primary reading medium.

Application Areas

Mobility. In the context of assistive technology, mobility takes the meaning of “moving safely, gracefully, and comfortably;”³ it relies in large part on perceiving the properties of the immediate surroundings, and it entails avoiding obstacles, negotiating steps, drop-offs, and apertures such as doors, and maintaining a possibly rectilinear trajectory while walking. Although blind people are more in need of mobility aids, low-vision individuals may also occasionally trip onto unseen small obstacles or steps, especially in poor lighting conditions.

The most popular mobility tool is the white cane (known in jargon as the long cane), with about 110,000 users in the U.S. The long cane allows one to extend touch and to “preview” the lower portion of the space in front of oneself. Dog guides may also support blind mobility, but have many fewer users (only about 7,000 in the U.S.). A well-trained dog guide helps maintain a direct route, recognizes and avoids obstacles and passageways that are too narrow to go through, and stops at all curbs and at the bottom and top of staircases until told to proceed. Use of a white cane or of a dog guide publicly identifies a pedestrian as blind, and carries legal obligations for nearby drivers, who are required to take special precautions to avoid injury to such a pedestrian.

A relatively large number of devices meant to provide additional support, or possibly to replace the long cane and the dog guide altogether, have been proposed over the past 40 years. Termed Electronic Travel Aids or ETAs,³ these devices typically utilize different types of range sensors (sonars, active triangulation systems, and stereo vision systems). Some ETAs are meant to simply give an indication of the presence of an obstacle at a certain distance along a given direction (clear path indicators). A number of ETAs are mounted on a long cane, thus freeing one user’s hand (but at the expense of adding weight to the cane and possibly interfering with its operation). For example, the Nurion Laser Cane (no longer in production) and the Laser Long Cane produced by Vistac use three laser beams to detect (via triangulation) obstacles at head-height level, while the UltraCane (formerly BatCane) produced by Sound Foresight uses sonar on a regular cane to detect obstacles up to height level. A different type of ETA—the Sonic Pathfinder, worn as a special spectacle frame, and the Bat K-Sonar, mounted on a cane—use one or more ultrasound transducers to provide the user with something closer to a “mental image” of the scene (such as the distance and direction of an obstacle and possibly some physical characteristics of its surface.)

In recent years, a number of computer vision-based ETAs have been proposed. For example, a device developed by Yuan and Manduchi⁴⁰ utilizes structured light to measure distances to surfaces and to detect the presence of a step or a drop-off at a distance of a few meters. Step and curb detection can also be achieved via stereo vision.²⁵ Range data can be integrated through time using a technique called “simultaneous localization and mapping” (SLAM), allowing for the geometric reconstruction of the environment and for self-localization. Vision-based SLAM, which has been used successfully for robotic navigation, has been recently proposed as a means to support blind mobility.^26,28,37 Range cameras, such as the popular PrimeSense’s Kinect, also represent a promising sensing modality for ETAs.

Although many different types of ETAs have appeared on the market, they have met with little success by the intended users so far. Multiple factors, including cost, usability, and performance, contribute to the lack of adoption of these devices. But the main reason is likely the fact that the long cane is difficult to surpass. The cane is economical, reliable, and long lasting, and never runs out of power. Also, it is not clear whether some of the innovative features of newly proposed ETAs (longer detection range, for example) are really useful for blind mobility. Finally, presenting complex environmental features (such as the direction and distance to multiple obstacles) through auditory or tactile channels can easily overwhelm the user, who is already concentrated on using his or her remaining sensory capacity for mobility and orientation.

Neither the long cane nor the dog guide can protect the user from all types of hazards. One example is given by obstacles that are at head height (such as a propped-open window or a tree branch), and thus are beyond the volume of space surveyed by the cane. In a recent survey of 300 blind and legally blind persons,²¹ 13% of the respondents reported they experience head-level accidents at least once a month. The type of mobility aid (long cane or dog guide) does not seem to have a significant effect on the frequency of such accidents. Another type of hazard is represented by walking in trafficked areas, and in particular crossing a street. This requires awareness of the environment around oneself as well as of the flow of traffic, and good control of one’s walking direction to avoid drifting out of the crosswalk. Technology that increases the pedestrian’s safety in these situations may be valuable, such as a mobile phone system using computer vision to orient the user to the crosswalk and to provide information about the timing of Walk lights^12,13 (see Figure 1).

Wayfinding. Orientation (or way-finding) can be defined as the capacity to know and track one’s position with respect to the environment, and to find a route to a destination. Whereas sighted persons use visual landmarks and signs in order to orient themselves, a blind person moving in an unfamiliar environment faces a number of hurdles:²⁰ accessing spatial information from a distance; obtaining directional cues to distant locations; keeping track of one’s orientation and location; and obtaining positive identification once a location is reached.

According to Loomis et al.,²⁰ there are two main ways in which a blind person can navigate with confidence in a possibly complex environment and find his or her way to a destination: piloting and path integration. Piloting means using sensory information to estimate one’s position at any given time, while path integration is equivalent to the “dead reckoning” technique of incremental position estimation, used for example by pilots and mariners. Although some blind individuals excel at path integration, and can easily retrace a path in a large environment, this is not the case for most blind (as well as sighted) persons.

Path integration using inertial sensors or visual sensors has been used extensively in robotics, and a few attempts at using this technology for blind wayfinding have been reported.^9,18 However, the bulk of research on wayfinding has focused on piloting, with very promising results and a number of commercial products already available. For outdoor travelers, GPS represents an invaluable technology. Several companies offer GPS-based navigational systems specifically designed for VI people. None of these systems, however, can help the user in tasks such as “Find the entrance door of this building,” due to the low spatial resolution of GPS reading and to the lack of such details in available GIS databases. In addition, GPS is viable only outdoors. Indoor positioning systems (for example, based on multilateration from WiFi beacons) are gaining momentum, and it is expected they will provide interesting solutions for blind wayfinding.

A different approach to wayfinding, one that does not require a geographical database or map, is based on recognizing (via an appropriate sensor carried by the user) specific landmarks placed at key locations. Landmarks can be active (light, radio, or sound beacons) or passive (reflecting light or radio signals). Thus, rather than absolute positioning, the user is made aware of their own relative position and attitude with respect to the landmark. This may be sufficient for a number of navigational tasks, for example, when the landmark is placed near a location of interest. For guidance to destinations that are beyond the landmark’s “receptive field” (the area within which the landmark can be detected), a route can be built as a set of waypoints that need to be reached in sequence. Contextual information about the environment can also be provided to the VI user using digital map software and synthetic speech.¹⁴

The best-known beaconing system for the blind is Talking Signs, now a commercial product based on technology developed at The Smith-Kettlewell Eye Research Institute.^a Already deployed in several cities, Talking Signs uses a directional beacon of infrared light, modulated by a speech signal. This can be received at a distance of several meters by a specialized handheld device, which also demodulates the speech signal and presents it to the user. RFID technology has also been proposed recently in the context of landmark-based wayfinding for the blind.¹⁶ Passive RFIDs are small, inexpensive, and easy to deploy, and may contain several hundreds of bits of information. The main limitation of RFID systems is their limited reading range and lack of directionality.

A promising research direction is the use of computer vision to detect natural or artificial landmarks, and thus assist in blind wayfinding. A VI person can use their own cellphone, the camera pointing forward, to search for landmarks in view. Natural landmarks are distinctive environmental features that can be detected robustly, and used for guidance either using an existing map¹¹ or by matching against possibly geotagged image datasets.^10,19 Detection is usually performed by first identifying specific key points in the image; the brightness or color image profile in the neighborhood of these key points is then represented by compact and robust descriptors. The presence of a landmark is tested by matching the set of descriptors in an image against a dataset formed by exemplar images collected offline. Note that some of this research work (for example, Hile et al.¹¹) was aimed to support navigation in indoor spaces for persons with cognitive impairments. Apart from the display modality, the same technology is applicable for assistance to visually impaired individuals.

Artificial landmarks are meant to facilitate the detection process. For example, the color markers developed by Coughlan and Manduchi^5,22 (see Figure 2) are designed so as to be highly distinctive (thus minimizing the rate of false alarms) and easily detectable with very moderate computational cost (an important characteristic for mobile platforms such as cellphones with modest computing power). A similar system, designed by researchers in Gordon Legge’s group at the University of Minnesota, uses retro-reflective markers that are detected by a “Magic Flashlight,” a portable camera paired with an infrared illuminator.³³

Artificial landmarks can be optimized for easy and fast detection by a mobile vision system. This is an advantage with respect to natural landmarks, whose robust detection is more challenging. On the other hand, artificial landmarks (as well as beacons such as Talking Signs) involve an infra-structure cost—they must be installed and maintained, and represent an additional element to be considered in the overall environment design. This trade-off should be considered carefully when developing wayfinding technology. It may be argued that the additional infrastructure cost could be better justified if other communities of users in addition to the VI population would benefit from the wayfinding system. For example, even sighted individuals who are unfamiliar with a certain location (for example, a shopping mall), and cannot read existing signs (because of a cognitive impairment, or possibly because of a foreign language barrier), may find a guidance system beneficial. Under this perspective, even the signage commonly deployed for sighted travelers can be seen as a form of artificial landmarks. Automatic reading of existing signs and, in general, of printed information via mobile computer vision, is discussed next.

Printed Information Access. A common concern among the VI population is the difficulty of accessing the vast array of printed information that normally sighted people take for granted in daily life. Such information ranges from printed documents such as books, magazines, utility bills, and restaurant menus to informational signs labeling streets, addresses, and businesses in outdoor settings as well office numbers, exits, and elevators found indoors. In addition, a variety of “non-document” information must also be read, including LED/LCD displays required for operating a host of electronic appliances such as microwave ovens, stoves, and DVD players, and barcodes or other information labeling the contents of packaged goods such as grocery items and medicine containers.

Great progress has been made in providing solutions to this problem by harnessing OCR, which has become a mature and mainstream technology after decades of development. Early OCR systems for VI users (for example, the Arkenstone Reader and Kurzweil Reading Machine) were bulky machines that required the text to be read be imaged using a fatbed scanner. More recent incarnations of these systems have been implemented in portable platforms such as mobile (cell) phones (for example, the KNFB reader^b) and tablets (for example, the IntelReader^c), which allow the user to point the device’s camera toward a document of interest and have it read aloud in a matter of seconds. It is important to note that an important challenge of mobile OCR systems for VI users is the difficulty of aiming the camera accurately enough to capture the desired document area; thus, an important feature of the KNFB user interface is that it provides guidance to the user to help him/her frame the image properly.

However, while OCR is effective for reading printed text that is clearly resolved and fills up most of the image, it is not equipped to find text in images that contain large amounts of unrelated clutter, such as an image of a restaurant sign captured from across the street. The problem of text detection and localization is an active area of research^4,29,35,36 that addresses the challenge of swiftly and reliably sorting through visual patterns to distinguish between text and non-text patterns, despite the huge variability of text fonts and background surfaces on which they are printed (for example, the background surface may be textured and/or curved) and the complications of highly oblique viewing perspectives, limited or poor resolution (due to large distances or motion blur), and low contrast due to poor illumination. A closely related problem is finding and recognizing signs²⁴ characterized by non-standard fonts and layouts that may encode important information using shape (such as stop signs and signs or logos labeling business establishments).

To the best of our knowledge, there are currently no commercially available systems for automatically performing OCR in cluttered scenes for VI users. However, Blindsight Corporation’s^d Smart Telescope SBIR project seeks to develop a system to detect text regions in a scene and present them to a partially sighted user via a head-mounted display that zooms into the text to enable him/her to read it. Mobile phone apps such as Word Lens go beyond the functionality offered by systems targeted to VI users, such as KNFB, in that they detect and read text in cluttered scenes, though these newer systems are intended for normally sighted users.

Research is under way to expand the reach of OCR beyond standard printed text to “non-document” text such as LED and LCD displays,³² which provide access to an increasingly wide range of household appliances. Such displays pose formidable challenges that make detection and reading difficult, including contrast that is often too low (LCDs) or too high (LEDs), the prevalance of specular highlights, and the lack of contextual knowledge to disambiguate unclear characters (for example, dictionaries are used in standard OCR to find valid words, whereas LED/LCD displays often contain arbitrary strings of digits).

Another important category of non-document text is the printed information that identifies the contents of packaged goods, which is vital when no other means of identification is available to a VI person (for example, a can of beans and a can of soup may feel identical in terms of tactile cues). UPC barcodes provide product information in a standardized form, and though originally designed for use with laser scanners there has been growing interest in developing computer vision algorithms for reading them from images acquired by digital cameras, especially for mobile cell platforms (for example, the Red Laser app^e). Such algorithms⁸ have to cope with noisy and blurred images and the need to localize the barcode in a cluttered image (for example, taken by a VI user who has little prior knowledge of the barcode’s location on a package). Some research in this area^17,31 has specifically investigated the usability of these algorithms by VI persons, and at least one commercial system (DigitEyes^f) has been designed specifically for the VI population. Finally, an alternative approach to package identification is to treat it as an object recognition problem,³⁸ which has the benefit of not requiring the user to locate the barcode, which comprises a small portion of the entire surface of the package.

Object Recognition. Over the past decade, increasing research efforts within the computer vision community have focused on algorithms for recognizing generic “objects” in images. For example, the PASCAL Visual Object Classes Challenge, which attracts dozens of participants every year, evaluates competing object recognition algorithms from a number of visual object classes in challenging realistic scenes.^g Another example is Google Goggles, an online service that can be used for automatic recognition of text, artwork, book covers, and more. Other commercial examples include oMoby, developed by IQ Engines, A9’s SnapTell, and Microsoft’s Bing Mobile application with visual scanning.

Visual object recognition for assistive technology is still in its infancy, with only a few applications proposed in recent years. For example, Winlock et al.³⁸ have developed a prototype system (named ShelfScanner) for assistance to a blind person while shopping at a supermarket. Images taken by a camera carried by the user are analyzed to recognize shopping items from a known set; the user is then informed about whether any of the items in his or her shopping list is in view. LookTel, a software platform for Android phones developed by IPPLEX LLC,³⁰ performs real-time detection and recognition of different types of objects such as bank notes, packaged goods, and CD covers. The detection of doors (that can be useful for wayfinding applications) has been considered in Yang and Tian.³⁹

Although many different types of Electronic Travel Aids have appeared on the market, they have met with little success by the intended users so far. Multiple factors, including cost, usability, and performance, contribute to the lack of adoption of these devices.

A Human in the Loop? The goal of the assistive technology described so far is to create the equivalent of a “sighted companion,” who can assist a VI user and answer questions such as “Where am I?” “What’s near me?” “What is this object?”

Some researchers have begun questioning whether an automatic system is the right choice for this task. Will computer vision ever be powerful enough to produce satisfactory results in any context of usage? What about involving a “real” sighted person in the loop, perhaps through crowd-sourcing? For example, the VizWiz system² uses Amazon’s Mechanical Turk to provide a blind person with information about an object (such as the brand of a can of food). The user takes a picture of the object, which is then transmitted to Mechanical Turk’s remote work force for visual analysis, and the results are reported back to the user. The NIH-funded “Sight on Call” project by the Blindsight Corporation addresses a similar application. However, rather than relying on crowdsourcing, it uses specially trained personnel interacting remotely with the visually impaired user, on the basis of video streams and GPS data taken by the user’s cellphone and transmitted to the call center.

Interfaces

Each one of the systems and algorithms described here furnishes some information (for example, the presence of an obstacle, the bearing of a landmark, or the type and brand of items on a supermarket’s shelf) that must be presented to the VI user. This communication can use any of the user’s remaining sensory channels (tactile or acoustic), but should be carefully tailored so as to provide the necessary information without annoying or tiring the user. The fact that blind persons often rely on aural cues for orientation precludes the use of regular headphones for acoustic feedback, but ear-tube earphones and bonephones³⁴ are promising alternatives. In the case of wayfinding, the most common methods for information display include: synthesized speech; simple audio (for example, spatialized sound, generated so as it appears to come from the direction of the landmark); auditory icons;⁶ haptic point interface,²³ a modality by which the user can establish the direction to a landmark by rotating a handheld device until the sound produced has maximum volume; and tactual displays such as “tappers.”²⁷

One major issue to be considered in the design of an interface is whether a rich description of the scene, or only highly symbolic information, should be provided to the user. An example of the former is the vOICe, developed by Peter Mijer, that converts images taken by a live camera to binaural sound. At the opposite end are computer vision systems that “filter” incoming images to recognize specific features, and provide the user with just-in-time, minimally invasive information about the detected object, landmark, or sign.

Usability

Despite the prospect of increased independence enabled by assistive technology devices and software, very few such systems have gained acceptance by the VI community as yet. Here, we analyze some of the issues that, in our opinion, should be taken into account when developing a research concept in this area. It is important to bear in mind that these usability issues can only be fully evaluated with continual feedback from the target VI population obtained by testing the assistive technology as it is developed.

Cosmetics, Cost, Convenience. No one (except perhaps for a few early adopters) wants to carry around a device that attracts unwanted attention, is bulky or inconvenient to wear or to hold, or detracts from one’s attire. Often, designers and engineers seem to forget these basic tenets and propose solutions that are either inconvenient (for example, interfering with use of the long cane or requiring a daily change of batteries) or simply unattractive (for example, a helmet with several cameras pointing in different directions). A forward-looking extensive discussion of design for disability can be found in the beautiful book Design Meets Disability by G. Pullin.

Cost is also an important factor determining usability. Economics of scale is hardly achievable in assistive technology given the relatively small size of the pool of potential users, and the diversity of such a population. This typically leads to high costs for the devices that do make it to the market, which may make them unaffordable by VI users who in many cases are either retired or on disability wages.

Performance. How well should a system work before it becomes viable? The answer clearly depends on the application type. Consider, for example, an ETA that informs the user about the presence of a head-level obstacle. If the system produces a high rate of false alarms, the user will quickly become annoyed and turn the system off. At the same time, the system must have a very low missed detection rate, lest users hurt themselves against undetected obstacles, possibly resulting in medical (and legal) consequences. Other applications may have less stringent requirements. For example, in the case of a cellphone-based system that helps one find a certain item in the grocery store, no harm will be caused to the user if the item is not found or if the wrong item is selected. Still, poor performance is likely to lead to users abandoning the system. Establishing functional performance metrics and assessing minimum performance requirements for assistive technology systems is still an open and highly needed research topic.

Mobile Vision and Usability. The use of mobile computer vision for assistive technology imposes particular functional constraints. Computer vision requires use of one or more cameras to acquire snapshots or video streams of the scene. In some cases, the camera may be handheld, for example, when embedded in a cellphone. In other cases, a miniaturized camera may be worn by the user, perhaps attached to one’s jacket lapel or embedded in one’s eyeglasses frames. The camera’s limited field of view is an important factor in the way the user interacts with the system to explore the surrounding environment: if the camera is not pointed toward a feature of interest, this feature is simply not visible. Thus, it is important to study how a visually impaired individual, who cannot use feedback from the camera’s viewfinder, can maneuver the camera in order to explore the environment effectively. Of course, the camera’s field of view could be expanded, but this typically comes at the cost of a lower angular resolution. Another possibility, explored by Winlock et al.,³⁸ is to build a panoramic image by stitching together several images taken by pointing the camera in different directions.

It should be noted that, depending on the camera’s shutter speed (itself determined by the amount of light in the scene), pictures taken by a moving camera may be blurred and difficult or impossible to decipher. Thus, the speed at which the user moves the camera affects recognition. Another important issue is the effective frame rate, that is, the number of frames per second that can be processed by the system. If the effective frame rate is too low, visual features in the environment may be missed if the user moves the camera too fast in the search process. For complex image analysis tasks, images can be sent to a remote server for processing (for example, the LookTel platform³⁰), in which case the speed and latency are determined by the communication channel. Hybrid local/remote processing approaches, with scene or object recognition performed on a remote sever and fast visual tracking of the detected feature performed by the cellphone, may represent an attractive solution for efficient visual exploration.

Thus, a mobile vision system for assistive technology is characterized by the interplay between camera characteristics (field of view, resolution), computational speed (effective achievable frame rate for a given recognition task), and user interaction (including the motion pattern used to explore the scene, possibly guided by acoustic or tactile feedback).

Preliminary research work has explored the usability of such systems for tasks such as wayfinding²² and access to information embedded in bar codes.^17,31

Conclusion

Advances in mobile computer vision hold great promise for assistive technology. If we can teach computers to see, they may become a valuable support for those of us whose sight is compromised or lost. However, decades-long experience has shown that creating successful assistive technology is difficult. Far too often, engineers have proposed technology-driven solutions that either do not directly address the actual problems experienced by VI persons, or that are not satisfactory in terms of performance level, ease of use, or convenience. Assistive technology is a prime example of user-centered technology: the needs, characteristics, and expectations of the target population must be understood and taken into account throughout the project, and must drive all of the design choices, lest the final product result in disappointment for the intended user and frustration for the designer. Our hope is that a new generation of computer vision researchers will take on the challenge, arm themselves with enough creativity to produce innovative solutions, and have humbleness to listen to the persons who will use this technology.

In closing, we would like to propose a few novel and intriguing application areas that in our opinion deserve further investigation by the research community.

Independent Wheeled Mobility. One dreaded consequence of progressive vision loss (for example, due to an age-related condition) is the ensuing loss of driving privileges. For many individuals, this is felt as a severe blow to their independence. Alternative means of personal wheeled mobility that do not require a driving license could be very desirable to active individuals who still have some degree of vision left. For example, some low-vision persons reported good experiences using the two-wheel Segway, driven on bicycle lanes.¹ These vehicles could be equipped with range and vision sensors to improve safety, minimizing the risk of collisions and ensuring that the vehicle remains within a marked lane. With the recent emphasis on sensors and machine intelligence for autonomous cars in urban environments, it is only reasonable that the VI community should soon benefit from these technological advances.

Blind Photography. Many people find it surprising that people with low vision or blindness enjoy photography as a recreational activity. In fact, a growing community of VI photographers takes and shares photos of family and friends, of objects, and of locations they have visited; some have elevated the practice of photography to an art form, transforming what would normally be considered a challenge (the visual impairment) into an opportunity for creativity. There are numerous Web sites (for example, http://blindwithcameraschool.org), books, and art exhibitions focused on this subject, which could present an interesting opportunity for computer vision researchers. A variety of computer vision techniques such as face detection, geometric scene analysis and object recognition could help a VI user correctly orient the camera and frame the picture. Such techniques, when coupled with a suitable interface, could provide a VI person with a feedback mechanism similar to the viewfinder used by sighted photographers.

Social Interaction. Blindness may, among other things, affect one’s interpersonal communication skills, especially in scenarios with multiple persons interacting (for example, in a meeting). This is because communication in these situations is largely nonverbal, relying on cues such as facial expressions, gaze direction, and other forms of the so-called “body language.” Blind individuals cannot access these nonverbal cues, leading to a perceived disadvantage that may result in social isolation. Mobile computer vision technology may be used to capture and interpret visual cues from other persons nearby, thus empowering the VI user to participate more actively in the conversation. The same technology may also help a VI person become aware of how he or she is perceived by others. A survey conducted with 25 visually impaired persons and two sighted specialists¹⁵ has highlighted some of the functionalities that would be most desirable in such a system. These include: understanding whether one’s personal mannerisms may interfere with social interactions with others; recognizing the facial expressions of other interlocutors; and knowing the names of the people nearby.

Assisted Videoscripting. Due to their overwhelmingly visual content, movies are usually considered inaccessible to blind people. In fact, a VI person may still enjoy a movie from its soundtrack, especially in the company of friends or family. In many cases, though, it is difficult to correctly interpret ongoing activities in the movie (for example, where the action is taking place, which characters are currently in the scene and what they are doing) from the dialogue alone. In addition, many relevant nonverbal cues (such as the facial expression of the actors) are lost. Videodescription (VD) is a technique meant to increase accessibility of existing movies to VI persons by adding a narration of key visual elements, which is presented to the listener during pauses in the dialogue. Although the VD industry is fast growing, due to increasing demand, the VD generation process is still tedious and time consuming. This process, however, could be facilitated by the use of semiautomated visual recognition techniques, which have been developed in different contexts (such as surveillance and video database indexing). An early example is VDManager,⁷ a VD editing software tool, which uses speech recognition as well as key-places and key-faces visual recognition.

Acknowledgments

Roberto Manduchi was supported by the National Science Foundation under Grants IIS-0835645 and CNS-0709472. James Coughlan was supported by the National Institutes of Health under Grants 1 R01 EY018345-01, 1 R01 EY018890-01 and 1 R01 EY01821001A1.

Figures

Figure 1. Crosswatch system for providing guidance to VI pedestrians at traffic intersections. (a) Blind user “scans” the crosswalk by panning cellphone camera left and right, and system provides feedback to help user align himself to crosswalk before entering it. (b) Schematic shows that system announces to user when the Walk light is illuminated.

Figure 2. Experiments with a blind user searching for a landmark (represented by a color marker placed on the wall) using a cellphone camera.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

(Computer) Vision Without Sight

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/2063176.2063200

January 2012 Issue

Published: January 1, 2012

Vol. 55 No. 1

Pages: 96-104

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Apr 24 2025

The Narrative Power of Data

Angelica Lo Duca

Data and Information

masks of tragedy and comedy on a digital grid

News Apr 24 2025

How Liquid Networks Make Robots Smarter

Bennie Mols

Architecture and Hardware

News Apr 21 2025

Nanoscale Makes a Power Play

Samuel Greengard

Architecture and Hardware

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Key Insights

The VI Population

Application Areas

Interfaces

Usability

Conclusion

Figures

(Computer) Vision Without Sight

DOI

January 2012 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.