Research and Advances
Architecture and Hardware Contributed articles: Virtual extension

On the Move, Wirelessly Connected to the World

How to experience real-world landmarks through a wave, gaze, location coordinates, or touch, prompting delivery of useful digital information.
  1. Introduction
  2. Interaction Paradigms
  3. Enablers
  4. Challenges
  5. Conclusion
  6. References
  7. Authors
  8. Footnotes
  9. Figures
  10. Sidebar: key insights

Today’s mobile handheld devices offer opportunities never before possible for interacting with digital information that responds to users’ physical locations. But mobile interfaces have only limited input capabilities, usually just a keyboard and audio, while emerging multimodal interaction paradigms are beginning to take advantage of user movements and gestures through sensors, actuators, and content. For example, tourists asking about an unfamiliar landmark might point at it intuitively and would certainly welcome a handheld computer that responds directly to that interest. When passersby provide directions, the description might include local features, as in, say, “Turn right after the red building and enter through the metal gates.” They, too, would welcome being able to see these features represented in a directly recognizable way on their handhelds. Or when following a route to a remote destination, they would want to know the turns and distances they would need to take through tactile or auditory cues, without having to switch their gaze between the environment and the display.

Here, we explore the synthesis of several emerging research trends we call Mobile Spatial Interaction, or MSI (, covering new interaction techniques that let users interact with physical, natural, and urban surroundings through today’s sensor-rich mobile devices.

Back to Top

Interaction Paradigms

Mobile devices are used in different ways as digital spatial interfaces,6 including four main interaction paradigms characteristic of MSI systems:

Magic wand. A handheld that can be pointed at a distant object to access further information through an embedded compass sensor (see Figure 1). The research project Point-to-Discover ( has developed a prototype spatially aware mobile phone that allows users to retrieve information about nearby points of interest when pointing the device at, say, historic buildings of interest (

Smart lens. Mobile devices can also be used as “smart” magnifying glasses or lenses.1,24 When placed in the field of vision and “looked through,” the device superimposes digital content directly onto recognized real-world objects. Augmenting a real view of the world with virtual information-rich objects can be done on handhelds with satisfactory precision and performance.18 One of the first publicly available smart-lens services was Wikitude, introduced in 2008 (, a travel guide enabling users to explore their surroundings by overlaying related Wikipedia entries on them (see Figure 2). To address the problem of how to select a point of interest within a field of depth, users access information by pointing the device in a particular direction and selecting a distance.

Virtual peephole. Rather than annotating the current real view, a handheld can instead serve as a peephole to related places distant in space and time. Virtual views aligned with the physical background provide a sense of, say, how historic sites might have looked hundreds of years ago. This idea has been explored in mixed-reality research based initially on “relocatable” tripod-mounted displays19 but more recently also on mobile phones.21 Peepholes beyond the proximate environment are beginning to be broadly available on handhelds, ranging from photo-based street-view applications to 3D cities and landscape models, including Google Earth on the Apple iPhone.

Sixth sense. Handhelds are also capable of delivering multimodal feedback to alert users of changes and opportunities in the dynamic environment.9,22 Enabling a “sixth sense” would be especially helpful for the visually impaired11; for example, the Swedish project called e-Adept5 deploys a phone-based navigation system providing detailed route guidance through synthetic speech on urban roads and walkways. Vibrations would help them stay on track, delivering alerts as to any potential obstacles (see Figure 3).

Back to Top


Spatial sensing, georeferenced digital content, and 3D environment models are the key MSI enablers:

Spatial sensing. The primary MSI prerequisite is a handheld’s ability to sense its orientation toward its physical surroundings, as in three common technologies for spatial sensing:

Geospatial calculation. By means of a mobile device’s built-in GPS receiver, electronic compass (magnetometer), and (optionally) acceleration sensors, the user’s field of view can be calculated on a 2D map or in a 3D model, helping determine the georeferenced content within it, as in Point-to-Discover. For magic-wand and smart-lens applications, the set of available georeferenced items must be limited to those that are currently visible—either through additional integration of an environmental model or through an adjustable virtual horizon to limit the calculated field of view. While GPS receivers and acceleration sensors are common features in today’s mobile phones, devices with integrated compasses were introduced only recently. At the same time, environmental 3D models became routine in applications like Google Earth and MS Virtual Earth and are available from providers (such as Tele Atlas and Navteq) of digital map data. Following the geospatial-calculation approach, service providers provide content easily and cheaply; new points of interest are simply anchored (virtually) on a map, with no further instrumentation of the physical environment necessary. MSI approaches based on geospatial calculations are well suited to distant interactions (such as pointing a mobile device at a building and at topographical features).

Visual detection. Another approach to spatial sensing is detection of a device’s orientation through a built-in camera and computer-vision algorithms. Depending on the application, photos and video streams may be analyzed on the mobile device itself or forwarded to a remote server for computationally intensive analysis. Covering a variety of MSI scenarios, visual detection applies four main machine-vision methods: detection of text and graphical patterns, visual markers, real-time feature tracking, and image-based localization.

Common image-recognition algorithms enable detection of text and graphical patterns, allowing interaction with ordinary posters, printed schedules, and other graphical representations. Once identified, the text can be communicated wirelessly to trigger a translation or Web search (

Attaching a “physical hyperlink” to a real-world object involves printed visual markers (such as the 2D matrix codes in Figure 4). When photographing them, the mobile device decodes the hidden URL and opens a Web browser to display the attached information or to trigger a Web service.

Real-time feature tracking. Real-time feature tracking detects local features in images, allowing augmentation of the video stream with precisely overlaid digital information.17,23 In such mobile augmented-reality applications, it can be used to, say, enrich physical objects with virtual 3D models bound to corresponding local feature points.

Image-based localization. Image-based localization approaches enable detection of a mobile device’s absolute location and orientation without built-in positioning features. Letting the device itself compare a submitted photo with a large set of existing georeferenced photos determines the user’s current field of view and returns attached georeferenced digital information.26

The hardware component necessary for MSI-based visual detection—a camera—is integrated into almost every cellphone manufactured today. However, implementation of appropriate applications requires considerable expertise by handheld-device designers in the field of computer vision to, for example, optimize common algorithms to cope with the device’s limited processing power. Depending on the scenario and methods applied, visual-detection methods are suitable for short-range MSI applications (such as visual markers) and mid-range applications (such as image-based localization).

Instrumented environments. Alternatively, wireless-communication technologies allow users to interact with surrounding active digitally enhanced landmarks. Today’s cellphones support Near Field Communication (NFC), WiFi, Bluetooth, and other technologies that enable communication with pervasive data sources and transmitters; for example, integrated NFC modules interact with corresponding tags over a distance of a few centimeters for submitting personal payment information and other applications. In contrast to the aforementioned camera-detected markers, NFC tags may provide not only more data but also advanced interaction, including bidirectional exchange of data and attachment of new information to NFC-enabled objects.

WiFi and Bluetooth lack orientation awareness but provide the basis for indoor localization techniques ( necessary for future hybrid MSI applications. Whereas NFC-enabled cellular phones have not yet broken through to the mass consumer-electronics market, and WiFi is supported only on the most current devices, Bluetooth is a common feature in today’s mobile devices. However, the investment in communications infrastructure for MSI applications based on wireless communication is significant for network and service providers; for example, NFC tags must be purchased, written to, and attached, while environments must be instrumented with Bluetooth services. Wireless communication technologies may be applied for short-range MSI (such as a few centimeters in NFC) and part of medium-range hybrid MSI applications (such as Bluetooth and WiFi).

Georeferenced digital content. Georeferencing, or annotating data with global coordinates, has been around for at least the past 2,000 years in the form of maps. However, digital models are much more complex than coordinate pairs. Content, rich with semantics, is associated with urban topologies and hierarchical structures, context, and metadata. The interaction paradigms offer multiple views of the same, persistent, application-independent digital world, which, unfortunately, is still not standardized. Standardization efforts (such as CityGML address this issue, providing both format and data model.

While the digital world takes shape, vast amounts of traditional real-world content are still being produced. The basic georeferencing feature is being adapted to cameras and phones that automatically annotate freshly created content with latitude and longitude. The resulting subset of user-generated data is called “volunteered geographical information.”8 This collaborative creation of geographical information helps generate many new georeferenced items and landmarks every day while keeping existing content up to date. However, even commercially available data sets are neither accurate nor bound to the underlying logical structures. For example, a restaurant’s street address does not necessarily yield the coordinates of its doorway. The entrance position may be somewhere close to the restaurant but is not bound directly to the doorway or to the doorway’s representation in the digital domain.

3D environments. Urban environmental models provide novel means to visualize, organize, and process spatial content. Methods based on laser scans and aerial photographs allow automatic creation of 3D models on a large scale. Moreover, editing tools (such as Google’s SketchUp let users produce accurate models of their local surroundings. The results can be embedded in urban hierarchies and topologies. Content attached by a user to a street location is bound to the physical topology, not just to the street’s coordinates. When a building is annotated, its digital identity can offer that content whenever the building itself is used as a reference. Waving a device at a building can launch a context-dependent menu from which the user might select, say, a restaurant and then, via a routing algorithm, be taken directly and correctly to its door.

Back to Top


Even as researchers and designers seeking to develop MSI applications take advantage of a growing body of gesture-interface and remote-sensing expertise and empirical knowledge, they still facing a number of obstacles in making MSI interfaces that are widely deployable:

Target selection. Target selection constitutes a core challenge for MSI. Users must find it easy to select physical objects from their surroundings and know to which physical object a digital object refers. The magic wand might seem a perfect solution, since it requires familiar pointing gesture and the simple touch of a button. But such immediacy is diminished if feedback for the selection must be read from the on-screen display. Furthermore, reliable selection is possible only for relatively large targets that happen to be within sight.

A smart lens couples selection and consumption on handheld displays. Laboratory studies have found that selection with a smart lens involves two separate user-initiated phases16: users first move the device to occlude the physical target, then move the object to the center of the display for selection. Both demand careful coordination of hand movement with perceptual information.

Techniques developed for desktop pointing could be useful in solving this problem; for instance, if the handheld is able to map itself to the 3D space around the user, it could provide halo-like cues already on the display when the “pointer” approaches a target. To improve interaction, magic-wand pointing could be combined with tactile feedback. That is, pointing at a building with the device could provide tactile cues, eliminating the viewfinder from the loop and helping users focus directly on the environment.

Moreover, bodily gestures (such as sweeping the thumb over a display) could aid in scanning the non-visible (remote) environment. In the future, assuming wearable mobile projectors and cameras12 are available on mass-market commercial devices, users could even interact with projected spatial media (see Figure 5).

Remote access. While supporting interaction with proximate objects is MSI’s most basic goal, users are often interested in remote objects. Failing to support remote interaction would limit an application to only a subset of the tasks users expect. Such an out-of-the-body experience can be realized through current mobile devices based on the display of highly textured 3D objects from dynamically modified viewpoints.

While a schematized visual representation of the close environment for nearby point-of-interest access is sufficient for most consumer applications,7 exploration of distant places requires more refined representations. When a user takes a mobile device on a “reconnaissance flight” to a nearby street corner, realistic 3D representations of the scenery and highlighted points of interest would make it easier to make decisions about, say, where to go next. Future research must determine which level of realism is most useful for meaningful exploration with minimum distraction.

Pointing at a building with the device could provide tactile cues, eliminating the viewfinder from the loop and helping the user focus directly on the environment.

Way-finding. Navigating the physical environment prompted by a personal interest in a remote object, users may want to move (physically) to the object. Navigation support is not trivial for service developers, because in each phase of navigation, different interaction techniques are optimal. For example, text-based search functionality is needed to locate a predefined target by street address. Because address information is alphanumeric, totally eliminating manual text input in a commercial device is unlikely for at least the next few years. However, predictive input combined with location-aware adaptation may significantly reduce text-entry time. Combining location information with modeling of users’ frequently accessed addresses may help narrow the set of cues.

When physically moving to the target, users need continuously updated information on route choice and available navigation options. As shown by experience with car navigation devices, redundant multimodal information works reasonably well in multitasking situations. Design for pedestrian navigation is further complicated by an unpredictable environment. While hiding everything not in front of the driver is the right strategy in a car, pedestrians often alter their gaze and therefore prefer “surround-type” overviews.7 A prototype called Going My Way developed at MIT ( gathers data on user movement patterns, selecting points of interest accordingly and showing only the cues most likely to be known by the users.4

In following a route, knowing one’s orientation in relation to the target or points of interest is a constant challenge; failing this is likely to lead a traveler astray. In principle, the information can be (again) provided through tactile or auditory cues, but one can expect to learn only a limited number of cues. Researchers at the University of Glasgow are exploring the option of orientation updating through continuous tactile feedback.22 Moreover, using tactile cues has the advantage of working better than auditory cues in noisy outdoor environments. Such feedback, combined with adaptive on-screen information on buildings and other landmarks,4 could be a suitable alternative to text-based navigation.

Contextual factors. Designers ultimately need to know how contextual factors affect the use of different interactive features. Relevant studies are scarce but increasing in number as more mature prototypes emerge. For example, Morrison et al.13 conducted a field study comparing a magic-lens solution to a non-augmented standard 2D map. Teams of three or more users played a game in the city center of Helsinki where they were required to use a map to find the game sites. The analysis concentrated on group practices in using these solutions. The researchers observed that the magic-lens solution made the information available to all group members gathering around the display more often to solve the problem jointly, like bees around a hive; meanwhile, the standard 2D map was used in a centralized fashion, with one person, with a handheld, doing the problem solving and delegating tasks to the rest of the group. While the 2D-map users were more effective, the magic-lens users found the game experience more enjoyable.

Comprehensive evaluation of such work remains scarce due to a lack of real content and the fragility of prototypes limit the opportunity for field studies. User-acceptance studies are also necessary for assessing whether certain paradigms are realistic for real-world use. For example, how would passersby react to (and how would users expect them to react to) a pedestrian’s sweeping hand and arm gestures with no apparent or verbal explanation. Would such a sixth sense conflict with the pedestrian’s ability to follow safety-critical signals in the physical environment?

Performance limitations. Despite development of technological enablers, we can expect sensing uncertainty and other performance limitations for years to come.25 Application designers must therefore be aware of the specific technology implications of each MSI interaction technique they devise to ensure their applications deliver reliable performance. One lesson researchers have drawn from field trials is that magic-wand pointing is less robust in response to inaccurate real-world alignment than map presentations; for example, when a magic-wand system we were investigating dislocated a pedestrian by no more than 10 meters to another intersection, pointing at a certain building resulted in the device delivering the wrong information.20 In contrast, an orientation-aware overview map indicating a user’s position within the 3D landscape helped users judge the trustworthiness of the results and cognitively adjust their incorrectly displayed position.

Smart-lens interfaces are even less forgiving of inaccurate alignment than magic wands. Imagine, for example, the price of an apartment being shown by a real-estate service atop the wrong building. A standard design principle is that users must be informed about the level of accuracy of any particular location. The result could be as simple as indicating GPS signal strength, a feature already offered by many mobile-map providers.

The scarcity of power resources is another challenge to all MSI-based applications. Rendering large 3D environments and using additional sensors are a constant drain on battery power. Application-level solutions can improve the situation; for example, 3D updates can be synchronized with GPS updates (once per second), unless interaction is required. Simple spatial-sensing devices can be enabled at the user’s discretion, through, say, an assigned quick activation button or usage context. Moreover, sensors should be shut off when the handheld device’s lid is closed or the keypad locked. Such simple methods for avoiding the irritation of frequent low-battery warnings promise to help win wider acceptance of mobile spatial applications.

Personalization of content. Today’s location-based services focus on attaching elements of static information to certain location coordinates. This basic limitation of connecting with points of interest implies the same content is presented regardless of user motivation, time of day, or length of journey. Approaches for better adapting spatial content to current user activity and interests are promising and should therefore be pursued.10 Applications should regard the bits and pieces of spatial information as an ensemble that could be used for recommending routes to suit immediate user needs.14 Analysis of what other users have done in the same or similar places can also help make mobile spatial services more user friendly and commercially viable.3,15

Back to Top


MSI stands a good chance of winning consumer acceptance as a standard form of mobile interaction; the interaction techniques are feasible and attractive, the technology enablers are becoming widely available, and the most important remaining problem areas have been identified by human-computer-interaction researchers. However, unless MSI is viewed as a coherent whole—a new paradigm of interaction with the physical real-world environment that uses handheld apps as if they were physical objects—interaction techniques and contents could become fragmented across applications, discouraging users from adopting or even trying this convenient form of interaction.

The greatest challenge is how to make MSI truly ubiquitous, encompassing relevant sources and types of information, as well as multiple usage contexts. In a worst-case scenario, users are constantly opening a special-purpose application, waiting to load georeferenced data, and only then interacting with the application, in a limited way. Addressing it requires a joint effort by the research community, device manufacturers, telecom operators, content providers, and hundreds of millions of end users worldwide.

Back to Top

Back to Top

Back to Top

Back to Top


F1 Figure 1. Pointing with a magic wand;

F2 Figure 2. Looking through a smart lens;

F3 Figure 3. A sixth sense for blind users.

F4 Figure 4. Visual-marker-based techniques;

F5 Figure 5. Finger-pointing to virtual spatial media with a wearable camera.

Back to Top

    1. Azuma, R.T. The challenge of making augmented reality work outdoors. Chapter 21 in Mixed Reality: Merging Real and Virtual Worlds, Y. Ohta and H. Tamura, eds. Springer-Verlag, New York, 1999, 379–390.

    2. Baldauf, M., Fröhlich, P., and Reichl, P. Gestural Interfaces for micro projector-based mobile phone applications. In Proceedings of the International Conference on Ubiquitous Computing (Orlando, FL Sept. 30–Oct. 3). ACM Press, New York, 2009.

    3. Bilandzic, M., Foth, M., and De Luca, A. CityFlocks: Designing social navigation for urban mobile information systems. In Proceedings of the Seventh ACM conference on Designing Interactive Systems (London, Apr. 7–8). ACM Press, New York, 2008, 174–183.

    4. Chung, J. and Schmandt, C. Going My Way: A user-aware route planner. In Proceedings of the 27th international conference on Human factors in Computing Systems (Boston, Apr. 4–9). ACM Press, New York, 2009, 1899–1902.

    5. e-Adept. Electronic Assistance for Disabled and Elderly Pedestrians and Travelers;

    6. Egenhofer, M.J. Spatial information appliances: A next generation of geographic information systems. In Proceedings of the First Brazilian Workshop on GeoInformatics (Campinas, Brazil, Oct. 10–13, 1999).

    7. Fröhlich, P., Obernberger, G., Simon, R., and Reichl P. Exploring the design space of Smart Horizons. In Proceedings of the 10th International Conference on Human Computer Interaction with Mobile Devices and Services (Amsterdam, The netherlands, Sept. 2–5). ACM Press, New York, 363–366.

    8. Goodchild, M. Citizens as sensors: The world of volunteered geography. GeoJournal 69 (2007), 211–221.

    9. Holland, S., Morse, D., and Gedenryd, H. AudioGPS: Spatial audio navigation with a minimal attention interface. Personal Ubiquitous Computing 6, 4 (Jan. 2002), 253–259.

    10. Krösche, J. and Boll, S. The xPOI concept. In Proceedings of the International Workshop on Location- and Context-Awareness (Oberpfaffenhofen, Germany, May 12–13). Springer, new York, 2005, 113–119.

    11. Makino, H., Ishii, I., and Nakashizuka, M. Development of navigation systems for the blind using GPS and mobile phone combination. In Proceedings of the International Conference of the IEEE (Chicago, Oct. 30–Nov. 2, 1997), 506–507.

    12. Mistry, P., Maes P., and Chang, L. WUW—Wear Ur World—A wearable gestural interface. In Extended Abstracts of the Conference on Computer Human Interaction (Boston, Apr. 4–9). ACM Press, New York, 2009 4111–4116.

    13. Morrison, A., Oulasvirta, A., Peltonen, P., Lemmelä, S., Jacucci, G., Reitmayr, G., Näsänen, J., and Juustila, A. Like bees around the hive: A comparative study of a mobile augmented reality map. In Proceedings of the Conference on Computer Human Interaction (Boston, Apr. 4–9). ACM Press, New York, 2008, 1889–1898.

    14. Paulini, M. and Schnabel, M. A. Surfing the city: An architecture for context-aware urban exploration. In Proceedings of the Fifth International Conference on Advances in Mobile Computing and Multimedia (Jakarta, Indonesia. Dec. 3–5). Austrian computer Society, Vienna, 2007, 31–44.

    15. Robinson, S., Eslambolchilar, P., and Jones, M. Point-to-geoblog: Gestures and sensors to support user-generated content creation. In Proceedings of the 10th International Conference on Human Computer Interaction with Mobile Devices and Services (Amsterdam, The Netherlands, Sept. 2–5). ACM Press, New York, 2008, 197–206.

    16. Rohs, M. and Oulasvirta, A. Target acquisition with camera phones when used as magic lenses. In Proceedings of the 26th Annual SIGCHI Conference on Human factors in Computing Systems (Boston, Apr. 4–9). ACM Press, New York, 2008, 1409–1418.

    17. Rohs, M., Schöning, J., Krüger A., and Hecht, B. Towards real-time markerless tracking of magic lenses on paper maps. In Adjunct Proceedings of the Fifth International Conference on Pervasive Computing (Toronto, May 13–16), ACM Press, New York, 2007, 69–72.

    18. Schmalstieg, D. and Wagner, D. The world as a user interface: Augmented reality for ubiquitous computing. In Proceedings of the Central European Multimedia and Virtual Reality Conference (Prague, June 8–10, 2005).

    19. Schnädelbach, H., Koleva, B., Flintham, M., Fraser, M., Izadi, S., Chandler, P., Foster, M., Benford, S., Greenhalgh, C., and Rodden, T. The Augurscope: A mixed-reality interface for outdoors. In Proceedings of the Conference on Computer Human Interaction (Minneapolis, Apr. 20–25). ACM Press, New York, 2002, 9–16.

    20. Simon, R. Mobilizing the Geospatial Web: A Framework and Conceptual Model for Spatially Aware Mobile Web Applications. Ph.D. Thesis, Vienna University of Technology, Vienna, Austria. 2008.

    21. Simon, R. The Creative Histories Mobile explorer: Implementing a 3D multimedia tourist guide for mass-market mobile phones. In Proceedings of Electronic Information, the Visual Arts and Beyond, Digital Cultural Heritage (Vienna, Aug. 25–28). Austrian computer Society, Vienna, 2006.

    22. Strachan, S. and Murray-Smith, R. Bearing-based selection in mobile spatial interaction (special issue on mobile spatial interaction). Personal Ubiquitous Computing 13, 4 (May 2009), 265–280.

    23. Wagner, D., Langlotz, T., and Schmalstieg, D. Robust and unobtrusive marker tracking on mobile phones. In Proceedings of the Seventh IEEE/ACM International Symposium on Mixed and Augmented Reality (Cambridge, U.K., Sept. 15). IEEE Computer Society, Washington, D.C., 2008, 121–124.

    24. Wellner, P., Mackay, W., and Gold, R. Back to the real world. Commun. ACM 36, 7 (July 1993), 24–26.

    25. Williamson, J., Strachan, S., and Murray-Smith, R. It's a long way to Monte Carlo: Probabilistic display in GPS navigation. In Proceedings of the Eighth Conference on Human-Computer Interaction with Mobile Devices and Services (Espoo, Finland, Sept. 12–15). ACM Press, New York, 2006, 89–96.

    26. Zhang, W. and Kosecka, J. Image-based localization in urban environments. In Proceedings of the Third International Symposium on 3D Data Processing, Visualization, and Transmission (Chapel Hill, NC, June 14–16). IEEE Computer Society, Washington D.C., 2006, 33–40.


Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More