Architecture and Hardware Contributed articles: Virtual extension

On the Move, Wirelessly Connected to the World

How to experience real-world landmarks through a wave, gaze, location coordinates, or touch, prompting delivery of useful digital information.

By Peter Fröhlich, Antti Oulasvirta, Matthias Baldauf, and Antti Nurminen

Posted Jan 1 2011

Introduction
Interaction Paradigms
Enablers
Challenges
Conclusion
References
Authors
Footnotes
Figures
Sidebar: key insights

Today’s mobile handheld devices offer opportunities never before possible for interacting with digital information that responds to users’ physical locations. But mobile interfaces have only limited input capabilities, usually just a keyboard and audio, while emerging multimodal interaction paradigms are beginning to take advantage of user movements and gestures through sensors, actuators, and content. For example, tourists asking about an unfamiliar landmark might point at it intuitively and would certainly welcome a handheld computer that responds directly to that interest. When passersby provide directions, the description might include local features, as in, say, “Turn right after the red building and enter through the metal gates.” They, too, would welcome being able to see these features represented in a directly recognizable way on their handhelds. Or when following a route to a remote destination, they would want to know the turns and distances they would need to take through tactile or auditory cues, without having to switch their gaze between the environment and the display.

Here, we explore the synthesis of several emerging research trends we call Mobile Spatial Interaction, or MSI (http://msi.ftw.at), covering new interaction techniques that let users interact with physical, natural, and urban surroundings through today’s sensor-rich mobile devices.

Interaction Paradigms

Mobile devices are used in different ways as digital spatial interfaces,⁶ including four main interaction paradigms characteristic of MSI systems:

Magic wand. A handheld that can be pointed at a distant object to access further information through an embedded compass sensor (see Figure 1). The research project Point-to-Discover (http://p2d.ftw.at) has developed a prototype spatially aware mobile phone that allows users to retrieve information about nearby points of interest when pointing the device at, say, historic buildings of interest (http://p2d.ftw.at).

Smart lens. Mobile devices can also be used as “smart” magnifying glasses or lenses.^1,24 When placed in the field of vision and “looked through,” the device superimposes digital content directly onto recognized real-world objects. Augmenting a real view of the world with virtual information-rich objects can be done on handhelds with satisfactory precision and performance.¹⁸ One of the first publicly available smart-lens services was Wikitude, introduced in 2008 (http://www.mobilizy.com/wikitude.php), a travel guide enabling users to explore their surroundings by overlaying related Wikipedia entries on them (see Figure 2). To address the problem of how to select a point of interest within a field of depth, users access information by pointing the device in a particular direction and selecting a distance.

Virtual peephole. Rather than annotating the current real view, a handheld can instead serve as a peephole to related places distant in space and time. Virtual views aligned with the physical background provide a sense of, say, how historic sites might have looked hundreds of years ago. This idea has been explored in mixed-reality research based initially on “relocatable” tripod-mounted displays¹⁹ but more recently also on mobile phones.²¹ Peepholes beyond the proximate environment are beginning to be broadly available on handhelds, ranging from photo-based street-view applications to 3D cities and landscape models, including Google Earth on the Apple iPhone.

Sixth sense. Handhelds are also capable of delivering multimodal feedback to alert users of changes and opportunities in the dynamic environment.^9,22 Enabling a “sixth sense” would be especially helpful for the visually impaired¹¹; for example, the Swedish project called e-Adept⁵ deploys a phone-based navigation system providing detailed route guidance through synthetic speech on urban roads and walkways. Vibrations would help them stay on track, delivering alerts as to any potential obstacles (see Figure 3).

Enablers

Spatial sensing, georeferenced digital content, and 3D environment models are the key MSI enablers:

Spatial sensing. The primary MSI prerequisite is a handheld’s ability to sense its orientation toward its physical surroundings, as in three common technologies for spatial sensing:

Geospatial calculation. By means of a mobile device’s built-in GPS receiver, electronic compass (magnetometer), and (optionally) acceleration sensors, the user’s field of view can be calculated on a 2D map or in a 3D model, helping determine the georeferenced content within it, as in Point-to-Discover. For magic-wand and smart-lens applications, the set of available georeferenced items must be limited to those that are currently visible—either through additional integration of an environmental model or through an adjustable virtual horizon to limit the calculated field of view. While GPS receivers and acceleration sensors are common features in today’s mobile phones, devices with integrated compasses were introduced only recently. At the same time, environmental 3D models became routine in applications like Google Earth and MS Virtual Earth and are available from providers (such as Tele Atlas and Navteq) of digital map data. Following the geospatial-calculation approach, service providers provide content easily and cheaply; new points of interest are simply anchored (virtually) on a map, with no further instrumentation of the physical environment necessary. MSI approaches based on geospatial calculations are well suited to distant interactions (such as pointing a mobile device at a building and at topographical features).

Visual detection. Another approach to spatial sensing is detection of a device’s orientation through a built-in camera and computer-vision algorithms. Depending on the application, photos and video streams may be analyzed on the mobile device itself or forwarded to a remote server for computationally intensive analysis. Covering a variety of MSI scenarios, visual detection applies four main machine-vision methods: detection of text and graphical patterns, visual markers, real-time feature tracking, and image-based localization.

Common image-recognition algorithms enable detection of text and graphical patterns, allowing interaction with ordinary posters, printed schedules, and other graphical representations. Once identified, the text can be communicated wirelessly to trigger a translation or Web search (http://pointandfind.nokia.com).

Attaching a “physical hyperlink” to a real-world object involves printed visual markers (such as the 2D matrix codes in Figure 4). When photographing them, the mobile device decodes the hidden URL and opens a Web browser to display the attached information or to trigger a Web service.

Real-time feature tracking. Real-time feature tracking detects local features in images, allowing augmentation of the video stream with precisely overlaid digital information.^17,23 In such mobile augmented-reality applications, it can be used to, say, enrich physical objects with virtual 3D models bound to corresponding local feature points.

Image-based localization. Image-based localization approaches enable detection of a mobile device’s absolute location and orientation without built-in positioning features. Letting the device itself compare a submitted photo with a large set of existing georeferenced photos determines the user’s current field of view and returns attached georeferenced digital information.²⁶

The hardware component necessary for MSI-based visual detection—a camera—is integrated into almost every cellphone manufactured today. However, implementation of appropriate applications requires considerable expertise by handheld-device designers in the field of computer vision to, for example, optimize common algorithms to cope with the device’s limited processing power. Depending on the scenario and methods applied, visual-detection methods are suitable for short-range MSI applications (such as visual markers) and mid-range applications (such as image-based localization).

Instrumented environments. Alternatively, wireless-communication technologies allow users to interact with surrounding active digitally enhanced landmarks. Today’s cellphones support Near Field Communication (NFC), WiFi, Bluetooth, and other technologies that enable communication with pervasive data sources and transmitters; for example, integrated NFC modules interact with corresponding tags over a distance of a few centimeters for submitting personal payment information and other applications. In contrast to the aforementioned camera-detected markers, NFC tags may provide not only more data but also advanced interaction, including bidirectional exchange of data and attachment of new information to NFC-enabled objects.

WiFi and Bluetooth lack orientation awareness but provide the basis for indoor localization techniques (http://www.koozyt.com) necessary for future hybrid MSI applications. Whereas NFC-enabled cellular phones have not yet broken through to the mass consumer-electronics market, and WiFi is supported only on the most current devices, Bluetooth is a common feature in today’s mobile devices. However, the investment in communications infrastructure for MSI applications based on wireless communication is significant for network and service providers; for example, NFC tags must be purchased, written to, and attached, while environments must be instrumented with Bluetooth services. Wireless communication technologies may be applied for short-range MSI (such as a few centimeters in NFC) and part of medium-range hybrid MSI applications (such as Bluetooth and WiFi).

Georeferenced digital content. Georeferencing, or annotating data with global coordinates, has been around for at least the past 2,000 years in the form of maps. However, digital models are much more complex than coordinate pairs. Content, rich with semantics, is associated with urban topologies and hierarchical structures, context, and metadata. The interaction paradigms offer multiple views of the same, persistent, application-independent digital world, which, unfortunately, is still not standardized. Standardization efforts (such as CityGML www.citygml.org) address this issue, providing both format and data model.

While the digital world takes shape, vast amounts of traditional real-world content are still being produced. The basic georeferencing feature is being adapted to cameras and phones that automatically annotate freshly created content with latitude and longitude. The resulting subset of user-generated data is called “volunteered geographical information.”⁸ This collaborative creation of geographical information helps generate many new georeferenced items and landmarks every day while keeping existing content up to date. However, even commercially available data sets are neither accurate nor bound to the underlying logical structures. For example, a restaurant’s street address does not necessarily yield the coordinates of its doorway. The entrance position may be somewhere close to the restaurant but is not bound directly to the doorway or to the doorway’s representation in the digital domain.

3D environments. Urban environmental models provide novel means to visualize, organize, and process spatial content. Methods based on laser scans and aerial photographs allow automatic creation of 3D models on a large scale. Moreover, editing tools (such as Google’s SketchUp http://sketchup.google.com) let users produce accurate models of their local surroundings. The results can be embedded in urban hierarchies and topologies. Content attached by a user to a street location is bound to the physical topology, not just to the street’s coordinates. When a building is annotated, its digital identity can offer that content whenever the building itself is used as a reference. Waving a device at a building can launch a context-dependent menu from which the user might select, say, a restaurant and then, via a routing algorithm, be taken directly and correctly to its door.

Challenges

Even as researchers and designers seeking to develop MSI applications take advantage of a growing body of gesture-interface and remote-sensing expertise and empirical knowledge, they still facing a number of obstacles in making MSI interfaces that are widely deployable:

Target selection. Target selection constitutes a core challenge for MSI. Users must find it easy to select physical objects from their surroundings and know to which physical object a digital object refers. The magic wand might seem a perfect solution, since it requires familiar pointing gesture and the simple touch of a button. But such immediacy is diminished if feedback for the selection must be read from the on-screen display. Furthermore, reliable selection is possible only for relatively large targets that happen to be within sight.

A smart lens couples selection and consumption on handheld displays. Laboratory studies have found that selection with a smart lens involves two separate user-initiated phases¹⁶: users first move the device to occlude the physical target, then move the object to the center of the display for selection. Both demand careful coordination of hand movement with perceptual information.

Techniques developed for desktop pointing could be useful in solving this problem; for instance, if the handheld is able to map itself to the 3D space around the user, it could provide halo-like cues already on the display when the “pointer” approaches a target. To improve interaction, magic-wand pointing could be combined with tactile feedback. That is, pointing at a building with the device could provide tactile cues, eliminating the viewfinder from the loop and helping users focus directly on the environment.

Moreover, bodily gestures (such as sweeping the thumb over a display) could aid in scanning the non-visible (remote) environment. In the future, assuming wearable mobile projectors and cameras¹² are available on mass-market commercial devices, users could even interact with projected spatial media (see Figure 5).

Remote access. While supporting interaction with proximate objects is MSI’s most basic goal, users are often interested in remote objects. Failing to support remote interaction would limit an application to only a subset of the tasks users expect. Such an out-of-the-body experience can be realized through current mobile devices based on the display of highly textured 3D objects from dynamically modified viewpoints.

While a schematized visual representation of the close environment for nearby point-of-interest access is sufficient for most consumer applications,⁷ exploration of distant places requires more refined representations. When a user takes a mobile device on a “reconnaissance flight” to a nearby street corner, realistic 3D representations of the scenery and highlighted points of interest would make it easier to make decisions about, say, where to go next. Future research must determine which level of realism is most useful for meaningful exploration with minimum distraction.

Pointing at a building with the device could provide tactile cues, eliminating the viewfinder from the loop and helping the user focus directly on the environment.

Way-finding. Navigating the physical environment prompted by a personal interest in a remote object, users may want to move (physically) to the object. Navigation support is not trivial for service developers, because in each phase of navigation, different interaction techniques are optimal. For example, text-based search functionality is needed to locate a predefined target by street address. Because address information is alphanumeric, totally eliminating manual text input in a commercial device is unlikely for at least the next few years. However, predictive input combined with location-aware adaptation may significantly reduce text-entry time. Combining location information with modeling of users’ frequently accessed addresses may help narrow the set of cues.

When physically moving to the target, users need continuously updated information on route choice and available navigation options. As shown by experience with car navigation devices, redundant multimodal information works reasonably well in multitasking situations. Design for pedestrian navigation is further complicated by an unpredictable environment. While hiding everything not in front of the driver is the right strategy in a car, pedestrians often alter their gaze and therefore prefer “surround-type” overviews.⁷ A prototype called Going My Way developed at MIT (http://www.media.mit.edu/speech/publications/) gathers data on user movement patterns, selecting points of interest accordingly and showing only the cues most likely to be known by the users.⁴

In following a route, knowing one’s orientation in relation to the target or points of interest is a constant challenge; failing this is likely to lead a traveler astray. In principle, the information can be (again) provided through tactile or auditory cues, but one can expect to learn only a limited number of cues. Researchers at the University of Glasgow are exploring the option of orientation updating through continuous tactile feedback.²² Moreover, using tactile cues has the advantage of working better than auditory cues in noisy outdoor environments. Such feedback, combined with adaptive on-screen information on buildings and other landmarks,⁴ could be a suitable alternative to text-based navigation.

Contextual factors. Designers ultimately need to know how contextual factors affect the use of different interactive features. Relevant studies are scarce but increasing in number as more mature prototypes emerge. For example, Morrison et al.¹³ conducted a field study comparing a magic-lens solution to a non-augmented standard 2D map. Teams of three or more users played a game in the city center of Helsinki where they were required to use a map to find the game sites. The analysis concentrated on group practices in using these solutions. The researchers observed that the magic-lens solution made the information available to all group members gathering around the display more often to solve the problem jointly, like bees around a hive; meanwhile, the standard 2D map was used in a centralized fashion, with one person, with a handheld, doing the problem solving and delegating tasks to the rest of the group. While the 2D-map users were more effective, the magic-lens users found the game experience more enjoyable.

Comprehensive evaluation of such work remains scarce due to a lack of real content and the fragility of prototypes limit the opportunity for field studies. User-acceptance studies are also necessary for assessing whether certain paradigms are realistic for real-world use. For example, how would passersby react to (and how would users expect them to react to) a pedestrian’s sweeping hand and arm gestures with no apparent or verbal explanation. Would such a sixth sense conflict with the pedestrian’s ability to follow safety-critical signals in the physical environment?

Performance limitations. Despite development of technological enablers, we can expect sensing uncertainty and other performance limitations for years to come.²⁵ Application designers must therefore be aware of the specific technology implications of each MSI interaction technique they devise to ensure their applications deliver reliable performance. One lesson researchers have drawn from field trials is that magic-wand pointing is less robust in response to inaccurate real-world alignment than map presentations; for example, when a magic-wand system we were investigating dislocated a pedestrian by no more than 10 meters to another intersection, pointing at a certain building resulted in the device delivering the wrong information.²⁰ In contrast, an orientation-aware overview map indicating a user’s position within the 3D landscape helped users judge the trustworthiness of the results and cognitively adjust their incorrectly displayed position.

Smart-lens interfaces are even less forgiving of inaccurate alignment than magic wands. Imagine, for example, the price of an apartment being shown by a real-estate service atop the wrong building. A standard design principle is that users must be informed about the level of accuracy of any particular location. The result could be as simple as indicating GPS signal strength, a feature already offered by many mobile-map providers.

The scarcity of power resources is another challenge to all MSI-based applications. Rendering large 3D environments and using additional sensors are a constant drain on battery power. Application-level solutions can improve the situation; for example, 3D updates can be synchronized with GPS updates (once per second), unless interaction is required. Simple spatial-sensing devices can be enabled at the user’s discretion, through, say, an assigned quick activation button or usage context. Moreover, sensors should be shut off when the handheld device’s lid is closed or the keypad locked. Such simple methods for avoiding the irritation of frequent low-battery warnings promise to help win wider acceptance of mobile spatial applications.

Personalization of content. Today’s location-based services focus on attaching elements of static information to certain location coordinates. This basic limitation of connecting with points of interest implies the same content is presented regardless of user motivation, time of day, or length of journey. Approaches for better adapting spatial content to current user activity and interests are promising and should therefore be pursued.¹⁰ Applications should regard the bits and pieces of spatial information as an ensemble that could be used for recommending routes to suit immediate user needs.¹⁴ Analysis of what other users have done in the same or similar places can also help make mobile spatial services more user friendly and commercially viable.^3,15

Conclusion

MSI stands a good chance of winning consumer acceptance as a standard form of mobile interaction; the interaction techniques are feasible and attractive, the technology enablers are becoming widely available, and the most important remaining problem areas have been identified by human-computer-interaction researchers. However, unless MSI is viewed as a coherent whole—a new paradigm of interaction with the physical real-world environment that uses handheld apps as if they were physical objects—interaction techniques and contents could become fragmented across applications, discouraging users from adopting or even trying this convenient form of interaction.

The greatest challenge is how to make MSI truly ubiquitous, encompassing relevant sources and types of information, as well as multiple usage contexts. In a worst-case scenario, users are constantly opening a special-purpose application, waiting to load georeferenced data, and only then interacting with the application, in a limited way. Addressing it requires a joint effort by the research community, device manufacturers, telecom operators, content providers, and hundreds of millions of end users worldwide.

Figures

Figure 1. Pointing with a magic wand;

Figure 2. Looking through a smart lens;

Figure 3. A sixth sense for blind users.

Figure 4. Visual-marker-based techniques;

Figure 5. Finger-pointing to virtual spatial media with a wearable camera.

Sidebar: key insights

Smart lenses, magic wands, and other interaction metaphors are beginning to let users explore digital information associated with everyday real-world objects.
Key technology enablers include spatial sensing, georeferenced digital content, and 3D environments, though effective use depends on first understanding related strengths and weaknesses.
Providers of new mobile spatial-interaction services should look to adapt human-computer interaction research, including optimal target selection techniques, attractive visualizations for remote access, and novel forms of way-finding.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

On the Move, Wirelessly Connected to the World

View in the ACM Digital Library

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DOI

10.1145/1866739.1866766

January 2011 Issue

Published: January 1, 2011

Vol. 54 No. 1

Pages: 132-138

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Apr 26 2024

Optimizing Energy Efficiency in Datacenters with Advanced Cooling Technologies

Alex Williams

Architecture and Hardware

Credit: Getty Images Servers in snowy setting.

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Interaction Paradigms

Enablers

Challenges

Conclusion

Figures

Sidebar: key insights

On the Move, Wirelessly Connected to the World

DOI

January 2011 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.