Research and Advances
Computing Applications Contributed articles

Sketch-Thru-Plan: A Multimodal Interface For Command and Control

Speaking military jargon, users can create labels and draw symbols to position objects on digitized maps.
Sketch-Thur Plan, illustration
  1. Introduction
  2. Command and Control
  3. Key Insights
  4. Multimodal Map-Based Systems
  5. Sketch-Thru-Plan
  6. STP Components
  7. Evaluations
  8. Transition and Deployment
  9. Conclusion
  10. Acknowledgments
  11. References
  12. Authors
  13. Footnotes
  14. Figures
  15. Tables
Sketch-Thur Plan, illustration

In 2000, Oviatt and Cohen25 predicted multimodal user interfaces would “supplement, and eventually replace, the standard GUIs of today’s computers for many applications,” focusing on mobile interfaces with alternative modes of input, including speech, touch, and handwriting, as well as map-based interfaces designed to process and fuse multiple simultaneous modes. In the intervening years, basic multimodal interfaces employing alternative input modalities have indeed become the dominant interface for mobile devices. Here, we describe an advanced fusion-based multimodal map system called Sketch-Thru-Plan, or STP, developed from 2009 to 2011 under the DARPA Deep Green program, enabling rapid creation of operational plans during command and control (C2) for military ground operations. As background, we describe the challenges posed by ground operations for C2 systems and their user interfaces. We discuss how C2 GUIs have led to inefficient operation and high training costs. And to address them, we cover STP’s multimodal interface and evaluations. Finally, we discuss deployment of the system by the U.S. Army and U.S. Marine Corps. This case study involves the user-centered design-and-development process required for promising basic research to scale reliably and be incorporated into mission-critical products in large organizations.

Back to Top

Command and Control

Command-and-control software must meet the needs of the commander and many types of staff, ranging from higher-echelon commanders (such as of an Army division or brigade) and their own dedicated staff to relatively inexperienced commanders of smaller units.1 Across this range, there is great need for a planning tool that is easy to learn and use for both actual and simulated operations while being functional in field and mobile settings with varying digital infrastructure and computing devices. No military C2 system currently meets all these requirements, due in part to GUI limitations.

Back to Top

Key Insights

  • Multimodal interfaces allow users to concentrate on the task at hand, not on the tool.
  • Multimodal speech+sketch employing standardized symbol names and shapes can be a much more efficient means of creating and positioning symbols on maps during planning.
  • Nearly all users tested preferred the multimodal interface to the graphical user interfaces commonly used for command-and-control and planning.

Prior to the introduction of digital systems, C2 functions were performed on paper maps with transparent plastic overlays and grease pencils. Users would collaboratively develop plans by speaking to one another while drawing on a map overlay. Such an interface had the benefit of requiring no interface training and fail-safe operation. However, obvious drawbacks included the need to copy data into digital systems and lack of remote collaboration. Addressing them, C2 systems today are based on GUI technologies. The most widely used Army C2 system, called the Command Post of the Future, or CPOF,11 is a three-screen system that relies on a drag-and-drop method of manipulating information. It supports co-located and remote collaboration through human-to-human dialogue and collaborative sketching. CPOF was a major advance over prior C2 systemsa and the primary Army C2 system during Operation Iraqi Freedom, starting 2003.

Table 1 outlines how a CPOF user would send a unit to patrol a specified route from time 0800 to 1600. These 11 high-level steps can take an experienced user one minute to perform, with many more functions necessary to properly specify a plan. In comparison, with a version of the Quickset multimodal interface that was tightly integrated with CPOF in 2006, a user could say “Charlie company, patrol this route <draw route> from oh eight hundred to sixteen hundred.” All attribute values are filled in with one simple utterance processed in six seconds on a tablet PC computer.

Soldiers must learn where the many functions are located within the menu system, how to link information by “ctrl-dragging”b a rendering of it and how to navigate among various screens and windows. With CPOF requiring so many atomic GUI steps to accomplish a standard function, SRI International and General Dynamics Corp. built a learning-by-demonstration system an experienced user could use to describe a higher-level procedure.21 Expert users were trained within their Army units to create such procedures, and the existence of the procedures would be communicated to the rest of the unit as part of the “lore” of operating the system. However, if the interface had supported easier expression of user intent, there would have been less need for the system to learn higher-level procedures. Thousands of soldiers are trained at great expense in Army schoolhouses and in deployed locations each year to operate this complex system.

One essential C2 planning task is to position resources, as represented by symbols on a map of the terrain. Symbols are used to represent military units, individual pieces of equipment, routes, tactical boundaries, events, and tasks. The symbol names and shapes are part of military “doctrine,” or standardized procedures, symbols, and language, enabling people to share meaning relatively unambiguously. Soldiers spend considerable time learning doctrine, and anything that reinforces doctrine is viewed as highly beneficial.

Each unit symbol has a frame and color (indicating friendly, hostile, neutral, and coalition), an echelon marking (such as a platoon) on top, a label or “designator” on the side(s), and a “role” (such as armored, medical, engineering, and fixed-wing aviation) in the middle, as well as numerous other markings (see Figure 1). This is a compositional language through which one can generate thousands of symbol configurations. In order to cope with the large vocabulary using GUI technology, C2 systems often use large dropdown menus for the types of entities that can be positioned on the map. Common symbols may be arrayed on a palette a user can select from. However, these palettes can become quite large, taking up valuable screen space better used for displaying maps, plans, and schedules.

Another method used in GUIs to identify a military unit involves specifying its compositional pieces in terms of the attributes and values for unit name, role, echelon, and strength. Each is displayed with multiple smaller menus from which the user chooses a value. The user may type into a search field that finds possible units through a string match. The user must still select the desired entity and set any attribute values via menus. When a symbol is created or found, it is then positioned through a drag-and-drop operation onto the map. Due to these constraints (and many more) on system design, users told STP developers C2 system interfaces based on such classical GUI technologies are difficult to learn and use. We have found speech-and-sketch interfaces employing doctrinal language, or standardized symbol names and shapes, to be a much more efficient means for creating and positioning symbols.

Back to Top

Multimodal Map-Based Systems

Many projects have investigated multimodal map-based interaction with pen and voice3,5,6,14,20,23 and with gesture and voice.2,7,16,19 Some such systems represent the research foundation for the present work, though none to our knowledge is deployed for C2. Apart from smartphones, the most widely deployed multimodal system is Microsoft’s Kinect, which tracks the user’s body movements and allows voice commands, primarily for gaming and entertainment applications; enterprise and health applications are also beginning to appear;22 and other commercial multimodal systems have been developed for warehousing and are emerging in automobiles.26 Adapx’s STP work is most related to the QuickSet system developed at the Oregon Graduate Institute in the late 1990s.c Quickset5,6,14,2325 was a prototype multimodal speech/sketch/handwriting interface used for map-based interaction. Because speech processing needs no screen space, its multimodal interface was easily deployed on tablets, PDAs, and wearables, as well as on wall-size displays. Offering distributed, collaborative operation, it was used to position entities on a map by speaking and/or drawing, as well as create tasks for them that could be simulated through the Modular Semi-Automated Forces, or ModSAF,4,8 simulator. QuickSet was also used to control 3D visualizations,7 various networked devices (such as TV monitors and augmented-reality systems16) through hand gestures tracked with acoustic, magnetic, and camera-based methods.

Based on extensive user-centered-design research, the Oregon Graduate Institute team showed users prefer to interact multimodally when manipulating a map. They are also able to select the best mode or combination of modes to suit their situation and task.23,25 User sketching typically provides spatial information (such as shape and location), while speech provides information about identity and other attributes. This user interface emulates the military’s non-digital practices using paper maps6 and leads to reduced cognitive load for the user.23

QuickSet’s total vocabulary was approximately 250 unit symbols and approximately 180 “tactical graphics,” as in Figure 1. Speech recognition was based on an early IBM recognizer, and sketch recognition involved a lightly trained neural network and hidden Markov-model recognizer. The major research effort was devoted to establishing innovative methods for multimodal fusion processing.14 QuickSet’s unification-based fusion of multimodal inputs14 supported mutual disambiguation, or MD, of modalities16,24 in which processing of information conveyed in one mode compensated for errors and ambiguities in others, leading to relative error rate reduction of 15%–67%; for example, a sketch with three objects could disambiguate that the user said “boats” and not “boat.” MD increased system robustness to recognition errors, critical in high-noise environments, where users are heavily accented, or when sketches are created while moving or when the user’s arm is tired.18 QuickSet demonstrated a multimodal interface could function robustly under such real-world conditions, a necessary precondition of field deployment.

Back to Top


DARPA established the Deep Green program in 2008 with the aim of using simulation during mission planning to enable planners to play out the consequences of a course of action, or COA, against the most likely and most dangerous enemy COAs. A COA is a specification of the actions a set of entities will perform over time to accomplish a mission. A critical piece of the Deep Green effort was to develop an easy-to-use interface that would allow a planning team to create COAs rapidly. DARPA chose Adapx’s multimodal technology, a derivative of QuickSet, for this task, along with the companies SAIC ( and BAE Systems ( Finally, as prime contractor, Adapx developed the STP system with guidance from a team of subject-matter experts and testing by ROTC students. STP users develop their plans by speaking and drawing, with an optional GUI input if appropriate. STP interoperates with existing C2 systems, notably CPOF, the LG-RAID simulator,27 and visualizations based on Google Earth, populating them with planned entity positions and tasks.

Sketched inputs are captured through stylus, finger, mouse, or digital pen and paper inputs. Most people prefer both speech and sketch, but sketch alone can be very useful, especially with digital paper and pen. The fastest and easiest input style is to sketch a simple point, line, or area where a symbol should be placed while saying the symbol’s features (such as status, affiliation, role, strength, and designation); for example, in Figure 2 (left panel), the user can speak the type of unit or tactical graphic—”hostile mechanized infantry platoon,” “objective black,” “company boundary alpha north bravo south”—while providing a point, line, or area drawing. The user can also draw the symbols, as in the “Fix” symbol, or zigzag arrow in Figure 2. These multimodal inputs are recognized and fused to provide military symbols, as in Figure 2, right panel. The resulting symbols are not just icons on a bitmap but digitally geo-registered objects that are part of the system’s databases and populate C2 and simulation systems.

The STP multimodal interface recognizes more than 4,000 symbol and tactical graphic configurations, as in Figure 1, each with a set of attributes and values. It also recognizes more than 150 tactical tasks (such as patrolling a route and delivering supplies) that make use of spatially and semantically related groups of symbols. The symbols are provided to both the speech and sketch recognizers through a database that specifies related labels, icons, and unique identifiers. The recognizer vocabularies are populated automatically, enabling the system to adopt additional symbols or switch to new domains.

STP enables a team of users to develop a plan collaboratively, with different users serving various functional roles (such as commander, logistics, intelligence, and engineering) and contribute their portions of the overall plan. The system supports users in creating notional entities (such as a default rifle company) or positioning actual entities from an existing military organization, assigning and synchronizing tasks, filling out worksheets that populate role-specific systems (such as for logistics and engineering), and creating required documents (such as an operations order).

The scope of STP’s vocabulary and task coverage is an order of magnitude greater than that of QuickSet. It also uses more capable speech recognition, sketch recognition, and mapping subsystems.d Like QuickSet, STP supports the same multimodal interface across multiple form factors, including handheld and tablet computers, as well as digital pen and paper.6 STP’s design is informed by domain experts, resulting in a full planning tool. Rather than being a research prototype, it has been developed to a higher level of technology readiness,e and STP has been tested by actual operational military units. As a result, STP is being transitioned to multiple organizations within the U.S. military.

Back to Top

STP Components

The system’s major functional components are covered in the following sections.

Speech and natural language. The goal of STP’s spoken-language processing is support for multimodal language, which is briefer and less complex than speech-only constructions.25 The aim is to provide significantly improved user performance, be transparent and memorable to the user, and minimize cognitive load and recognition errors. The STP approach is as follows: Users typically speak noun phrases and draw symbols to create and position entities. The basic spoken language vocabulary of nouns with their attributes and values is defined in the database, as discussed earlier.

When creation of an entity involves multiple attributes, users are encouraged to not impart all information in one long utterance, as it can lead to syntactic complexity, “disfluencies,” and other obstacles to robust spoken interaction. Instead, users can select an entity by drawing a mark on it and speaking subsequent attribute-value informationf; for example, a unit’s strength can be altered by marking it and saying “strength reduced.” Likewise, a user can create a restricted operating zone for aviation for a certain time period and altitude with “R O Z <draw area> from oh eight hundred to sixteen hundred,” <mark the ROZ> “minimum altitude one thousand meters maximum two thousand meters.” For most C2 systems, creating this 3D tactical graphic is very time consuming.

The STP multimodal interface recognizes more than 4,000 symbol and tactical graphic configurations, each with a set of attributes and values.

Unlike QuickSet, STP includes two natural language parsers, one handling just noun phrases, whose constituent phrases can occur in any order, and Gemini,9 which uses a large-scale grammar of English. The noun phrase parser is used during symbol creation for analyzing such phrases as “anticipated enemy mechanized platoon strength reduced.” The broader-coverage Gemini parser is used when describing tasks for the entities that typically involve uttering verb phrases (such as “Resupply along MSR alpha”). Gemini has been used in many systems, including at NASA,10 and is one of the most competent parsers of English designed for spoken-language systems. The verb phrases in the grammar are derived from a “task signature” table that specifies the types of required and optional arguments for each military task. Because the system can infer potential tasks, the need for a user to utter complex sentences is minimized.

Speech recognition employs the Microsoft Speech engine in Windows 7/8 using grammar-based recognition. The system is always listening, but speech recognition is coordinated with screen touching, so human-to-human conversation without sketch does not result in spurious recognition. STP coordinates multiple simultaneous instances of the recognizer, each with a different grammar, or “context,” as a function of the user interface state. Contextual knowledge restricts potential speech and language, thus increasing accuracy and speed; for example, an “attribute-value” grammar context is invoked when a stroke is drawn over an object on the map. As the context-setting actions may themselves be ambiguous, STP is designed to compare the results of multiple simultaneous recognizers embodying different restrictions.

In the future, it may be helpful to use spoken dictation, as in Google Voice, Nuance Communications’s Dragon Dictate, Apple’s Siri, and speech-to-speech translation systems, 13 that require development of large-scale statistical-language models. However, because the spoken military data needed to build such language models is likely classified, this approach to creating a language model could be problematic. Since STP can take advantage of users’ knowledge of military jargon and a structured planning process,g grammar-based speech recognition has thus far been successful.

Sketch recognition. STP’s sketch recognizer is based on algorithms from computer vision, namely Hausdorff matching,17 using an array of ink interpreters to process sketched symbols and tactical graphics (see Figure 3). For unit symbols, the recognizer’s algorithm uses templates of line segments, matching the sketched digital ink against them and applying a modified Hausdorff metric based on stroke distance and stroke angles to compute similarity. For tactical graphics, the recognizer creates graphs of symbol pieces and matches them against the input. Fundamental to them all is a spatiotemporal ink segmenter. Regarding spatial segmentation, if the minimum distance of a given stroke from the currently segmented group of strokes, or “glyph,” is below a threshold proportional to the already existing glyph size and its start time is within a user-settable threshold from the end-time of the prior stroke, then the new stroke is added to the existing glyph.

For template-based unit-icon interpretation, the affiliation “frame” is first located, after which the glyph is broken into its constituent parts, including affiliation, role, and echelon, that have canonical locations relative to the frame. Though the roles may themselves have compositional structure, they are matched holistically. Where linguistic content that annotates the icon is conventionally expected, handwriting is processed by Microsoft’s recognizer. These parts are then compared to a library of template images, with the results combined to form recognition outputs. If a symbol “frame” is not found, the sketch recognizer attempts to use the tactical graphics interpreter. For tactical graphics, whose shapes can be elongated or contorted, the algorithm uses a graph-matching approach that first partitions the glyph into a graph of line segments and nodes. This graph is then matched against piecewise graph templates that allow for elongation or bending. The pieces are recombined based on sketch rules that define the relations between the pieces and anchor points from which a complete symbol can be composed; for example, such rules define a “forward line of own troops,” or FLOT, symbol, as in Figure 1, as a “linear array” of semicircles (a “primitive”), with a barbed-wire fence composed of two approximately parallel lines, plus a parallel linear array of circles.

The interface takes advantage of and reinforces skills soldiers already have, as they are trained in the standardized language, symbols, and military decision-making process.

An advantage of the template-based approach to unit-icon recognition is easy expansion by adding new templates to the library; for example, new unit roles can be added in the form of scalable vector graphics that would then be located within the affiliation border by the compositional unit symbol recognizer.

Explicit and implicit task creation. Aside from creating and positioning symbols on a map, users can state tasks explicitly or rely on the system to implicitly build up an incremental interpretation of the set of tasks that use those symbols (see Figure 4). STP does the latter inference by matching the symbols on the map against the argument types of possible domain tasks (such as combat service units perform “supply” and medical units perform “evacuate casualties”), as in Figure 4, subject to spatiotemporal constraints. STP presents the planner with a running visualization of matching tasks in the evolving plan under creation. The planner can readily inspect the potential tasks, accepting or correcting them as needed. Here, STP has inferred that the Combat Service Support unit and Main Supply Route A can be combined into the task Resupply along Main Supply Route A. If that is correct, the planner can select the checkbox that then updates the task matrix and schedule. As the planner adds more symbols to the map, the system’s interpretations of matching tasks are likewise updated. Task start and end times can be spoken or adjusted graphically in a standard Gantt chart task-synchronization matrix. Note STP is not trying to do automatic planning or plan recognition but rather assist during the planning process; for instance, STP can generate a templated “operations order” from the tasks and graphics, a required output of the planning process. Much more planning assistance can, in principle, be provided, though not clear is what a planner would prefer.

Because the system is database driven, the multimodal interface and system technology have many potential commercial uses, including other types of operations planning (such as “wildland” firefighting, as firefighters say), as well as geographical information management, computer-aided design, and construction management.

Back to Top


Four types of evaluations of STP have been conducted by the U.S. military: component evaluations, user juries, controlled study, and exercise planning tests.

Component evaluations. In recognition tests of 172 symbols by a DARPA-selected third-party evaluator during the Deep Green Program in 2008, the STP sketch-recognition algorithm had an accuracy of 73% for recognizing the correct value at the top of the list of potential symbol-recognition hypotheses. The next-best Deep Green sketch recognizer built for the same symbols and tested at the same time with the same data had a 57% recognition accuracy for the top-scoring hypothesis.12 Rather than use sketch alone, most users prefer to interact multimodally, speaking labels while drawing a point, line, or area. STP’s multimodal recognition in 2008, as reported by an externally contracted evaluator, was a considerably higher 96%. If STP’s interpretation is incorrect, users are generally able to re-enter the multimodal input, select among symbols on a list of alternative symbol hypotheses, or invoke the multimodal help system that presents the system’s coverage and can be used for symbol creation.

STP has also been tested using head-mounted noise-canceling microphones in high-noise Army vehicles. Two users—one male, one female—issued a combined total of 221 multimodal commands while riding in each of two types of moving vehicles in the field, with mean noise 76.2dbA and spikes to 93.3dbA. They issued the same 220 multimodal commands to STP with the recorded vehicle noise played at maximum volume in the laboratory, with mean 91.4dbA and spikes to 104.2dbA. These tests resulted in 94.5% and 93.3% multimodal recognition accuracy, respectively. We conjecture that, in addition to the multimodal architecture, the noise-canceling microphones may have compensated for the loud but relatively constant vehicle noise. Further research by the STP team will look to tease apart the contributions of these factors in a larger study.

User juries. One way the Army tests software is to have soldiers just returning from overseas deployment engage in a “user jury” to try a potential product and provide opinions as to whether it would have been useful in their recent activities. In order to get soldier feedback on STP, 2011–2013, the Army’s Training and Doctrine Command invited 126 soldiers from four Army divisions experienced with the vehicle C2 system and/or CPOF to compare them with STP. For privacy, this article has changed the names of those divisions to simply Divisions 1, 2, 3, and 4. STP developers trained soldiers for 30 minutes on STP, then gave them a COA sketch to enter using STP. They later filled out a five-point Likert-style questionnaire. In all areas, STP was judged more usable and preferred to the soldiers’ prior C2 systems; Table 2 summarizes their comparative ratings of STP versus their prior C2 systems.

Controlled user study. Contractors have difficulty running controlled studies with active-duty soldiers. However, during the STP user jury in January 2013, 12 experienced CPOF users from Division 3 evaluated STP vs. CPOF in a controlled experiment. Using a within-subject design with order-of-system-use counterbalanced, experienced CPOF users were trained for 30 minutes by the STP development team on STP, then given the COA sketch in Figure 5 to enter through both STP and CPOF. Results showed these experienced CPOF users created and positioned units and tactical graphics on the map using STP’s multimodal interface 235% faster than with CPOFh; these subjects’ questionnaire remarks are included in Table 2. Note “symbol laydown” is only one step in the planning process, which also included tasking of units and creating a full COA and an operations order. Experts have reported the STP time savings for these other planning functions is considerably greater.

Exercise planning test. STP was recently used by a team of expert planners charged with developing exercise COAs that would ultimately appear in CPOF. The STP team worked alongside another expert planning team using Microsoft PowerPoint in preference to CPOF to develop its plans. Many attempts to develop exercise COAs have used various planning tools, including CPOF itself, but PowerPoint continues to be used in spite of its many limitations (such as lack of geospatial fidelity) because it is known by all. When the exercise was over, the team using PowerPoint asked for STP for planning future exercises.

Back to Top

Transition and Deployment

Although the U.S. military is extremely conservative in its adoption of computing technology, there is today a growing appreciation that operational efficiency and training are being hampered by systems with different user interfaces and operational difficulty. Still, such a realization takes time to pervade such a large organization with so many military and civilian stakeholders, including operational users and the defense-acquisition community. In addition to technology development, it took the STP development team years of presentations, demonstrations, tests, and related activities to achieve the visibility needed to begin to influence organizational adoption. Over that time, although the STP prototypes had been demonstrated, commercial availability of speech recognition was necessary to enable conservative decision makers to decide that the risk from incorporating speech technology into mission-critical systems had been sufficiently reduced. Moreover, the decision makers had independently become aware of the effects of interface complexity on their organizations’ training and operations. Still, the process is by no means complete, with organizational changes and thus customer education always a potential problem. Currently, STP has been transitioned to the Army’s Intelligence Experimentation Analysis Element, the Army Simulation and Training Technology Center, and the Marine Corps Warfighting Laboratory’s Experiments Division where it is used for creating plans for exercises and integrating with simulators. We have also seen considerable interest from the Army’s training facilities, where too much time is spent training students to use C2 systems, in relation to time spent on the subject matter. Moreover, beyond STP’s use as a planning tool, there has been great interest in its multimodal technology for rapid data entry, in both vehicle-based computers and handhelds.

Regarding full deployment of STP, the congressionally mandated “program of record” acquisition process specifies program budgets many years into the future; new technologies have a difficult time being incorporated into such programs, as they must become an officially required capability and selected to displace already budgeted items in a competitive feature triage process. In spite of these hurdles, STP and multimodal interface technology are now being evaluated for integration into C2 systems by the Army’s Program Executive Office responsible for command-and-control technologies.

Back to Top


We have shown how the STP multimodal interface can address user-interface problems challenging current C2 GUIs. STP is quick and easy to learn and use and supports many different form factors, including handheld, tablet, vehicle-based, workstation, and ultra-mobile digital paper and pen. The interface takes advantage of and reinforces skills soldiers already have, as they are trained in the standardized language, symbols, and military decision-making process. In virtue of this common “doctrinal” language, STP users can quickly create a course of action or enter data multimodally for operations, C2, and simulation systems without extensive training on a complicated user interface. The result is a highly usable interface that can be integrated with existing C2 systems, thus increasing user efficiency while decreasing cost.

Back to Top


STP development was supported by Small Business Innovation Research Phase III contracts, including HR0011-11-C-0152 from DARPA, a subcontract from SAIC under prime contract W15P7T-08-C-M011, a subcontract from BAE Systems under prime contract W15P7T-08-C-M002, and contract W91CRB-10-C-0210 from the Army Research, Development, and Engineering Command/Simulation and Training Technology Center. This article is approved for public release, distribution unlimited. The results of this research and the opinions expressed herein are those of the authors, and not those of the U.S. Government. We thank Todd Hughes, Colonels (ret.) Joseph Moore, Pete Corpac, and James Zanol and the ROTC student testers. We are grateful to Paulo Barthelmess, Sumithra Bhakthavatsalam, John Dowding, Arden Gudger, David McGee, Moiz Nizamuddin, Michael Robin, Melissa Trapp-Petty, and Jack Wozniak for their contributions to developing and testing of STP. Thanks also to Sharon Oviatt, General (ret.) Peter Chiarelli, and the anonymous reviewers.

Back to Top

Back to Top

Back to Top

Back to Top


F1 Figure 1. Compositional military-unit symbols and example tactical graphic.

F2 Figure 2. Speech and Sketch (left) processed by STP into the digital objects on the right.

F3 Figure 3. Overview of sketch-recognition processing: TG = tactical graphic, PLA = point-line-area.

F4 Figure 4. Implicit task creation.

F5 Figure 5. COA sketch used in controlled study, January 2013.

UF1 Figure. Watch the authors discuss this work in this exclusive Communications video.

Back to Top


T1 Table 1. Steps to send a unit on patrol through Command Post of the Future.

T2 Table 2. Questionnaire results for STP vs. prior C2 system(s) used by the subjects.

Back to top

    1. Alberts, D.S. and Hayes, R.E. Understanding Command and Control. DoD Command and Control Research Program Publication Series, Washington, D.C., 2006.

    2. Bolt, R.A. Voice and gesture at the graphics interface. ACM Computer Graphics 14, 3 (1980), 262–270.

    3. Cheyer, A. and Julia, L. Multimodal maps: An agent-based approach. In Proceedings of the International Conference on Cooperative Multimodal Communication (Eindhoven, the Netherlands, May). Springer, 1995, 103–113.

    4. Clarkson, J.D. and Yi, J. LeatherNet: A synthetic forces tactical training system for the USMC commander. In Proceedings of the Sixth Conference on Computer Generated Forces and Behavioral Representation Technical Report IST-TR-96-18, University of Central Florida, Institute for Simulation and Training, Orlando, FL, 1996, 275–281.

    5. Cohen, P.R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L., and Clow, J. QuickSet: Multimodal interaction for distributed applications. In Proceedings of the Fifth ACM International Conference on Multimedia (Seattle, WA, Nov. 9–13). ACM Press, New York, 1997, 31–40.

    6. Cohen, P.R. and McGee, D.R. Tangible multimodal interfaces for safety-critical applications. Commun. ACM 47, 1 (Jan. 2004), 41–46.

    7. Cohen, P.R., McGee, D., Oviatt, S., Wu, L., Clow, J., King, R., Julier, S., and Rosenblum, L. Multimodal interaction for 2D and 3D environments. IEEE Computer Graphics and Applications 19, 4 (Apr. 1999), 10–13.

    8. Courtemanche, A.J. and Ceranowicz, A. ModSAF development status. In Proceedings of the Fifth Conference on Computer Generated Forces and Behavioral Representation, University of Central Florida, Institute for Simulation and Training, Orlando, FL, 1995, 3–13.

    9. Dowding, J., Gawron, J.M., Appelt, D., Bear, J., Cherny, L., Moore, R., and Moran, D. Gemini: A natural language system for spoken-language understanding. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (Ohio State University, Columbus, OH, June 22–26). Association for Computational Linguistics, Stroudsburg, PA, 1993, 54–61.

    10. Dowding, J., Frank, J., Hockey, B.A., Jonsson, A., Aist, G., and Hieronymus, J. A spoken-dialogue interface to the EUROPA planner. In Proceedings of the Third International NASA Workshop on Planning and Scheduling for Space (Washington, D.C.). NASA, 2002.

    11. Greene, H., Stotts, L., Patterson, R., and Greenburg, J. Command Post of the Future: Successful Transition of a Science and Technology Initiative to a Program of Record. Defense Acquisition University, Fort Belvoir, VA, Jan. 2010;

    12. Hammond, T., Logsdon, D., Peschel, J., Johnston, J., Taele, P., Wolin, A., and Paulson, B. A sketch-recognition interface that recognizes hundreds of shapes in course-of-action diagrams. In Proceedings of ACM CHI Conference on Human Factors in Computing Systems (Atlanta, Apr. 10–15). ACM Press, New York, 2010, 4213–4218.

    13. Hyman, P. Speech-to-speech translations stutter, but researchers see mellifluous future. Commun. ACM 57, 4 (Apr. 2014), 16–19.

    14. Johnston, M., Cohen, P.R., McGee, D., Oviatt, S.L., Pittman, J.A., and Smith, I. Unification-based multimodal integration. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Annual Meeting of the European ACL (Madrid, Spain, July 7–12). Association for Computational Linguistics, Stroudsburg, PA, 1997, 281–288.

    15. Johnston, M., Bangalore, S., Varireddy, G., Stent, A., Ehlen, P., Walker, M., Whittaker, S., and Maloor, P. MATCH: An architecture for multimodal dialogue systems. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (Philadelphia, PA, July). Association for Computational Linguistics, Stroudsburg, PA, 2002, 376–383.

    16. Kaiser, E., Olwal, A., McGee, D., Benko, H., Corradini, A., Li, X., Cohen, P.R., and Feiner, S. Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality. In Proceedings of the Seventh International Conference on Multimodal Interfaces (Trento, Italy, Oct. 4–6). ACM Press, New York, 2005, 12–19.

    17. Kara, L.B. and Stahovich, T.F. An image-based, trainable symbol recognizer for hand-drawn sketches. Computers and Graphics 29, 4 (2005), 501–517.

    18. Kumar, S., Cohen, P.R., and Coulston, R. Multimodal interaction under exerted conditions in a natural field setting. In Proceedings of the Sixth International Conference on Multimodal Interfaces (State College, PA, Oct. 13–15). ACM Press, New York, 2004, 227–234.

    19. MacEachren, A.M., Cai, G., Brewer, I., and Chen, J. Supporting map-based geo-collaboration through natural interfaces to large-screen display. Cartographic Perspectives 54 (Spring 2006), 16–34.

    20. Moran, D.B., Cheyer, A.J., Julia, L.E., Martin, D.L., and Park, S. Multimodal user interfaces in the Open Agent Architecture. In Proceedings of the Second International Conference on Intelligent User Interfaces (Orlando, FL, Jan. 6–9). ACM Press, New York, 1997, 61–68.

    21. Myers, K., Kolojejchick, J., Angiolillo, C., Cummings, T., Garvey, T., Gervasio, M., Haines, W., Jones, C., Knittel, J., Morley, D., Ommert, W., and Potter, S. Learning by demonstration for military planning and decision making: A deployment story. In Proceedings of the 23rd Innovative Applications of Artificial Intelligence Conference (San Francisco, CA, Aug. 6–10). AAAI Press, Menlo Park, CA, 2011, 1597–1604.

    22. O'Hara, K., Gonzalez, G., Sellen, A., Penney, G., Varnavas, A., Mentis, H., Criminisi, A., Corish, R., Rouncefield, M., Dastur, N., and Carrell, T. Touchless interaction in surgery. Commun. ACM 57, 1 (Jan. 2014), 70–77.

    23. Oviatt, S.L. Multimodal interfaces. The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications, Revised Third Edition, J. Jacko, Ed. Lawrence Erlbaum Associates, Mahwah, NJ, 2012, 405–430.

    24. Oviatt, S.L. Taming recognition errors with a multimodal architecture. Commun. ACM 43, 9 (Sept. 2000), 45–51.

    25. Oviatt, S.L. and Cohen, P.R. Perceptual user interfaces: Multimodal interfaces that process what comes naturally. Commun. ACM 43, 3 (Mar. 2000), 45–53.

    26. Oviatt, S.L. and Cohen, P.R. The Paradigm Shift to Multimodality in Contemporary Computer Interfaces. Morgan & Claypool Publishers, San Francisco, CA, 2015.

    27. Stilman, B., Yakhnis, V., and Umanskiy, O. Strategies in large-scale problems. In Adversarial Reasoning: Computational Approaches to Reading the Opponent's Mind, A. Kott and W. McEneaney, Eds. Chapman & Hall/CRC, London, U.K., 2007, 251–285.

    28. U.S. Army. U.S. Army Field Manual 101–5-1, Chapter 5, 1997;


    b. "Ctrl-dragging" refers to holding down the CTRL key while also holding the left mouse button on a map symbol, then dragging the symbol elsewhere in the user interface; a "clone" of the symbol appears at the destination location, such that if the original one is changed, the clone is changed as well.

    c. Adapx Inc. was a corporate spin-off of the Oregon Graduate Institute's parent institution, the Oregon Health and Science University, Portland, OR.

    d. Unlike QuickSet's use of a geo-registered bitmap, STP uses the ArcGIS mapping system from ESRI.

    e. For NASA's definition of "technology readiness level," see QuickSet was developed through technology readiness level 3, whereas STP has been developed through level 6. Ongoing development and deployment will take it through take it to level 9.

    f. Note "selection" through marking is not an atomic operation but must be recognized and interpreted, as a stroke can be ambiguous. Selection through marking avoids having a "moded" interface in which selection is a distinct mode; likewise, STP does not use hardware-assisted selection (such as a specific button on a pen or keyboard) to support simple touch or digital pen and paper.

    g. Soldiers are taught to use the structured Military Decision-Making Process.28

    h. F-test two-sample for variances test: F(19) = 4.05, p <0.02

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More