For the design of computer interfaces, interaction can be represented as situated acts that abstract up from situated action. The situated-act representation specifies interaction independent of interface modalities, enabling specialization of interfaces based into particular modalities. The approach is illustrated through interfaces in two domains: an aircraft navigation system and a multimodal VR viewer.
Situated action, situated acts, modality, multimodal interfaces
It is generally accepted that when humans and machines interact cooperatively, they rely on communicative actions to carry out their intentions. A problem with this view, though, is that the concept of action is typically linked to the physical affordances of the interface or the transmissive characteristics of the medium of communication. As a result, accounts of interaction--even situated interaction--tend to be expressed in terms of actions: "Push button B" or "Move levers forward to position P." We suggest that interaction can be more fruitfully represented at a higher level of abstraction, situated acts, which can represent the force of the communication in terms of the domain task instead of in terms of the interface: "Go to navigation fix F" or "Accelerate to speed S." In this paper, then, we develop a theory of situated acts, show how this approach can be applied to interaction in a variety of domains, including (a) aircraft navigation systems and (b) multimodal navigation interfaces. We discuss the theory's implications for design methodology and explain how the situated act approach offers the promise of designing cooperative systems at high level, independent of interface modality.
The distinction between action and act can be seen in the photocopier interface studied by Suchman (1987). For example in Display 4 (p. 173), the machine's instruction reads "Press the Start button (to produce a copy in the output tray)." The action is pressing the start button; the goal is producing a copy in the output tray; and the act is a command to copy a page. In contrast, in Display 3 (p. 173), the machine's instruction reads "Place your original face down on the glass, centered over the registration guide (to position it for the copier lens)." Here the action is to place the original face down on the glass, centered over the registration guide; the goal is positioning the original for the copier lens; but there is no act because this is the user's action has no communicative value for the copier and the user has no expected outcome other than the direct product of their own action.
The kinds of acts we suggest are taking place in cases such as the photocopier example are like classical speech acts (Austin, 1962; Searle, 1969) but are broader in their definition. These situated acts, like speech acts, have intended effects, symbolic force, form of expression, and actual effects. Unlike speech acts, situated acts do more than "do things with words:" they include the kinds of direct actions that speech cannot usually carry out, such as direct manipulation of objects in the interface or in the real world. In a graphical user interface, direct manipulation has symbolic content: the action of moving a document icon to a location depicting a folder might mean to move or to copy the document into the folder. Symbolic content constituting an act could also be ascribed to physical actions in the world. These actions could be subtle and overtly para-linguistic, such as a casting a quizzical glance, or hugely physical, not obviously linguistic but no less symbolically communicative, such as punching someone in the nose.
Task-action grammars (Payne & Green, 1989) related user behaviors to task accomplishment; they provided means of expressing interactions in detail at the level of action. Here, we are attempting to abstract up from actions to acts; our problem is to produce accounts of interaction that make sense--especially to the human actors--at the act level and then to develop appropriate means of expressing these acts as actions in an interface.
A version of abstraction from action can be found in activity theory (Nardi, 1996), which describes interaction at an "activity" level and pays particular attention to social context. Activity theory, however, has some characteristics that limit its usefulness for the analysis and reformulation of interfaces of the kind we consider here. In particular, it lacks formalisms for dealing with communicative intent and multiple agents. More generally, it has not been operationalized in terms of techniques that offer specific analyses of particular interfaces and their contexts. Consequently, our notion of situated acts tries to embody some of the spirit of activity theory while providing a more concrete basis for development and reformulation of interfaces in communicative, multi-agent settings.
There are a number of fairly well-known problems with speech-act models of interaction (cf., Levinson, S., 1981). The most serious of these is the problem of knowing the complete range of intended and actual effects (Marcu, 1997). This is especially the case for interactions among human agents, which reflect a wide variety of social needs and goals. For machine agents (at least agents with no social goals, which is all we are likely to encounter for the foreseeable future), these objections to speech acts are less critical. More troublesome, for our purposes, is that classical speech acts necessarily involve speech, so that they tend to be ill-suited for representing acts expressed more broadly. The theory of classical speech acts includes categories of acts such as requests, informs and promises. These have been broadened to include meta-acts such as turn-taking (Novick, 1988) and topic control (Traum & Hinkelman, 1992). Conversely, one can extend act models in the other direction--toward domain-level acts that have meaning in the particular context of an interface, the interface's underlying application, and the application's situation of use.
Some results have been achieved in creating increasingly complex models of what might be generally termed dialogue acts, including domain-independent conversational exchanges (Winograd & Flores, 1987) and grounding (Clark, 1996). Novick (1988) proposed a set of "conversation levels" that included both domain and meta-acts. Traum and Hinkelman (1992) proposed a related approach, which they called "conversation acts". However, neither of these models explicitly abstracted communicative action to a level independent of modality. In fact, it is possible to change the modality while keeping the commonality of the user expectations of interaction (Pérez-Quiñones & Sibert, 1996). Similarly, an abstract set of communicative acts has been used to generate text or graphical expressions for referring expressions (Green et al., 1997; Kerpedjiev et al., 1997).
We now turn to our theory of situated acts, which are a kind of dialogue act expressed largely in domain, rather than domain-independent, terms. The intended perlocutionary effects of situated acts tend to be the accomplishment of domain, rather than interface, goals. The illocutionary force of such acts are thus linked to domain concepts (e.g., climb), rather than putatively universal meanings (e.g., request). The acts are situated in the sense that they do not have meaning outside the context of the interaction itself; they are the logical thing for the user to do under the circumstances. Indeed, one of the purposes of expressing interaction at the act level is that the salience of the situation is increased. That is, the state of the interface may be observable at the action level (e.g., a new prompt appears), but the state of the underlying system is a higher-level construct that provides a deeper level of understanding. This is why the title of the paper is from "actions to situated acts:" while every action is--by definition--situated, the salience or meaning of the situation at that level may not be optimal. So our idea here is to "lift" the actions into acts that have a higher degree of situational meaning.
What we have, then, is a two-tier approach. At the more abstract level, we have a representation of situated acts that serve as communication goals, independent of specific interface style or even dialogue style. At the more concrete level, we have the interface itself, where design decisions help give shape to the interface. This level is specific to each interface design. An important consequence of our situated-act approach is that for each "application" there might be more than one design that enables the situated acts to take place.
Having defined our theory of situated acts, we now show how it can be applied to complex interfaces. First, we show how a widely used command-line interface for aircraft navigation can be abstracted to an act-based representation that (a) accounts for interaction among the human and machine agents in the cockpit and (b) can be specialized to produce new kinds of interfaces for the same task. Second, we apply the situated-act approach to the problem of integrating a 3D viewer of a navy battle simulation with a speech system.
The navigation of modern-generation aircraft such as the Airbus A340 is controlled by a flight management and guidance system (FMGS). The interface to this system is called the multifunction control and display unit (MCDU). Some of the functions handled by the system include navigation, lateral and vertical flight planning, performance calculations, and guidance. Within the navigation subsystem, functions include alignment of the inertial navigation system, computation of position, assessment of accuracy level, selection of radio navigation aids, and polar navigation.
The interface embodied in the MCDU has different displays, pages, modes, input items that correspond to the aspects of the system functions. For example, the MCDU for the A340 contains pages that cover everything from initializing a flight plan to performance on approach. The pages can be slewed vertically or horizontally on the display. The name of the page is normally indicated at the top of the display, typically in abbreviated form. The crew using the MCDU receive messages from the flight management system and type in entries using a "scratch-pad" line at the bottom of the display. An entry from the scratch pad is inserted into a field by pushing a selector key adjacent to the field. It is important to note that the MCDU is an interface used by sophisticated users in a safety-critical environment; it is about as far removed from a walk-up-and-use kiosk as one can get. But the task of following of instructions--understanding how to do things and to how to interpret the results--remains at issue even for these sophisticated users. Anecdotal evidence (Glasgow, 1997) suggests that even experienced crews of Boeing 747 aircraft differ widely in their knowledge of the interface for the aircraft's flight-management system, and that the complexity of the interface design makes it difficult to carry out foreseeable tasks under conditions of high cognitive load, such as during a flight's approach phase.
The aircraft's crew is, by definition, in a situation. Unlike the Micronesian navigators cited by Suchman (1987), who do not give evidence of a pre-conceived plan and who use purely local, observable phenomena to guide their craft, commercial aircrews have explicit flight plans and a variety of long-distance navigational aids. But the presence of an explicit plan makes the aircrew's actions no less "situated" than the Micronesians': the plan is part of the context. An element of a "situation" is no less authentic because it was created by the actor; presumably the Micronesians' situation included elements such as the trim of their sails and their intentions to journey to a particular destination. It is appropriate simply to accept that aircrews plan their flights and that these plans thus become a factor in their interactions with the flight management system through the MCDU.
The point of this analysis is (a) to show the action-level interaction of the interface and its associated procedures as presented in the flight crew operating manual, and (b) to demonstrate that these actions can be abstracted into situated acts. Please bear in mind that this account is able to present the analysis for a tiny fraction of the MCDU, and that many links to other parts of the interface and its underlying system will have to go unexplained because they are of limited relevance for present purposes. Also, some of the acronyms used in the manual have been expanded or clarified to make the account of the interaction more comprehensible (e.g., "direct-to" for the manual's "DIR TO").
One of the things that the MCDU permits the flight crew to do is to define a leg from the aircraft's present position to any waypoint; this "direct-to" waypoint may be either already in the active flight plan or otherwise designated by the pilot. In fact, there are two other distinct direct-to functions: the direct-to/abeam function defines waypoints that are projected along a leg of the initial flightplan; the direct-to/intercept function defines a means of intercepting a course defined by navigational signal at a specified waypoint. Each of these three functions has its own procedure, although there are similarities among the procedures. This example presents the simplest case, direct-to-waypoint; the analyses for the other cases are similar. As published in the Flight Crew Operating Manual for the Airbus A340, the first part of the direct-to-waypoint procedure comprises the following actions:
The effect of the crew's action is to tell the flight management and guidance system to travel directly to the indicated waypoint instead of continuing the current leg of the flight plan. If the new waypoint was already in the flight plan, the revised flight plan is already complete. Otherwise, if the new waypoint was not in the flight plan, then the direct-to waypoint creates what is called a flight-plan discontinuity, where the system does not have sufficient knowledge to link the new waypoint into the flight plan. The manual instructs the crew to adjust the flight plan to get the most probable flight plan beyond the new direct-to waypoint.
These actions, then, can be abstracted into situated acts along the following lines:
Or, even more generally, the actions could be abstracted into the following single act:
Clearly either account of these acts could be realized through a large range of interface design alternatives. An assumption underlying the situated-act view is that a better design will track more closely the "natural" lines of the abstract acts. In point of fact, the MCDU's interface is relatively good in this example, as pushing the "DIR" button initiates the command, the system then offers the crew the choice of available waypoints, and the crew then selects the waypoint and confirms and action. An alternate design might involve first selecting a waypoint and then pushing the "DIR" button to take the action; this would semantically link the "DIR" button with the domain action of the command rather than the interface action of displaying the page of waypoints.
The second example that we will discuss is a design analysis after a project was completed that shows how our act-based model could have been used to solve some design decisions for the integration of a speech interface to a virtual reality (VR) project. 
The VR system was a Navy battle simulator that enabled viewing a pre-scripted scenario either as a 3D graphics on a computer monitor or on an overhead display (similar to a head-mounted display except that the display was mounted on a device and the user just looks into it, like a set of binoculars). The system had few basic commands, mostly dealing with controlling the playback of the simulation and moving the user's position in the simulation. Most of the commands were implemented using either a keyboard or a mouse. There were no advanced 3D pointing devices used in this program. To integrate a speech input and natural language processing system into the viewer, we defined the different user-level acts available in the system. From these, we defined the set of verbs and objects needed for the grammar in the domain.
The application domain in this example is very simple. There are six basic domain commands (acts). These are
Each act is paired with one or more interface actions. The original interface had two forms of interaction: a 3D display on a monitor and an overhead display. A third interface style, using speech input and natural language processing, was added. This section discusses how each act was paired with one or more interface actions. The discussion is organized by interface style.
3D Graphical Display. The graphical display included the use of the mouse to select small windows that allowed access to the application domain acts. Most of the acts described above were implemented in windows that floated above the 3D view of the world. Also, the two different views of the world allowed some different commands that were specific for each view. For example, the top-of-the world view allowed the user to select the platform for the out-of-window view by simply clicking on the platform icon. The out-of-window view did not allow changing the platform. For each domain act, we present the interface action in 3D-monitor style:
Overhead Display. The overhead display interface was designed to work as an extension of the 3D style. Therefore not all controls were implemented in this style. The list below shows only those acts that were implemented with the overhead display, and the rest were taken from the list above. The overhead display provided not only angle of view but also direction of viewing, so the system could track the userŐs head and use it to change the viewing area accordingly. The overhead display had a small button that was used for movement. For each domain act, we present the interface action in overhead style:
While there was only one new interface action added for this style of interface, this proves to be a considerable one. The user can "move" more freely around the simulation using the overhead display than with the 3D monitor style. The limitation of this style, much like the limitation of many VR systems, is how to interact with the world. Removing this limitation is what led us to integrate a speech interface style with the application.
Speech and Natural Language Processing. The speech interface was added to the application primarily to enhance the use of the overhead display. We were also interested in studying the use of voice commands for a hands-busy interface. The next list presents the interface actions implemented. As before, the list shows only those acts that were mapped to the speech interface; all previously mentioned interface actions are still available. The set of new acts is as follows:
The acts in this domain should have served as guidelines for implementing interface actions in the system. Because this system was not designed with an NLP front-end in mind, many other domain acts were not considered because the equivalent interface actions would have been cumbersome in the other styles. One example was the only query on the system, a window that displayed with information on the platform specifics. With the NLP addition it becomes feasible to implement new interface actions that did not have a mapping in the domain acts. For example, a new set of queries was implemented to show/hide platforms that meet certain criteria:
In this example, a single domain act mapped to many new actions in the interface afforded by natural language expressiveness. The result was new functionality that should have been available in the act level but was not included. The new act should have been:
The addition of the natural language front end to the application improved the usability of the system significantly. The overhead display, for example, now had more uses because the user did not have to switch back and forth to the 3D view to manipulate the simulation. Also, the use of a the natural language processor with some simple queries allowed the user to learn about items such as the domain and platforms much more quickly than clicking on each platform independently.
But what is important for our model about these new multimodal interactions in the domain is that the act-level representation did not have to change. The act-level representation, if expressed completely, should afford new interactions and interaction styles without changing that representation, consonant with our point that the act-level representation is specific to the domain and the action-level representation is specific to the interface style. The adaptation of styles through different actions can produce better usability but the act level remains unchanged. Open questions include whether it is possible or advantageous to provide act-level interface entities or descriptions to users, and whether action- and act-level elements should or should not be combined within an interface or procedure.
Helen Gigley inspired the collaboration and work that led to this paper. David Novick's research was supported by a grant from Aerospatiale Aeronautique.
Austin, J. (1962). How to do things with words. Cambridge, MA: Harvard University Press.
Clark, H. (1996). Using language. Cambridge: Cambridge University Press.
Glasgow, S. Mill visual. E-mail message to Bluecoat Forum, September 8, 1997.
Green, N., Kerpedjiev, S., Carenini, G., Moore, J., and Roth, F. (1997). Media-independent communicative actions in integrated text and graphics generation. Working Notes of the AAAI Fall Symposium on Communicative Action, Cambridge, MA, November, 1997, 43-50.
Kerpedjiev, S., Carenini, G., Roth, S. F., and Moore, J. D. (1997). Integrating planning and task-based design for multimedia presentation, International Conference on Intelligent User Interfaces, Orlando, FL, January, 1997, 145-152.
Levinson, S. (1981). The essential inadequacies of speech act models of dialogue. In H. Parret, M. Sbisa, and J. Verschuren (eds.), Possibilities and limitations of pragmatics: Proceedings of the Conference on Pragmatics at Urbino, July, 1979. Amsterdam: Benjamins, 473-492.
Marcu, D. (1997). Perlocutions: The Achilles heel of speech act theory. Working Notes of the AAAI Fall Symposium on Communicative Action, Cambridge, MA, November, 1997, 51-58.
Nardi, B. (ed.) (1996). Context and consciousness. Cambridge, MA: MIT Press.
Novick, D. (1988). Control of mixed-initiative discourse through meta-locutionary acts: A computational model. Doctoral dissertation, available as Technical Report CIS-TR-88-18, Department of Computer and Information Science, University of Oregon.
Payne, S. and Green, T. (1989). The structure of command languages: An experiment on task-action grammar. Int. J. Man-Machine Studies, 30(2), 213-234.
Pérez-Quiñones, M. and Sibert, J. (1996). A collaborative model of feedback in human-computer interaction. Proceedings of CHI 96, Vancouver, BC, April, 1996, 316-323.
Searle, J. (1969). Speech acts. Cambridge: Cambridge University Press.
Suchman, L. (1987). Plans and situated actions. Cambridge: Cambridge University Press.
Traum, D., and Hinkelman, E. (1992). Conversation acts in task-oriented spoken dialogue, Computational Intelligence, 8, 3, 575-599.
Winograd, T., and Flores, F. (1987). Understanding computers and cognition: A new foundation for design. Reading, MA: Addison-Wesley.
1. Work performed by second author (Perez-Quiñones) while working at the Naval Research Laboratory, Washington, DC.
This paper appeared as: Novick, D., and Pérez-Quiñones, M. (1998). Cooperating with computers: Abstracting from action to situated acts,Proceedings of the European Conference on Cognitive Ergonomics (ECCE-9), Limerick, IR, August, 1998, 49-54.