Evaluating Design of Human-Machine Cooperation:
The Cognitive Walkthrough for Operating Procedures

David G. Novick


Meriem Chater



The methodology of development of interfaces can be adapted to development of operating procedures. In particular, the cognitive walkthrough can be adapted to account for steps and resources outside the computer's part of the system interface. Empirical evaluation suggests that a cognitive walkthrough for operating procedures (CW-OP) is reasonably efficient and can provide useful information for developers.


Methodology, operating procedures, cognitive walkthrough

Evaluation de la Conception de la Coopération Homme-Machine:
Le Parcours Cognitif pour les Procédures Opérationnelles


Notre objectif est de proposer un support méthodologique pour la conception centrée sur l'homme des procédures opérationnelles pour le contrôle des processus. Les méthodologies de développement d'interfaces homme-machine peuvent être adaptées aux procédures opérationnelles. C'est le cas de la méthode du parcours cognitif qui a été adaptée à l'évaluation des procédures et de leur documentation. Cette adaptation porte sur cinq points : (1) prise en compte des étapes du parcours au niveau des procédures et non pas au niveau de l'interface; (2) attirer l'attention de l'évaluateur sur la présentation des procédures dans la documentation; (3) demander à l'évaluateur de déterminer explicitement si la formation ou l'expérience est nécessaire; (4) déterminer si la procédure met en oeuvre correctement la fonction recherchée; (5) déterminer la probabilité d'occurrence des erreurs et leurs implications pour la sécurité. L'évaluation empirique montre que cette méthode est efficace et peut fournir des informations utiles aux développeurs.


We seek to introduce methodological support for human-centered development of operating procedures for process control. Grudin (1990) that the user's interface to a computer is not just the hardware and software that compose the computer. From the user's view, the interface includes an array of associated elements in the context of use, including documentation, training, and advice from colleagues. This suggests that the methodology of development of interfaces could be transferred or adapted to development of operating procedures. In particular, we focus on the adaptation and use of the cognitive walkthrough (Wharton, Bradford, Jeffries, & Franzke, 1992) for the evaluation of operating procedures and their documentation. We ask two key questions: (1) Can the cognitive walkthrough for operating procedures be used effectively by evaluators? and (2) Does the cognitive walkthrough for operating procedures provide perspectives of value to developers of procedures and their documentation? To address these questions, we will review the use of and need for procedures in aviation, review current human-centered methodologies for development of procedures, introduce the cognitive walkthrough for procedures (CW-OP), and report empirical results from use of the walkthrough.

Methodology of Procedures

In some complex, dynamic, and risky domains, such as aviation, operating procedures are an inevitable part of the crew's interaction with the system. An operating procedure exists to specify, unambiguously, what the task is, when the task should be conducted, how the task should be done, by whom it should be conducted, and what feedback is provided to other agents (Degani & Wiener, 1997). To a certain extent, procedures exist to deal with irreducible issues of the design of interfaces; humans may have a better capacity to handle dynamic or other complex and difficult situations.

A number of approaches for development of operating procedures have been proposed in the field of aviation. While researchers generally agree that the field lacks a genuine methodology of procedure development, these approaches are steps forward. The most prominent approach is Degani and Wiener's (1997) "Four P's" model, which incorporates the organization's philosophy of operations, their business policies, procedures that effectuate the operations consistently with the policies, and the crews' actual practices on the flight deck. Within the design process itself, however, Degani and Wiener's model does not offer specific guidance on assessing usefulness or usability. Another approach that relates procedures to documentation is the act-function-phase (AFP) model (Novick & Tazi, 1998). While AFP provides a basis for classifying kinds of acts in operating procedures, it does not yet provide means of developing the procedures as such. A third approach was proposed by Drury and Rangel (1996) for reducing automation-related errors in aircraft maintenance and inspection. This approach does have a strong analytic component, which concentrates on identifying error types and opportunities in the procedures.

These approaches all encompass some degree of human-centered design, but tend to rely on post-design testing or interviews. What is missing is an exploratory inspection method for evaluating use, so that issues of both usefulness and usability (Gould & Lewis, 1983) could be addressed early in the design process.


Developers of user interfaces could address these issues through the use of a human-centered analytical technique such as the cognitive walkthrough (Lewis, Polson, Wharton, & Rieman, 1991; Wharton, Bradford, Jeffries, & Franzke, 1992; Wharton, Rieman, Lewis, & Polson, 1994,). The cognitive walkthrough is a usability inspection method for interfaces that originally focused on evaluating a design for ease of learning. The method leads the designer to consider factors such as users' backgrounds and mental effort. It is based on an interaction model like that of Norman's (1986) stages of user activities. In brief, a task is decomposed into steps in the interface, which are analyzed individually with respect to connections between goals, artifacts, actions, and results. Each step is described as a "success story" or a "failure story," depending on the outcome of the analysis. An extensive literature, summarised by Wharton, Bradford, Jeffries, and Franzke (1992), has examined the relative effectiveness of the cognitive walkthrough.

From the most practical version of cognitive walkthrough (Wharton, Rieman, Lewis, & Polson, 1994), we developed a new version adapted to operating procedures. This involved an iterative process of revision and use of forms and instructions adapted to procedures and their documentation. The CW-OP included five key changes:

The current version of the CW-OP, like the cognitive walkthrough for the physical interface, is supported by the use of two forms: (1) a cover sheet and (2) a form that presents the success or failure "story" each step analyzed. The contents of the cover sheet, which are basically the same as that in the original walkthrough, are presented in Figure 1. The contents of the form for an individual step are presented in Figure 2. The second form reflects the five changes adapting the walkthrough to procedures. In the actual forms, much more space is provided.

CW-OP Cover Sheet

Action sequence:

Figure 1. Contents of the Cover Sheet

CW-OP Story:   ( ) Success    ( ) Failure



1. Will the users try to achieve the intended effect?
2. Will the users notice that the correct action is available?
   a. Documentation
   b. Interface
3. Will the users associate the correct action with the effect
   trying to be achieved?
4. If the correct action is performed, will the users see that
   progress is being made toward solution of the task?


1. Are experience or training needed?
   If so,
   a. Is this kind of step common or rare?
   b. Will training be easy or difficult?
2. Is the step correct in terms of function?
3. Are particular errors likely?
   If so, what is their impact on safety? 
4. Design suggestions
5. Other comments

Figure 2. Contents of the procedure-step form


An empirical evaluation of the cognitive walkthrough for procedures was conducted in order to address the study's main questions, which were operationalised in terms of the following hypotheses: (1) The CW-OP could incorporate elements dealing with procedures and documentation without undue burden on evaluators; (2) the CW-OP would identify issues involving the procedural as well as the physical interface; and (3) the evaluators' assessments would show a high level of agreement.

The hypotheses were tested through a walkthrough of draft operating procedures for a proposed text-based cockpit interface for air-traffic control (ATC) communications. Figure 3 presents one of the draft procedures. The test used six evaluators, including a computer scientist, two industrial ergonomists, two doctoral students in computer science and ergonomics, and one graduate-student intern in computer science. The evaluation was preceded a half-day of training. The evaluators were given three pages of a draft manual explaining the physical interface and the meaning of the air-traffic-control messages, along with three draft procedures. Taken together, the procedures encompassed eight unique top-level steps. Evaluators performed their own decomposition of steps into sub-steps as they judged appropriate. The tasks corresponded to the procedures. The action sequences that served as the standard for evaluation were determined by the procedures themselves plus the regulatory standards from which the interface and the procedures were developed.

Procedure: Respond to a clearance


Aircraft receives message "AT ALCOA CLB
TO & MAINT FL390."

Aircraft sends message "WILCO."

Figure 3. Draft procedure "Respond to a clearance"


Following the CW-OP, all forms were reviewed for completeness, and the responses to individual questions aggregated.

Hypothesis 1

The data confirmed the hypothesis that the CW-OP could incorporate elements dealing with procedures and documentation without undue burden on evaluators. All evaluators completed the task within the 90-minute period. The number of steps analyzed per evaluator ranged between 8 and 13. These rates appear consistent with those reported for the non-procedure walkthrough (cf., Lewis, Polson, Wharton, & Rieman, 1991). The evaluators all expressed the opinion that the session had been valuable for them.

Hypothesis 2

The data confirmed the hypothesis that the CW-OP would highlight issues involving the procedures as well as the interface. As indicated in Table 1, 31 of 48 comments or design suggestions concerned procedures and their documentation rather than the physical interface.

Table 1. Distribution of evaluator responses: Comments and suggestions (N = 48)
Physical interface Procedures or doc
Overall comments 0% 2%
Step comments 28% 22%
Step design suggestions 6% 38%
Total 34% 62%

Most of the step comments about the physical interface were actually questions about how the interface or system worked. The design suggestions included items such as changes to procedure embedding, wording, level of detail, and sequencing of steps. Additionally, there were four overall comments on the evaluation process itself, including suggestions for reworking the evaluation forms. The form in Figure 2 reflects some of these suggestions.

Whether the evaluators' findings were useful in the redesign of the procedures is a different question. While difficult to quantify, development following the evaluation included redesign of all three procedures that either eliminated or reformulated each procedure using specific findings reported on the forms.

Hypothesis 3

The data were inconclusive in confirming the hypothesis that the evaluators would agree in their assessments. The raw distributions suggest reasonable agreement. Application of Cohen's Kappa statistic (Carletta, 1996), which assesses whether classification by multiple coders exceeds chance levels, suggests that the agreement is better than chance but that confirmation of inter-rater reliability requires a larger test set. These results can be interpreted in terms of evaluators' responses to representative questions on the step form, including success/failure, availability of the action in the documentation, and availability of the action in the interface.

Success/Failure Stories There was some disagreement among the evaluators as to whether the procedure steps presented success or failure stories. Raw percentage agreement among all six coders was 0.750. However, given that there were only two categories, percentage agreement was necessarily at least 0.500. Looking at Kappa, which ranges from 0 (no agreement) to 1 (perfect agreement), K=0.40 for classification of success/failure. This indicates that there was a good measure of agreement above chance but not as high as generally sought for classification of reliable categories. In fact, Kappa is not stable at low n, as is the case here. Consequently, these values are considered reasonable for exploratory work involving initial sessions by relatively untrained evaluators while coding categories are being developed.

A more informative view of data such as these comes from direct examination of the distribution of responses, which is presented in Table 2. The six evaluators are identified by letters that are consistent across the table. The eight steps are numbered for purposes of the table; these numbers were not in the materials provided to the evaluators. The table indicates strong agreement among the evaluators on three of the eight steps, and an even split on one step. This suggests that developers of operating procedures might benefit from understanding the reasons behind the evaluators' disagreements. For example, evaluator F considered that task sharing and coordination between the crew members was not sufficiently explicit.

Table 2. Distribution of evaluator responses: Is the step a success or failure story?
Story for a Procedure Step?
Step 1.1C D E FA B
Step 1.2D EA B C F
Step 2.1A B C D E F
Step 2.2A B C D E F
Step 2.3B C D EA F
Step 2.4A B C D EF
Step 3.1B C D EA F
Step 3.2B C EA D F

Action Available in Documentation The distribution of ratings on availability of the correct action in the documentation, presented in Table 3, suggests reasonably strong levels of agreement. Kappa was highly unstable on this distribution, and thus was not meaningful. The distribution suggests, though, that most evaluators will agree on whether the documentation provides the right action. For step 2.1, the evaluators generally found that an action term in the procedure was not explained in the documentation.

Table 3. Distribution of evaluator responses: Is the correct action available in the documentation?
Action Available in Documentation?
AvailableNot Available
Step 1.1A B C D E F
Step 1.2A B C D E F
Step 2.1DA B C E F
Step 2.2A B C D E F
Step 2.3B C D E FA
Step 2.4A B C D E F
Step 3.1A B C D E F
Step 3.2B C D E FA

Action Available in Interface The distribution of ratings on availability of the correct action in the interface, presented in Table 4, suggests moderate levels of agreement. Kappa was highly unstable on this distribution, and thus was not meaningful.

Table 4. Distribution of evaluator responses: Is the correct action available in the interface?
Action Available in Interface?
AvailableNot Available
Step 1.1B C D E FA
Step 1.2A B C D E F
Step 2.1A B C D E F
Step 2.2A B C D E F
Step 2.3B C EA D F
Step 2.4A B C D E F
Step 3.1A B C E FD
Step 3.2B C EA D F

For step 2.1, the evaluators all found that an action term in the procedure was not a label in the interface. For step 2.3, some of the evaluators found that a label's meaning was confusing. For step 3.2, some evaluators found that the documentation referred only to the goal rather than the action.

The range of evaluator agreement for these questions on the step form suggests that multiple evaluators can provide complementary perspectives on the usefulness and usability of operating procedures through use of the CW-OP. Multiple evaluators bring a variety of experiences with interfaces and procedures, different kinds of knowledge about human factors, and different levels of expertise about the interface being evaluated.


The empirical evaluation suggests that the cognitive walkthrough for procedures can be used effectively by evaluators and that it provides perspectives of value to developers of procedures and their documentation. Post-experiment debriefings and reviews of the study's evaluation forms suggested a number of improvements in the evaluation process, including training and coding.

The study involved parallel individual evaluations by persons with training in cognitive science. Users could be involved directly in the evaluation through use of the group-style cognitive walkthrough (Wharton, Rieman, Lewis, & Polson, 1994).

Training should include some individual (i.e., non-group evaluations) in order to make sure that each person has experience in answering all questions. For example, one evaluator reported difficulty in making the success/failure distinction.

Developers may want to ask evaluators to mark sections they read or used during the evaluation process. For example, no evaluator provided comments on the examples included as part of the procedures. Because there was no record of whether the evaluators used the examples to understand the procedures, the process did not provide a basis for assessing whether the examples were helpful.

Some other aspects of the procedures and their documentation did not get evaluated explicitly in the walkthrough. First, no one evaluated the names of the procedures, presumably because they are not steps as such. This leaves open the question of how to obtain analyses of the role of the name in accessing procedures as a part of their context of use. Second, overall instructions tend not to be evaluated. For example, the draft manual contained an instruction on when members of the crew should make announcements to each other during procedures. There was only one comment on this instruction in the session. How to obtain analyses of such overall instructions remains an open question.

Analysis of the evaluations and post-comments also suggests that the task model presented to evaluators was relatively weak. In determining the correct action during evaluation of steps in the computer interface, the evaluators used the interface description provided in the draft documentation, backed up by (1) an available reference document for the requirements set by the regulatory authorities and (2) their personal knowledge of the interface. The number of questions written by the evaluators with respect to the operation of the interface and its underlying system, as reported in Table 1, indicate that these resources were not sufficient.

This experience suggests using, to the extent possible, an explicit "gold-standard" definition of the interface that would eliminate uncertainty about correct actions in the physical interface and the functions of the underlying system. However, requiring a precisely defined physical interface makes it more difficult to develop the procedural interface while the physical interface has not been specified in detail. This may limit the method's usefulness during early development phases.

An alternative to the CW-OP might be the GOMS method (John, 1995), which enables describing the task and the user's knowledge needed to perform it in terms of goals, operators, methods, and selection rules. GOMS can predict skilled-performance time, method-learning time, operator sequence and likelihood of memory errors. GOMS can also be used to design training programs and help systems as it describes the content of task-oriented documentation. However, GOMS may not be appropriate for use early in the design process of operating procedures, particularly where issues of usability and safety are more important that issues of timing. GOMS is of relatively little help on design issues independent of the procedural quality of the interface including: (1) standard human factors issues such as visual quality of a display layout, readability of words and letters (2) the quality of work environment, user acceptance, and (3) the social and organizational impact of the system (John & Kieras, 1996). And, most important for operating procedures, GOMS does not help assess factors such as guidance and feedback.

Aside from the direct assessment of usability, the CW-OP provides additional insight into usefulness and safety. In particular, the cognitive walkthrough's requirement of a reason for linking goals to actions leads evaluators to determine whether training or experience are required for correct use of the procedure. Reductions in training required to operate an aircraft would both increase reliability and lower operating costs.


This research was supported by a contract from Aerospatiale Aeronautique. We thank Florence Buratto, Laurent Moussault, Florence Reuzeau and Stéphane Sikorski for their participation in and contributions to the research, and we thank Jean Carletta for her advice about how to use and interpret findings on inter-rater reliability.


Carletta, J. (1996). Assessing agreement on classification tasks: The Kappa statistic. Computational Linguistics, 22(2), 249-254.

Degani, A., and Wiener, E. Procedures in complex systems: The airline cockpit. IEEE Transactions on Systems, Man, and Cybernetics, 27, 3 (1997), 302-312.

Drury, C. G., & Rangel, J. (1996). Reducing automation-related errors in maintenance and inspection. In Human Factors in Aviation Maintenance-Phase VI: Progress Report Vol. II (pp. 281-306). Washington, DC: Federal Aviation Administration/Office of Aviation Medicine.

Gould, J., and Lewis, C. (1983). Designing for usability: Key principles and what designers think. Proceedings of the Conference on Human Factors in Computing Systems (CHI 83) (pp. 50-53). New York, NY: ACM Press.

Grudin, J. (1990). interface. Proceedings of the Conference on Computer-Supported Cooperative Work (CSCW 90 ) (pp. 269-278). New York, NY: ACM Press.

John, B. (1995). "Why GOMS?" ACM Interactions, 2(4), 81-89.

John, B., and Kieras, D. (1996). "Using GOMS for user face design and evaluation: Which technique?" ACM Transactions on Computer-Human Interaction, 3(4), 287-319.

Lewis, C., Polson, P., Wharton, C., and Rieman, J. (1991). Testing a walkthrough methodology for theory-based design of walk-up-and-use interfaces. Proceedings of the Conference on Human Factors in Computing Systems (CHI 91) (pp. 235-242). New York, NY: ACM Press.

Norman, D. (1986). Cognitive engineering. In Norman, D., and Draper, S. (Eds.), User centered system design. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.

Novick, D., and Tazi, S. (1998). Flight crew operating manuals as dialogue: The act-function-phase model. Proceedings of the International Conference on Human-Computer Interaction in Aeronautics (HCI-Aero'98) (pp. 179-184). Montreal, Canada: Editions de l'Ecole Polytechnique de Montréal.

Wharton, C., Bradford, J., Jeffries, R., and Franzke, M. (1992). Applying cognitive walkthroughs to more complex user interfaces: Experiences, issues and recommendations. Proceedings of the Conference on Human Factors in Computing Systems (CHI 92) (pp. 381-388). New York, NY: ACM Press.

Wharton, C., Rieman, J., Lewis, C., and Polson, P. (1994). The cognitive walkthrough method: A practitioner's guide. In Nielsen, J., and Mack, R. (Eds.), Usability inspection methods. New York, NY: John Wiley & Sons, Inc.

This paper is in press as Novick, D., and Chater, M. (in press). Evaluating the design of human-machine cooperation: The cognitive walkthrough for operating procedures. Proceedings of the Conference on Cognitive Science Approaches to Process Control (CSAPC 99), Villeneuve d'Ascq, FR, September, 1999.

DGN, May 24, 2005