Chapter IV



In the preceding chapters I considered that the general question of coherence is posed in the integration problem, examined a range of prior work which provided foundations for attacking the problem, and proposed a theory of meta-locutionary acts for explaining the control processes of conversation. In this chapter I discuss the specifics of a study for assessing the theory. In particular, I look at the effects to be studied, and the mechanics of the protocol analysis of conversation. The computational modeling of the observed behaviors is discussed in chapters V and VI.


Methodology in Cognitive Science


Although the field is not settled (Suppes, 1984), methodology in cognitive science has produced some agreement on what constitutes adequate work. Unfortunately, research into the processes of conversation has (at least) two major methodological pitfalls. The first is the lure of introspection. Introspection seems inexpensive and easy; virtually every human being communicates interactively with others. But one cannot simply present one’s intuitions as a descriptive study. It is precisely because mixed‑initiative discourse is so familiar to us that its processes are transparent to the introspective observer. The observed behaviors must be explained in the context of a theory that proposes a coherent model (Miller, Polson, & Kintsch, 1984). What sort of theory is adequate? Miller, et al., (1984) suggest that there is a continuum between a naked idea and an explicit process model. In the center of this continuum lie “middle ground” theories in which structures and processes have been specified in sufficient detail to support both the instantiation of the theory as a computer program and qualitative experimental study. In this chapter, I explain how the theory of meta‑locutionary acts can be validated in both regards; it thus avoids the first pitfall. The second pitfall is that of failure to explain behavior as a process. Although all studies must inherently look back at behavior that is now--because it occurred in the past--static, a good model will explain the behavior on the basis of real‑time factors: those which were available in the actual context of the observed interaction (Swinney, 1984). Although this view has been criticized as overly restrictive (Charniak, 1984), its essential truth is that the model cannot embody in its explanation the kinds of post‑perceptual processes which are available to observers rather than to the participants. I have tried to address this concern through (1) minimizing domain‑level, real‑world knowledge in the experimental task, (2) using temporally‑significant acts to explain the observed behaviors, and (3) trying to replicate the behaviors with a simulation of the model.

Effects to be Studied

The eventual goal of this study is to test the theory of meta‑locution through computational modeling of actual interactive discourse. Within the general philosophical framework of the speech‑act approach to language, most of the discourse theories we have discussed have looked at basically sentence‑level (or utterance‑level) interaction. At a minimum, the discourse unit of interest has been the noun phrase, where the noun phrase has been characterized as a act of assertion or the subject of negotiation. As I have tried to show in Chapter II, both Cohen (1984) and Clark and Wilkes‑Gibbs (1986) took constructive steps toward solving the integration problem through partial reduction of the units of discourse. Taking this research as leads or indications of the proper direction for more general solutions of the integration problem, one can go much farther in atomizing discourse for purposes of both understanding and generation. In Chapter V, I extend the analytical techniques such as those used by Grosz, Allen, Clark, and Cohen to sub-domain levels of discourse.

I thus seek to examine sub‑domain levels of conversational interaction. As the socio‑linguistic literature has described the features of conversation which apparently lead to coherence, this study concentrates on aspects of language which are related to these observations. Thus the model should explain turn‑taking, negotiation of reference, and confirmation of the mutuality of knowledge.

General Approach

I have argued that maintenance of a conversational model is clearly part of any reasonably sophisticated approach to generational simulation of conversation, yet our initial observations of conversations suggest that maintenance of the model is a continuous, multi-level process of incremental addition and revision rather than a post-utterance assessment. This set of behaviors of the speaker and hearer is a) at a more finely grained level than that of the standard illocutionary act and b) about the process of conversation itself. They are, broadly speaking, locutions which embody illocutionary acts in the sense that they are intended to produce a change of state in the world outside the speaker, and are meta‑acts in the sense that in combination they effect Austinian speech acts but individually are specifically directed to changes in the state of the conversational model itself. That is, these are illocutionary acts which correspond to intended perlocutionary effects performed on the shared conversational structure. The feedback-suffused basis of coherence in conversation is a “critical fact” (VanLehn, Brown, & Greeno, 1984) which the theory of meta-locutionary acts should better explain than traditional speech-act theory.

Given a theory of the meta-(il)locutionary act as a maintenance device for a multi‑layered shared conversational model, what is the immediate program of research to be followed? Cohen (1984) states that one should derive the illocutionary acts as a rational strategy of action, given attributions of participants’ beliefs, goals, and expectations at the point in the discourse in which the illocutionary acts actually occurred. Accordingly, I first identify a set of speech acts which (1) comprise illocutionary acts and (2) handle the micro‑tasks of turn‑taking, negotiation of reference, and confirmation of the mutuality of knowledge. This is accomplished through analysis of conversation protocols at a sufficiently small granularity that the acts can be discerned. Second, I develop, using these acts, a computational model of the belief structures of the conversants. The structures, with a suitable set of operators, are sufficient to account for most of the observed meta-locutionary behavior.


Study of Spoken, Face‑to‑Face Conversation

The effects of meta‑locutions, as I have defined them, are peculiar to real-time interaction and especially to spoken conversation. However, not all discourse is interactive or spoken, much less face-to-face. In fact, there are a number of dimensions which characterize the various modalities of discourse. The character of the discourse changes in ways corresponding to the characteristics of the modality.

Why then should one conduct research into computational models of meta‑locutions using spoken, face-to-face conversation? It appears that the most efficient (in temporal terms, anyway) form of interactive communication for human beings is spoken conversation. It is certainly easier. Why should users of interactive computer systems be limited to the capabilities of computing technology circa 1965? Moreover, it may be difficult or impossible to understand the processes of language and interaction without starting with face-to-face, spoken conversation. While the precise nature of the contributions of verbal and nonverbal communication to interaction are not known, they are ascertainable through research:

This notion, though simple and intuitive, carries a strong implication for research: we can fully understand language only by examining its functioning as an aspect of face-to-face interaction, and not be treating it as an autonomous entity, unsullied by contact with everyday social processes, including co-occurring “nonverbal” actions. (Duncan, 1980, p. 67).

Looking at the advantages of conversation to interfaces, cross-modal research on discourse as surveyed by Cohen (1984) found significant efficiencies in spoken conversation. In particular, a series of cross-modality studies was conducted by Chapanis, Ochsman, Parrish, & Weeks (1972, 1977). They found, inter alia, that problems are solved twice as fast in vocal modalities as they are in written ones, even though conversants use twice as many words when speaking. Cohen (1984) also cites a thread of psychological research on cross-modal comparisons of reference. These studies show that for spoken interaction the length of noun phrases tends to decrease as subsequent references are made; this decrease is not as sharp for non-interactive spoken modalities. Cohen concludes that these results indicate that efficiency in referential communication is a function of conversants’ feedback.

Many of the differences between narrative and interactive discourse are well known. People looking at conversational language often remark on its apparently ill-formed and “ungrammatical” qualities. Not only are what one often considers speakers’ mistakes more prevalent, correspondingly abundant are the opportunities for repair. Thus Goodman suggests that the study of miscommunication is a necessary task for building natural language understanding systems since any computer capable of communicating with humans in natural language must be tolerant of the complex, imprecise, or ill-devised utterances that people often use (Goodman, 1986). Interactive systems which aspire to live in the real world of feral language must be able to cope with its perplexing characteristics. Among the factors affected by modality of discourse is the conversants’ ability or opportunity to maintain their model of the conversation. Clearly, authors of novels do not depend on real-time feedback from their readers to check uncertainty in the readers’ models, nor can authors negotiate the meaning of acts, lexical items, and references. Thus Jernudd and Thuan observe that discourse represents a continuous process of accommodation among conversants.

The relationship between conversants can vary greatly, from the great distance of written discourse to closeness of face-to-face conversation, which brings the question of accommodation into sharp focus (Jernudd & Thuan, 1983). If one is to study the shared model created by discourse and its maintenance through negotiation and accommodation, then one should study conversational rather than textual discourse. Other modalities of discourse, such as keyboard input, can be interactive but not spoken. It turns out that the presence of speech itself in place of written interactive language is significant for the structure and processes of discourse. Cohen (1984) found that keyboard communication is distinctly different from other modalities of discourse like face-to-face conversation. In particular, among other differences, keyboard interaction emphasized optimal packing of information into the smallest linguistic space. As a result, keyboard communication alters the normal organization of discourse. Of course, non‑textual information represents some sort of (presumably most efficient) lexicalization of an underlying meta‑locutionary act. To the extent that acts can be lexicalized in the alternate modality, the meta-locutionary content can be transmitted. However, this may require a new vocabulary; unless recognized though social acceptance or immediately negotiated, new lexemes will not be understood.8

Modality-induced differences in discourse also include the way conversants use reference. For example, voice‑only communication removes some of the forms of acts of reference (e.g., deixis and common visual context) which keep the interaction coherent. When these acts are not available, the interaction can easily break down:

O.K., uh ... now, we need to attach the um ... conduit to the motor ... the conduit is the uh ... the covering around the wire that you ... uh ... were working with earlier. Um, there is a small part um ... oh brother ... (Grosz, 1982, p. 88).

Accordingly, cross-modal studies have shown that different modalities of communication lead to different uses of referring expressions. Analysis of protocols from teletype and telephone interactions shows marked differences in the use of explicit requests for identification. As a consequence, systems which understand spoken language will have to be prepared for language which differs from that observed in teletype interaction (Cohen, 1981). If the goal of research in natural language processing is either (1) understanding how people actually converse or (2) developing systems which converse with people, the evidence with respect to differences in conversational characteristics from differences in modality suggests that spoken conversation would be a fruitful area of study.

The Study

The methodological strategy of this study, then, is to examine conversational interaction in some reasonable domain, and then to derive the underlying illocutionary acts as a rational strategy of action, given attributions of the participants’ beliefs, goals, and expectations at the points in the discourse in which the illocutionary acts actually occurred (Cohen, 1984). I thus turn to the particulars of the design of the study that instantiate this strategy.

Domain and Tasks

The first part of the empirical work involved development of a suitable domain for the observed conversations. The general requirements were that the conversation produce acts by the conversants which included turn‑taking, negotiation of reference, and determination of mutuality of knowledge. In order to facilitate ready experimental determination of mutuality, the domain should be a simplified one in which as much of the mutuality as possible is created through direct rather than indirect copresence. That is, if the knowledge becomes mutual to the conversants during the conversation itself, the fact of the mutuality can be more easily ascertained by an experimenter‑observer. In contrast, if the subject of the conversation is something which is mutual through indirect copresence, then the experimenter has virtually no basis from which to determine the knowledge confirmed or the extent of the confirmation. Unfortunately, extreme reduction in domain complexity may also lead to unrealistic interaction that exhibits artificial effects. In reaching a balance here, one may simply have to qualitatively evaluate the resulting conversation for verisimilitude.

A second factor in selecting the domain concerns deixis. I sought to reduce confusion over gestures which were referential by minimizing circumstances which called for deictic reference. This meant that a task such as jointly building a structure would have to be limited to structures which were wholly mental rather than physical or visual. The advantages of reduced confusion from deictic gestures are not obtained without cost. DeLancey (personal communication, February 23, 1988) has pointed out that a shared physical work serves as a default place for gaze; thus where the conversants have something other than each other to look at, direction of gaze toward each other may be more significant than in the case where they look at each other more or less continuously.

A third factor which influenced the choice of both the domain and the task was a need to keep the conversations reasonably short--around two minutes at the most. At the fine‑grained level of transcription which this study requires, longer protocols would simply be impractical to transcribe. Experience with the pilot study suggests that to transcribe fully ten seconds at the appropriate level of detail takes about six hours. Thus forty seconds of protocol requires three working days to transcribe. It would be also be possible to elicit longer conversations with greater structure and then choose short sections for analysis. However, this would lead to more speculative analysis in initializing the conversational models. Accordingly for this initial study, the domain and task had to revolve around concepts and activities that could have fairly rapid closure.

Given these constraints, I chose for the domain the task of jointly reconstructing a sequence of random letters. Thus a typical protocol involves two subjects, each of whom has a copy of the same sequence of 15 letters. Some of the letters have been replaced by blanks. The blanks do not overlap. Thus if one had both copies of the sequence, the entire sequence could be determined. The sequences of letters and the positions of the blanks were chosen randomly. Figure 7 depicts the copies of the sequences which subjects received in one of the trials.

The tasks were structured as follows: The subjects were given, by random choice, their copies of a sequence. The subjects then had one minute in which to memorize their respective sequences. After the minute expired, the subjects turned over the pieces of paper on which the sequences were printed and were instructed to jointly recall the entire sequence. This task was repeated two additional times, so that for each pair of subjects I obtained three conversations. In the third task, the copies of the sequence were altered so that toward the end of the sequence the copies differed by one letter. Copies of this kind are depicted in Figure 8.


Copy of sequence given to subject A:

_ i s u t w r q g l d _ _ _ _ g o

Copy of sequence given to subject B:

o _ s u t w r q g _ _ f w w d g o

Figure 7. Example of random‑letter sequences used in protocols. Neither copy contains the entire sequence. The blanks do not overlap. Therefore, the entire sequence can be reconstructed if the information in both copies is used.


Copy of sequence given to subject A:

n _ _ b w _ e g u q t y v o x u e

Copy of sequence given to subject B:

_ i k b w f e g u q t y _ _ _ y e

Figure 8. Example of non‑identical sequences used in third task. The copies of the sequences give to the subjects differ toward the end. In this example, the difference occurs in the next-to-last position. This stimulus was used to induce apparent failure of the subjects’ models of the conversation.




The rationale for the three tasks is that the first provides familiarization to the subjects, the second lets the subjects interact proficiently, and the third uses the subjects’ expectations garnered in the first two tasks to induce apparent failure of the subjects’ model of the conversation. The domain has the needed characteristic that all of the knowledge will be directly copresent, because the entire domain‑level knowledge structure is contained in the sequences. Moreover, because the sequences are memorized, the resulting conversations revolve entirely around mental rather than physical constructions. The relatively short length of the sequences meant both 1) that memorization and recall would be feasible tasks and 2) that the conversations could conclude rapidly. These assumptions were confirmed in a series of pre‑experimental tests.


I conducted for this paper a pilot study with two sets of subjects. Each set consisted of two persons who performed the three tasks described above. The subjects sat roughly at a ninety‑degree angle to each other so as to face each other and still be visible to the camera. An overhead view of the situation is presented diagramatically in Figure 9.

The subjects were recorded on VHS videotape during all instructions, preparations, and the actual tasks. The subjects completed, from their subjective standpoint, all of the tasks, although objectively they did fail on occasion to recall the sequences accurately. However, their accuracy in the task was sufficient for the analysis. The conversations typically lasted about two to three minutes. In all, then, I obtained six protocols.


Figure 9. Overhead view of the experimental set-up for the letter-sequence protocols.



Protocol Analysis of Conversational Interaction

In the preceding section, I discussed the rationale, design, and collection of the protocol data for this research. I now turn to the analytical methods applied to the protocols. Useful sections of the protocols were transcribed for both verbal and non‑verbal behaviors. The verbal portions were transcribed with the aid of the SoundCapª digitizing program on an Apple Macintosh computer. The sounds were resolved down to a precision of about a tenth of a second. Ambiguous words were resolved by listening to the original videotape. Sounds that could not be resolved into words were noted phonetically. Intonation was coded through punctuation. The subjects’ physical actions were transcribed using an editing videotape recorder with slow-motion control. The actions were resolved, where necessary, on a frame-by-frame basis down to a thirtieth of a second. A notation for the physical actions was developed to the extent necessary to impart the general nature of the actions observed; while accuracy in timing is important, accuracy in the specifics of the motions is not needed if the gist of the action can be discerned. This feature of the transcription process is facilitated by the fact that the study does not examine nor does it make claims about lexicalization of physical action (or of verbal acts either, for that matter). A transcript of the protocol specifically analyzed in this paper is set forth in Appendix A.

Once the verbal and physical behaviors had been transcribed, I then assigned identifiers to the behaviors which I deemed could be considered acts. These included virtually all of the behaviors observed. Only later in the study were behaviors declassified as acts; the motivation for this approach is that it was better to over‑include acts as significant rather than risk ignoring a behavior which might have been a significant within the context of the conversation. As Birdwhistell (1970) observed, no behavior ever carries meaning in and of itself; it is the context which provides the meaning, if any. For example, in the protocol reported in Appendix A, B’s behavior of leaning back was initially coded as a separate act identified as Bi3; subsequent analysis of the interaction suggested no reasonable basis for finding this action as an act distinct from its other associated actions, and so the notation of Bi3 was deleted. As it turned out, though, very few behaviors were found insignificant. Conversely, in some cases additional acts were noted where, on further analysis, the behaviors appeared to require decomposition. In the protocol reported in Appendix A, act Ai6a was added in this way.


In assessing the validity of the theory of meta‑locutionary acts, the effects to be studied should include turn-taking, negotiation of reference, and confirmation of the mutuality of knowledge. Accordingly, a domain was developed in which these meta‑conversational phenomena could be observed and modeled. The domain task required face-to-face, spoken conversation between the conversants, maximized the use of directly copresent knowledge, and minimized deictic gestures. The actual task given the conversants in the experimental protocols--joint recall of a random sequence of letters--met these requirements. Protocols were obtained of three trials each of two pairs of subjects. The protocols were transcribed for verbal and non-verbal behaviors, and the behaviors were noted as possible meta‑locutionary acts.

In the next chapter I describe how the behaviors were connected to the acts developed as part of the theory of meta‑locutionary acts. These acts, and the conversational models which they affect, explicate the conversation in meta‑locutionary terms.


8. Some new lexemes may be slowly evolving for textually based computer interaction. For examples the smiley-face icon, “:-)” and its variants, conveys the meta-locutionary sense of satire or irony.