This dissertation presents the motivation, methodology, and results of a qualitative study of mixed-initiative conversational interaction. The study examines speech meta-acts which handle the various functions of coördinating and producing conversational discourse. It addresses the problem of multi-agent computer and computer-human interaction by examining human‑human interaction. In the day-to-day experience of ordinary human interaction, people encounter and deal with multi-agent communication situations all of the time. If the processes humans use to produce conversation can be understood, these insights may facilitate human-computer interaction that conveniently resembles the interactive discourse of independent human agents.
Human-computer interaction tends to be fragile and frustrating while human-human interaction displays robustness over a wide variety of circumstances (Hayes & Reddy, 1983). Some of the differences between the two forms of interaction are summarized in Figure 1. In human—computer interaction, typically either the human or the computer controls flow and structure of the interaction. This is single-initiative interaction in the sense that only one of the parties is empowered to initiate exchanges. This form of interaction is largely pre-planned, either by the human or by the designers of the program. Even so, it is largely because they are forced by the interface (or have learned from earlier frustration) to use well-formed language that the human and the computer are able to understand each other at all. In contrast, human-human conversations are characterized by unpredictability, ungrammatical utterances, non-verbal expression, and mixed‑initiative control in which the conversants take independent actions. In the face of an unexpected shift of conversational context, somehow humans are able to understand that the shift has occurred. Our every-day language is notoriously ill-formed, full of improper usages, mismatched agreements, halts and starts, in-line correction of self and other, and utterances that may not even be words. Moreover, we interpret and express a wide range of non‑verbal behaviors as a concomitant of verbal interaction. In human-human conversation, both conversants have the power to seize the initiative to meet their own needs and expectations. If the language of human-computer interaction seems stilted and tame, the language of ordinary human-human conversation is wild and untamed.
H H I C H I
control mixed-initiative single‑initiative
actions situated pre-planned
language informal, relaxed formal, stilted
Figure 1. Differences between interaction modes. Human‑human interaction (HHI) is generally more flexible than human‑computer interaction (HCI)
Traditional natural-language systems--even those based on speech acts--are largely unable to handle these aspects of “feral” language because they mainly rely on parsing sentence-level interaction and fail to account properly for the context created by the conversation itself. Conversational processes are apparently not characterized by top-down planning; they are made up of actions which are responsive to the dynamic conversational situation. A striking aspect of conversational discourse is its coherence--the character of making sense to the conversants. What accounts for the coherent nature of conversation? Despite the incoherence of human-human interaction to an observer to whom only the words are given, the interaction is coherent for the participants; the conversants take turns, make interruptions, detect and cure misunderstandings, and resolve ambiguous references.
How can these processes of control be modeled formally in a manner sufficient to bring this sort of coherence to human‑computer interaction? Although speech act theory adds clarity to the role of intention in interactive discourse, and implies a central function for intention in making discourse coherent, intention is not sufficient to explain the coherent nature of interactive discourse. I argue in this dissertation that shared models play a key role in the mechanics of conversation and creation of coherence. If interactive discourse is achieved through a process of maintenance of a shared model, the conversants will have to use feedback in their discourse as a process control. This feedback is contained in non‑sentential aspects of conversation such as nods, fragmentary utterances, and correction. Such actions by the conversants, based on the context of their interaction, determine the form of the conversation. These actions are thus “meta” to the conversation because they are about the conversation itself rather than constituting its primary substance. In this view ungrammaticality, for example, is not a problem but evidence of these “meta” acts.
This dissertation develops a theory of meta-locutionary acts that explains these control processes. The theoretical approach developed and used here is based on a synthesis of two schools of thought with respect to natural language.1 The first, speech-act theory, is a branch of the philosophy of language. The second, conversational analysis, is a branch of the sociology of language. The differences between these approaches results in what I term the integration problem. I propose a solution to the integration problem and then use this result to build an operator-based model of conversation. The theory, which encompasses a taxonomy of meta-locutionary acts, thus extends speech-act theory to real-world conversational control.
The theory of meta-locutionary acts was refined and validated by a protocol study of human‑human interaction and a computational simulation of this interaction. In the protocol study, subjects were given a coöperative problem‑solving task in a simple domain which necessitated linguistic interaction. The conversants’ behaviors, both verbal and non‑verbal, were transcribed, and their interaction was mapped into illocutionary and meta-locutionary acts. The simulation was developed using a rule-based system written in Prolog; it embodies a system which, based on the theory of meta-locutionary acts, shows that the theory has practical explanatory power. It extends the work of Power (1979) to meta-locutionary phenomena. The system contains independent knowledge bases for both conversants and can represent their separate conversational knowledge simultaneously. The conversants simultaneous actions can then be simulated in the rule‑based system on a cycle-by-cycle basis. Simulations of illocutionary and meta-locutionary acts in the conversations obtained in the protocol study showed that meta-locutionary acts are capable of providing contextually appropriate control of mixed-initiative discourse.
The dissertation concludes with an assessment of the computational model, including proposals for extension, and a discussion of the implications of meta-locution for understanding human-computer interaction.
language is viewed as a series of assertions about the world, the reasons for
speaking and the way language is comprehended are difficult to understand. A
breakthrough in understanding the functions of language was made by the
development of a theory of speech-acts.
Problems of Speech-Act Theory
Speech-act theory has not been without critics. Levinson (1981) argued that speech acts are inadequate for modeling dialogue. First, he argued, the immense variety of surface utterances cannot be reduced to a limited set of acts. Second, speech acts do not appear to correspond neatly to sentences. Third, assignment of meaning to acts cannot be accomplished by a set of rules expressing conventions; rather, this process must involve inferential techniques that take many aspects of context into account. Fourth, sequence‑oriented speech‑act models are too simple to account for the complexity of real‑world exchanges, and coherence cannot, as matter of principle, be explained by rule‑based systems. The first three points are addressed directly by the theory advanced in this dissertation. It is plausible that speech acts are specialized by domain. This is analogous to observations of development of sublanguages in communities of speakers with specialized interests or needs. In this case, viewing the process of conversation as a domain itself leads to the observation that speech acts need not correspond to sentence‑level discourse units. The circumstance that early speech-act models were simple does not mean that more sophisticated rule‑based speech-act models cannot embody situated action in response to complex contexts. As I argue in this dissertation, the fault is not in the idea of speech act theory but rather in confining its application to sentence‑level meaning. Finally, Levinson’s argument about the computational properties of rule-based systems are mistaken. One could dispute matters such as convenience of representation, but Levinson’s observation that simple rules cannot explain some exchanges does not reasonably lead to the conclusion that this problem is an inherent one. For example, rule‑based systems are not inherently limited to modeling adjacency-pair exchanges. Bowers and Churcher (1988), for example, have shown that a rule-based system modeling speech acts can account for complex situated communicative actions. They do so by extending the theory of illocutionary acts from individual performance to social accomplishment; illocution is understood in “dialogical” terms.
Another problem with speech‑act theory has been the difficulty with what are called “indirect” speech acts (Perrault & Cohen, 1980). The problem which indirect speech acts address is that of explaining why an utterance which on its face appears to be one sort of act and which is nevertheless easily understood as an altogether different sort of act. Indirect acts are often the result of social conventions for politeness and seemly behavior. For example, “Gee, that cake sure looks good.” is apparently an act of assertion when the conversants both know the act is really a request: “Please give me a piece of that cake.” Understanding and explaining indirect acts is a major open problem for speech-act theory. The protocol analysis in Chapter V makes a modest gain in this area; the evidence suggests that indirect speech acts have utility in terms of efficiency of conversational communication.
There is also a continuing argument over whether conversants must have the ability to recognize illocutionary acts as such (Bach & Harnish, 1979; Cohen & Levesque, 1986), but the outcome of this debate is not critical to the points made in this dissertation. What is important about speech acts here, though, is a view of language as a series of acts with effects on listeners, or even as coöperative acts which have effects on a jointly constructed conversational structure. Construed broadly, speech acts can be seen as the principal units of of inter‑agent interaction; they are the acts which build conversations. Winograd and Flores (1986) describe language as form of human social action. Language as acts, they argue, creates a consensual domain-interlinked patterns of activity. While Austinian speech-act theory was developed to explain overt effects of sentence-level utterances, these qualities can be relaxed to include mental effects in the hearer and sub‑sentence level utterances. As I will show in this dissertation, such extensions of speech act theory lead to plausible computational models for interactive discourse.
Extension of Speech-Act Theory
Tying speech acts to sentence level utterances appears to be a limitation which does not take adequate account of the relationship between acts and language. Some acts are physical acts masquerading as speech acts. Others are speech acts which look like physical acts. For example, a light switch which reacts to loud noises might turn itself on when someone yells “Lights on!” but would also work if they yelled “I now pronounce you husband and wife.” Conversely, a cough might embody a request for someone to be quiet. In particular, not all effects which speakers intend to cause through their illocutionary acts are physical in nature like passing the salt. As we have seen, some illocutions can make changes in legal relations. Other illocutions have as their effect another speech act. Thus speaking “Do you have the time?” may lead to the hearer saying “Yes, it’s ten of two.” Indeed, as is centrally important to this dissertation, the perlocutionary effect of an utterance may involve the conversation itself. Thus, for example “Shut up!” may be intended by the speaker to cause the hearer to stop talking. In a more pleasant and coöperative context, speaking “Go on.” may be intended by the speaker to cause the hearer to continue speaking (and maybe even effect some purely mental change in the hearer’s beliefs about the speaker’s level of comprehension). Such utterances can be viewed as meta-locutionary acts because they concern and affect the process of their own conversation. This dissertation, then, develops a framework for research into speech meta‑acts which handle the various functions of coördinating and producing discourse as well as research developing semantics and pragmatics for these meta-acts which are sufficiently complete to permit computational modeling of simple interaction in conversational discourse.
Mutuality of Knowledge
The model proposed in this dissertation is based on the idea that the meta-locutionary acts of participants in a conversation are directed toward maintenance of some set of mutual beliefs. Before tackling the problem of what knowledge is shared, though, the question of what constitutes sharing must first be addressed. This issue was studied by Clark and Marshall (1981) in their research on definite reference. They noted what they termed the mutual knowledge paradox: People clearly understand definite reference in a finite time, but this ought to take infinite time because to be certain of a reference the conversants are caught in an infinitely recursive evaluation of the form “B knows that A knows that B knows that A knows that p.”
and Marshall resolved the mutual knowledge paradox through use of copresence
heuristics. In the simplest case, direct copresence creates mutual knowledge
when both parties simultaneously observe a domain feature and are aware that
the other party is also observing the feature. For example, two people sitting
across from each other at a dinner table observe simultaneously a candle in the
center of the table, thus also observing each other observe. Copresence heuristics
eliminate the infinite recursion of the mutual knowledge paradox by allowing
the parties to have direct knowledge of the conclusion. These direct copresence
heuristics are supplemented (and here the supplements may be far larger than
the original corpus) by indirect copresence heuristics that provide an
equivalent basis for concluding mutual knowledge. That is, direct copresence is
not necessary where the circumstances or the parties’ general knowledge about
each other provide a substitute for physical copresence. Clark and Marshall
described an extensive set of indirect copresences, the details of which are
far beyond the scope or needs of this dissertation. As an example, though, I
note the case where two conversants can infer that they are both members of a
In the case of interactive discourse, the notion of mutual knowledge through copresence heuristics applies directly to conversants’ sharing of knowledge about the conversation itself. In conversing, the direct copresence heuristic applies to the utterances of both parties because they hear both their own and the others’ utterances while in each other’s presence. Thus the conversants have mutual knowledge of their conversation (and they know it, because the mutuality is “on the table”). The conversants also share a directly copresent knowledge not only of the individual utterances but also of the contextual structure which the utterances create for the conversation. This, then, is what I mean generally when I refer to a shared model of interactive discourse. Note that this exposition covers only what is meant by sharing and does not reach the issues of what specifically is shared or what is significant in what is shared.
Even though Clark and Marshall’s work was based on resolving a problem of understanding definite reference, their concept of mutual knowledge is straightforwardly extensible to other aspects of the domain and the conversants’ view of the domain. Moreover, even the nominally limited field of definite reference is broad enough to support reasonable inference about the role of shared structures in interactive discourse. When a conversant makes a reference to a (conversational) domain referent, the effect in the hearer might be a simple one of affording recall of or attention to some specific and insular thing. More likely, though, is that the referent is one part of a complex memorial structure. If there is a specific and definite referent with copresence, the speaker can reasonably infer the hearer’s shared knowledge. But even in such a case the referent is likely to have remindings and shadings of meaning for the hearer which differ from that of the speaker. To the extent that copresence heuristics are available for these remindings (community, shared experiences, “our song”) the remindings are part of the shared knowledge of the conversation; beyond this extent, the conversants’ knowledge is a matter for inference at best and more likely ignorance. That is, often there must be remindings about which the other party has little or no knowledge, and the parties presumably know this.
point here that the boundary between shared and unshared knowledge is not
sharp. Rather a field of imprecision and conjecture exists where shared
dissolves into unshared as the inferences of indirect copresence become
increasingly implausible. Thus a shared model of a conversation will be
clearest for the facts of the utterances themselves (because of direct
copresence) and increasingly indistinct for the memorial extensions of
reference and the semantics of the conversants’ actions. For another
conversant, as for any observer, the imprecision of ascertaining the meanings
of a conversant’s actions in response to illocutionary acts is exacerbated to
the extent that the action is a mental one. That is, some responsive actions
are physical—like opening a door or passing the salt; other actions are
Given these fundamental concepts of speech acts and mutuality, I can begin to address the research question around which this proposal is centered. A vocabulary, though, does not in itself constitute a theory. In the next chapter, then, I examine the state of related research and the frontier at which my own problem is formed. Specifically, Chapter II looks at the problem of applying speech‑act theory to real‑world language, examines evidence for shared models of conversations, and discusses the role of repair in maintaining coherence. Chapter III presents the theory of meta-locutionary acts as a way of explaining coherence in terms of conversational acts analogous to domain-level illocutionary acts. Chapter IV discusses the methodology of a protocol study of human‑human conversation. In the study, the results of which are reported in Chapter V, a conversational protocol is analyzed in terms of meta-locutionary acts. Chapter VI presents a simulation of the protocol conversation using rule-based operators. The operators represent an application of the theory of meta-locutionary acts; they are able to use the theory to account for significant parts of the interaction observed in the protocol. Finally, Chapter VII summarizes the results of the dissertation research, discusses issues of intentionality which arose out the research, and presents thoughts about related future work.
1. Speech-act theory is the source of the terms locution, illocution and perlocution. The spoken words are a locution; illocution is the rhetorical force of the utterance; and perlocution is the change wrought by the illocution. The term meta-locution is an invention of the author's which refers to speech about speech. A meta-locutionary act is an illocutionary act which has as its intended perlocutionary effect a change in the conversation in which the act itself occurs.