Survey of Related Work
The Integration Problem
In addressing the question of what language is, speech-act philosophy was spectacularly successful. It solved many problems in linking language to the world and explained why people use language. Speech‑act theory has been useful as a basis for some aspects of computational modeling of interactive discourse, particularly in matters relating to intention and planning (see, e.g., Power, 1979; Cohen & Perrault, 1979; Allen & Perrault, 1980; Cohen, 1981). Speech acts could explain in terms of goal-directed behavior how conversants coördinated when and what they said. Speech-act theory reintroduced communication and purpose into analysis of planning in language. It did not, however, prove as successful in explaining how language is used. That is, the feedback‑suffused protocol evidence of actual speakers was often difficult or impossible to explicate using speech acts (Cohen, 1984; Clark & Wilkes-Gibbs, 1986; Clark & Schaefer, in press). This meant that things like the coördination of real turn-taking could not be modeled with speech acts because the turns frequently did not contain utterances which could be classified as speech acts.
At the same time, socio-linguistic research (see e.g., Schegloff, Sacks & Jefferson, 1977) was uncovering the very linguistic behavior which posed problems for speech-act theory. However, they did not propose computational models. Clark and Wilkes-Gibbs (1986) noted that for both sociologists and speech-act philosophers, the central issue is coördination: How do conversants coördinate on the content and timing of what is meant and understood? The two divergent research traditions had approached the same problems from different directions, equally unsuccessful: The issue cannot be resolved by either tradition alone. In the [speech‑act] tradition conversation is idealized as a succession of illocutionary acts—assertions, questions, promises—each uttered and understood clearly and completely [citations omitted]. Yet from the [sociological] tradition we know that many utterances remain incomplete and only partly understood until corrected or amplified in further exchanges. How are these two views to be reconciled? (Clark & Wilkes-Gibbs, 1986, p.2)
This problem, which I call the integration problem, also bothered Cohen (1984) in his examination of the use of referring expressions in interactive discourse. According to Searle, to perform an illocutionary act an act of predication is required, and the predicate must be uttered. How then, Cohen asked, can one explain the following dialogue where there is no apparent predication for a completed utterance containing only a noun phrase:
A: Now, the small blue cap we talked about before?
A: Put that over the hole on the side of that tube .... (Cohen, 1984, p. 104)
The only apparent effect of A’s initial utterance can be to direct B’s attention to the referent. A does not say anything about the referent here. Equally perplexing, Cohen argues, are predications without expressed referents. For example, Sherlock Holmes might lean over a body on the floor and exclaim to Watson “Dead.” Conversely, Searle would claim that utterances like “There is a little yellow piece of rubber.” contains no act of referring at all and are simply predications. The problem, then, is again that the traditional view of speech acts cannot be reconciled with the linguistic evidence. Cohen noted that the requirement that the act of referring be jointly located with some predication in a sentence or illocutionary act is too restrictive. The functions of reference and predication can be embodied in separate utterances which have, in effect, different though related perlocutionary effects. His solution to this particular part of the problem was to break down Searle’s unitary acts into separate acts with separate functions. Thus he suggested that referring, in and of itself, is a kind of request that speakers make of hearer (i.e., to direct attention). But where lies the solution to the general problem of integrating speech acts with the actual character of interactive discourse? The principle of breaking down exchanges involving reference into segments characterized by their goals might be extended to the general phenomenon of interactive discourse by recognizing that each utterance fragment reflects its own proper purposes.
Characteristics of Conversation
The socio-linguistic line of research on interaction is part of a large field which is broadly known as conversational analysis. This research has described a wide range of conversational characteristics which are not directly representable in Austinian speech-act theory. These characteristic behaviors include laughter, fragmentation, turn-taking, correction, gaze, and nonverbal communication generally.
The prevalence in conversation of laughter and fragmentation was shown by Allen and Guy (1974). There are distinct regularities associated with these phenomena. For example, in conversations between a male and a female, the male tends to laugh twice as often as the female. Although fascinating, laughter is really beyond the scope of the present research. Fragmentation, however, is one of the central characteristics of language for which this dissertation attempts to account. This includes uncompleted sentences and words, plus non‑word utterances such as “uh,” “ah,” and “eh.” Fragmentation is sometimes associated with correction (Allen & Guy, 1974). More generally, they represent sub-lexical, incomplete sentential thoughts, or complete conversational contributions which are expressed in non-sentential form. These include, for example, referential and confirmatory utterances such as “OK, now, the small blue cap we talked about before?” (Cohen, 1984, p. 104) and “Uh-huh.” That is, while fragmentation sometimes results from correction, it is a general functional characteristic of natural discourse. Thus despite their part-formedness, fragments are used in utterances which are nevertheless understood by conversants.
Turn-taking (Sacks, Schegloff, & Jefferson, 1974; Duncan, 1980) is a control-oriented account of conversation that explicates patterns of mixed-initiative interaction in terms of conversational turns. This represents an organizational substrate for domain‑level, intentionality‑based interaction. It is maintained through a wide set of behaviors and acts, including nonverbal communication (Ekman & Friesen, 1981).
In conversation, repair of language and interaction has been observed in two forms: self- and other-correction. There is an apparent preference for self-correction (Schegloff et al., 1977; Clark & Wilkes-Gibbs, 1986), which leads to fragmentation of utterances as speakers in effect erase parts of their utterances and replace them with substitute phrases. As a sub-sentential phenomenon, self-correction is not accounted for by speech-act theory. Other-correction, when lexical, for example, presents the same difficulty.
The role of gaze in conversation has been the subject of extensive—though largely descriptive—analysis. Gaze performs multiple functions in human-human interaction (Argyle & Cook, 1976; Argyle, Ingham, Alkema, & McCallin, 1981), and can be described in terms of its temporal association with language use in dialogue (Beattie, 1981). Gaze was found to be organized in a coördinated system with the plans underlying speech and with the speech flow. Gaze was more highly associated with turn-yielding cues than simply with syntactic clause boundaries.
is a significant part of a broader range of behaviors generally considered as
nonverbal communication. Extensive systems have been developed for noting
physical states and actions associated with language (See e.g., Birdwhistell,
1970; De Long, 1983). Indeed, there is much evidence in support of the
proposition that nonverbal behaviors are part of language. For example, Hoffer
All of the emerging data to me to support the contention that linguistics and kinesics are infracommunicational systems. Only in their interrelationship with each other and with comparable systems from other sensory modalities are the emergent communication systems achieved. (Birdwhistell, 1970, p. 127)
Clearly, it would be difficult to explicate coherence in interactive discourse without taking account of these characteristics of language as it is used; this is the heart of the integration problem. Why, though, do conversants rely on these functions? In the next two sections, I address this question by looking at models of mixed-initiative conversation which would explain the use and necessity of these behaviors.
Evidence for Shared Models of Conversations
This section examines the role of feedback as a process control for interactive discourse by discussing the following questions: How closely does the hearer track the speaker’s utterances against their shared assumptions about the conversation? How do conversants minimize and/or correct expectation failures?
The evidence is strong for the proposition that conversants in interactive discourse share a model of their conversation. In informal terms, a weak version of this conjecture would be that the conversants must have at least some knowledge necessary to the conversation which is common to all conversants. A strong version is that the conversants are jointly creating a single (though possibly complex) intellectual product. Suchman (1987) characterizes conversation as an “ensemble” work:
Closer analyses of face‑to‑face communication indicate that conversation is not so much an alternating series of actions and reactions between individuals as it is a joint action accomplished through the participants’ continuous engagement in speaking and listening [references omitted]. (Suchman, 1987, p. 71)
Clearly, speakers of a language necessarily share lexical knowledge, even if their internal representations of the lexicon differ in their extension. To what extent, then, must conversants construct or share a mutual model of their jointly‑created conversation?
Implications of the Interactive Process for Shared Models
Observation of interactive discourse suggests that when conversants share information, their exchanges cover the range of conversational levels, from domain information to turn-taking. As has often been noted, one of the central questions of interactive discourse is how the conversants coördinate the timing of what is meant and understood (see e.g., Clark & Wilkes-Gibbs, 1986). Clearly, a conversational model common to the conversants would facilitate this coördination. This process has been characterized as one of coöperation. The fact of coöperation would seem to presuppose something to coöperate about. According to Clark and Wilkes-Gibbs (1986), Grice (1975) observed that conversants coöperate in their contributions to a conversation by directing their contributions toward the accepted purpose or direction of their exchange.
Even in the simplest cases, though, it is apparent that some kind of mutually understood model of the conversation is created. Consider the railway station protocols discussed by Allen and Perrault (1980):
Patron: The train to
Clerk: Gate 10. (Allen & Perrault, 1980, p. 442)
the notion of mutual knowledge developed by Clark and Marshall, the clerk and
the patron mutually know that the patron has made an inquiry about the train to
Cohen and Perrault (1979) also described a model for possible intentions underlying speech acts. Cohen and Perrault were interested in providing a theory of speech acts which, among other things, answered questions about changes the successful performance of a speech act makes in the speaker’s model of the hearer and in the hearer’s model of the speaker. They proposed a planning approach which provided formal criteria for defining speech acts in terms of intentions, abilities, and effects. Acts of requesting and informing were then specified as operators which can be used by a planning system. This model necessarily assumed a speaker’s model of hearer and vice versa. To the extent that these are self-referential we have at least the raw basis for a shared model, whether or not the authors recognized it as such.
To be coherent, a conversation must be about something. It follows that the discourse segments which make up a conversation must have corresponding foci. The notion of focus as a characteristic of interactive discourse was developed by Grosz (1977). These studies were primarily based on task-oriented dialogues. In the context of shared models, a practical interpretation of Grosz’s work is that focus really means the thing on which the conversants are mutually focusing; otherwise the conversation would lose its coherence (absent recurring and unbelievable coïncidence). Grosz’s idea of focus space implies that this is a shared view or partitioning of the domain; indeed, discourse presumes mutual focus. Another way of looking at focus spaces might be to consider them as shared models of the real domain which provide necessary (1) common extensional meaning and (2) reference points for coherent discourse structure. Similarly, in the protocols studied by Clark and Wilkes-Gibbs (1986), the conversants (have to) find a mutually acceptable perspective. As I see this result, the parties have established an interpretive model of the real domain that constitutes a suitable model of the domain for their achieving their underlying intentions through discourse.
Of course, the role of focus and perspective need not be limited to physical domain objects, definite referents of any kind, or even pragmatics. It has been noted that a successful speech act is based on agreement or on continuing negotiation of what values should prevail (Jernudd & Thuan, 1983). That is, the conversants must arrange to share the meanings of their illocutionary acts, even if the process of understanding doesn’t require that the acts be recognized explicitly. In order to maintain coherence, then, conversants have (or are trying to obtain) a shared model of at least the semantics of the illocutionary acts. This sort of negotiation may also be applicable (and probably much more frequently) to the pragmatics of a conversation.
The Role of Meta-Locution in Conversation
Participants in interactive discourse presumably engage in their interaction because of underlying intentions. While they may have private intentions which motivate their communication, they present to each other apparent discourse purposes (Grosz & Sidner, 1986). Thus if the discourse purposes are not apparent, their meanings must be clarified. This secondary discourse is a collaborative process which is meta to the subject of the conversation. Similarly, negotiation of referential meanings becomes meta to the conversation. For the meanings of illocutionary acts, the evidence of negotiation is indirect. Jernudd and Thuan (1983) suggested that speech acts, to be successful in language, require agreement by the conversants on the meaning of the acts. They observed that partners in communication generally coöperate in the (meta-) communicative goal that the speaker’s speech act is identical to that understood by the hearer. In other words, if both conversants are striving to make sense of their conversation, they will try to get the speaker’s and hearer’s identification of the speaker’s act to match up.
In the same way that the semantics of illocutionary acts are subject to negotiation and agreement, so too is reference a collaborative process. This was shown experimentally by Clark and Wilkes-Gibbs (1986). The conversants must mutually accept that the hearer understands the reference before the conversation proceeds. As I understand this work relative to possible shared models of the conversation itself, there is an incremental process of building a mutually understood reference scheme that constitutes a shared model at least as to extensional meanings.
with an interpretation of Clark and Wilkes-Gibbs’s work showing the negotiation
of the topic of a conversation (and thus suggesting a shared conversational
model) is Grosz’s view that the knowledge of the participants in discourse can
be characterized by a common structure. She suggested that one can model focus
in discourse as a partitioning of a semantic net which encodes the domain of
discourse (Grosz, 1977). This network represents one conversant's model of the
conversation, and it therefore includes that part of the conversation which the
conversant believes constitutes the shared conversational structure. Of course,
the conversants’ respective models need not be and are not likely to be
identical. This is like
Grosz’s analysis looked at explicit indicators of shifts in focus; a similar analysis could be applied to other utterances or acts which have specific model-maintenance functions, and even to functions which conversationally propose lack of mutuality. Such model-maintaining utterances are what I call have called meta-locutions, which correspondingly embody meta-illocutionary acts. In other words, the intended perlocutionary effects of such acts concern the process of conversation itself rather than the underlying discourse purposes. The utterances are in this sense meta-locutions because they are about the process of the very conversation in which they occur. For example, an utterance like “Go on ....” can be seen as a meta-locutionary act in which the illocutionary meaning is something like “I’m asserting that I don’t want to repair anything here and I’m letting you keep your turn.” Similarly, looking away while talking could be construed as a meta-locutionary act such as “I’m holding on to my turn.” Such functions of gaze in particular have been recognized as providing meta-information used in conversational control. People link use of their verbal‑auditory and nonverbal-visual channels:
(1) Since the vocal channel requires that people take turns to speak, signals must be used to negotiate turn‑taking; it would be difficult to contain these signals in the vocal channel—speakers would have to combine messages and meta-messages, and listeners would have to speak at the same time. Therefore the synchronizing signals are forced into the second channel. (2) While a person is speaking he needs feedback on how others are reacting; this could be provided by vocal comments, but that would involve double-talking, so feedback signals are also relegated to the second channel. (Argyle & Cook, 1976, p. 124)
The relatively broad range of behaviors which constitute communicative interaction, then, suggest a pervasive and central role for meta-locution in the control of conversation.
Interactive Discourse Requires Monitoring by All Conversants
Jernudd and Thuan (1983, p. 81) reasonably contended that language is usually expectation driven: “Norms of use are founded on expectations that users form. Obviously, interaction proceeds mainly in worn grooves and these generate reasonable expectations.” Applying this principle to the maintenance of shared conversational models, this suggests that conversants often share a large part of their conversational model automatically or by default. A consequence of this is that if a conversation is to be coherent each conversant must have a set of expectations which is consistent with the others’. These sorts of structures have been recognized when conventionalized as analogous to the well-known script models of conversations (Cohen, 1984). Yet not all, and maybe not even most, conversations always follow the script exactly. Otherwise we would find, contrary to experience, that we are always having the same conversations or fragments over and over again. This means that along with the shared set of expectations, conversants must detect and maintain the set of deviations from the well-worn expectational grooves. Thus Clark and Wilkes-Gibbs (1986) noted the prevalence of conversational feedback: the hearer lets the speaker know how things are going. This implies that the hearer has a model of what the speaker is trying to say. This process of monitoring thus apparently involves the hearer checking the actual utterances of the speaker for consistency against a set of expectations. The deviations from the expectations are, as I’ve observed, frequent. Moreover, no small set of expectations could possibly cover the variety of conversations in which we might and do engage. As a consequence the identification and selection of expectations also becomes important.
Repair-Based Conversational Interaction
We thus see that for conversation to be coherent in a manner consistent with the observed process of conversational monitoring, the conversants must maintain adequate models of the discourse. To the extent that the conversants are having the same conversation (i.e., to the extent that the conversation is coherent), the model must be a shared one. This does not mean that the models must be identical. Rather, if conversants satisfice with respect to understanding, their conversation can be, for each of them, coherent to the extent that their models are believed to overlap. For example, the sort of conversation in which one person’s down-to-earth discussion is interpreted by the other conversant as a metaphor or parable is coherent for both participants, even though their models may share only an analogical structure. If a conversant detects too great a divergence between her model and the apparent track of the conversation, she may take remedial action to regain mutuality of conversational knowledge.
As previously noted, Grosz (1981) showed that in interactive discourse conversants have a pervasive assumption that they share a common focus. This approach in effect substitutes inference for actual mutual knowledge of focus. I note, though, that the word “assumption” may be distracting. The assumption of mutual focus is usually true. This phenomenon is associated with the highly predictive nature of discourse. The assumption fails only when the expectations are not met (and thus the focus turns out not to be mutual). How often does this occur? How do conversants minimize and/or correct these failures? Moreover, the mutual‑knowledge assumption applies not only to focus but to many (if not all) aspects of interactive discourse. Grosz specifically demonstrated the existence of the assumption for focus, but the factors which make the assumption occur with respect to focus are also present for most of the aspects of a shared model of conversation. Grosz observed that the speaker is always one step ahead of the hearer (simply because the speaker is speaking), and noted that communication only ensues if shifts in focus are in fact clearly indicated to the hearer. It is true that listeners’ predictions of what speakers say is sometimes (or even often) correct; however, listeners cannot confirm their understanding of their conversational model as mutual until their prediction has been realized. Grosz suggested that the main avenue for understanding this process is through mechanisms that distinguish the conversants’ beliefs and then reasoning about knowledge and beliefs. But generalizing the shared focus process to the shared model process, if the speaker is one step ahead of the hearer then how big are these steps? Keeping the steps small minimizes the size of the failures of expectation. This is turn keeps the mutual knowledge assumption true enough to obviate the need for an elaborate maintenance scheme.
Correction, Repair, and Feedback Generally, are Pervasive Phenomena of Interactive Discourse
one sense, all interactive discourse is feedback. That is, the utterances of
one conversant are recursively responsive to the utterances of the other (see
Conversation effectuating domain intentions
Perlocutionary effect: domain-level actions and changes in belief structures
Illocution: Austinian speech acts
Locution: sentence‑level utterances
Conversation effectuating domain reference
Perlocutionary effect: changes in extensional reference
Illocution: attention‑directing speech acts
Locution: interjected phrases, deixis
Conversation resolving illocutionary meanings
Perlocutionary effect: agreement on meanings of language acts
Illocution: indications of understanding or misunderstanding
Locution: corroborative restatement, repetition
Conversation managing turn-taking
Perlocutionary effect: agreement on who should be talking
Illocution: interruption or indication of super‑level agreement
Locution: start and stop signals such as directed gaze, gesture, nodding
Figure 2. Possible levels of conversational interaction. Each level represents interaction which maintains models of the levels above.
To illustrate the woven nature of these layers of interactive discourse, here is a brief excerpt from a protocol of an English-as-a-second-language lesson. The layers and the analysis are set out here for observational purposes rather than as a specific theory of linguistic interaction. This protocol is interesting because the context necessitates that the conversants arrive at new agreements about the labels and meanings and roles of various things, including illocutionary acts. The English-speaking teacher (T) and the non‑English‑speaking student (S) sit on opposite sides of a small table. Various cardboard tiles, depicting geometric shapes which are large or small, blue or red, circular or square, lie at one edge of the table.2 The teacher and the student engage in the following discourse (non‑verbal actions are described in brackets and emphasis is indicated by underlining):
(1) T: [Puts cards LBC LRC SRS in the center of the table.]
(2) T: First can you show me [Makes `pointing’ gestures.] the circle.
(3) T: Which one—
(4) T: [Glances down and up.] I’m sorry the square—
(5) T: which one is the square.
(6) S: [Looks confused. Looks at T, arms at side.]
(7) S: Square.
(8) T: Uh huh.
(9) S: [Points to LRC.]
(10) T: The square.
(11) S: [Points to SRS.]
(12) T: OK, that’s right.
(13) T: That’s a square. [Pointing to SRS.]
(14) T: This is a circle. [Pointing to LRC.]
(15) T: These are the circles. [Pointing to LBC and LRC.] (Novick, 1986, p. 1)
This exchange exhibits a number of interesting features which can present the reader a more concrete idea of the general role of layers of discourse and the phenomena they represent. These features include the rapid establishment of the meaning of “show” in (2); T’s self-correction in (4); T’s confirming repetition in (5); S’s initial failure to take his turn in (6); S’s indication of non-comprehension by repetition in (7); S’s purely deictic language acts in (9) and (11); T’s indication of failure by repetition in (10); and T’s holding on to her turn in (13), (14), and (15). We know that in this exchange a person is teaching English to a non-English speaker. Thus (2) encompasses both locutions and meta-locutions, and (6) is clearly some sort of communicative act but must be meta-locutionary. Interestingly, in the absence of deixis, none of these utterances can be considered standard Austinian speech acts. That is, contextually determined or physically indicated references stand in for the explicit references which would be needed for Austinian analysis. Rather, T and S demonstrate a kind of mutual control of their discourse through a heterogeneous mixture of acts, most of which appear to track the conversants’ comprehension and acceptance of previous acts.
though it turns out that we engage in this sort of feedback-saturated
conversational behavior every day, rarely are we conscious of it. Schegloff,
Why should feedback play such an important part in the process of interactive discourse? While some aspects of discourse are settled before a conversation begins, many others remain to be determined as part of the interaction itself. Aspects of discourse that are usually not the subject of correction or repair may nevertheless involve feedback, either through positive feedback indicating acceptance of normative values or in the exceptional case through repair. Thus conversants normally consider parts of discourse like the lexicon and the set of speech acts to be relatively fixed. They can indicate agreement (or at least not indicate disagreement) as long as they do not encounter new words or acts, or as long as previously encountered words or acts are used with their conventional meanings. Thus a successful speech act is based on agreement--the normal case--or on ongoing negotiation of what values shall prevail (Jernudd & Thuan, 1983). In other words, the conversants’ valuations, beliefs, and purposes may converge or conflict with each other’s. To the extent they converge, the discourse will manifest agreement; to the extent they diverge or are unclear, the discourse will manifest negotiation. Where, after all, do the meanings of things like speech acts come from? Conversants need to find out each other’s expectations of the meanings of speech acts, need to express their own such expectations, and need to find a way to agree on these. The extent to which a speaker is successful in producing a speech act depends on the extent to which the conversants agree it shall be so. This agreement depends on shared expectations of speaking and language. The conversants’ understanding of the speech act reflects the (historical) resolution of negotiation of fairly permanent expectations (Jernudd & Thuan, 1983).
Some aspects of discourse are of course not susceptible of normative predetermination. One such aspect is reference. It turns out, even in situations where both conversants can perceive the referents used in their discourse, that definite reference in interactive discourse is a collaborative process requiring actions by both speakers and hearers (Clark & Wilkes-Gibbs, 1986). Another aspect requiring feedback is the turn-taking behavior characteristic of interactive discourse described by Schegloff et al. (1977).
Other aspects of interactive discourse requiring feedback include most if not all of the structural qualities of discourse. The speaker’s generation process may even include a sort of self-feedback or “monitoring” as he listens to himself talk. With respect to repair‑oriented feedback, Jernudd and Thuan (1983) pointed out that kinds of feedback from hearers to speakers include production errors that escape the speaker’s monitor, nonreceipt of what was said, incomprehension, miscomprehension, disapproval, and perhaps more. With respect to positive feedback, Clark and Wilkes‑Gibbs (1986), along with many others, observed that sociologists have shown that when one person speaks, the others not only listen but let the speaker know they are understanding—with head nods, “yes’s,” “uh-huh’s,” and other so-called back‑channel responses.
The role of feedback seems to be linked directly to process of language generation. Sociologists of language have observed that speakers have to have a repertoire of ways of following their own generational processes. This repertoire will involve speakers’ abilities to monitor, correct, evaluate, and correct what they are producing even as the process takes place. They need a way of checking that what they are actually saying is consistent with what they intend to say. Additionally, they need to cope with the reactions of the hearers (Jernudd & Thuan, 1983). A large class of nonverbal behaviors is used by conversants for such feedback. Ekman and Friesen (1981) described a class of nonverbal behaviors which they termed regulators:
These are acts which maintain the back-and-forth nature of speaking and listening between two or more interactants. They tell the speaker to continue, repeat, elaborate, hurry up, become more interesting, less salacious, give the other a chance to talk, etc.... The most common regulator is the head nod, the equivalent of the verbal mm-hmm; other regulators include eye contacts, slight movements forward, small postural shifts, eyebrow raises, and a whole host of other nonverbal acts. (Ekman & Friesen, 1981, p. 90)
These behaviors convey feedback so intrinsic to interaction that conversation stops if one of the conversants suppresses them (Ekman & Friesen, 1981).
et al. (1977) also pointed out that because of the overwhelming evidence for
correction and repair in conversation, any adequate theory of the organization
of natural language will have to account for how natural language handles its
intrinsic troubles, including the organization of repair. In this view, repair
(specifically, and, I suggest by extension, feedback generally) is an inherent
part of the process of interactive language.
”Non-grammaticality” and apparent errors in discourse are thus not to be explained or erased by grammars of non-grammaticality that derive spoken language from a perfect formulation3. These characteristics of interaction are the result of performance and cannot be accounted for by extension of competence‑based sentential grammars. Rather, these “imperfections” are phenomena to be explained in and of themselves, and are thus useful objects of study in the search for scientific understanding of language.
Intention, Action, and Language
An enormous amount of work in natural language processing, and in artificial intelligence generally, assumes the existence and utility of human intentionality. This work suggests, more or less explicitly, that actions in the world are the result of humans' intentions. Speech-act theory itself is based on this sort of assumption because illocutionary acts are produced by speakers to achieve intended perlocutionary effects: use of language is a form of intentional action (Searle, 1969). Nevertheless, the relationship between intention, action and language is not well understood. For the research presented in this dissertation, two issues are particularly problematic: First, what (linguistic) behaviors are intentional? Second, how do people act on intentions to produce conversational interaction? I address each of these problems in turn.
Acts and Signals
Aside from the occasional case like someone crying out in surprise or fright, verbal acts are largely considered to be intentional. At the same time, there is a class of behaviors, including communicative behaviors, which are widely regarded as unintentional or unconscious. As I have discussed, there is a wide range of nonverbal behaviors in conversational interaction. These behaviors can be interpreted either as intentional acts or as unintentional signals. To some analysts, the majority of this massive stream of communication is unconscious on the part of the agent (Allen & Guy, 1974). To others, significant aspects of nonverbal behavior are directly intentional (Argyle & Cook, 1976; Birdwhistell, 1970). It is certainly true that the kinds of routinized behavior relevant to conversational control are in an indistinct zone with respect to intentional action. They seem to be on the periphery of awareness (Ekman & Friesen, 1981). Some communicative behaviors seem to be in the province of autonomic response; pupil dilation and contraction have been observed in response to informational content (Argyle & Cook, 1976). Nonverbal behaviors are also interpreted as involuntary because they convey or reveal things which the agent has no intention of communicating (Allen & Guy, 1974). Yet other behaviors seem to be part of the same action that we associate with production of an utterance; the rise or drop in pitch at the end of English sentences is invariably accompanied by a raising or lowering of the eyelids, head, or hands (Scheflen, 1980).
The issue is further complicated by the possibility that conversants can consciously display behaviors that would ordinarily be unconscious. This can be done for emphasis (e.g., looking up in frustration) or as a deception (e.g., looking blank to feign lack of prior knowledge of a reference).
From the standpoint of understanding, nonverbal communication the situation is equally perplexing. Actions which could be taken as cues are not always noted by the partner (Allen & Guy, 1974). For example, the extent of people’s attention or perception of gaze appears to vary widely (Argyle & Cook, 1976). In short, the role of intention in producing nonverbal communicative acts is an unsettled matter:
What is actually intentional and what is not need not by any means be the same as the way it is treated by others. Accordingly, although it is fruitless to try to decide what messages a person actually intends to convey and what he does not, how people treat each other in this regard should nevertheless be carefully attended to. That is, it is very important to consider what aspects of of the flow of information participants treat as if they have been provided intentionally and what aspects they treat as if they are unintentional. As a corollary to this, it then becomes a matter of great interest to investigate which features actions must have to be treated as intentional and which they must have to be treated otherwise. To the best of my knowledge, this question remains one to be investigated systematically. (Kendon, 1981, p. 10).
is no getting around the fact that intentionality is a difficult subject. How,
then, is intention to be interpreted in a computational model of conversation? The
characterization of much communicative behavior as unconscious is the product
of introspective analysis in which the analyst cannot locate any specific
intent or purpose for motor activity. In my view, this conclusion is the
product of an ill-founded assumption that unconscious equals unintentional. In
defining intentionality, for example, Ekman and Friesen (1981) specifically
refer to the “deliberate” use of a nonverbal act to communicate a message to
another informant, although they do note that it may not be possible to
determine the intentionality of every instance of nonverbal behavior. To shed
some modest light on this matter, I obtained from a variety of adult informants
descriptions of their own processes of linguistic production. All felt that
they had little or no conscious control over the process of actually producing
speech; they could not explain how they talked. This experience, I feel, is the
product of the routinized nature of linguistic production; people spend a great
deal of time using language. Yet none of the informants would characterize
their speech as involuntary. Even if they were unable to articulate their
intentions, they surely had motivations for speaking, even on an
utterance-by-utterance basis. There is no reason to distinguish the nonverbal
acts associated with these utterances as any less the product of such
motivations. It is not necessary to specify goals for the acts in order to
describe them (
There is perhaps a reasonable analogy here between the production of language and the performance of other motor activity. If I walk from my desk to the bookshelf to get a book, I am performing an action in service of my intention to get the book. The overall intention may or may not be conscious, but certainly the individual actions which accomplish it--using my legs and feet, maintaining my balance--are not consciously performed. But neither are these actions involuntary; they are simply easy and routine. It is possible that I might move my leg reflexively if, for example, someone spilled ice-water on it. Similarly, I might blink if dust irritated my eyes. But in a purposive context, both moving my leg and blinking help me to achieve non-reflexive goals. They are the consequences of intention and constitute its embodiment. That is, while the act of walking to bookshelf can be said to embody my intention to get the book, this “act” is a composite; it has no existence outside our interpretation of the sum of a large number of smaller acts which together produce it. Even if attenuated, intentionality must underlie each constituent sub-act. Thus while the distinction between voluntary and involuntary linguistic action is not clear, a reasonable model of conversation will interpret displayed behaviors in terms of intentional acts unless (1)they can be shown reflexive because of physical factors or (2) they do not occur in--or appear to be reasonably related to--a context of larger, intentional action.
Planned vs. Situated Action
How do people's intentions produce conversational acts? More specifically, how does intention get translated into a sequence of acts that produce coherence in the organization of conversation? A strong thread in artificial intelligence has involved production of rationally organized actions through planning. Planning systems have been proposed for the production of text (McKeown, 1985) and for interactive conversation (Power, 1979; Hobbs & Evans, 1980; cf., Johnson & Robertson, 1981).4 Other computational models of interactive discourse suggest that conversants use similar processes which rely on depth-first tree search. Grosz (1981, 1982) implied this by using a stack‑based process for changing focus spaces in conversation. It should be noted, though, that Grosz does not at all claim that conversation is pre-planned. The structure of a discourse, she observed, tends to arise naturally out of the structure of the discourse task. It is the focus‑space structure which models the conversation as it develops that is stack-based. Reichman (1985), though, directly proposed an ATN-based model for conversational exchanges.
It is unlikely that simple planning or stack-based models of interactive discourse are adequate for modeling mixed-initiative conversational discourse. The importance of conversants being able to change the structure of their conversation in unforeseen, flexible ways is underlined by the observation that although some tasks produce neatly stacked discourse structures, everyday interaction requires an enormous variety of structures for which stack-like models are inadequate. The principal problem is that planning involves searching a state space, yet for most conversation the future states of the conversation are not reasonably calculable. For example, Birnbaum (1986) pointed out the case where a conversant refutes an argument on the grounds that the other conversant has used a supporting fact which is demonstrably false. To have planned this exchange from the beginning, the conversant's original state space would have had to include not only the universe of all relevant facts which support her opponent’s argument but also all other possible facts which are false as well. This unknown state space precludes the direct use of a planning model.
Power (1979) found that his stack-based planning system for conversation would run into problems because of (1) incompleteness and (2) insufficient flexibility in adjusting to changes in context:
Let us turn now to the second fault of the control stack as representation of the dialogue state: namely, insufficient explicitness. What this means is that the relations between elements of the dialogue state are not represented systematically.... The result is that the dialogue state can be interpreted just one way; it cannot be interpreted by several different procedures for several different purposes. The robots therefore cannot respond flexibly to unexpected turns in the conversation; an unexpected remark throws them completely. (Power, 1979, pp. 133-134).
To help solve this problem, Power proposed marking the elements of the dialogue state with explanatory relations that could be used inferentially to rework the plan. In effect, this would mean trying to re-plan the conversation at each state. This approach is not parsimonious and tries to graft skills for opportunism onto a fundamentally top-down structure. It does not really address the underlying issue of producing conversational organization from a system which is fundamentally flexible.
Cohen (1984), working with earlier research, reported that discourse analysis of human‑computer interaction reveals that users do not follow the strict embedding of subdialogues required by an ATN model. Rather, a more flexible “demand” model was needed. Cohen also reports research indicating that efficiency in referential communication is a function of user feedback. ATN’s, as a stack-based method, are considered too rigid for even sentential grammars and thus are unlikely to be capable of representing dialogue processes (Frederking, 1988).
Indeed, the planning model has been characterized as a post-hoc rationalization of actions; it is an artifact of reasoning about actions rather than a mechanism for producing them (Suchman, 1987). Plans, Suchman suggests, are simply a restatement of intention.5 She proposes that the coherence of action is not adequately explained by either stored plans or scripts; rather, the organization of the conversants’ actions is an emergent property of moment-by-moment interactions between actions, and between conversants and their context. That is, global coherence is the result of situated application of locally meaningful operators. This extends Birnbaum’s (1986) notion of opportunistic planning.
This does not preclude the use by conversants of high-level reasoning about their context and actions they might take to achieve their goals. This is, after all, a large component of what we perceive as conscious thought. Therefore, as I understand the implications of Suchman's thesis, we can produce conscious plans (based on some known state-space), act on them, and then react to changed or unanticipated circumstances as needed. More typically, we do not formulate an explicit plan; rather, we take some initial action to achieve our goal, and thereby create a set of expectations about what will follow. It does not matter (from the standpoint of conversational control) if the expectations are not met because we can again produce from our intentions a new action which is responsive to the new situation. To the extent that our expectations are met, we can routinize the selection and application of operators.
2. In the transcript, the notation refers to L(arge) or S(mall), B(lue) or R(ed), C(ircle) or S(quare) shapes. Thus the LBC is the large blue circle, the LRC is the large red square, and the SRS is the small red square.
3. Cohen (1984) presents a brief account of grammar-based approaches to ill-formed input.
4. Appelt (1981, 1985) also proposed planning models for conversational discourse. This work, however, principally concerned intra-utterance planning rather than inter-utterance conversational planning.
5. Suchman also rejects speech-act models as begging the question of situated interpretation. As I attempt to show in this dissertation, speech acts and situated action are not incompatible. Speech acts can be viewed as the product of contextually sensitive operators.