CHAPTER II
Survey
of Related Work
The Integration Problem
In
addressing the question of what language is,
speech-act philosophy was spectacularly successful. It solved many problems in
linking language to the world and explained why people use language. Speech‑act
theory has been useful as a basis for some aspects of computational modeling of
interactive discourse, particularly in matters relating to intention and
planning (see, e.g., Power, 1979; Cohen & Perrault, 1979; Allen &
Perrault, 1980; Cohen, 1981). Speech acts could explain in terms of
goal-directed behavior how conversants coördinated when and what they said. Speech-act
theory reintroduced communication and purpose into analysis of planning in
language. It did not, however, prove as successful in explaining how language
is used. That is, the feedback‑suffused protocol evidence of actual
speakers was often difficult or impossible to explicate using speech acts
(Cohen, 1984; Clark & Wilkes-Gibbs, 1986; Clark & Schaefer, in press). This
meant that things like the coördination of real turn-taking could not be
modeled with speech acts because the turns frequently did not contain
utterances which could be classified as speech acts.
At
the same time, socio-linguistic research (see e.g., Schegloff, Sacks &
Jefferson, 1977) was uncovering the very linguistic behavior which posed
problems for speech-act theory. However, they did not propose computational
models. Clark and Wilkes-Gibbs (1986) noted that for both sociologists and
speech-act philosophers, the central issue is coördination: How do conversants
coördinate on the content and timing of what is meant and understood? The two
divergent research traditions had approached the same problems from different
directions, equally unsuccessful: The issue cannot be resolved by either
tradition alone. In the [speech‑act] tradition conversation is idealized
as a succession of illocutionary acts—assertions, questions, promises—each
uttered and understood clearly and completely [citations omitted]. Yet from the
[sociological] tradition we know that many utterances remain incomplete and
only partly understood until corrected or amplified in further exchanges. How
are these two views to be reconciled? (Clark & Wilkes-Gibbs, 1986, p.2)
This
problem, which I call the integration problem, also bothered Cohen (1984) in
his examination of the use of referring expressions in
interactive discourse. According to Searle, to perform an illocutionary act an
act of predication is required, and the predicate must be uttered. How then,
Cohen asked, can one explain the following dialogue where there is no apparent
predication for a completed utterance containing only a noun phrase:
A: Now, the small blue cap we talked about
before?
B: Uh‑huh.
A: Put that over the hole on the side of that tube .... (Cohen, 1984, p. 104)
The
only apparent effect of A’s initial utterance can be to direct B’s attention to
the referent. A does not say anything about the referent here. Equally
perplexing, Cohen argues, are predications without expressed referents. For
example, Sherlock Holmes might lean over a body on the floor and exclaim to
Watson “Dead.” Conversely, Searle would claim that utterances like “There is a
little yellow piece of rubber.” contains no act of referring at all and are
simply predications. The problem, then, is again that the traditional view of
speech acts cannot be reconciled with the linguistic evidence. Cohen noted that
the requirement that the act of referring be jointly located with some
predication in a sentence or illocutionary act is too restrictive. The
functions of reference and predication can be embodied in separate utterances
which have, in effect, different though related perlocutionary effects. His
solution to this particular part of the problem was to break down Searle’s
unitary acts into separate acts with separate functions. Thus he suggested that
referring, in and of itself, is a kind of request that speakers make of hearer
(i.e., to direct attention). But where lies the
solution to the general problem of integrating speech acts with the actual
character of interactive discourse? The principle of breaking down exchanges
involving reference into segments characterized by their goals might be
extended to the general phenomenon of interactive discourse by recognizing that
each utterance fragment reflects its own proper purposes.
Characteristics of
Conversation
The
socio-linguistic line of research on interaction is part of a large field which
is broadly known as conversational analysis. This research has described a wide
range of conversational characteristics which are not directly representable in
Austinian speech-act theory. These characteristic behaviors include laughter,
fragmentation, turn-taking, correction, gaze, and nonverbal communication
generally.
The
prevalence in conversation of laughter and fragmentation was shown by Allen and
Guy (1974). There are distinct regularities associated with these phenomena. For
example, in conversations between a male and a female, the male tends to laugh
twice as often as the female. Although fascinating, laughter is really beyond
the scope of the present research. Fragmentation, however, is one of the
central characteristics of language for which this dissertation attempts to
account. This includes uncompleted sentences and words, plus non‑word
utterances such as “uh,” “ah,” and “eh.” Fragmentation is sometimes associated
with correction (Allen & Guy, 1974). More generally, they represent
sub-lexical, incomplete sentential thoughts, or complete conversational
contributions which are expressed in non-sentential form. These include, for
example, referential and confirmatory utterances such as “OK, now, the small
blue cap we talked about before?” (Cohen, 1984, p. 104) and
“Uh-huh.” That is, while fragmentation sometimes results from
correction, it is a general functional characteristic of natural discourse. Thus
despite their part-formedness, fragments are used in
utterances which are nevertheless understood by conversants.
Turn-taking
(Sacks, Schegloff, & Jefferson, 1974; Duncan, 1980) is a control-oriented
account of conversation that explicates patterns of mixed-initiative
interaction in terms of conversational turns. This represents an organizational
substrate for domain‑level, intentionality‑based interaction. It is
maintained through a wide set of behaviors and acts, including nonverbal
communication (Ekman & Friesen, 1981).
In
conversation, repair of language and interaction has been observed in two
forms: self- and other-correction. There is an apparent preference for
self-correction (Schegloff et al., 1977; Clark & Wilkes-Gibbs, 1986), which
leads to fragmentation of utterances as speakers in effect erase parts of their
utterances and replace them with substitute phrases. As a sub-sentential
phenomenon, self-correction is not accounted for by speech-act theory. Other-correction,
when lexical, for example, presents the same difficulty.
The
role of gaze in conversation has been the subject of extensive—though largely
descriptive—analysis. Gaze performs multiple functions in human-human
interaction (Argyle & Cook, 1976; Argyle, Ingham, Alkema, & McCallin,
1981), and can be described in terms of its temporal association with language
use in dialogue (Beattie, 1981). Gaze was found to be organized in a
coördinated system with the plans underlying speech and with the speech flow. Gaze
was more highly associated with turn-yielding cues than simply with syntactic
clause boundaries.
Gaze
is a significant part of a broader range of behaviors generally considered as
nonverbal communication. Extensive systems have been developed for noting
physical states and actions associated with language (See e.g., Birdwhistell,
1970; De Long, 1983). Indeed, there is much evidence in support of the
proposition that nonverbal behaviors are part of language. For example, Hoffer
and
All of the emerging data to me to support the
contention that linguistics and kinesics are infracommunicational systems. Only
in their interrelationship with each other and with comparable systems from
other sensory modalities are the emergent communication systems achieved. (Birdwhistell,
1970, p. 127)
Clearly,
it would be difficult to explicate coherence in interactive discourse without
taking account of these characteristics of language as it is used; this is the
heart of the integration problem. Why, though, do conversants rely on these
functions? In the next two sections, I address this question by looking at
models of mixed-initiative conversation which would explain the use and
necessity of these behaviors.
Evidence for Shared
Models of Conversations
This
section examines the role of feedback as a process control for interactive
discourse by discussing the following questions: How closely does the hearer
track the speaker’s utterances against their shared assumptions about the
conversation? How do conversants minimize and/or correct expectation failures?
The
evidence is strong for the proposition that conversants in interactive discourse
share a model of their conversation. In informal terms, a weak version of this
conjecture would be that the conversants must have at least some knowledge
necessary to the conversation which is common to all conversants. A strong
version is that the conversants are jointly creating a single (though possibly
complex) intellectual product. Suchman (1987) characterizes conversation as an
“ensemble” work:
Closer analyses of face‑to‑face
communication indicate that conversation is not so much an alternating series
of actions and reactions between individuals as it is a joint action
accomplished through the participants’ continuous engagement in speaking and
listening [references omitted]. (Suchman, 1987, p. 71)
Clearly,
speakers of a language necessarily share lexical knowledge, even if their
internal representations of the lexicon differ in their extension. To what
extent, then, must conversants construct or share a mutual model of their
jointly‑created conversation?
Implications of the
Interactive Process for Shared Models
Observation
of interactive discourse suggests that when conversants share information,
their exchanges cover the range of conversational levels, from domain
information to turn-taking. As has often been noted, one of the central questions
of interactive discourse is how the conversants coördinate the timing of what
is meant and understood (see e.g., Clark & Wilkes-Gibbs, 1986). Clearly, a
conversational model common to the conversants would facilitate this
coördination. This process has been characterized as one of coöperation. The
fact of coöperation would seem to presuppose something to coöperate about. According
to Clark and Wilkes-Gibbs (1986), Grice (1975) observed that conversants
coöperate in their contributions to a conversation by directing their
contributions toward the accepted purpose or direction of their exchange.
Even
in the simplest cases, though, it is apparent that some kind of mutually
understood model of the conversation is created. Consider the railway station
protocols discussed by Allen and Perrault (1980):
Patron: The
Clerk: Gate 10. (Allen & Perrault, 1980, p.
442)
Using
the notion of mutual knowledge developed by Clark and Marshall, the clerk and
the patron mutually know that the patron has made an inquiry about the
Cohen
and Perrault (1979) also described a model for possible intentions underlying
speech acts. Cohen and Perrault were interested in providing a theory of speech
acts which, among other things, answered questions about changes the successful
performance of a speech act makes in the speaker’s model of the hearer and in
the hearer’s model of the speaker. They proposed a planning approach which
provided formal criteria for defining speech acts in terms of intentions,
abilities, and effects. Acts of requesting and informing were then specified as
operators which can be used by a planning system. This model necessarily
assumed a speaker’s model of hearer and vice versa. To the extent that these
are self-referential we have at least the raw basis for a shared model, whether
or not the authors recognized it as such.
To
be coherent, a conversation must be about something. It follows that the
discourse segments which make up a conversation must have corresponding foci. The
notion of focus as a characteristic of interactive discourse was developed by
Grosz (1977). These studies were primarily based on task-oriented dialogues. In
the context of shared models, a practical interpretation of Grosz’s work is
that focus really means the thing on which the conversants are mutually
focusing; otherwise the conversation would lose its coherence (absent recurring
and unbelievable coïncidence). Grosz’s idea of focus space implies that this is
a shared view or partitioning of the domain; indeed, discourse presumes mutual
focus. Another way of looking at focus spaces might be to consider them as
shared models of the real domain which provide necessary (1) common extensional
meaning and (2) reference points for coherent discourse structure. Similarly,
in the protocols studied by Clark and Wilkes-Gibbs (1986), the conversants
(have to) find a mutually acceptable perspective. As I see this result, the
parties have established an interpretive model of the real domain that
constitutes a suitable model of the domain for their achieving their underlying
intentions through discourse.
Of
course, the role of focus and perspective need not be limited to physical
domain objects, definite referents of any kind, or even pragmatics. It has been
noted that a successful speech act is based on agreement or on continuing
negotiation of what values should prevail (Jernudd & Thuan, 1983). That is,
the conversants must arrange to share the meanings of their illocutionary acts,
even if the process of understanding doesn’t require that the acts be
recognized explicitly. In order to maintain coherence, then, conversants have
(or are trying to obtain) a shared model of at least the semantics of the
illocutionary acts. This sort of negotiation may also be applicable (and
probably much more frequently) to the pragmatics of a conversation.
The Role of
Meta-Locution in Conversation
Participants
in interactive discourse presumably engage in their interaction because of
underlying intentions. While they may have private intentions which motivate
their communication, they present to each other apparent discourse purposes
(Grosz & Sidner, 1986). Thus if the discourse purposes are not apparent,
their meanings must be clarified. This secondary discourse is a collaborative
process which is meta to the subject of the
conversation. Similarly, negotiation of referential meanings becomes meta to the conversation. For the meanings of illocutionary
acts, the evidence of negotiation is indirect. Jernudd and Thuan (1983)
suggested that speech acts, to be successful in language, require agreement by
the conversants on the meaning of the acts. They observed that partners in
communication generally coöperate in the (meta-) communicative goal that the
speaker’s speech act is identical to that understood
by the hearer. In other words, if both conversants are striving to make sense
of their conversation, they will try to get the speaker’s and hearer’s
identification of the speaker’s act to match up.
In
the same way that the semantics of illocutionary acts are subject to
negotiation and agreement, so too is reference a collaborative process. This
was shown experimentally by Clark and Wilkes-Gibbs (1986). The conversants must
mutually accept that the hearer understands the reference before the
conversation proceeds. As I understand this work relative to possible shared
models of the conversation itself, there is an incremental process of building
a mutually understood reference scheme that constitutes a shared model at least
as to extensional meanings.
Consistent
with an interpretation of Clark and Wilkes-Gibbs’s work showing the negotiation
of the topic of a conversation (and thus suggesting a shared conversational
model) is Grosz’s view that the knowledge of the participants in discourse can
be characterized by a common structure. She suggested that one can model focus
in discourse as a partitioning of a semantic net which encodes the domain of
discourse (Grosz, 1977). This network represents one conversant's model of the
conversation, and it therefore includes that part of the conversation which the
conversant believes constitutes the shared conversational structure. Of course,
the conversants’ respective models need not be and are not likely to be
identical. This is like
Grosz’s
analysis looked at explicit indicators of shifts in focus; a similar analysis
could be applied to other utterances or acts which have specific
model-maintenance functions, and even to functions which conversationally
propose lack of mutuality. Such model-maintaining utterances are what I call
have called meta-locutions, which correspondingly embody meta-illocutionary acts. In other words, the intended
perlocutionary effects of such acts concern the process of conversation itself
rather than the underlying discourse purposes. The utterances are in this sense
meta-locutions because they are about the process of the very conversation in
which they occur. For example, an utterance like “Go on ....” can be seen as a
meta-locutionary act in which the illocutionary meaning is something like “I’m
asserting that I don’t want to repair anything here and I’m letting you keep
your turn.” Similarly, looking away while talking could be construed as a
meta-locutionary act such as “I’m holding on to my turn.” Such functions of
gaze in particular have been recognized as providing meta-information used in
conversational control. People link use of their verbal‑auditory and
nonverbal-visual channels:
(1) Since the vocal channel requires that people
take turns to speak, signals must be used to negotiate turn‑taking; it
would be difficult to contain these signals in the vocal channel—speakers would
have to combine messages and meta-messages, and listeners would have to speak
at the same time. Therefore the synchronizing signals are forced into the
second channel. (2) While a person is speaking he needs feedback on how others
are reacting; this could be provided by vocal comments, but that would involve
double-talking, so feedback signals are also relegated to the second channel.
(Argyle & Cook, 1976, p. 124)
The
relatively broad range of behaviors which constitute
communicative interaction, then, suggest a pervasive and central role for meta-locution
in the control of conversation.
Interactive Discourse
Requires Monitoring by All Conversants
Jernudd
and Thuan (1983, p. 81) reasonably contended that language is usually
expectation driven: “Norms of use are founded on expectations that users form. Obviously,
interaction proceeds mainly in worn grooves and these generate reasonable
expectations.” Applying this principle to the maintenance of shared
conversational models, this suggests that conversants often share a large part
of their conversational model automatically or by default. A consequence of
this is that if a conversation is to be coherent each conversant must have a
set of expectations which is consistent with the others’. These sorts of
structures have been recognized when conventionalized as analogous to the
well-known script models of conversations (Cohen, 1984). Yet not all, and maybe
not even most, conversations always follow the script exactly. Otherwise we
would find, contrary to experience, that we are always having the same
conversations or fragments over and over again. This means that along with the
shared set of expectations, conversants must detect and maintain the set of
deviations from the well-worn expectational grooves. Thus
Clark and Wilkes-Gibbs (1986) noted the prevalence of conversational feedback:
the hearer lets the speaker know how things are going. This implies that the
hearer has a model of what the speaker is trying to say. This process of
monitoring thus apparently involves the hearer checking the actual utterances
of the speaker for consistency against a set of expectations. The deviations
from the expectations are, as I’ve observed, frequent. Moreover, no small set
of expectations could possibly cover the variety of conversations in which we
might and do engage. As a consequence the identification and selection of
expectations also becomes important.
Repair-Based
Conversational Interaction
We
thus see that for conversation to be coherent in a manner consistent with the
observed process of conversational monitoring, the conversants must maintain
adequate models of the discourse. To the extent that the conversants are having
the same conversation (i.e., to the extent that the conversation is coherent),
the model must be a shared one. This does not mean that the models must be
identical. Rather, if conversants satisfice with
respect to understanding, their conversation can be, for each of them, coherent
to the extent that their models are believed to overlap. For example, the sort
of conversation in which one person’s down-to-earth discussion is interpreted
by the other conversant as a metaphor or parable is coherent for both
participants, even though their models may share only an analogical structure. If
a conversant detects too great a divergence between her model and the apparent
track of the conversation, she may take remedial action to regain mutuality of
conversational knowledge.
As
previously noted, Grosz (1981) showed that in interactive discourse conversants
have a pervasive assumption that they share a common focus. This approach in
effect substitutes inference for actual mutual knowledge of focus. I note,
though, that the word “assumption” may be distracting. The assumption of mutual
focus is usually true. This phenomenon is associated with the highly predictive
nature of discourse. The assumption fails only when the expectations are not
met (and thus the focus turns out not to be mutual). How often does this occur?
How do conversants minimize and/or correct these failures? Moreover, the mutual‑knowledge
assumption applies not only to focus but to many (if
not all) aspects of interactive discourse. Grosz specifically demonstrated the
existence of the assumption for focus, but the factors which make the
assumption occur with respect to focus are also present for most of the aspects
of a shared model of conversation. Grosz observed that the speaker is always
one step ahead of the hearer (simply because the speaker is speaking), and
noted that communication only ensues if shifts in focus are in fact clearly
indicated to the hearer. It is true that listeners’ predictions of what
speakers say is sometimes (or even often) correct; however, listeners cannot
confirm their understanding of their conversational model as mutual until their
prediction has been realized. Grosz suggested that the main avenue for
understanding this process is through mechanisms that distinguish the
conversants’ beliefs and then reasoning about knowledge and beliefs. But
generalizing the shared focus process to the shared model process, if the
speaker is one step ahead of the hearer then how big are these steps? Keeping
the steps small minimizes the size of the failures of expectation. This is turn
keeps the mutual knowledge assumption true enough to obviate the need for an
elaborate maintenance scheme.
Correction, Repair, and
Feedback Generally, are Pervasive Phenomena of Interactive Discourse
In
one sense, all interactive discourse is feedback. That is, the utterances of
one conversant are recursively responsive to the utterances of the other (see
e.g.,
Conversation effectuating domain intentions
Perlocutionary effect: domain-level
actions and changes in belief structures
Illocution: Austinian
speech acts
Locution: sentence‑level
utterances
Conversation effectuating domain reference
Perlocutionary effect:
changes in extensional reference
Illocution: attention‑directing
speech acts
Locution: interjected
phrases, deixis
Conversation resolving illocutionary meanings
Perlocutionary effect:
agreement on meanings of language acts
Illocution: indications
of understanding or misunderstanding
Locution: corroborative
restatement, repetition
...
Conversation managing turn-taking
Perlocutionary effect:
agreement on who should be talking
Illocution: interruption
or indication of super‑level agreement
Locution: start and stop
signals such as directed gaze, gesture, nodding
Figure 2. Possible levels of
conversational interaction. Each level represents interaction which maintains
models of the levels above.
To
illustrate the woven nature of these layers of interactive discourse, here is a
brief excerpt from a protocol of an English-as-a-second-language lesson. The
layers and the analysis are set out here for observational purposes rather than
as a specific theory of linguistic interaction. This protocol is interesting
because the context necessitates that the conversants arrive at new agreements
about the labels and meanings and roles of various things, including
illocutionary acts. The English-speaking teacher (T) and the non‑English‑speaking
student (S) sit on opposite sides of a small table. Various cardboard tiles, depicting
geometric shapes which are large or small, blue or red, circular or square, lie
at one edge of the table.2 The teacher and the student engage in the
following discourse (non‑verbal actions are described in brackets and
emphasis is indicated by underlining):
(1) T:
[Puts cards LBC LRC SRS in the center of the table.]
(2) T:
First can you show me [Makes `pointing’ gestures.] the circle.
(3) T:
Which one—
(4) T:
[Glances down and up.] I’m sorry the square—
(5) T:
which one is the square.
(6) S:
[Looks confused. Looks at T, arms at side.]
(7) S:
Square.
(8) T:
Uh huh.
(9) S:
[Points to LRC.]
(10) T:
The square.
(11) S:
[Points to SRS.]
(12) T:
OK, that’s right.
(13) T:
That’s a square. [Pointing to SRS.]
(14) T:
This is a circle. [Pointing to LRC.]
(15) T:
These are the circles. [Pointing to LBC and LRC.] (Novick,
1986, p. 1)
This
exchange exhibits a number of interesting features which can present the reader
a more concrete idea of the general role of layers of discourse and the phenomena
they represent. These features include the rapid establishment of the meaning
of “show” in (2); T’s self-correction in (4); T’s confirming repetition in (5);
S’s initial failure to take his turn in (6); S’s indication of
non-comprehension by repetition in (7); S’s purely deictic language acts in (9)
and (11); T’s indication of failure by repetition in (10); and T’s holding on
to her turn in (13), (14), and (15). We know that in this exchange a person is
teaching English to a non-English speaker. Thus (2) encompasses both locutions
and meta-locutions, and (6) is clearly some sort of communicative act but must
be meta-locutionary. Interestingly, in the absence of deixis, none of these
utterances can be considered standard Austinian speech acts. That is,
contextually determined or physically indicated references stand in for the
explicit references which would be needed for Austinian analysis. Rather, T and
S demonstrate a kind of mutual control of their discourse through a
heterogeneous mixture of acts, most of which appear to track the conversants’
comprehension and acceptance of previous acts.
Even
though it turns out that we engage in this sort of feedback-saturated
conversational behavior every day, rarely are we conscious of it. Schegloff,
Why
should feedback play such an important part in the process of interactive
discourse? While some aspects of discourse are settled before a conversation
begins, many others remain to be determined as part of the interaction itself. Aspects
of discourse that are usually not the subject of correction or repair may nevertheless
involve feedback, either through positive feedback indicating acceptance of
normative values or in the exceptional case through repair. Thus conversants
normally consider parts of discourse like the lexicon and the set of speech
acts to be relatively fixed. They can indicate agreement (or at least not
indicate disagreement) as long as they do not encounter new words or acts, or
as long as previously encountered words or acts are used with their
conventional meanings. Thus a successful speech act is based on agreement--the
normal case--or on ongoing negotiation of what values shall prevail (Jernudd
& Thuan, 1983). In other words, the conversants’ valuations, beliefs, and
purposes may converge or conflict with each other’s. To the extent they converge,
the discourse will manifest agreement; to the extent they diverge or are
unclear, the discourse will manifest negotiation. Where, after all, do the
meanings of things like speech acts come from? Conversants need to find out
each other’s expectations of the meanings of speech acts, need to express their
own such expectations, and need to find a way to agree on these. The extent to
which a speaker is successful in producing a speech act depends on the extent
to which the conversants agree it shall be so. This agreement depends on shared
expectations of speaking and language. The conversants’ understanding of the
speech act reflects the (historical) resolution of negotiation of fairly
permanent expectations (Jernudd & Thuan, 1983).
Some
aspects of discourse are of course not susceptible of normative
predetermination. One such aspect is reference. It turns out, even in
situations where both conversants can perceive the referents used in their
discourse, that definite reference in interactive discourse is a collaborative
process requiring actions by both speakers and hearers (Clark &
Wilkes-Gibbs, 1986). Another aspect requiring feedback is the turn-taking
behavior characteristic of interactive discourse described by Schegloff et al.
(1977).
Other
aspects of interactive discourse requiring feedback include most if not all of
the structural qualities of discourse. The speaker’s generation process may
even include a sort of self-feedback or “monitoring” as he listens to himself
talk. With respect to repair‑oriented feedback, Jernudd and Thuan (1983)
pointed out that kinds of feedback from hearers to speakers include production
errors that escape the speaker’s monitor, nonreceipt
of what was said, incomprehension, miscomprehension, disapproval, and perhaps
more. With respect to positive feedback, Clark and Wilkes‑Gibbs (1986),
along with many others, observed that sociologists have shown that when one
person speaks, the others not only listen but let the speaker know they are
understanding—with head nods, “yes’s,” “uh-huh’s,”
and other so-called back‑channel responses.
The
role of feedback seems to be linked directly to process of language generation.
Sociologists of language have observed that speakers have to have a repertoire
of ways of following their own generational processes. This repertoire will
involve speakers’ abilities to monitor, correct, evaluate, and correct what
they are producing even as the process takes place. They need a way of checking
that what they are actually saying is consistent with what they intend to say. Additionally,
they need to cope with the reactions of the hearers (Jernudd & Thuan,
1983). A large class of nonverbal behaviors is used by conversants for such
feedback. Ekman and Friesen (1981) described a class of nonverbal behaviors
which they termed regulators:
These are acts which maintain the back-and-forth
nature of speaking and listening between two or more interactants.
They tell the speaker to continue, repeat, elaborate, hurry up, become more
interesting, less salacious, give the other a chance to talk, etc.... The most
common regulator is the head nod, the equivalent of the verbal mm-hmm; other
regulators include eye contacts, slight movements forward, small postural
shifts, eyebrow raises, and a whole host of other nonverbal acts. (Ekman &
Friesen, 1981, p. 90)
These
behaviors convey feedback so intrinsic to interaction that conversation stops
if one of the conversants suppresses them (Ekman & Friesen, 1981).
Schegloff
et al. (1977) also pointed out that because of the overwhelming evidence for
correction and repair in conversation, any adequate theory of the organization
of natural language will have to account for how natural language handles its
intrinsic troubles, including the organization of repair. In this view, repair
(specifically, and, I suggest by extension, feedback generally) is an inherent
part of the process of interactive language.
”Non-grammaticality”
and apparent errors in discourse are thus not to be explained or erased by
grammars of non-grammaticality that derive spoken language from a perfect
formulation3. These characteristics of interaction are the result of
performance and cannot be accounted for by extension of competence‑based
sentential grammars. Rather, these “imperfections” are phenomena to be
explained in and of themselves, and are thus useful
objects of study in the search for scientific understanding of language.
Intention, Action, and
Language
An
enormous amount of work in natural language processing, and in artificial
intelligence generally, assumes the existence and utility of human
intentionality. This work suggests, more or less explicitly, that actions in
the world are the result of humans' intentions. Speech-act theory itself is
based on this sort of assumption because illocutionary acts are produced by
speakers to achieve intended perlocutionary effects: use of language is a form
of intentional action (Searle, 1969). Nevertheless, the relationship between
intention, action and language is not well understood. For the research
presented in this dissertation, two issues are particularly problematic: First,
what (linguistic) behaviors are intentional? Second, how do people act on
intentions to produce conversational interaction? I address each of these
problems in turn.
Acts and Signals
Aside
from the occasional case like someone crying out in surprise or fright, verbal
acts are largely considered to be intentional. At the same time, there is a
class of behaviors, including communicative behaviors, which are widely
regarded as unintentional or unconscious. As I have discussed, there is a wide
range of nonverbal behaviors in conversational interaction. These behaviors can
be interpreted either as intentional acts or as unintentional signals. To some
analysts, the majority of this massive stream of communication is unconscious
on the part of the agent (Allen & Guy, 1974). To others, significant
aspects of nonverbal behavior are directly intentional (Argyle & Cook,
1976; Birdwhistell, 1970). It is certainly true that the kinds of routinized
behavior relevant to conversational control are in an indistinct zone with
respect to intentional action. They seem to be on the periphery of awareness
(Ekman & Friesen, 1981). Some communicative behaviors seem to be in the
province of autonomic response; pupil dilation and contraction have been
observed in response to informational content (Argyle & Cook, 1976). Nonverbal
behaviors are also interpreted as involuntary because they convey or reveal
things which the agent has no intention of communicating (Allen & Guy,
1974). Yet other behaviors seem to be part of the same action that we associate
with production of an utterance; the rise or drop in pitch at the end of
English sentences is invariably accompanied by a raising or lowering of the
eyelids, head, or hands (Scheflen, 1980).
The
issue is further complicated by the possibility that conversants can
consciously display behaviors that would ordinarily be unconscious. This can be
done for emphasis (e.g., looking up in frustration) or as a deception (e.g.,
looking blank to feign lack of prior knowledge of a reference).
From
the standpoint of understanding, nonverbal communication the situation is
equally perplexing. Actions which could be taken as cues are not always noted
by the partner (Allen & Guy, 1974). For example, the extent of people’s attention
or perception of gaze appears to vary widely (Argyle & Cook, 1976). In
short, the role of intention in producing nonverbal communicative acts is an
unsettled matter:
What is actually intentional and what is not
need not by any means be the same as the way it is treated by others. Accordingly,
although it is fruitless to try to decide what messages a person actually
intends to convey and what he does not, how people treat each other in this
regard should nevertheless be carefully attended to. That is, it is very
important to consider what aspects of of the flow of
information participants treat as if they have been provided intentionally and
what aspects they treat as if they are unintentional. As a corollary to this,
it then becomes a matter of great interest to investigate which features
actions must have to be treated as intentional and which they must have to be
treated otherwise. To the best of my knowledge, this question remains one to be
investigated systematically. (Kendon, 1981, p. 10).
There
is no getting around the fact that intentionality is a difficult subject. How,
then, is intention to be interpreted in a computational model of conversation? The
characterization of much communicative behavior as unconscious is the product
of introspective analysis in which the analyst cannot locate any specific
intent or purpose for motor activity. In my view, this conclusion is the
product of an ill-founded assumption that unconscious equals unintentional. In
defining intentionality, for example, Ekman and Friesen (1981) specifically
refer to the “deliberate” use of a nonverbal act to communicate a message to
another informant, although they do note that it may not be possible to
determine the intentionality of every instance of nonverbal behavior. To shed
some modest light on this matter, I obtained from a variety of adult informants
descriptions of their own processes of linguistic production. All felt that
they had little or no conscious control over the process of actually producing
speech; they could not explain how they talked. This experience, I feel, is the
product of the routinized nature of linguistic production; people spend a great
deal of time using language. Yet none of the informants would characterize
their speech as involuntary. Even if they were unable to articulate their
intentions, they surely had motivations for speaking, even on an
utterance-by-utterance basis. There is no reason to distinguish the nonverbal
acts associated with these utterances as any less the product of such
motivations. It is not necessary to specify goals for the acts in order to
describe them (
There
is perhaps a reasonable analogy here between the production of language and the
performance of other motor activity. If I walk from my desk to the bookshelf to
get a book, I am performing an action in service of my intention to get the
book. The overall intention may or may not be conscious, but certainly the
individual actions which accomplish it--using my legs and feet, maintaining my
balance--are not consciously performed. But neither are these actions
involuntary; they are simply easy and routine. It is possible that I might move
my leg reflexively if, for example, someone spilled ice-water on it. Similarly,
I might blink if dust irritated my eyes. But in a purposive context, both moving
my leg and blinking help me to achieve non-reflexive goals. They are the
consequences of intention and constitute its embodiment. That is, while the act
of walking to bookshelf can be said to embody my intention to get the book,
this “act” is a composite; it has no existence outside our interpretation of
the sum of a large number of smaller acts which together produce it. Even if
attenuated, intentionality must underlie each constituent sub-act. Thus while
the distinction between voluntary and involuntary linguistic action is not
clear, a reasonable model of conversation will interpret displayed behaviors in
terms of intentional acts unless (1)they can be shown reflexive because of
physical factors or (2) they do not occur in--or appear to be reasonably
related to--a context of larger, intentional action.
Planned vs. Situated
Action
How
do people's intentions produce conversational acts? More specifically, how does
intention get translated into a sequence of acts that produce coherence in the
organization of conversation? A strong thread in artificial intelligence has
involved production of rationally organized actions through planning. Planning
systems have been proposed for the production of text (McKeown, 1985) and for
interactive conversation (Power, 1979; Hobbs & Evans, 1980; cf., Johnson
& Robertson, 1981).4 Other computational models of
interactive discourse suggest that conversants use similar processes which rely
on depth-first tree search. Grosz (1981, 1982) implied this by using a stack‑based
process for changing focus spaces in conversation. It should be noted, though,
that Grosz does not at all claim that conversation is pre-planned. The
structure of a discourse, she observed, tends to arise naturally out of the
structure of the discourse task. It is the focus‑space structure which
models the conversation as it develops that is stack-based. Reichman (1985),
though, directly proposed an ATN-based model for conversational exchanges.
It
is unlikely that simple planning or stack-based models of interactive discourse
are adequate for modeling mixed-initiative conversational discourse. The
importance of conversants being able to change the structure of their
conversation in unforeseen, flexible ways is underlined by the observation that
although some tasks produce neatly stacked discourse structures, everyday
interaction requires an enormous variety of structures for which stack-like
models are inadequate. The principal problem is that planning involves
searching a state space, yet for most conversation the future states of the
conversation are not reasonably calculable. For example, Birnbaum (1986)
pointed out the case where a conversant refutes an argument on the grounds that
the other conversant has used a supporting fact which is demonstrably false. To
have planned this exchange from the beginning, the conversant's original state
space would have had to include not only the universe of all relevant facts
which support her opponent’s argument but also all other possible facts which
are false as well. This unknown state space precludes the direct use of a
planning model.
Power
(1979) found that his stack-based planning system for conversation would run
into problems because of (1) incompleteness and (2) insufficient flexibility in
adjusting to changes in context:
Let us turn now to the second fault of the
control stack as representation of the dialogue state: namely, insufficient
explicitness. What this means is that the relations between elements of the
dialogue state are not represented systematically.... The result is that the
dialogue state can be interpreted just one way; it cannot be interpreted by
several different procedures for several different purposes. The robots
therefore cannot respond flexibly to unexpected turns in the conversation; an
unexpected remark throws them completely. (Power, 1979, pp.
133-134).
To
help solve this problem, Power proposed marking the elements of the dialogue
state with explanatory relations that could be used inferentially to rework the
plan. In effect, this would mean trying to re-plan the conversation at each
state. This approach is not parsimonious and tries to graft skills for
opportunism onto a fundamentally top-down structure. It does not really address the underlying
issue of producing conversational organization from a system which is
fundamentally flexible.
Cohen
(1984), working with earlier research, reported that discourse analysis of
human‑computer interaction reveals that users do not follow the strict
embedding of subdialogues required by an ATN model. Rather, a more flexible
“demand” model was needed. Cohen also reports research indicating that
efficiency in referential communication is a function of user feedback. ATN’s, as a stack-based method, are considered too rigid
for even sentential grammars and thus are unlikely to be capable of
representing dialogue processes (Frederking, 1988).
Indeed,
the planning model has been characterized as a post-hoc rationalization of
actions; it is an artifact of reasoning about actions rather than a mechanism
for producing them (Suchman, 1987). Plans, Suchman suggests, are simply a
restatement of intention.5 She proposes
that the coherence of action is not adequately explained by either stored plans
or scripts; rather, the organization of the conversants’ actions is an emergent
property of moment-by-moment interactions between actions, and between
conversants and their context. That is, global coherence is the result of
situated application of locally meaningful operators. This extends Birnbaum’s
(1986) notion of opportunistic planning.
This
does not preclude the use by conversants of high-level reasoning about their
context and actions they might take to achieve their goals. This is, after all,
a large component of what we perceive as conscious thought. Therefore, as I
understand the implications of Suchman's thesis, we can produce conscious plans
(based on some known state-space), act on them, and then react to changed or
unanticipated circumstances as needed. More typically, we do not formulate an
explicit plan; rather, we take some initial action to achieve our goal, and
thereby create a set of expectations about what will follow. It does not matter
(from the standpoint of conversational control) if the expectations are not met
because we can again produce from our intentions a new action which is
responsive to the new situation. To the extent that our expectations are met,
we can routinize the selection and application of operators.
Summary
In this Chapter, I presented the integration
problem: resolving speech-act theory with the ragged character of real
conversation. Socio-linguistics has identified characteristics of linguistic
interaction that (1) are difficult to explain in speech-act theory and (2)
indicate the presence of meta-messages for conversational control. I then
discussed techniques and assumptions which conversants use to maintain
coherence. The evidence suggests that coherence is a product of a joint process
in which the parties judge their understanding with respect to a mutual model
of the conversation. When conversants detect significant differences between
their beliefs about the mutual model of the conversation, they use feedback and
repair to restore coherence. This process of feedback and repair can be viewed
as consisting of meta-locutionary acts which control the interaction. As a
result, the non-grammatical features of conversational interaction can be
interpreted as embodying meta-locutionary acts which promote coherence. Finally,
I discussed the role of intention in conversation, with particular attention to
issues of interpreting expression and conversational structure. While the
distinction between voluntary and involuntary linguistic action is not a clear
one, a reasonable model of conversation will interpret displayed behaviors as
acts. As to structure, the spontaneous nature of conversation suggests that
planner-based models should be disfavored; global coherence may be produced
through situated application of locally meaningful operators. In the next
chapter, I more fully develop the idea of meta-locutionary acts as a theory of
action in language.
2.
In the transcript, the notation refers to L(arge) or S(mall), B(lue) or
R(ed), C(ircle) or S(quare)
shapes. Thus the LBC is the large blue circle, the LRC is the large red square,
and the SRS is the small red square.
3.
Cohen (1984) presents a brief account of grammar-based approaches to ill-formed
input.
4.
Appelt (1981, 1985) also proposed planning models for
conversational discourse. This work, however, principally concerned
intra-utterance planning rather than inter-utterance conversational planning.
5.
Suchman also rejects speech-act models as begging the question of situated
interpretation. As I attempt to show in this dissertation, speech acts and
situated action are not incompatible. Speech acts can be viewed as the product
of contextually sensitive operators.