Chapter
IV
Methodology
In
the preceding chapters I considered that the general question of coherence is
posed in the integration problem, examined a range of prior work which provided
foundations for attacking the problem, and proposed a theory of meta-locutionary
acts for explaining the control processes of conversation. In this chapter I
discuss the specifics of a study for assessing the theory. In particular, I
look at the effects to be studied, and the mechanics of the protocol analysis
of conversation. The computational modeling of the observed behaviors is
discussed in chapters V and VI.
Methodology in Cognitive
Science
Although
the field is not settled (Suppes, 1984), methodology in cognitive science has
produced some agreement on what constitutes adequate work. Unfortunately,
research into the processes of conversation has (at least) two major
methodological pitfalls. The first is the lure of introspection. Introspection
seems inexpensive and easy; virtually every human being communicates interactively
with others. But one cannot simply present one’s intuitions as a descriptive
study. It is precisely because mixed‑initiative discourse is so familiar
to us that its processes are transparent to the introspective observer. The
observed behaviors must be explained in the context of a theory that proposes a
coherent model (Miller, Polson, & Kintsch, 1984). What sort of theory is
adequate? Miller, et al., (1984) suggest that there is a continuum between a
naked idea and an explicit process model. In the center of this continuum lie “middle
ground” theories in which structures and processes have been specified in
sufficient detail to support both the instantiation of the theory as a computer
program and qualitative experimental study. In this chapter, I explain how the
theory of meta‑locutionary acts can be validated in both regards; it thus
avoids the first pitfall. The second pitfall is that of failure to explain
behavior as a process. Although all studies must inherently look back at
behavior that is now--because it occurred in the past--static, a good model
will explain the behavior on the basis of real‑time factors: those which
were available in the actual context of the observed interaction (Swinney,
1984). Although this view has been criticized as overly restrictive (Charniak,
1984), its essential truth is that the model cannot embody in its explanation
the kinds of post‑perceptual processes which are available to observers
rather than to the participants. I have tried to address this concern through
(1) minimizing domain‑level, real‑world knowledge in the
experimental task, (2) using temporally‑significant acts to explain the
observed behaviors, and (3) trying to replicate the behaviors with a simulation
of the model.
Effects to be Studied
The
eventual goal of this study is to test the theory of meta‑locution
through computational modeling of actual interactive discourse. Within the
general philosophical framework of the speech‑act approach to language,
most of the discourse theories we have discussed have looked at basically
sentence‑level (or utterance‑level) interaction. At a minimum, the
discourse unit of interest has been the noun phrase, where the noun phrase has
been characterized as a act of assertion or the subject of negotiation. As I
have tried to show in Chapter II, both Cohen (1984) and
I
thus seek to examine sub‑domain levels of conversational interaction. As
the socio‑linguistic literature has described the features of
conversation which apparently lead to coherence, this study concentrates on
aspects of language which are related to these observations. Thus the model
should explain turn‑taking, negotiation of reference, and confirmation of
the mutuality of knowledge.
General Approach
I
have argued that maintenance of a conversational model is clearly part of any
reasonably sophisticated approach to generational simulation of conversation,
yet our initial observations of conversations suggest that maintenance of the
model is a continuous, multi-level process of incremental addition and revision
rather than a post-utterance assessment. This set of behaviors of the speaker
and hearer is a) at a more finely grained level than that of the standard
illocutionary act and b) about the process of conversation itself. They are,
broadly speaking, locutions which embody illocutionary acts in the sense that
they are intended to produce a change of state in the world outside the
speaker, and are meta‑acts in the sense that in combination they effect
Austinian speech acts but individually are specifically directed to changes in
the state of the conversational model itself. That is, these are illocutionary
acts which correspond to intended perlocutionary effects performed on the
shared conversational structure. The feedback-suffused basis of coherence in
conversation is a “critical fact” (VanLehn, Brown, & Greeno, 1984) which
the theory of meta-locutionary acts should better explain than traditional
speech-act theory.
Given
a theory of the meta-(il)locutionary act as a maintenance device for a multi‑layered
shared conversational model, what is the immediate program of research to be
followed? Cohen (1984) states that one should derive the illocutionary acts as
a rational strategy of action, given attributions of participants’ beliefs,
goals, and expectations at the point in the discourse in which the
illocutionary acts actually occurred. Accordingly, I first identify a set of
speech acts which (1) comprise illocutionary acts and (2) handle the micro‑tasks
of turn‑taking, negotiation of reference, and confirmation of the
mutuality of knowledge. This is accomplished through analysis of conversation
protocols at a sufficiently small granularity that the acts can be discerned. Second,
I develop, using these acts, a computational model of the belief structures of
the conversants. The structures, with a suitable set of operators, are
sufficient to account for most of the observed meta-locutionary behavior.
Study of Spoken, Face‑to‑Face
Conversation
The
effects of meta‑locutions, as I have defined them, are peculiar to real-time
interaction and especially to spoken conversation. However, not all discourse
is interactive or spoken, much less face-to-face. In fact, there are a number
of dimensions which characterize the various modalities of discourse. The
character of the discourse changes in ways corresponding to the characteristics
of the modality.
Why
then should one conduct research into computational models of meta‑locutions
using spoken, face-to-face conversation? It appears that the most efficient (in
temporal terms, anyway) form of interactive communication for human beings is
spoken conversation. It is certainly easier. Why should users of interactive
computer systems be limited to the capabilities of computing technology circa
1965? Moreover, it may be difficult or impossible to understand the processes
of language and interaction without starting with face-to-face, spoken
conversation. While the precise nature of the contributions of verbal and
nonverbal communication to interaction are not known, they are ascertainable
through research:
This notion, though simple and intuitive,
carries a strong implication for research: we can fully understand language
only by examining its functioning as an aspect of face-to-face interaction, and
not be treating it as an autonomous entity, unsullied by contact with everyday
social processes, including co-occurring “nonverbal” actions. (Duncan, 1980, p.
67).
Looking
at the advantages of conversation to interfaces, cross-modal research on
discourse as surveyed by Cohen (1984) found significant efficiencies in spoken
conversation. In particular, a series of cross-modality studies was conducted
by Chapanis, Ochsman, Parrish, & Weeks (1972, 1977). They found, inter
alia, that problems are solved twice as fast in vocal modalities as they are in
written ones, even though conversants use twice as many words when speaking. Cohen
(1984) also cites a thread of psychological research on cross-modal comparisons
of reference. These studies show that for spoken interaction the length of noun
phrases tends to decrease as subsequent references are made; this decrease is
not as sharp for non-interactive spoken modalities. Cohen concludes that these
results indicate that efficiency in referential communication is a function of
conversants’ feedback.
Many
of the differences between narrative and interactive discourse are well known. People
looking at conversational language often remark on its apparently ill-formed
and “ungrammatical” qualities. Not only are what one often considers speakers’
mistakes more prevalent, correspondingly abundant are the opportunities for
repair. Thus Goodman suggests that the study of miscommunication is a necessary
task for building natural language understanding systems since any computer
capable of communicating with humans in natural language must be tolerant of
the complex, imprecise, or ill-devised utterances that people often use
(Goodman, 1986). Interactive systems which aspire to live in the real world of
feral language must be able to cope with its perplexing characteristics. Among
the factors affected by modality of discourse is the conversants’ ability or
opportunity to maintain their model of the conversation. Clearly, authors of
novels do not depend on real-time feedback from their readers to check
uncertainty in the readers’ models, nor can authors negotiate the meaning of
acts, lexical items, and references. Thus Jernudd and Thuan observe that
discourse represents a continuous process of accommodation among conversants.
The
relationship between conversants can vary greatly, from the great distance of
written discourse to closeness of face-to-face conversation, which brings the
question of accommodation into sharp focus (Jernudd & Thuan, 1983). If one
is to study the shared model created by discourse and its maintenance through
negotiation and accommodation, then one should study conversational rather than
textual discourse. Other modalities of discourse, such as keyboard input, can
be interactive but not spoken. It turns out that the presence of speech itself
in place of written interactive language is significant for the structure and
processes of discourse. Cohen (1984) found that keyboard communication is
distinctly different from other modalities of discourse like face-to-face
conversation. In particular, among other differences, keyboard interaction
emphasized optimal packing of information into the smallest linguistic space. As
a result, keyboard communication alters the normal organization of discourse. Of
course, non‑textual information represents some sort of (presumably most
efficient) lexicalization of an underlying meta‑locutionary act. To the
extent that acts can be lexicalized in the alternate modality, the meta-locutionary
content can be transmitted. However, this may require a new vocabulary; unless
recognized though social acceptance or immediately negotiated, new lexemes will
not be understood.8
Modality-induced
differences in discourse also include the way conversants use reference. For
example, voice‑only communication removes some of the forms of acts of
reference (e.g., deixis and common visual context) which keep the interaction
coherent. When these acts are not available, the interaction can easily break
down:
O.K., uh ... now, we need to attach the um ...
conduit to the motor ... the conduit is the uh ... the covering around the wire
that you ... uh ... were working with earlier. Um, there is a small part um ...
oh brother ... (Grosz, 1982, p. 88).
Accordingly,
cross-modal studies have shown that different modalities of communication lead
to different uses of referring expressions. Analysis of protocols from teletype
and telephone interactions shows marked differences in the use of explicit
requests for identification. As a consequence, systems which understand spoken
language will have to be prepared for language which differs from that observed
in teletype interaction (Cohen, 1981). If the goal of research in natural
language processing is either (1) understanding how people actually converse or
(2) developing systems which converse with people, the evidence with respect to
differences in conversational characteristics from differences in modality
suggests that spoken conversation would be a fruitful area of study.
The Study
The
methodological strategy of this study, then, is to examine conversational
interaction in some reasonable domain, and then to derive the underlying
illocutionary acts as a rational strategy of action, given attributions of the
participants’ beliefs, goals, and expectations at the points in the discourse
in which the illocutionary acts actually occurred (Cohen, 1984). I thus turn to
the particulars of the design of the study that instantiate this strategy.
Domain
and Tasks
The
first part of the empirical work involved development of a suitable domain for
the observed conversations. The general requirements were that the conversation
produce acts by the conversants which included turn‑taking, negotiation
of reference, and determination of mutuality of knowledge. In order to
facilitate ready experimental determination of mutuality, the domain should be
a simplified one in which as much of the mutuality as possible is created
through direct rather than indirect copresence. That is, if the knowledge
becomes mutual to the conversants during the conversation itself, the fact of
the mutuality can be more easily ascertained by an experimenter‑observer.
In contrast, if the subject of the conversation is something which is mutual
through indirect copresence, then the experimenter has virtually no basis from
which to determine the knowledge confirmed or the extent of the confirmation. Unfortunately,
extreme reduction in domain complexity may also lead to unrealistic interaction
that exhibits artificial effects. In reaching a balance here, one may simply
have to qualitatively evaluate the resulting conversation for verisimilitude.
A
second factor in selecting the domain concerns deixis. I sought to reduce
confusion over gestures which were referential by minimizing circumstances
which called for deictic reference. This meant that a task such as jointly
building a structure would have to be limited to structures which were wholly
mental rather than physical or visual. The advantages of reduced confusion from
deictic gestures are not obtained without cost. DeLancey (personal
communication,
A
third factor which influenced the choice of both the domain and the task was a
need to keep the conversations reasonably short--around two minutes at the
most. At the fine‑grained level of transcription which this study
requires, longer protocols would simply be impractical to transcribe. Experience
with the pilot study suggests that to transcribe fully ten seconds at the
appropriate level of detail takes about six hours. Thus forty seconds of
protocol requires three working days to transcribe. It would be also be
possible to elicit longer conversations with greater structure and then choose
short sections for analysis. However, this would lead to more speculative
analysis in initializing the conversational models. Accordingly for this
initial study, the domain and task had to revolve around concepts and
activities that could have fairly rapid closure.
Given
these constraints, I chose for the domain the task of jointly reconstructing a
sequence of random letters. Thus a typical protocol involves two subjects, each
of whom has a copy of the same sequence of 15 letters. Some of the letters have
been replaced by blanks. The blanks do not overlap. Thus if one had both copies
of the sequence, the entire sequence could be determined. The sequences of
letters and the positions of the blanks were chosen randomly. Figure 7 depicts
the copies of the sequences which subjects received in one of the trials.
The tasks were structured as follows: The
subjects were given, by random choice, their copies of a sequence. The subjects
then had one minute in which to memorize their respective sequences. After the
minute expired, the subjects turned over the pieces of paper on which the
sequences were printed and were instructed to jointly recall the entire
sequence. This task was repeated two additional times, so that for each pair of
subjects I obtained three conversations. In the third task, the copies of the
sequence were altered so that toward the end of the sequence the copies
differed by one letter. Copies of this kind are depicted in Figure 8.
Copy of sequence given to subject A:
_ i s u t w r q g l
d _ _ _ _ g o
Copy of sequence given to subject B:
o _ s u t w r q g _
_ f w w d g o
Figure 7. Example of random‑letter
sequences used in protocols. Neither copy contains the entire sequence. The
blanks do not overlap. Therefore, the entire sequence can be reconstructed if
the information in both copies is used.
Copy of sequence given to subject A:
n _ _ b w _ e g u q t y v o x u e
Copy of sequence given to subject B:
_ i k b w f e g u q t y _ _ _ y e
Figure 8. Example of non‑identical
sequences used in third task. The copies of the sequences give to the subjects
differ toward the end. In this example, the difference occurs in the next-to-last
position. This stimulus was used to induce apparent failure of the subjects’
models of the conversation.
The
rationale for the three tasks is that the first provides familiarization to the
subjects, the second lets the subjects interact proficiently, and the third
uses the subjects’ expectations garnered in the first two tasks to induce
apparent failure of the subjects’ model of the conversation. The domain has the
needed characteristic that all of the knowledge will be directly copresent,
because the entire domain‑level knowledge structure is contained in the
sequences. Moreover, because the sequences are memorized, the resulting
conversations revolve entirely around mental rather than physical
constructions. The relatively short length of the sequences meant both 1) that
memorization and recall would be feasible tasks and 2) that the conversations
could conclude rapidly. These assumptions were confirmed in a series of pre‑experimental
tests.
Protocols
I
conducted for this paper a pilot study with two sets of subjects. Each set
consisted of two persons who performed the three tasks described above. The
subjects sat roughly at a ninety‑degree angle to each other so as to face
each other and still be visible to the camera. An overhead view of the
situation is presented diagramatically in Figure 9.
The
subjects were recorded on VHS videotape during all instructions, preparations,
and the actual tasks. The subjects completed, from their subjective standpoint,
all of the tasks, although objectively they did fail on occasion to recall the
sequences accurately. However, their accuracy in the task was sufficient for the
analysis. The conversations typically lasted about two to three minutes. In all,
then, I obtained six protocols.

Figure 9. Overhead view of the experimental set-up
for the letter-sequence protocols.
Protocol Analysis of Conversational
Interaction
In
the preceding section, I discussed the rationale, design, and collection of the
protocol data for this research. I now turn to the analytical methods applied to
the protocols. Useful sections of the protocols were transcribed for both verbal
and non‑verbal behaviors. The verbal portions were transcribed with the aid
of the SoundCapª digitizing program on an Apple Macintosh computer. The sounds were
resolved down to a precision of about a tenth of a second. Ambiguous words were
resolved by listening to the original videotape. Sounds that could not be resolved
into words were noted phonetically. Intonation was coded through punctuation. The
subjects’ physical actions were transcribed using an editing videotape recorder
with slow-motion control. The actions were resolved, where necessary, on a frame-by-frame
basis down to a thirtieth of a second. A notation for the physical actions was developed
to the extent necessary to impart the general nature of the actions observed; while
accuracy in timing is important, accuracy in the specifics of the motions is not
needed if the gist of the action can be discerned. This feature of the transcription
process is facilitated by the fact that the study does not examine nor does it make
claims about lexicalization of physical action (or of verbal acts either, for that
matter). A transcript of the protocol specifically analyzed in this paper is set
forth in Appendix A.
Once
the verbal and physical behaviors had been transcribed, I then assigned identifiers
to the behaviors which I deemed could be considered acts. These included virtually
all of the behaviors observed. Only later in the study were behaviors declassified
as acts; the motivation for this approach is that it was better to over‑include
acts as significant rather than risk ignoring a behavior which might have been a
significant within the context of the conversation. As Birdwhistell (1970) observed,
no behavior ever carries meaning in and of itself; it is the context which provides
the meaning, if any. For example, in the protocol reported in Appendix A, B’s behavior
of leaning back was initially coded as a separate act identified as Bi3; subsequent
analysis of the interaction suggested no reasonable basis for finding this action
as an act distinct from its other associated actions, and so the notation of Bi3
was deleted. As it turned out, though, very few behaviors were found insignificant.
Conversely, in some cases additional acts were noted where, on further analysis,
the behaviors appeared to require decomposition. In the protocol reported in Appendix
A, act Ai6a was added in this way.
Summary
In
assessing the validity of the theory of meta‑locutionary acts, the effects
to be studied should include turn-taking, negotiation of reference, and confirmation
of the mutuality of knowledge. Accordingly, a domain was developed in which these
meta‑conversational phenomena could be observed and modeled. The domain task
required face-to-face, spoken conversation between the conversants, maximized the
use of directly copresent knowledge, and minimized deictic gestures. The actual
task given the conversants in the experimental protocols--joint recall of a random
sequence of letters--met these requirements. Protocols were obtained of three trials
each of two pairs of subjects. The protocols were transcribed for verbal and non-verbal
behaviors, and the behaviors were noted as possible meta‑locutionary acts.
In the next chapter I describe how the behaviors
were connected to the acts developed as part of the theory of meta‑locutionary
acts. These acts, and the conversational models which they affect, explicate the
conversation in meta‑locutionary terms.
8.
Some new lexemes may be slowly evolving for textually based computer interaction.
For examples the smiley-face icon, “:-)” and its variants, conveys the meta-locutionary
sense of satire or irony.