Pragmatics
and Cognition,
to appear
version
of February 16, 2005
Nigel
Ward
University
of Texas at El Paso
Acknowledgements:
I thank Takeki Kamiyama for phonetic label checking, Gautam Keene and Andres
Tellez for pragmatic function labeling and discussion, and all those who let me
record their conversations. For
general discussion I thank Daniel Jurafsky and Kazutaka Maruyama. I would also like to thank Keikichi
Hirose, the Japanese Ministry of Education, the Sound Technology Promotion
Foundation, the Nakayama Foundation, the Inamori Foundation, the International
Communications Foundation the Okawa Foundation and the National Science
Foundation for support. Most of
this work was done at the University of Tokyo.
Nigel
Ward
nigelward@acm.org
phone:
915-747-6827
fax:
915-747-5030
http://www.cs.utep.edu/nigel/
Computer
Science,
University
of Texas at El Paso
El
Paso, TX 79968-0518
Biographical
Note: Nigel Ward received a Ph.D. from the University of California at Berkeley
in 1991. From 1991 to 2002 he was
with the University of Tokyo. His
primary research interest is human-computer interaction, especially sub-second
responsiveness in spoken dialog systems.
Abstract: Sounds like h-nmm, hh-aaaah, hn-hn,
unkay, nyeah, ummum, uuh, um-hm-uh-hm, um and uh-huh occur
frequently in American English conversation but have thus far escaped
systematic study. This article
reports a study of both the forms and functions of such tokens in a corpus of
American English conversations.
These sounds appear not to be lexical, in that they are productively
generated rather than finite in number, and in that the sound-meaning mapping
is compositional rather than arbitrary.
This implies that English bears within it a small specialized
sub-language which follows different rules from the language as a whole. This functions supported by this
sub-language complement those of main-channel English; they include low-overhead
control of turn-taking, negotiation of agreement, signaling of recognition and
comprehension, management of interpersonal relations such as control and
affiliation, and the expression of emotion, attitude, and affect.
1.
Introduction
American
English conversations are sprinkled with large variety of non-lexical sounds,
as suggested by Table 1. Along
with such familiar items as oh, um, and uh-huh, there are a large number
of less common sounds such as h-nmm, hh-aaaah, hn-hn, unkay, nyeah, ummum,
uuh and um-hm-uh-hm.
Similar variety is also seen in Swedish (Allwood & Ahlsen 1999), German (Batliner et al.
1995) and Japanese (Ward 1998).
[INSERT TABLE 1 ABOUT HERE]
While
aspects of non-lexical items in conversation have been studied, the less common
sounds have mostly escaped notice.
In particular four basic questions have not been raised, much less
addressed: first, the reason for such a large variety of sounds, second, what
they all mean, third, their role in human communication, and fourth, their
cognitive status.
The
structure of this paper is as follows.
The first three sections illustrate the phenomena, survey the current
state of knowledge, explain the practical importance, and outline the overall
approach. Section 4 presents a
phonetic description and argues that most non-lexical conversational items,
including both the rare and the common forms, are productive combinations of 10
component sounds. Sections 5, 6,
and 8 present meanings for each of these component sounds and evaluate the
power of a Compositional Model, in which the meaning of a non-lexical token is
the sum of the meanings of the component sounds. The methods used to identify and check these meanings are
presented as they arise, but mostly in Sections 2, 5, and 7. Sections 9 and 10 explore how the model
helps clarify the role of non-lexical utterances in human communication and
their relationship to phenomena such as interjection and laughter. Section 11 summarizes.
2.
The Need for an Integrative Account
For
several reasons an integrative account of non-lexical items in conversation is
needed. Although aspects of these
phenomena have been addressed by a large number of studies, undertaken with a
variety of aims, there has as yet been no attempt to integrate the findings. This section explains why it is worth doing
so.
First,
although there are many studies which have focused on one or a few of these
items, the big picture has been missing.
That is, there has been no attempt to explain how these items function
as a system, meaning that, for example, there is no account of how speakers can
chose among these items, especially the less common ones.
This
lack hinders the construction of more useful spoken-dialog systems, in that
non-lexical items have the potential to let spoken dialog systems give the user
better, more motivating feedback, to deliver information more efficiently and
smoothly, and in general to make human-computer more pleasant (Schmandt 1994;
Shinozaki and Abe 1998; Thorisson 1996; Rajan et al. 2001; Iwase & Ward
1998; Ward 2000a; Ward & Tsukahara 2003). This lack also hampers learners of English as a second
language Gardner 1998). Today
there is no model or resource that describes even approximately, for example,
the relation between uh and uh-huh, the ways in which the
meaning of uh-huh resembles and differs from that of uh-hn,
and when people use myeah
instead of yeah. Thus, as a
supplement to more detailed studies, a big-picture account would have great
practical value.
[INSERT TABLE 2 ABOUT HERE]
Second,
although there have been detailed studies of non-lexical utterances within
certain roles, especially disfluencies and back-channels, there has been little
work looking at the distribution of non-lexical items across such roles. This lack of category-spanning studies
is unfortunate since, as McCarthy (2003) notes, many of these sounds are
multi-functional. This is seen
also in Table 2 for example, oh occurs both as a back-channel and
turn-initially. An integrative
account has the potential to reveal broader generalizations.
Third,
although there have been on the one hand several phonetically sensitive studies
of non-lexical utterances, and on the other hand many pragmatically
sophisticated studies of their use in conversation and a few controlled
experiments, there has been little connection between the two: the phonetically
sensitive work has said little about those variations which are common in
conversation or cognitively significant, and conversely the work based on
conversation or dialog data has not paid much attention to phonetic variation. An integrative account, looking at
variations in form and variations in meaning together, has the potential to
improve our understanding of both aspects.
Ultimately,
of course, the reason to seek an integrative account lies is the hope that it will
be simpler overall.
3. Approach
To seek an integrative account it was
necessary to approach the phenomena in a novel way.
3.1 Working with a Mid-Size Corpus
The basic strategy adopted was to take a
mid-sized corpus of casual conversations and try to understand and explain
everything about all of the non-lexical utterances. By looking at all occurrences it was easier to notice the
relations between items and to examine items across a variety of functional and
positional roles.
Conversations were used, rather than
task-oriented dialogs or controlled dialog fragments, to allow the study of
diverse dialogs and rich interactions, giving a broader view of when and how
non-lexical utterances are used.
Analysis was limited to a mid-size
corpus, rather than a large one, in order to allow a reasonably thorough
examination of the phonetics and pragmatics of each occurrence. This also made it possible for all the
analysis to be done by listening directly to the data, without having to rely
on transcriptions.
A home-made corpus, rather than a standard one, was used
because the author was familiar with it, as the sound engineer recording the
conversations, as a friend or acquaintance of most of the conversants, and as a
participant in a few of the conversations. (The author's own non-lexical utterances were excluded from
the analysis.) The extra
information this gave was often helpful when interpreting ambiguous utterances.
The corpus used includes 13 different
speakers, male and female, all American, aged from 20 to 50ish, from a variety
of geographical areas. Most of the
conversations were recorded for another purpose (Ward & Tsukahara 200) and
participants were not informed of the interest in non-lexical utterances. In some cases people were brought together
to converse and be recorded, other times the conversations were already in
progress. All recordings had only
two speakers, and in most cases these two were doing nothing but conversing
with each other, although some conversations included interactions with other
people or pets, and one speaker was driving. Recording locations included the laboratory, living rooms, a
conference room, a hotel lobby, a restaurant, and a car. The relationships between conversants
ranged from relatives to close friends to acquaintances to strangers. Most conversations were recorded in
stereo with head-mounted microphones; one was a telephone conversation.
3.2 Looking at a Wide Variety of Items
Given this corpus, the first thing to do
was to identify all the non-lexical items. To avoid missing anything that might be relevant, the
initial definition was made inclusive.
Specifically, all sounds which were not laughter and not words were
labeled as non-lexical items. A
`word' was considered to be a sound having 1. a clear meaning, 2. the ability
to participate in syntactic constructions, and 3. a phonotactically normal
pronunciation. For example, uh-huh
is not a word since it has no referential meaning, has no syntactic affinities,
and has salient breathiness.
Although the distinction between words and non-lexical items is not
clear-cut, as will be seen, this gave a reasonable way to pick out an initial
set of sounds to examine.
To keep the scope manageable, attention
was limited to sounds which seemed at least in part directed at the
interlocutor, rather than being purely self-directed, even if the communicative
significance was not clear. This
ruled out stutters and inbreaths.
The corpus has 316 non-lexical items,
with one occurring about every 5 seconds on average.
3.3 Listening to the Data
Rather than working from transcripts, all
analysis was done by listening.
This probably helped focus attention on the interpersonal aspects of the
dialogs, rather than the information content. This research style was facilitated by the use of a
special-purpose software tool for the analysis of conversational phenomena,
didi (Ward 2003).
However, it being important to pay
attention to the detailed sounds of non-lexical items, these were labeled
phonetically. These labels were always
visible while listening.
NOTE TO TYPESETTER: THERE ARE A LARGE
NUMBER OF EXTRANEOUS HORIZONTAL LINES, LIKE THE ONE ABOVE, WHICH MY VERSION OF
WORD WAS UNABLE TO DELETE. SOME OF
THESE ALSO OCCUR IN THE BOXED EXAMPLES BELOW. THE BOXED EXAMPLES SHOULD CONTAIN A HORIZONTAL LINE AFTER
THE “Example x” LINE, BUT NO OTHER HORIZONTAL LINES. PLEASE DO NOT CARRY THESE LINES OVER TO THE FINAL VERSION.
The phonetic labeling was done using
normal English orthography, as discussed below. IPA was not used as it provides more detail than was needed,
potentially obscuring generalizations.
This is a common choice in studying dialog, for example Trager (1958)
argued that the study of `vocal segregates’ such as uh-uh, uh-huh,
and uh, requires `less fine-grained’ phonetic descriptions. The labels in the corpus included
annotations regarding prosody and voice, although this information is not shown
in this paper except where relevant.
The labels in the corpus are as seen in Table 1.
Due to concern that native knowledge of
English or theoretical predilections might bias phonetic judgments, about half
of the items, including all difficult cases, were labeled independently or
cross-checked by an advanced phonetics student with little experience of
conversational English and no knowledge of the hypotheses presented below. However no biases were found, and the
remaining items were labeled by the author alone.
3.4 Comparison to Alternative Approaches
Thus the method of analysis is
unusual. Moreover, as will be seen
in Section 5, it relies in part on subjective judgments. Although there are better established
and more powerful methods and theoretical frameworks, none of these seemed
quite appropriate for the task of attaining an integrative account of
non-lexical items. Thus the
approach taken here.
4. A Model of the Phonology
Revisiting Table 1, the variety of
non-lexical items is striking.
Phonological conditioning, a common cause of phonetic variety, can
provide little explanatory power here, since these items mostly occur in
isolation. This section shows how
most of the variation can be accounted for by a relatively simple model.
4.1 Intuitions about Non-lexical
Expressions
Not only is the variety great, the set of
possible sounds in these roles appears not to be finite. For example, it would not be surprising
at all to hear the sound hm-ha-hn in conversation, or mm-ha-an,
or hm-haun and so on.
However, there are limits: not every possible non-lexical sound seems
likely to be used in conversation.
For example ziflug would seem a surprising novelty, and would be
downright weird in any of the functional positions typical for non-lexical
items. The existence of this
intuition --- that only certain non-lexical sounds are plausible in
conversation --- is a puzzle that has not previously been addressed.
There have, of course, been attempts to
describe the phonetics of such items by identifying all possible phonetic
components (Trager 1958; Poyatos 1975).
However the descriptive systems produced by these efforts cover wider
ranges of sounds, including moans, cries and belches, and so they do not help
with the task of circumscribing the set of conversational non-lexical items.
It is also possible to attempt to
describe the set of possible items in terms of a list. Although it is possible, for purposes
of linguistic theory, to postulate the existence of such a list, actually
making one is problematic. The
best attempts so far have been by researchers who are labeling corpora for
training speech recognizers, who of course have an immediate practical need for
some characterization of these sounds.
For example, the best current labeling of the largest conversation
corpus, Switchboard, uses a scheme (Hamaker et al. 1998) which specifies a
small finite list, where hesitations are represented with one of uh, ah, um,
hm and huh; `yes/no sounds’ are represented with one of uh-huh,
um-hum, huh-uh or hum-um `for
anything remotely resembling these sounds’; and `non-speech sounds during
conversations’ are represented with one of: `laughter’, `noise’ and
`vocalized-noise’. Comparison with
Table 1 reveals how much information is lost by using such a list. Moreover, no mere list can account for
intuitions about which sounds are
plausible: a description in terms of a list of 10 or 100 items gives no
explanation for why hum-ha-hn, but not ziflug, could be the 11th
or 101st observed token.
Of course a list-based model could be
embellished with descriptions of the permitted phonetic variations or sub-forms
--- as in Bolinger's discussion which starts with the claim that `Huh, hunh,
hm is [sic] our most versatile interjection’, and then turns around and
focuses on differences between these three forms. However such a hybrid approach seems unlikely to be concise
or to have much explanatory power.
Thus a satisfactory list-based account of
conversational non-lexical items seems likely to be elusive.
4.2 The Phonetic Components
I propose that many non-lexical
utterances in American English are formed compositionally from phonetic components
(leaving open the vexed question of whether these components are phonemes or
features (Marsen-Wilson & Warren 1994)). This claim is not without precedent: there are a number of
works which have, more or less independently, attempted to characterize variation
in non-lexical expressions in German, Japanese, and Swedish, and have done so
using tables of non-lexical items or lists of rules relating or distinguishing
different tokens (Ehlich 1986; Werner 1991; Takubo 1994; Takubo & Kinsui
1997; Kawamori et al. 1995; Shinozaki & Abel 1997; Ward 1998; Allwood &
Ahlsen 1999; Kokenawa et al. 2004).
These all imply the possibility of an analysis in terms of component
sounds.
This subsection describes the main
inventory of phonetic components in non-lexical conversational sounds in
American English.
l
Schwa
is often present, as seen in uh and uh-huh. (In
conversation this is a schwa, although when stressed, in tokens produced in
citation form, it appears as a lower back vowel.)
l
An
/a/ vowel can also be present, as seen in ah, which is distinct from
schwa, at least for some speakers.
l
An
/o/ vowel occurs in some sounds, such as oh.
l
An
/e/ vowel occurs in yeah and occasionally elsewhere.
l
/n/
and nasalization, of vowels or of the semivowel /j/, is a feature that can be
present or absent, as seen in uh-hn (versus uh-huh),
in uun (versus uh), in nyeah (versus yeah).
l
/m/
can occur in isolation (mm) or as a component, as in um (versus uh),
hm (versus huh) or myeah (versus yeah).
l
/j/
occurs initially in yeah and variants thereof.
l
/h/
occurs in isolation occasionally, as a noisy exhalation or a sigh. /h/ or breathiness is also present in
items such as hm (versus mm), and in the back-channel uh-huh. Some such items involve breathiness
throughout, others involve a consonantal /h/, while others are ambiguous
between these two realizations.
l
Tongue
clicks occur often in isolation, and occasionally initially. (Specifically, there are cases where
the click is followed by a voiced sound with no noticeable pause; the delay
from the onset of the click to the onset of voicing ranged from 50 milliseconds
to 170 milliseconds in the corpus for these cases.)
l
Creaky
voice (vocal fry) occurs often , including for example on aummm, yeah,
okay, um, hm, aa. Creakiness sometimes spans the entire sound, but other times
is present only towards the end.
[INSERT TABLE 3 ABOUT HERE]
The list above is summarized in Table
3. Although this summary may
suggest that these phonetic qualities are binary, for example nasalization
being either present or absent, it seems more likely that the phonetic
components are in fact non-categorical, involving `gradual, rather than binary,
oppositional character' (Jakobson & Waugh 1979). This explains how the set of non-lexical items generated can
be literally not finite.
It is also worth noting that the vowel
identifications are approximate.
Indeed, it is entirely likely that, as found for German hesitation
particles, `the vocalic portions ... have their own quality', distinct from
those used in lexical items (Patzold & Simpson 1995).
For expository convenience, this
phonological analysis is presented here, before the semantic analysis, although
in fact the set of relevant component sounds cannot be determined without
reference to meaning. Actually a
preliminary version of the semantic investigations described below was done
before the list of sound components was drawn up. This is why, for example, the inventory of sounds groups
together consonantal /h/ and breathiness, but not the nasals /m/ and /n/: the
first grouping, but not the second, has a consistent meaning, as will be seen.
[INSERT TABLE 4 ABOUT HERE]
The fact that this inventory of sounds is
fairly small makes it possible to concisely specify the phonetic values for all
the labels seen in Table 1. Thus
the non-obvious American English orthographic conventions for non-lexical items
are (slightly regularized) as summarized in Table 4. Other Englishes apparently have other conventions, for
example, British English uses er to represent a sound not unlike
American English uh (Biber et al. 1999). Further discussion of spelling appears elsewhere (Ward
2000b).
4.3 Rules for Combining Phonetic
Components
The full phonological model includes the
above list of component sounds plus two rules for combining them.
The first way in which sounds are
combined is by superposition. For
example, a sound can be a schwa that is simultaneously also nasal and creaky.
The second way is concatenation. There are probably minor constraints on
this, for example /j/ and /e/ have very limited distributions, and click seems
to appear only initially. These
remain to be worked out.
There seems to be a tendency for these
sounds to have relatively few components, that is, the number of component
sounds in a non-lexical token generally is less than the average number of
phonemes in a word. There is also
a tendency, rather stronger, for the number of different sounds to be
few: most sounds have only one or two, and more than three is rare. This is also seen in the fact that these
sounds often involve repetition.
4.4 The Power of the Phonological Model
The above components and rules constitute
a simple, first-pass model of the phonology of these sounds. In effect, this describes the space of
non-lexical utterances as based on `a phonological system which is different
from those employed in lexical items' (Patzold & Simpson 1995), although
the ultimate status of this phonological system remains to be determined.
However it is relatively easy to evaluate
the model for descriptive adequacy.
Ideally a model should generate all and only the non-lexical utterances
of English.
As far as generating only
non-lexical items, the model does reasonably well. The key explanatory factor is that the inventory of
component sounds excludes most of the phonemes present in lexical items,
including high vowels, plosives, and most fricatives. This provides a partial explanation for native speakers'
intuitions that only certain sounds are plausible as non-lexical items in
conversation. However this model
does overgenerate somewhat; although Section 7.3 explains how it can be
extended to reduce this.
As far as generating all the
non-lexical items, this model does fairly well on this also. Evaluating it against the inventory of
grunts in the corpus, the phonological model accounts for 91% (=286/316). It achieves this performance because,
of course, it includes sound components not present in English lexical
items. However it does not account
for all the non-lexical items. The
exceptions fall into 4 categories.
First, there are 3 breath noises such as throat-clearings and noisy
inhalations. Second there 2 exclamations including rare sounds, namely achh
and yegh. Third, there are
5 items which only seem explicable as word fragments, extreme reductions or
dialectal items, such as i, nu and yei. Finally, there are 20 tokens with
phonemes missing from the model but normal for lexical English, including okay
and wow. This last set
includes items which are only marginally non-lexical, in the sense discussed in
10.2, so it is not entirely surprising that the model fails to handle them
poorly.
Thus, although the model is not perfect,
it accounts for rare non-lexical tokens and the common ones in the same
way. It is also more parsimonious
and explains intuitions better than the alternative, modeling these items with
a finite list of fixed forms. In
this sense, these sounds are truly non-lexical. Using this model as a base, subsequent sections extend the
analysis to deal with meaning and dialog roles.
5. Methods for Finding Sound-Meaning
Correspondences
Thus it seems that these sounds can be
analyzed in terms of the composition of phonetic components. This leads inevitably to the question:
what do they mean? This is the
topic of this section.
Asking this question presupposes that
sound components of this size can bear meaning. While most morphemes are syllable-sized or larger units,
various studies have found a rich vein of sound-meaning mappings at a lower
level, or ``sound symbolism''.
That is, there exist phonesthemes, sounds which are smaller than normal
morphemes but still bear meaning.
The existence of such mappings is theoretically interesting in that they
violate Saussure's principle of the ``arbitrary nature of the sign'', which
postulates that the meaning of the whole cannnot be predicted from the meanings
of the parts (de Saussure 1915/1959).
However there is a wealth of evidence that sound symbolism is often
productive in non-lexical items and also infuses large portions of the lexicon (Sapir
1929; Hinton et al. 1994; Abelin 1999; Magnus 2000). For example there appears to be a phonestheme common to
words like splash, crash, bash, and mash. In such cases some of the meaning of
the whole is predictable from the meanings of the component phonesthemes.
The specific mappings most commonly
identified in studies of sound-symbolism relate mostly to percepts, including
sounds, smells, tastes, feels, shapes, spatial configurations, and manners of
motion. Thus few of the mappings
previously identified seem relevant to non-lexical items with conversational
functions. There are few
exceptions: some work has discussed or examine the possibility of a
sound-symbolic system operating in discourse particles and related items. Jakobson and Waugh (1979), Ameka (1992)
and Wharton (2003) have noted that sound symbolism may also be present in
interjections. Bolinger (1989), in
his discussion of exclamations and interjections, proposed specific meanings
for vowel height, vowel rounding, and various prosodic features in a variety of
non-lexical items, as detailed below.
Finally, Nenova et al. (2001) examined various non-lexical items in a
corpus of transcripts of task-oriented dialogs. Based on considerations of articulatory effort, they
proposed a distinction between `marked’ items, those which involve
nonsonorants, lengthening, multiple syllables or rounded, noncentral or tense
vowels, and `unmarked' items, those which are composed of only /m/ and
schwa. They showed that marked
items are more common as indicators of “dynamic participation”, as opposed to
the production of neutral back-channels during passive listening. The present paper goes beyond this
level of analysis to ascribe specific meanings to specific sounds.
The analysis methods used in this paper
combine and extend the methods used in these studies. Detailed discussion of the methodological issues appears
after an example of the analysis.
5.1 A First Example: /m/
In fillers, /m/ generally occurs while
the speaker is trying to decide whether to speak or trying to decide what to
say. This is illustrated in
Example 1, where the umm occurs before a substantial pause preceding a
restart of the explanation, in contrast to the uh, which occurs before
minor formulation difficulties.
There is a wealth of statistical and experimental evidence that uh
indicates a minor delay and um a major delay (Fox Tree 2001; Barr 2001;
Clark & Fox Tree 2002) although it may be that only speakers, not
listeners, make this distinction (Brennan & Williams 1995; Barr 2001). Also Smith and Clark (1993) have
observed, in the context of quizzes, that fillers um and am,
compared to uh and ah, generally seem to indicate more
thought. Also, the distributions
of uh, um and umm in Table 2 show that the presence of /m/
correlates with the tendency to appear as a filler, utterance-initial, rather
than as a simple disfluency.
|
Example
1: (discussing the effects
of speaking rate on phonology) |
|
1. E: going to be different than if
they’re, uh, talking much more slowly, 2. X: um-hm 3. E: so, umm [3 second pause]
so, uh, the stuff that we did at … |
This meaning for /m/ is seen in back-channels also. The contemplation can be directed at various things, including trying to understand what the interlocutor is saying, trying to empathize with him, or trying to evaluate the truth or relevance of his statement. For example, in Example 2 M s