Non-Lexical Conversational Sounds in American English

 

Pragmatics and Cognition, to appear

version of February 16, 2005

 

 

 

Nigel Ward

University of Texas at El Paso

 

Acknowledgements: I thank Takeki Kamiyama for phonetic label checking, Gautam Keene and Andres Tellez for pragmatic function labeling and discussion, and all those who let me record their conversations.  For general discussion I thank Daniel Jurafsky and Kazutaka Maruyama.  I would also like to thank Keikichi Hirose, the Japanese Ministry of Education, the Sound Technology Promotion Foundation, the Nakayama Foundation, the Inamori Foundation, the International Communications Foundation the Okawa Foundation and the National Science Foundation for support.  Most of this work was done at the University of Tokyo.

 

Nigel Ward 

nigelward@acm.org

phone: 915-747-6827

fax: 915-747-5030

http://www.cs.utep.edu/nigel/

Computer Science,

University of Texas at El Paso

El Paso, TX 79968-0518

 

 

 

Biographical Note: Nigel Ward received a Ph.D. from the University of California at Berkeley in 1991.  From 1991 to 2002 he was with the University of Tokyo.  His primary research interest is human-computer interaction, especially sub-second responsiveness in spoken dialog systems.

 

 

 

 

 

 

Abstract:  Sounds like h-nmm, hh-aaaah, hn-hn, unkay, nyeah, ummum, uuh, um-hm-uh-hm, um and uh-huh occur frequently in American English conversation but have thus far escaped systematic study.  This article reports a study of both the forms and functions of such tokens in a corpus of American English conversations.  These sounds appear not to be lexical, in that they are productively generated rather than finite in number, and in that the sound-meaning mapping is compositional rather than arbitrary.  This implies that English bears within it a small specialized sub-language which follows different rules from the language as a whole.  This functions supported by this sub-language complement those of main-channel English; they include low-overhead control of turn-taking, negotiation of agreement, signaling of recognition and comprehension, management of interpersonal relations such as control and affiliation, and the expression of emotion, attitude, and affect.

 

 

 

 

 

 

 

 

 

1. Introduction

 

American English conversations are sprinkled with large variety of non-lexical sounds, as suggested by Table 1.  Along with such familiar items as oh, um, and uh-huh, there are a large number of less common sounds such as h-nmm, hh-aaaah, hn-hn, unkay, nyeah, ummum, uuh and um-hm-uh-hm.  Similar variety is also seen in Swedish  (Allwood & Ahlsen 1999), German (Batliner et al. 1995) and Japanese (Ward 1998).

 

[INSERT TABLE 1 ABOUT HERE]

 

While aspects of non-lexical items in conversation have been studied, the less common sounds have mostly escaped notice.  In particular four basic questions have not been raised, much less addressed: first, the reason for such a large variety of sounds, second, what they all mean, third, their role in human communication, and fourth, their cognitive status.

 

The structure of this paper is as follows.  The first three sections illustrate the phenomena, survey the current state of knowledge, explain the practical importance, and outline the overall approach.  Section 4 presents a phonetic description and argues that most non-lexical conversational items, including both the rare and the common forms, are productive combinations of 10 component sounds.  Sections 5, 6, and 8 present meanings for each of these component sounds and evaluate the power of a Compositional Model, in which the meaning of a non-lexical token is the sum of the meanings of the component sounds.  The methods used to identify and check these meanings are presented as they arise, but mostly in Sections 2, 5, and 7.  Sections 9 and 10 explore how the model helps clarify the role of non-lexical utterances in human communication and their relationship to phenomena such as interjection and laughter.  Section 11 summarizes.

 

2. The Need for an Integrative Account

 

For several reasons an integrative account of non-lexical items in conversation is needed.  Although aspects of these phenomena have been addressed by a large number of studies, undertaken with a variety of aims, there has as yet been no attempt to integrate the findings.  This section explains why it is worth doing so.

 

First, although there are many studies which have focused on one or a few of these items, the big picture has been missing.  That is, there has been no attempt to explain how these items function as a system, meaning that, for example, there is no account of how speakers can chose among these items, especially the less common ones.

 

This lack hinders the construction of more useful spoken-dialog systems, in that non-lexical items have the potential to let spoken dialog systems give the user better, more motivating feedback, to deliver information more efficiently and smoothly, and in general to make human-computer more pleasant (Schmandt 1994; Shinozaki and Abe 1998; Thorisson 1996; Rajan et al. 2001; Iwase & Ward 1998; Ward 2000a; Ward & Tsukahara 2003).  This lack also hampers learners of English as a second language Gardner 1998).  Today there is no model or resource that describes even approximately, for example, the relation between uh and uh-huh, the ways in which the meaning of uh-huh resembles and differs from that of uh-hn, and when people  use myeah instead of yeah.  Thus, as a supplement to more detailed studies, a big-picture account would have great practical value.

 

[INSERT TABLE 2 ABOUT HERE]

 

Second, although there have been detailed studies of non-lexical utterances within certain roles, especially disfluencies and back-channels, there has been little work looking at the distribution of non-lexical items across such roles.  This lack of category-spanning studies is unfortunate since, as McCarthy (2003) notes, many of these sounds are multi-functional.  This is seen also in Table 2 for example, oh occurs both as a back-channel and turn-initially.  An integrative account has the potential to reveal broader generalizations.

 

Third, although there have been on the one hand several phonetically sensitive studies of non-lexical utterances, and on the other hand many pragmatically sophisticated studies of their use in conversation and a few controlled experiments, there has been little connection between the two: the phonetically sensitive work has said little about those variations which are common in conversation or cognitively significant, and conversely the work based on conversation or dialog data has not paid much attention to phonetic variation.  An integrative account, looking at variations in form and variations in meaning together, has the potential to improve our understanding of both aspects.

 

Ultimately, of course, the reason to seek an integrative account lies is the hope that it will be simpler overall.

 

3. Approach

 

To seek an integrative account it was necessary to approach the phenomena in a novel way.

 

3.1 Working with a Mid-Size Corpus

 

The basic strategy adopted was to take a mid-sized corpus of casual conversations and try to understand and explain everything about all of the non-lexical utterances.  By looking at all occurrences it was easier to notice the relations between items and to examine items across a variety of functional and positional roles.

 

Conversations were used, rather than task-oriented dialogs or controlled dialog fragments, to allow the study of diverse dialogs and rich interactions, giving a broader view of when and how non-lexical utterances are used.

 

Analysis was limited to a mid-size corpus, rather than a large one, in order to allow a reasonably thorough examination of the phonetics and pragmatics of each occurrence.  This also made it possible for all the analysis to be done by listening directly to the data, without having to rely on transcriptions.

 

A home-made corpus, rather than a standard one, was used because the author was familiar with it, as the sound engineer recording the conversations, as a friend or acquaintance of most of the conversants, and as a participant in a few of the conversations.  (The author's own non-lexical utterances were excluded from the analysis.)  The extra information this gave was often helpful when interpreting ambiguous utterances.

 

The corpus used includes 13 different speakers, male and female, all American, aged from 20 to 50ish, from a variety of geographical areas.  Most of the conversations were recorded for another purpose (Ward & Tsukahara 200) and participants were not informed of the interest in non-lexical utterances.  In some cases people were brought together to converse and be recorded, other times the conversations were already in progress.  All recordings had only two speakers, and in most cases these two were doing nothing but conversing with each other, although some conversations included interactions with other people or pets, and one speaker was driving.  Recording locations included the laboratory, living rooms, a conference room, a hotel lobby, a restaurant, and a car.  The relationships between conversants ranged from relatives to close friends to acquaintances to strangers.  Most conversations were recorded in stereo with head-mounted microphones; one was a telephone conversation.

 

3.2 Looking at a Wide Variety of Items

 

Given this corpus, the first thing to do was to identify all the non-lexical items.  To avoid missing anything that might be relevant, the initial definition was made inclusive.  Specifically, all sounds which were not laughter and not words were labeled as non-lexical items.  A `word' was considered to be a sound having 1. a clear meaning, 2. the ability to participate in syntactic constructions, and 3. a phonotactically normal pronunciation.  For example, uh-huh is not a word since it has no referential meaning, has no syntactic affinities, and has salient breathiness.  Although the distinction between words and non-lexical items is not clear-cut, as will be seen, this gave a reasonable way to pick out an initial set of sounds to examine.

 

To keep the scope manageable, attention was limited to sounds which seemed at least in part directed at the interlocutor, rather than being purely self-directed, even if the communicative significance was not clear.  This ruled out stutters and inbreaths.

 

The corpus has 316 non-lexical items, with one occurring about every 5 seconds on average.

 

3.3 Listening to the Data

 

Rather than working from transcripts, all analysis was done by listening.  This probably helped focus attention on the interpersonal aspects of the dialogs, rather than the information content.  This research style was facilitated by the use of a special-purpose software tool for the analysis of conversational phenomena, didi (Ward 2003).

 

However, it being important to pay attention to the detailed sounds of non-lexical items, these were labeled phonetically.  These labels were always visible while listening.

 

NOTE TO TYPESETTER: THERE ARE A LARGE NUMBER OF EXTRANEOUS HORIZONTAL LINES, LIKE THE ONE ABOVE, WHICH MY VERSION OF WORD WAS UNABLE TO DELETE.  SOME OF THESE ALSO OCCUR IN THE BOXED EXAMPLES BELOW.  THE BOXED EXAMPLES SHOULD CONTAIN A HORIZONTAL LINE AFTER THE “Example x” LINE, BUT NO OTHER HORIZONTAL LINES.  PLEASE DO NOT CARRY THESE LINES OVER TO THE FINAL VERSION.

The phonetic labeling was done using normal English orthography, as discussed below.  IPA was not used as it provides more detail than was needed, potentially obscuring generalizations.  This is a common choice in studying dialog, for example Trager (1958) argued that the study of `vocal segregates’ such as uh-uh, uh-huh, and uh, requires `less fine-grained’ phonetic descriptions.  The labels in the corpus included annotations regarding prosody and voice, although this information is not shown in this paper except where relevant.  The labels in the corpus are as seen in Table 1.

 

Due to concern that native knowledge of English or theoretical predilections might bias phonetic judgments, about half of the items, including all difficult cases, were labeled independently or cross-checked by an advanced phonetics student with little experience of conversational English and no knowledge of the hypotheses presented below.  However no biases were found, and the remaining items were labeled by the author alone.

 

3.4 Comparison to Alternative Approaches

 

Thus the method of analysis is unusual.  Moreover, as will be seen in Section 5, it relies in part on subjective judgments.  Although there are better established and more powerful methods and theoretical frameworks, none of these seemed quite appropriate for the task of attaining an integrative account of non-lexical items.  Thus the approach taken here.

 

 

4. A Model of the Phonology

 

Revisiting Table 1, the variety of non-lexical items is striking.  Phonological conditioning, a common cause of phonetic variety, can provide little explanatory power here, since these items mostly occur in isolation.  This section shows how most of the variation can be accounted for by a relatively simple model.

 

4.1 Intuitions about Non-lexical Expressions

 

Not only is the variety great, the set of possible sounds in these roles appears not to be finite.  For example, it would not be surprising at all to hear the sound hm-ha-hn in conversation, or mm-ha-an, or hm-haun and so on.  However, there are limits: not every possible non-lexical sound seems likely to be used in conversation.  For example ziflug would seem a surprising novelty, and would be downright weird in any of the functional positions typical for non-lexical items.  The existence of this intuition --- that only certain non-lexical sounds are plausible in conversation --- is a puzzle that has not previously been addressed.

 

There have, of course, been attempts to describe the phonetics of such items by identifying all possible phonetic components (Trager 1958; Poyatos 1975).  However the descriptive systems produced by these efforts cover wider ranges of sounds, including moans, cries and belches, and so they do not help with the task of circumscribing the set of conversational non-lexical items.

 

It is also possible to attempt to describe the set of possible items in terms of a list.  Although it is possible, for purposes of linguistic theory, to postulate the existence of such a list, actually making one is problematic.  The best attempts so far have been by researchers who are labeling corpora for training speech recognizers, who of course have an immediate practical need for some characterization of these sounds.  For example, the best current labeling of the largest conversation corpus, Switchboard, uses a scheme (Hamaker et al. 1998) which specifies a small finite list, where hesitations are represented with one of uh, ah, um, hm and huh; `yes/no sounds’ are represented with one of uh-huh, um-hum, huh-uh or hum-um `for anything remotely resembling these sounds’; and `non-speech sounds during conversations’ are represented with one of: `laughter’, `noise’ and `vocalized-noise’.  Comparison with Table 1 reveals how much information is lost by using such a list.  Moreover, no mere list can account for intuitions about which sounds  are plausible: a description in terms of a list of 10 or 100 items gives no explanation for why hum-ha-hn, but not ziflug, could be the 11th or 101st observed token.

 

Of course a list-based model could be embellished with descriptions of the permitted phonetic variations or sub-forms --- as in Bolinger's discussion which starts with the claim that `Huh, hunh, hm is [sic] our most versatile interjection’, and then turns around and focuses on differences between these three forms.  However such a hybrid approach seems unlikely to be concise or to have much explanatory power.

 

Thus a satisfactory list-based account of conversational non-lexical items seems likely to be elusive.

 

4.2 The Phonetic Components

 

I propose that many non-lexical utterances in American English are formed compositionally from phonetic components (leaving open the vexed question of whether these components are phonemes or features (Marsen-Wilson & Warren 1994)).  This claim is not without precedent: there are a number of works which have, more or less independently, attempted to characterize variation in non-lexical expressions in German, Japanese, and Swedish, and have done so using tables of non-lexical items or lists of rules relating or distinguishing different tokens (Ehlich 1986; Werner 1991; Takubo 1994; Takubo & Kinsui 1997; Kawamori et al. 1995; Shinozaki & Abel 1997; Ward 1998; Allwood & Ahlsen 1999; Kokenawa et al. 2004).  These all imply the possibility of an analysis in terms of component sounds.

 

This subsection describes the main inventory of phonetic components in non-lexical conversational sounds in American English.

 

l         Schwa is often present, as seen in uh and uh-huh. (In conversation this is a schwa, although when stressed, in tokens produced in citation form, it appears as a lower back vowel.)

l         An /a/ vowel can also be present, as seen in ah, which is distinct from schwa, at least for some speakers.

l         An /o/ vowel occurs in some sounds, such as oh.

l         An /e/ vowel occurs in yeah and occasionally elsewhere.

l         /n/ and nasalization, of vowels or of the semivowel /j/, is a feature that can be present or absent, as seen in uh-hn (versus uh-huh), in uun (versus uh), in nyeah (versus yeah).

l         /m/ can occur in isolation (mm) or as a component, as in um (versus uh), hm (versus huh) or myeah (versus yeah).

l         /j/ occurs initially in yeah and variants thereof.

l         /h/ occurs in isolation occasionally, as a noisy exhalation or a sigh.  /h/ or breathiness is also present in items such as hm (versus mm), and in the back-channel uh-huh.  Some such items involve breathiness throughout, others involve a consonantal /h/, while others are ambiguous between these two realizations.

l         Tongue clicks occur often in isolation, and occasionally initially.  (Specifically, there are cases where the click is followed by a voiced sound with no noticeable pause; the delay from the onset of the click to the onset of voicing ranged from 50 milliseconds to 170 milliseconds in the corpus for these cases.)

l         Creaky voice (vocal fry) occurs often , including for example on aummm, yeah, okay, um, hm, aa.  Creakiness sometimes spans the entire sound, but other times is present only towards the end.

 

 

[INSERT TABLE 3 ABOUT HERE]

 

The list above is summarized in Table 3.  Although this summary may suggest that these phonetic qualities are binary, for example nasalization being either present or absent, it seems more likely that the phonetic components are in fact non-categorical, involving `gradual, rather than binary, oppositional character' (Jakobson & Waugh 1979).  This explains how the set of non-lexical items generated can be literally not finite.

 

It is also worth noting that the vowel identifications are approximate.  Indeed, it is entirely likely that, as found for German hesitation particles, `the vocalic portions ... have their own quality', distinct from those used in lexical items (Patzold & Simpson 1995).

 

For expository convenience, this phonological analysis is presented here, before the semantic analysis, although in fact the set of relevant component sounds cannot be determined without reference to meaning.  Actually a preliminary version of the semantic investigations described below was done before the list of sound components was drawn up.  This is why, for example, the inventory of sounds groups together consonantal /h/ and breathiness, but not the nasals /m/ and /n/: the first grouping, but not the second, has a consistent meaning, as will be seen.

 

[INSERT TABLE 4 ABOUT HERE]

 

The fact that this inventory of sounds is fairly small makes it possible to concisely specify the phonetic values for all the labels seen in Table 1.  Thus the non-obvious American English orthographic conventions for non-lexical items are (slightly regularized) as summarized in Table 4.  Other Englishes apparently have other conventions, for example, British English uses er to represent a sound not unlike American English uh (Biber et al. 1999).  Further discussion of spelling appears elsewhere (Ward 2000b).

 

 

4.3 Rules for Combining Phonetic Components

 

The full phonological model includes the above list of component sounds plus two rules for combining them.

 

The first way in which sounds are combined is by superposition.  For example, a sound can be a schwa that is simultaneously also nasal and creaky.

 

The second way is concatenation.  There are probably minor constraints on this, for example /j/ and /e/ have very limited distributions, and click seems to appear only initially.  These remain to be worked out.

 

There seems to be a tendency for these sounds to have relatively few components, that is, the number of component sounds in a non-lexical token generally is less than the average number of phonemes in a word.  There is also a tendency, rather stronger, for the number of different sounds to be few: most sounds have only one or two, and more than three is rare.  This is also seen in the fact that these sounds often involve repetition.

 

4.4 The Power of the Phonological Model

 

The above components and rules constitute a simple, first-pass model of the phonology of these sounds.  In effect, this describes the space of non-lexical utterances as based on `a phonological system which is different from those employed in lexical items' (Patzold & Simpson 1995), although the ultimate status of this phonological system remains to be determined.

 

However it is relatively easy to evaluate the model for descriptive adequacy.  Ideally a model should generate all and only the non-lexical utterances of English.

 

As far as generating only non-lexical items, the model does reasonably well.  The key explanatory factor is that the inventory of component sounds excludes most of the phonemes present in lexical items, including high vowels, plosives, and most fricatives.  This provides a partial explanation for native speakers' intuitions that only certain sounds are plausible as non-lexical items in conversation.  However this model does overgenerate somewhat; although Section 7.3 explains how it can be extended to reduce this.

 

As far as generating all the non-lexical items, this model does fairly well on this also.  Evaluating it against the inventory of grunts in the corpus, the phonological model accounts for 91% (=286/316).  It achieves this performance because, of course, it includes sound components not present in English lexical items.  However it does not account for all the non-lexical items.  The exceptions fall into 4 categories.  First, there are 3 breath noises such as throat-clearings and noisy inhalations. Second there 2 exclamations including rare sounds, namely achh and yegh.  Third, there are 5 items which only seem explicable as word fragments, extreme reductions or dialectal items, such as i, nu and yei.  Finally, there are 20 tokens with phonemes missing from the model but normal for lexical English, including okay and wow.  This last set includes items which are only marginally non-lexical, in the sense discussed in 10.2, so it is not entirely surprising that the model fails to handle them poorly.

 

Thus, although the model is not perfect, it accounts for rare non-lexical tokens and the common ones in the same way.  It is also more parsimonious and explains intuitions better than the alternative, modeling these items with a finite list of fixed forms.  In this sense, these sounds are truly non-lexical.  Using this model as a base, subsequent sections extend the analysis to deal with meaning and dialog roles.

 

5. Methods for Finding Sound-Meaning Correspondences

 

Thus it seems that these sounds can be analyzed in terms of the composition of phonetic components.  This leads inevitably to the question: what do they mean?  This is the topic of this section.

 

Asking this question presupposes that sound components of this size can bear meaning.  While most morphemes are syllable-sized or larger units, various studies have found a rich vein of sound-meaning mappings at a lower level, or ``sound symbolism''.  That is, there exist phonesthemes, sounds which are smaller than normal morphemes but still bear meaning.  The existence of such mappings is theoretically interesting in that they violate Saussure's principle of the ``arbitrary nature of the sign'', which postulates that the meaning of the whole cannnot be predicted from the meanings of the parts (de Saussure 1915/1959).  However there is a wealth of evidence that sound symbolism is often productive in non-lexical items and also infuses large portions of the lexicon (Sapir 1929; Hinton et al. 1994; Abelin 1999; Magnus 2000).  For example there appears to be a phonestheme common to words like splash, crash, bash, and mash.  In such cases some of the meaning of the whole is predictable from the meanings of the component phonesthemes.

 

The specific mappings most commonly identified in studies of sound-symbolism relate mostly to percepts, including sounds, smells, tastes, feels, shapes, spatial configurations, and manners of motion.  Thus few of the mappings previously identified seem relevant to non-lexical items with conversational functions.  There are few exceptions: some work has discussed or examine the possibility of a sound-symbolic system operating in discourse particles and related items.  Jakobson and Waugh (1979), Ameka (1992) and Wharton (2003) have noted that sound symbolism may also be present in interjections.  Bolinger (1989), in his discussion of exclamations and interjections, proposed specific meanings for vowel height, vowel rounding, and various prosodic features in a variety of non-lexical items, as detailed below.  Finally, Nenova et al. (2001) examined various non-lexical items in a corpus of transcripts of task-oriented dialogs.  Based on considerations of articulatory effort, they proposed a distinction between `marked’ items, those which involve nonsonorants, lengthening, multiple syllables or rounded, noncentral or tense vowels, and `unmarked' items, those which are composed of only /m/ and schwa.  They showed that marked items are more common as indicators of “dynamic participation”, as opposed to the production of neutral back-channels during passive listening.  The present paper goes beyond this level of analysis to ascribe specific meanings to specific sounds.

 

The analysis methods used in this paper combine and extend the methods used in these studies.  Detailed discussion of the methodological issues appears after an example of the analysis.

 

5.1 A First Example: /m/

 

In fillers, /m/ generally occurs while the speaker is trying to decide whether to speak or trying to decide what to say.  This is illustrated in Example 1, where the umm occurs before a substantial pause preceding a restart of the explanation, in contrast to the uh, which occurs before minor formulation difficulties.  There is a wealth of statistical and experimental evidence that uh indicates a minor delay and um a major delay (Fox Tree 2001; Barr 2001; Clark & Fox Tree 2002) although it may be that only speakers, not listeners, make this distinction (Brennan & Williams 1995; Barr 2001).  Also Smith and Clark (1993) have observed, in the context of quizzes, that fillers um and am, compared to uh and ah, generally seem to indicate more thought.  Also, the distributions of uh, um and umm in Table 2 show that the presence of /m/ correlates with the tendency to appear as a filler, utterance-initial, rather than as a simple disfluency.

 

Example 1:   (discussing the effects of speaking rate on phonology)

1.  E: going to be different than if they’re, uh, talking much more slowly,

2.  X: um-hm

3.  E: so, umm [3 second pause] so, uh, the stuff that we did at …

 

This meaning for /m/ is seen in back-channels also.  The contemplation can be directed at various things, including trying to understand what the interlocutor is saying, trying to empathize with him, or trying to evaluate the truth or relevance of his statement.  For example, in Example 2 M s