A RESPONSIVE DIALOG SYSTEM 

Nigel Ward and Wataru Tsukahara


to appear in Machine Conversations (provisional title)
(edited by) Yorick Wilks. Springer Verlag, to appear.


Mechano-Informatic Engineering,
University of Tokyo,
7-3-1 Hongo, Bunkyo-ku,
Tokyo 113 Japan

+81-3-3182-2111 ext. 6282

fax: +81-3-3815-8356

nigel@sanpo.t.u-tokyo.ac.jp
http://www.sanpo.t.u-tokyo.ac.jp/~nigel/

tsuka@sanpo.t.u-tokyo.ac.jp
http://www.sanpo.t.u-tokyo.ac.jp/~tsuka/


Abstract:

Being responsive is important in dialog.  In particular, back-channel
feedback is essential to human conversations.  Back-channel feedback
is sometimes produced without thinking, in response to simple prosodic
clues.  A simple implementation of this behavior produces
natural-sounding responses in conversation with live human subjects.


MOTIVATION

Modeling language as people really use it is an elusive goal. Today,
thanks to advances in speech recognition, dialog systems capable of
understanding the meaning of user input and replying with appropriate
information exist, but there are as yet no systems which interact
naturally with humans.  Two problems are:

1.  Priority is given to understanding and responding accurately;
but for human dialog, being responsive and interactive is also important.

2.  The granularity of interaction is the sentence;
but for human dialog, interaction happens frequently, in real time, often
with overlapping utterances;

Given that such responsiveness is important to human language use, the
question arises: how do we build systems with these abilities?  The
obvious approach is to add these abilities to a meaning-based speech
system.  An alternative approach is to take these abilities as a basic
foundation, and to layer meaning-based processing on top of this,
subsumption-style (Ward 1997).


PHENOMENON

Back-channel feedback, also called ``listener responses'', is produced
by one participant as a response that does not interfere with
utterances by the other participant (Ward and Tsukahara, submitted).
In American English `yeah', `mm' and `uh-huh' are typical back-channel
feedback.  In Japanese `un' is most typical.

To be a good conversation partner, production of back-channel feedback
is essential; if it is lacking the conversation tends to die out.
Back-channel feedback is an example of ``responsiveness'', which is
important in spoken dialog between humans, and probably also in
human-computer systems (Johnstone and Berry et. al 1995, Ward 1997).


ANALYSIS

Many have sought for the perceptual clue that tells a participant
``it's now time to produce back-channel feedback''.  It has often been
speculated that this clue from the speaker would be prosodic, rather
than involving meaning.

In search of this clue we looked at the prosodic environments of
back-channel feedback in corpora of natural Japanese and English
conversations.  Potential clues considered included pitch contours,
vowel lengthening or speaking rate slowdown, volume increase or
decrease on final syllables, a low pitch point, and gross energy level
changes, following suggestions in the literature, as discussed in Ward
(1996) and Ward and Tsukahara (submitted).  None of these appeared to
have a strong correlation with whether back-channel feedback was
produced or not.

However, there was one good clue: a region of low pitch.  While we
have no real proof yet that this is actually a trigger for a reflex
response in people, this is certainly usable in spoken dialog systems.

It is commonly thought that silence (at the end of a speaker's turn)
is a major clue for back-channel feedback.  This is probably important
in business-like transactions between strangers, but not for more
casual interactions.  In the latter the low pitch cue accounts for
both back-channel feedback which is produced after the speaker paused
and stopped, and that which overlaps the speaker's continued
utterance.


RESULTS

Using correspondent to corpus data as the criterion, we sought for the
rule which best models human behavior.  For Japanese, the best we have
found so far is as follows:

Upon detection of 

---  a region of pitch less than the 28th-percentile pitch level
	and 

--- continuing for at least 110 milliseconds,  

--- coming after at least 700 ms. of speech, 

--- providing you have not output back-channel feedback
	within the preceding 1.0 seconds, 

--- 350 ms. later you should produce back-channel feedback.

Testing the predictions of the above rule against the corpus of human
conversations gives a coverage of 50% (half of the back-channels were
predicted) at an accuracy of 34% (a third of the predictions were
correct), over all speakers and all dialog types. Performance was
better for friendly, attentive listeners and for conversation portions
that involved narrative or explanation.  About half of the false
predictions seem to be due to inter-speaker differences; thus the rule
does much better when judged as a model of a specific speaker (Ward
and Tsukahara, submitted).

For English speakers, the best prediction rule has somewhat different
parameters:

Upon detection of 

---  a region of pitch less than the 23th-percentile pitch level
	and 

--- continuing for at least 120 milliseconds,  

--- coming after at least 700 ms. of speech, 

--- providing you have not output back-channel feedback
	within the preceding .9 seconds, 

--- 700 ms. later you should produce back-channel feedback.

For the English corpus, this achieved a coverage of 39% (141/360) at
an accuracy of 19% (141/735).


EXPERIMENTS

We built a system to find out how well the above rules would perform
in live conversation. There were two critical issues.

One issue was which back-channels to produce.  For Japanese, it turned
out to be acceptable to always produce `un', the most common and most
neutral back-channel, in a falling pitch. For English `uh-huh' and
`mm' were acceptable. Since always producing the same token sounded
mechanical, we used two in alternation, or three with random
selection.

Another issue was how to get people to try to interact naturally with
the system.  The only solution was to fool them into thinking they
were interacting with a person.  Hence we used a human decoy to
jump-start the conversation.  For the initial experiments we used a
partition so that the subject couldn't see when it was the system that
was responding; later we ran the system over the telephone.  The
outputs of the system were recordings of decoy-produced samples, not
synthesized.  To make it impossible for subjects to distinguish
between the decoy's live voice and the system's contributions, we
distorted both slightly by over-amplifying them.

We hypothesized that back-channel feedback produced according to the
rule would sound natural, and permit the conversation to proceed
normally.  Conversely, we hypothesized that inappropriate back-channel
conversation would seem unfriendly or unnatural, be annoying, or kill
the conversation.

We found that back-channel feedback in response to low pitch regions
did indeed sound natural.  In one run of twenty-odd Japanese subjects,
only two suspected that the back-channels were artificial; the vast
majority were surprised when told that the decoy had handed the
conversation over to the computer.  Unfortunately this was not
significant, since even in the control experiments, with back-channels
produced at random, most subjects did not notice anything odd or
different, no matter how hard we probed in post-conversation
interviews.  Indeed, in many cases they hadn't even noticed whether
back-channels were present.

However, third party judges listening to the conversations generally
could distinguish the low pitch based back-channels from the randomly
produced ones; the former sounded natural and the latter sounded odd,
with clear cases of inappropriate back-channels and of inappropriate
silences when a back-channel was called for.

We surmise that our subjects were generally so busy speaking that they
had only minimal attention to pay to back-channel feedback.  Also, to
the extent that they do notice back-channel feedback, there is
probably a human tendency to be generous and uncritical in
interpreting a dialog partner's responses and response patterns.  It
is of course possible that more sensitive metrics, either subjective,
such as impressions of how interested or friendly the listener was, or
objective, such as average utterance length, would reveal a difference
between the effects of appropriate and inappropriate back-channel
feedback on the speaker.  However this will not be easy, since the
effects of manipulating back-channel feedback are complex and
context-dependent (Siegman 1976).

We also tried a different experimental procedure, where the subject
knew that his partner might be a machine, and had to guess, in any
trial, whether it was a human or a machine.  We found here that slight
differences of sound quality between the live and pre-recorded
back-channels, more than timing, were what people were sensitive to.
Also, the problem of factoring out the effects of variations in topic
across runs made it difficult to get meaningful results.

It is interesting to note that people in general are not prepared to
chat naturally with a machine.  In demonstrations, where subjects knew
that their conversation partner was a machine, we sometimes found that
the subject, after putting on the microphone, would challenge the
system to respond, with no result, and then turn to the experimenter
and make some comment, only to have the system them chime in with a
perfectly appropriate back-channel.

In experiments with a couple dozen subjects in English run over the
telephone as part of the Elsnet Speech System Olympics at EuroSpeech
97, we found two main factors affecting the success of the system.
The first was the ability to get the other person talking; this
depended on whether the subject was talkative and on the decoy's
success at putting them at ease and leading them on to an suitable
topic of conversation.  The other factor was native language; the
system generally worked poorly or not at all for speaker's whose
native language was not a Germanic one.


Here are transcripts of two sessions.  D is the decoy, S is the
subject, and C is the computer.  Computer responses are also decorated
with asterisks.


Example 1:

D: Okay, so let's see, I'll hit return.  Say something.

S: Nani ga iimasu.

D: Okay, great, let's just speak English, because all I
want is your pitch range.

S: Yeah that's fine.

H: And let's talk for a minute and um.

S: Okay, because I think my pitch range is 

. . .

S: Shall I keep on talking?

D: Yeah, please.  So how's the weather, in England?  Is it better than
here?

S: It's certainly cooler, that's for sure.

C: *mm*

S: Probably it's better, I don't know.  I saw the um forecast
on the TV last night,

C: *mm*

S: and it's something like 15 centigrade, which is on the cooler side
I think, isn't it.

C: *mm*

S: I'm not sure what it is here for us

C: *mm*

S: 59, 58 something I think, about that.  Fairly cool.

C: *mm*

But um, over here I find, I have to pace myself
carefully, because I start sweating,

C: *mm*

S: before I get tired.

C:  *mm*

S: [laughs]

D:  Well that's the humidity more than the, uh temperature, I think.

S: Yeah, that's right, yes 

. . .

D: Okay, so we have your pitch down.  So, um that was, like a normal
English conversation?

S: I think so, yeah.

D: Nothing strange about it.

S: No.  Except the fact we can't see each other, but that's nothing.
And I'm being videotaped of course. [laughs]

D: Okay, great.  So in fact, in about 10 places, it's that the system
said `mm', and that was produced by the system.

S: Oh right.  I didn't notice, I didn't notice

. . .


The next transcript is of a failure.  This was perhaps because the
subject was not a native speaker, and perhaps because she was
suspicious from the start.

Example 2:

D: Okay, so.  So tell me, is Rick Alterman still there? I guess.

S: Is who?

D: Rick Alterman.

S: Oh yeah definitely, because he's in the department.

C: *mm*

D: So what's he up to?

S: It's really hard to say

C: *mm*

S: I don't think I 

C: *mm*

S: can really quite define his work,

C: *mm*

S: but um, 

C: *mm*

S: you know he's.  Why are you doing that?

C: *mm*

S: [laughs] What's that sound?

C: *mm*


SIGNIFICANCE

We have demonstrated a system that can keep up its end of a
conversation, without doing speech recognition or understanding.  More
generally, the impression of naturalness in spoken conversation can be
achieved, in large part, by simply following the prosodic and
gaze-given cues of the interlocutor, as seem also in some other work
(Schmandt 1994, Thorisson 1994, Iwase 1998).

We are planning to look for more such cues.  So far we have
tentatively identified some of the factors that affect the choice of
which word or grunt to produce as back-channel feedback (Ward 1998,
Tsukahara 1998).

Ultimately we plan to combine simple reflex-type responsiveness with
recognition and understanding.  Our near-term aim is to build a system
that will interact truly naturally with people in a simple verbal
game.


ACKNOWLEDGEMENTS

We thank the many students who have helped with this work, and the
Sound Technology Promotion Foundation, the Hayao Nakayama Foundation,
and the Japanese Ministry of Education for support.


REFERENCES

Tatsuya Iwase.
Yuza ni awaseta Taiwa Peesu no Chosetsu
(Adjusting the Pace of Conversation to Suit the User).
in Proceedings of the 4th Annual Meeting of the
(Japanese) Association for Natural Language Processing,
1998.

Anne Johnstone and Umesh Berry and Tina Nguyen and Alan Asper.
There was a Long Pause:
Influencing turn-taking behaviour in human-human and human-computer dialogs.
International Journal of Human-Computer Studies, 42, pp. 383--411, 1995.

Chris Schmandt.
Computers and Communication.
New York, Van Nostrand Reinhold,
1994.

Aron W. Siegman.
Do Noncontingent Interviewer Mm-hmms Facilitate
Interviewee Productivity?.
Journal of Consulting and Clinical Psychology, 44, pp. 171--182, 1976.

Kristinn R. Thorisson.
Face-to-Face Communication with Computer Agents.
Working Notes, AAAI Spring Symposium on Believable
Agents, pp. 86--90, 1994.

Wataru Tsukahara.
Purosodi oyobi Bunmyaku Joho o Mochiita Ooto no
Sentaku/Chosetsu no Kokoromi
(Selecting and Adapting Confirmations in Response to
Prosodic Indications and Contextual Factors).
Proceedings of the 4th Annual Meeting of the
(Japanese) Association for Natural Language Processing,
1998.

Nigel Ward.
Using Prosodic Clues to Decide When to Produce Back-channel Utterances.
International Conference on Spoken Language Processing,
pp. 1728--1731
1996.

Nigel Ward.
Responsiveness in Dialog and Priorities for Language Research.
Systems and Cybernetics, 28, pp. 521--533, 1997.

Nigel Ward.
The Relationship between Sound and Meaning in Japanese Back-channel Grunts.
Proceedings of the 4th Annual Meeting of the
(Japanese) Association for Natural Language Processing,
1998.

Nigel Ward and Wataru Tsukahara.
Production of Back-Channel Feedback in Japanese 
may involve a Prosodically Triggered Reflex.
Language,
submitted.