CHI 2000 Workshop on Natural Language Interfaces
What
makes a Natural Language Interface natural?
Consistency as a Social Rule
Li Gong
Department of Communication
Stanford University
Stanford, CA 94305-2050 USA
+1 650 968 6685
ligong@stanford.edu
INTRODUCTION
The kernel of natural language interfaces (NLI)
is natural, which also imposes the largest barrier in the development of
NLI. The traditional focus of the NLI research and development is identifying
the linguistic rules of natural language and transplanting them to the computer
systems. A more subtle aspect of naturalness, however, resides in the social
aspect of the NLI and is yet to receive more attention from researchers and
practitioners. A natural language interface may have all the correct linguistic
rules of human language implemented, but if it violates the social rules of
human interaction, its naturalness will be tremendously discounted.
Reeves and Nass’s (1996) Media Equation theory
claims that the user’s interaction with computers and other new media is
fundamentally social. Through dozens of experiments, they demonstrated that
regular users automatically and unconsciously apply social rules to their
interaction with computers and other new media, including plain-text computer
interfaces. To name a few, the rules that were found evident in human-computer
interaction include politeness, reciprocity, flattery, and gender stereotyping.
The Media Equation phenomenon should seem more evident in NLI because the
employment of natural language makes it more closely resemble the natural
human-human language intercourse.
The recent years have witnessed the emphasis on
NLI with speech input and output as speech synthesis and recognition have made
substantial progress and increasingly intersected with NLI (Oliver, 1997;
White, 1990). A speech-based natural language interface appears a desirable
choice of NLI because it frees up the user’s hands and eyes. Speech, an ability
acquired earlier in child development, is also usually easier than composing
and reading texts. Then a pending question is what social rules are relevant
and important in NLI and particularly speech-based interfaces.
In this paper, I will present an experiment
which proves consistency as an important social rule, in this case, in the
visual and auditory speech output. Scientists studying speech perception have
long found that consistent visual cues facilitate speech perception and
understanding (Massaro, 1998). Presenting a talking face in the interface has
gained a lot of attention in recent years because of its visual and auditory
integration and its promise of more vividness and liveliness. Yet, its social
aspect is largely unexplored.
Due to the technological advance, now designers
have choices in speech output. There is a very human and a much more
machine-like option. High-quality recorded
speech, facilitated by new recording and compression technologies and dramatic
increases in disk space, reproduces speech that is comparable to speaking with
a person in face-to-face conversation.
The machine-like option is synthesized speech, also known as
text-to-speech (TTS). Although this
technology produces comprehensible content, the speech is obviously produced by
a machine rather than a person: Synthesized speech lacks both the clarity and
prosody of normal human speech (Oliver, 1997).
In the face arena, high-quality video is
generally too memory-intensive, except in very small bursts, to make videotaped
faces a viable option (Sproull, Subramani, Kiesler, Walker, & Waters,
1996). Also, the technology of the moment is inadequate in providing spontaneous
lip-synchronization of a pre-videotaped human face with any given text.
Instead, synthesized faces have proven to be much more compact and practical
than full-motion videos. Synthesized faces can now perform very well in
real-time, especially with respect to lip-synch and emotion manifestation
(Massaro, 1998). Unfortunately,
synthesized faces, like their speech counterparts, are readily and
automatically distinguished from both real faces and videos.
Given the options delineated above, what should
designers do, particularly for the end of making the interface social,
human-like and natural? Pairing the synthesized face with recorded human voice
or synthesized voice? How would these two combinations compare to just voices
alone without the face? Would the face enhance the user’s social feel of the
voice interface regardless of the type of the voice?
Two
approaches seem relevant here: maximization
and consistency. The maximization approach suggests that one
should adopt the most human-like option for each dimension, i.e., the recorded
voice and the synthesized face in this case. The logic underlying this approach
is: 1) each dimension is responded to independently; and 2) the values of each
dimension linearly add up without interaction between each other.
A
more subtle viewpoint stresses consistency and argues that independent
maximization of each dimension does not necessarily
lead to linear addition and an overall more human-like and natural experience.
Instead, mismatching or inconsistency between different dimensions may actually
undermine the user’s feeling of
socialness and naturalness of the interface. The designer should make sure that
the dimensions are at the same level of
humanness, even if the designers have tools to “improve” a particular
dimension. Under this view, the synthesized face should only be combined with
synthesized voice and not with recorded voice because the recorded voice is
clearly of a human and the synthesized face is clearly not. There is a large
body of evidence in the social psychology literature that in social situations,
people prefer to interact with individuals who behave consistently, even if
consistently undesirably, as compared to individuals who behave inconsistently
(Fiske &Taylor, 1991). For example,
nonverbal behavior that is inconsistent with the verbal content cues deception
(Ekman & Friesen, 1969). In the
human-computer interaction arena, a study has shown that users become disturbed
by inconsistencies between bodily personality cues of a stick figure (extroversion
or introversion) and verbal personality cues in the text of the stick figure
(also extroversion or introversion) (Isbister & Nass, 1998).
METHOD
To
critically test the maximization and the consistency arguments, a 2
(synthesized speech vs. recorded speech) x 2 (synthesized face vs. no face)
between-subjects experiment was conducted. (“Face” in this paper refers to the
computer-synthesized face).
Participants
Participants
were 48 students enrolled in a large communication course at a university. To
avoid potential difficulties in understanding the synthesized speech, all
participants were native English speakers. The participants received course
credit for participating in the experiment. They were told the purpose of the
experiment was to test a computer-based interviewing system. The participants
were randomly assigned to the four conditions, with gender balanced across
conditions.
Procedure
Each
participant completed the experiment individually in a media lab. Upon arrival,
they were asked to read the Informed Consent Form and assured that the
information they submitted in the study was totally confidential. After they
signed the consent form and read the instruction on the computer screen, they
completed the practice round with the assistance of the experimenter. The
purpose of the practice round was to demonstrate how to: 1) use the mouse to
answer questions on a Likert-type scale, 2) type information into a text box, and
3) use the “Submit” and “Repeat” buttons.
After
the practice round, the experimenter left the room. In the first round of the computer-based interview, the computer
(via the assigned modality) asked a series of 20 standard questions that
assessed socially desirable responding. After each question, the participants
indicated their answers by using the mouse to click on a response button on a
1-7 scale. The second round consisted of nine standard, open-ended questions
that assessed the level of self-disclosure. The participants typed their
answers in a text box; when done, they clicked the “Submit” button. For both
rounds, there was a “Repeat” button that, when pressed, had the computer repeat
the question. After they finished the experiment, the participants left the
room and were then thanked and debriefed by the experimenter.
Manipulation
The
CSLU Toolkit software was used to create the stimuli and to run the experiment.
The Festival TTS engine in the Toolkit was used to provide synthesized speech.
For recorded speech, we recorded the voice of an adult American male. In the
face conditions, we employed the “Baldi” face provided with the Toolkit. The face was placed on the left side of the
screen. The face was 17.8 cm high and
12.5 cm wide. We synchronized “Baldi”
with both the synthesized speech and the recorded speech using the
Toolkit. The interfaces were presented
on a 43.2 cm diagonal monitor screen.
Measures
Users’
social feeling of the interface was measured by two constructs: socially
desirable responding and self-disclosure. The rationale is the greater the
social feeling, the more pressure for impression management, and also the more
willingness to disclose about oneself because disclosure is a highly social act
(Sproull et al, 1996; Moon, 1998). Socially desirable responding was measured
by the BIDR-Impression Management (IM) subscale (Kroner & Weekes, 1996).
The original 20 BIDR items were first-person statements, for example, "I
sometimes tell lies if I have to". To suit the interview nature of this
study, the items were adapted to "Do you" or "Have you"
questions, such as "Do you sometimes tell lies if you have to?" The
original 1-7 Likert-type scale was retained (1 = “not true”, 7 = “very true”).
The responses to the 20 BIDR-IM questions were averaged to form an IM index
(Cronbach’s a
= .80). A higher value on the IM index indicates a greater tendency for
impression management. The computer also recorded the time that subjects took in
answering each BIDR question as a measure of how seriously people treated the
task.
Self-disclosure
was measured by Moon's (1998) nine open-ended self-disclosure questions, for
example, "What do you dislike about your physical appearance?” There were
two indices of this measure. The amount of self-disclosure was the average
number of words, across the nine items, in the participant’s responses. The
reliability of this index was very high (a
= .85). The depth of self-disclosure was rated by two independent judges on a
five-point Likert-type scale (1 = “low intimacy”, 5 = “high intimacy”). The
inter-rater reliability was a very high .74; disagreements were resolved by
averaging. The assigned value for a given participant was the average depth of
response across the nine items; the reliability of the index was very high (a = .86).
RESULTS
Full-factorial ANOVA’s were conducted on all the dependent measures.
The voice and face factors showed consistent cross-over interaction effects on
all of the measures, supporting the consistency hypothesis and challenging the
maximization argument.
For BIDR-Impression Management (IM), there was a significant cross-over
interaction, F(1, 44) = 4.3, p < .05 (see Figure 1). Participants who interacted with the
synthesized face speaking with the synthesized voice exhibited greater
impression management than those who only heard the synthesized voice without
the face. However, the opposite pattern showed for the recorded voice. The
participants who interacted with the synthesized face speaking with the
recorded voice showed less impression
management than those who only heard the recorded voice without the synthesized
face.
Similar cross-over interaction effects were observed with respect to
the amount of self-disclosure, F(1, 44) = 4.6, p < .05, and to
the depth of self-disclosure, F(1, 44) = 13.9, p < .001. Participants disclosed more information
about themselves and the disclosure was more intimate when the interface was
the synthesized voice with synthesized face than when it was the synthesized
voice alone (see Figures 2 and 3). On the contrary, they disclosed less about
themselves and the disclosure was less intimate when the recorded voice was
combined with the synthesized face than when the interface incorporated the
recorded voice alone (see Figures 2 and 3).
Figure
2: Comparison of means
in the amount of self-disclosure.
Further evidence in support of consistency was found in the significant
cross-over interaction effect for the average time that the participants spent
in answering BIDR questions, F(1, 44) = 12.7, p < .001 (we did
not include time on the open-ended questions, because time would a priori be
correlated with the amount of disclosure). The participants spent more time on
BIDR questions when the interface was the synthesized face with the synthesized
voice than when there was the synthesized voice alone. Conversely, they spent
less time when the interface included the recorded voice with the synthesized
face as compared to when the interface was the recorded voice alone (see Figure
4). Participants spent more time with
the synthesized speech than the recorded speech, a function of greater
difficulty in processing the former, F(1,44) = 86.0, p <
.001.


Figure 1: comparison of means in BIDR-IM. Figure 2: comparison of means
in the amount of
self-disclosure.


Figure 3: Comparison of means in Figure 4. Comparison of average time
the depth of
self-disclosure. spent
on BIDR questions (in seconds).
In
sum, while a synthesized face enhanced the user’s social feeling of the
synthesized speech, it undermined the
social feeling of recorded human speech.
Consistency seems to well explain this discrepancy in that synthesized
face is consistent with synthesized voice, but clearly inconsistent with
recorded voice. The recorded voice is clearly a human’s voice, while the
synthesized face is clearly not of a human. Their mismatching may lead the user
feel strange, disturbed, mistrustful, and less social.
DISCUSSION
It
is very difficult to overcome the designer’s natural impulse to make each
dimension of a multi-dimensional interface as technologically advanced and
“human-like” as possible. Because
different research communities make advances at different rates and at
different times, it seems to be in everyone’s interest to provide the “best of
breed” for each dimension of the technology (Nass & Mason, 1991). Unfortunately,
the present research shows that allowing each modality (and its interested
parties) to “show off” without matching
and balancing between each other leads to an experience that is less human-like and less social for the
user. Because humans value social consistency, likely for evolutionary reasons
(Reeves & Nass, 1996), an interface that manifests a consistent social feel
may be more desirable than one that maximizes dimensions independently. Thus, the principle that a good interface
should be consistent (see, e.g., Norman, 1988; Shneiderman, 1998) might have a
social as well as an ergonomic basis.
The
present study demonstrated that a face, when it is consistent with the voice,
indeed enhances the social feel of the interface, in line with the face-voice
enhancement effect in the traditional perception and performance arena. When
the face is inconsistent with the voice, however, it harms the social feel of
the interface, even though it is still well lip-synchronized. This points to
the importance of social aspects in developing and assessing computer
interfaces including NLI, in addition to the matter of technical perfection.
The technological aspect of NLI, because it is well acknowledged and studied,
may not pose the largest barrier in the development of NLI. The social aspects
of NLI, which could be easily overlooked, deserve more attention and provide
enormous opportunity as well as challenge in enhancing the social feel and
naturalness of the interface.
Of
course, consistency between voice and face is just one pairing in an interface.
As NLI and computer interfaces in general exhibit more skills, abilities, and
manifestations that are associated with humans, what are other critical domains
of consistency? For example, should the
language level of the computer match that of the user in a NLI? Should text
output be matched with text input and speech output with speech input? Should the feedback of the computer be
adaptive to and consistent with the mood of the user? (e.g., more positive feedback
and less critical feedback when the user is in a down mood?) The list can go on
and on because of the complexity and intricacy in human-computer interaction as
in human-human interaction. And consistency is just one of the social
principles that need to be abided by in designing interfaces. Further and more
extensive exploration of social aspects of NLI not only provides us more
guidelines to make a natural language interface more social and natural but
also provides insight into the general human-computer interaction. And the NLI
community is particularly in an advantageous position in achieving this goal
because social aspects of human-computer interaction are more salient in an
interface driven by natural language.
REFERENCES
Ekman, P., & Friesen, W. V. (1969).
Nonverbal leakage and clues to deception. Psychiatry, 32, 88-95.
Fiske, S. T. & Taylor, S. E. Social Cognition. New York: McGraw-Hill,
1991.
Isbister, K. & Nass, C. Personality
in conversational characters: Building better digital interaction partners
using knowledge about human personality preferences and perceptions. Proceedings of the WECC Conference, Lake
Tahoe, CA, 1998.
Kroner, D. G. & Weekes, J. R. Balanced
inventory of desirable responding: Factor structure, reliability, and validity
with an offender sample. Personality and
Individual Differences, 21(3), 323-333, 1996.
Massaro, D. M. Perceiving Talking Faces: From Speech Perception to a Behavioral Principle.
Cambridge, MA: MIT Press, 1997.
Moon, Y. Intimate self-disclosure
exchanges: Using computers to build reciprocal relationships with consumers. Working paper for Harvard Business School,
1998.
Nass, C. & Mason, L. On the study of
technology and task: A variable-based approach. In J. Fulk & C. Steinfeld
(Eds.), Organizations and Communication
Technology, Newbury Park, CA: Sage, 1991.
Norman, D. The Design of Everyday Things. New York: Currency Doubleday, 1988.
Olive, J. P. “The talking computer”:
Text-to-speech synthesis. In D. G. Stork (Ed.), HAL’s Legacy: 2001’s Computer as Dream and Reality. Cambridge, MA:
MIT Press, 1997.
Reeves, B. & Nass, C. The Media Equation: How People Treat Computers,
Television, and New Media like Real People and Places. New York: Cambridge
University Press/CSLI, 1996.
Shneiderman, B. Designing the User Interface: Strategies for Effective HCI (3 ed.).
Reading, MA: Addison Wesley Longman, 1998.
Sproull, L, Subramani, M., Kiesler, S.,
Walker, J. H. & Waters, K. When the interface is a face. Human-Computer Interaction, 11, 97-124,
1996.
White, G. Natural language
understanding and speech recognition. Communications
of the ACM, 33(8), 72-82, 1990.