CHI 2000 Workshop on Natural Language Interfaces

The Hague, The Netherlands, April 3, 2000

 

 

What makes a Natural Language Interface natural?
Consistency as a Social Rule

 


Li Gong

 

Department of Communication
Stanford University


Stanford, CA 94305-2050 USA
+1 650 968 6685

ligong@stanford.edu

 

 

 

INTRODUCTION

 

The kernel of natural language interfaces (NLI) is natural, which also imposes the largest barrier in the development of NLI. The traditional focus of the NLI research and development is identifying the linguistic rules of natural language and transplanting them to the computer systems. A more subtle aspect of naturalness, however, resides in the social aspect of the NLI and is yet to receive more attention from researchers and practitioners. A natural language interface may have all the correct linguistic rules of human language implemented, but if it violates the social rules of human interaction, its naturalness will be tremendously discounted.

Reeves and Nass’s (1996) Media Equation theory claims that the user’s interaction with computers and other new media is fundamentally social. Through dozens of experiments, they demonstrated that regular users automatically and unconsciously apply social rules to their interaction with computers and other new media, including plain-text computer interfaces. To name a few, the rules that were found evident in human-computer interaction include politeness, reciprocity, flattery, and gender stereotyping. The Media Equation phenomenon should seem more evident in NLI because the employment of natural language makes it more closely resemble the natural human-human language intercourse.

The recent years have witnessed the emphasis on NLI with speech input and output as speech synthesis and recognition have made substantial progress and increasingly intersected with NLI (Oliver, 1997; White, 1990). A speech-based natural language interface appears a desirable choice of NLI because it frees up the user’s hands and eyes. Speech, an ability acquired earlier in child development, is also usually easier than composing and reading texts. Then a pending question is what social rules are relevant and important in NLI and particularly speech-based interfaces.

In this paper, I will present an experiment which proves consistency as an important social rule, in this case, in the visual and auditory speech output. Scientists studying speech perception have long found that consistent visual cues facilitate speech perception and understanding (Massaro, 1998). Presenting a talking face in the interface has gained a lot of attention in recent years because of its visual and auditory integration and its promise of more vividness and liveliness. Yet, its social aspect is largely unexplored. 

Due to the technological advance, now designers have choices in speech output. There is a very human and a much more machine-like option.  High-quality recorded speech, facilitated by new recording and compression technologies and dramatic increases in disk space, reproduces speech that is comparable to speaking with a person in face-to-face conversation.  The machine-like option is synthesized speech, also known as text-to-speech (TTS).  Although this technology produces comprehensible content, the speech is obviously produced by a machine rather than a person: Synthesized speech lacks both the clarity and prosody of normal human speech (Oliver, 1997). 

In the face arena, high-quality video is generally too memory-intensive, except in very small bursts, to make videotaped faces a viable option (Sproull, Subramani, Kiesler, Walker, & Waters, 1996). Also, the technology of the moment is inadequate in providing spontaneous lip-synchronization of a pre-videotaped human face with any given text. Instead, synthesized faces have proven to be much more compact and practical than full-motion videos. Synthesized faces can now perform very well in real-time, especially with respect to lip-synch and emotion manifestation (Massaro, 1998).  Unfortunately, synthesized faces, like their speech counterparts, are readily and automatically distinguished from both real faces and videos. 

Given the options delineated above, what should designers do, particularly for the end of making the interface social, human-like and natural? Pairing the synthesized face with recorded human voice or synthesized voice? How would these two combinations compare to just voices alone without the face? Would the face enhance the user’s social feel of the voice interface regardless of the type of the voice?

Two approaches seem relevant here: maximization and consistency. The maximization approach suggests that one should adopt the most human-like option for each dimension, i.e., the recorded voice and the synthesized face in this case. The logic underlying this approach is: 1) each dimension is responded to independently; and 2) the values of each dimension linearly add up without interaction between each other.

A more subtle viewpoint stresses consistency and argues that independent maximization of each dimension does not necessarily lead to linear addition and an overall more human-like and natural experience. Instead, mismatching or inconsistency between different dimensions may actually undermine the user’s feeling of socialness and naturalness of the interface. The designer should make sure that the dimensions are at the same level of humanness, even if the designers have tools to “improve” a particular dimension. Under this view, the synthesized face should only be combined with synthesized voice and not with recorded voice because the recorded voice is clearly of a human and the synthesized face is clearly not. There is a large body of evidence in the social psychology literature that in social situations, people prefer to interact with individuals who behave consistently, even if consistently undesirably, as compared to individuals who behave inconsistently (Fiske &Taylor, 1991).  For example, nonverbal behavior that is inconsistent with the verbal content cues deception (Ekman & Friesen, 1969).  In the human-computer interaction arena, a study has shown that users become disturbed by inconsistencies between bodily personality cues of a stick figure (extroversion or introversion) and verbal personality cues in the text of the stick figure (also extroversion or introversion) (Isbister & Nass, 1998).

 

METHOD

 

To critically test the maximization and the consistency arguments, a 2 (synthesized speech vs. recorded speech) x 2 (synthesized face vs. no face) between-subjects experiment was conducted. (“Face” in this paper refers to the computer-synthesized face). 

Participants

Participants were 48 students enrolled in a large communication course at a university. To avoid potential difficulties in understanding the synthesized speech, all participants were native English speakers. The participants received course credit for participating in the experiment. They were told the purpose of the experiment was to test a computer-based interviewing system. The participants were randomly assigned to the four conditions, with gender balanced across conditions. 

Procedure

Each participant completed the experiment individually in a media lab. Upon arrival, they were asked to read the Informed Consent Form and assured that the information they submitted in the study was totally confidential. After they signed the consent form and read the instruction on the computer screen, they completed the practice round with the assistance of the experimenter. The purpose of the practice round was to demonstrate how to: 1) use the mouse to answer questions on a Likert-type scale, 2) type information into a text box, and 3) use the “Submit” and “Repeat” buttons.

After the practice round, the experimenter left the room.  In the first round of the computer-based interview, the computer (via the assigned modality) asked a series of 20 standard questions that assessed socially desirable responding. After each question, the participants indicated their answers by using the mouse to click on a response button on a 1-7 scale. The second round consisted of nine standard, open-ended questions that assessed the level of self-disclosure. The participants typed their answers in a text box; when done, they clicked the “Submit” button. For both rounds, there was a “Repeat” button that, when pressed, had the computer repeat the question. After they finished the experiment, the participants left the room and were then thanked and debriefed by the experimenter.

Manipulation

The CSLU Toolkit software was used to create the stimuli and to run the experiment. The Festival TTS engine in the Toolkit was used to provide synthesized speech. For recorded speech, we recorded the voice of an adult American male. In the face conditions, we employed the “Baldi” face provided with the Toolkit.  The face was placed on the left side of the screen.  The face was 17.8 cm high and 12.5 cm wide.  We synchronized “Baldi” with both the synthesized speech and the recorded speech using the Toolkit.  The interfaces were presented on a 43.2 cm diagonal monitor screen. 

Measures

Users’ social feeling of the interface was measured by two constructs: socially desirable responding and self-disclosure. The rationale is the greater the social feeling, the more pressure for impression management, and also the more willingness to disclose about oneself because disclosure is a highly social act (Sproull et al, 1996; Moon, 1998). Socially desirable responding was measured by the BIDR-Impression Management (IM) subscale (Kroner & Weekes, 1996). The original 20 BIDR items were first-person statements, for example, "I sometimes tell lies if I have to". To suit the interview nature of this study, the items were adapted to "Do you" or "Have you" questions, such as "Do you sometimes tell lies if you have to?" The original 1-7 Likert-type scale was retained (1 = “not true”, 7 = “very true”). The responses to the 20 BIDR-IM questions were averaged to form an IM index (Cronbach’s a = .80). A higher value on the IM index indicates a greater tendency for impression management. The computer also recorded the time that subjects took in answering each BIDR question as a measure of how seriously people treated the task.

Self-disclosure was measured by Moon's (1998) nine open-ended self-disclosure questions, for example, "What do you dislike about your physical appearance?” There were two indices of this measure. The amount of self-disclosure was the average number of words, across the nine items, in the participant’s responses. The reliability of this index was very high (a = .85). The depth of self-disclosure was rated by two independent judges on a five-point Likert-type scale (1 = “low intimacy”, 5 = “high intimacy”). The inter-rater reliability was a very high .74; disagreements were resolved by averaging. The assigned value for a given participant was the average depth of response across the nine items; the reliability of the index was very high (a = .86).

 

RESULTS

                       

Full-factorial ANOVA’s were conducted on all the dependent measures. The voice and face factors showed consistent cross-over interaction effects on all of the measures, supporting the consistency hypothesis and challenging the maximization argument.

For BIDR-Impression Management (IM), there was a significant cross-over interaction, F(1, 44) = 4.3, p < .05 (see Figure 1).  Participants who interacted with the synthesized face speaking with the synthesized voice exhibited greater impression management than those who only heard the synthesized voice without the face. However, the opposite pattern showed for the recorded voice. The participants who interacted with the synthesized face speaking with the recorded voice showed less impression management than those who only heard the recorded voice without the synthesized face. 

Similar cross-over interaction effects were observed with respect to the amount of self-disclosure, F(1, 44) = 4.6, p < .05, and to the depth of self-disclosure, F(1, 44) = 13.9, p < .001.  Participants disclosed more information about themselves and the disclosure was more intimate when the interface was the synthesized voice with synthesized face than when it was the synthesized voice alone (see Figures 2 and 3). On the contrary, they disclosed less about themselves and the disclosure was less intimate when the recorded voice was combined with the synthesized face than when the interface incorporated the recorded voice alone (see Figures 2 and 3).

Figure 2: Comparison of means in the amount of self-disclosure.

 

Further evidence in support of consistency was found in the significant cross-over interaction effect for the average time that the participants spent in answering BIDR questions, F(1, 44) = 12.7, p < .001 (we did not include time on the open-ended questions, because time would a priori be correlated with the amount of disclosure). The participants spent more time on BIDR questions when the interface was the synthesized face with the synthesized voice than when there was the synthesized voice alone. Conversely, they spent less time when the interface included the recorded voice with the synthesized face as compared to when the interface was the recorded voice alone (see Figure 4).  Participants spent more time with the synthesized speech than the recorded speech, a function of greater difficulty in processing the former, F(1,44) = 86.0, p < .001. 

 


 


Figure 1: comparison of means in BIDR-IM.                   Figure 2: comparison of means in the amount of

self-disclosure.


 


Figure 3: Comparison of means in                                   Figure 4. Comparison of average time

the depth of self-disclosure.                                            spent on BIDR questions (in seconds).

 

In sum, while a synthesized face enhanced the user’s social feeling of the synthesized speech, it undermined the social feeling of recorded human speech.  Consistency seems to well explain this discrepancy in that synthesized face is consistent with synthesized voice, but clearly inconsistent with recorded voice. The recorded voice is clearly a human’s voice, while the synthesized face is clearly not of a human. Their mismatching may lead the user feel strange, disturbed, mistrustful, and less social.

 

DISCUSSION

 

It is very difficult to overcome the designer’s natural impulse to make each dimension of a multi-dimensional interface as technologically advanced and “human-like” as possible.  Because different research communities make advances at different rates and at different times, it seems to be in everyone’s interest to provide the “best of breed” for each dimension of the technology (Nass & Mason, 1991). Unfortunately, the present research shows that allowing each modality (and its interested parties) to  “show off” without matching and balancing between each other leads to an experience that is less human-like and less social for the user. Because humans value social consistency, likely for evolutionary reasons (Reeves & Nass, 1996), an interface that manifests a consistent social feel may be more desirable than one that maximizes dimensions independently.  Thus, the principle that a good interface should be consistent (see, e.g., Norman, 1988; Shneiderman, 1998) might have a social as well as an ergonomic basis.

The present study demonstrated that a face, when it is consistent with the voice, indeed enhances the social feel of the interface, in line with the face-voice enhancement effect in the traditional perception and performance arena. When the face is inconsistent with the voice, however, it harms the social feel of the interface, even though it is still well lip-synchronized. This points to the importance of social aspects in developing and assessing computer interfaces including NLI, in addition to the matter of technical perfection. The technological aspect of NLI, because it is well acknowledged and studied, may not pose the largest barrier in the development of NLI. The social aspects of NLI, which could be easily overlooked, deserve more attention and provide enormous opportunity as well as challenge in enhancing the social feel and naturalness of the interface.

Of course, consistency between voice and face is just one pairing in an interface. As NLI and computer interfaces in general exhibit more skills, abilities, and manifestations that are associated with humans, what are other critical domains of consistency?  For example, should the language level of the computer match that of the user in a NLI? Should text output be matched with text input and speech output with speech input?  Should the feedback of the computer be adaptive to and consistent with the mood of the user? (e.g., more positive feedback and less critical feedback when the user is in a down mood?) The list can go on and on because of the complexity and intricacy in human-computer interaction as in human-human interaction. And consistency is just one of the social principles that need to be abided by in designing interfaces. Further and more extensive exploration of social aspects of NLI not only provides us more guidelines to make a natural language interface more social and natural but also provides insight into the general human-computer interaction. And the NLI community is particularly in an advantageous position in achieving this goal because social aspects of human-computer interaction are more salient in an interface driven by natural language.

 

REFERENCES

 

         Ekman, P., & Friesen, W. V. (1969). Nonverbal leakage and clues to deception. Psychiatry, 32, 88-95.

         Fiske, S. T. & Taylor, S. E. Social Cognition. New York: McGraw-Hill, 1991.

         Isbister, K. & Nass, C. Personality in conversational characters: Building better digital interaction partners using knowledge about human personality preferences and perceptions. Proceedings of the WECC Conference, Lake Tahoe, CA, 1998.

         Kroner, D. G. & Weekes, J. R. Balanced inventory of desirable responding: Factor structure, reliability, and validity with an offender sample. Personality and Individual Differences, 21(3), 323-333, 1996.

         Massaro, D. M. Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. Cambridge, MA: MIT Press, 1997.

         Moon, Y. Intimate self-disclosure exchanges: Using computers to build reciprocal relationships with consumers. Working paper for Harvard Business School, 1998.

         Nass, C. & Mason, L. On the study of technology and task: A variable-based approach. In J. Fulk & C. Steinfeld (Eds.), Organizations and Communication Technology, Newbury Park, CA: Sage, 1991.

         Norman, D. The Design of Everyday Things. New York: Currency Doubleday, 1988.

         Olive, J. P. “The talking computer”: Text-to-speech synthesis. In D. G. Stork (Ed.), HAL’s Legacy: 2001’s Computer as Dream and Reality. Cambridge, MA: MIT Press, 1997.

         Reeves, B. & Nass, C. The Media Equation: How People Treat Computers, Television, and New Media like Real People and Places. New York: Cambridge University Press/CSLI, 1996.

         Shneiderman, B. Designing the User Interface: Strategies for Effective HCI (3 ed.). Reading, MA: Addison Wesley Longman, 1998.

         Sproull, L, Subramani, M., Kiesler, S., Walker, J. H. & Waters, K. When the interface is a face. Human-Computer Interaction, 11, 97-124, 1996.

         White, G. Natural language understanding and speech recognition. Communications of the ACM, 33(8), 72-82, 1990.