CHI 2000 Workshop on Natural Language Interfaces

The Hague, The Netherlands, April 3, 2000

 

Shifting the design philosophy of spoken natural language dialogue: From invisible to transparent systems


Laurent KARSENTY

ARAMIIHS-IRIT, University Paul Sabatier
118 route de Narbonne, 31062 Toulouse, France


 

Why transparent systems?

It is often advocated that the more a system can be invisible, the more users can concentrate on their direct objectives and appreciate the system. Spoken natural language interfaces seemed, until recently, the most promising way to reach this aim: the hope was that, with the use of spoken natural language, users will not have to learn any specific language to interact with the computers. They would simply speak, as naturally as possible, and the computers would understand and respond to them in the most appropriate way.

This vision gave rise to a design philosophy: spoken language interfaces had to be able to process all the naturally occurring language phenomena (speech disfluencies, anaphors, indirectness of speech acts, synonyms, etc.) and exhibit human-like behaviors (open-ended questions, politeness, cooperativeness, etc.). A great effort was then devoted to research in speech recognition, natural language processing and computational linguistics. The goal - whether it had been implicitly or explicitly stated - was to obtain "perfect performance" in terms of speech recognition and natural language processing. In the following, we will refer to this design philosophy as the "invisible computer philosophy".

Following this design philosophy, promising prototypes appeared in the laboratories. However, once the usability of these prototypes was tested, unexpected difficulties appeared. For internal or external reasons, users frequently do not produce the kind of fluent but constrained speech that a speech recognizer has been trained to process (Hauptmann & Rudnicky, 1988, Karis & Dobroth, 1991, Tucker & Jones, 1991). Users experienced serious difficulties in correcting the system's misrecognitions and misunderstandings which were - and still are - relatively frequent (Karsenty, 1999). Users formed wrong expectations as to the system's capabilities (Boyce & Gorin, 1996). More surprisingly, it was found that many users refrain from speaking naturally. At the same time, it was observed that these users were often unable to identify the single words or expressions that the system expected (Yankelovith, 1996). Related findings reported that users, especially during their first contact with a new system, build an incorrect cognitive model of how the system works and how to interact with it (Kamm, 1994, Karsenty, 1997). The problem is that, since users seem to adapt themselves spontaneously to what they assumed the machine's capabilities to be (Luzzati & Neel, 1989, Spitz, 1991), the use of an incorrect cognitive model of the system inevitably results in unexpected and problematic behaviors from users.

All these findings led to the conclusion that it would not be reasonable to expect a natural language spoken system to understand any type of user's response. Researchers realized that users should not speak as freely as in human conversation. To be successful, the system's prompts had to constrain the users' speech while providing a feeling of "natural dialogue" (Hansen, Novick & Sutton, 1996). A number of studies were then conducted to determine which types of prompts were the most favorable to provoke the expected responses from users (e.g., Zoltan-Ford, 1991, Kamm, 1994, Oviatt, Cohen & Wang, 1994).

It is worth noting that the implications of these findings go beyond the design of prompts. All the studies mentioned earlier highlighted that the goal of designing a strictly invisible system did not match the major need of the users: every user needs to build a cognitive model of the system, in order to understand the system's behaviors and plan or anticipate the next steps in a dialogue. In the context of spoken human-computer dialogue, this need is intensified by the fact that current speech processing systems still have unreliable performances. Responding to this user need with invisible computer systems can only lead to poor performances, especially because the majority of users are not competent in spoken interfaces: hence, most of them build an incorrect cognitive model of the system and produce, as a result, unexpected and problematic behaviors.

A shift in spoken interface design is required: rather than seeking a perfect performance of speech recognizers and natural language processors, the goal should preferably be to help users build an internal model of the system's functions, capabilities and limitations. In other words, a new design philosophy is required, one that promotes transparency (Maass, 1983) in spoken human-computer dialogue (referred to as "transparent computer philosophy" in the following text).

A transparent system should help users to determine the services offered by the system. It should help users in knowing not only what to say (Yankelovich, 1996) but also when and how to say it. It should provide users with positive and negative evidence of what the system has understood and done (Brennan & Hulteen, 1995) and, if necessary, make the limitations of the system apparent (Kamm, Litman & Walker, 1998). In brief, a transparent system, through its behavior, "makes it easy for users to build up an internal model of the system" (Maass, 1983, p. 25). It should be, in a certain way, "visible from the inside". The main issue raised by this design philosophy, in the context of a system using natural language as input and output medium, is to determine which linguistic behaviors are necessary and sufficient to reach this goal, knowing that long and/or complex messages are excluded from the design space.

In the following, I present an illustrative example of how transparency may be achieved in a specific dialog context and how it may improve spoken human-computer dialogue.

Comparing the invisible vs. transparent computer philosophies: an example

Study context

In the context of a collaborative effort conducted with the CNET (French Research Center on Telecommunications), my work aims to define and test the necessary transparency strategies for spoken natural language dialogue. An application was chosen to carry out this study: AGS, a directory of Audiotel servers  for job searches and weather bulletins developed by the CNET on the basis of ARTIMIS technology (Sadek et al., 1996).

AGS uses speaker independent continuous speech recognition, develops an understanding of natural language and uses a text-to-speech synthesis. AGS handles a vocabulary of approximately 1000 words. A first prototype of AGS (AGS97 in the following) was particularly representative of those systems built on the invisible computer philosophy : it offered, at the start of the dialogue, only a polite open-ended prompt ("What can I do for you?"). It did not clutter users neither with systematic comprehension feedback nor with prompts listing the possible answers and actions at each step of the dialogue. Finally, its comprehension error messages were simple and polite ("Excuse me?"). After a series of usability tests, AGS97 evolved to a new version called AGS99, which exhibited many transparency strategies. We then conducted a new study aimed at comparing AGS97 with AGS99. Given the reduced size of this text and the particular purpose of this workshop, I will not present all the results of this study. I will only mention the results relative to the first prompt of the systems.

Description and rationales of the transparency strategy used for the first prompt

Unlike AGS97, AGS99 recalls its function by introducing itself: "Welcome to AGS, a directory of Audiotel job search and weather servers." AGS99 then asks a question: “What would you like?”, which is more direct than the one used by AGS97: “What can I do for you? ”. These changes were made for two reasons:

Test procedure

The test focused on the job searches directory of AGS. 28 users, with various professional occupations, ages and computer expertise levels took part in this test. Each user was placed at a desk with a telephone and an information sheet describing the system. It is worth stressing that this information sheet, whether it concerned AGS97 or AGS99, described among other things the system's functions. The test was conducted on an individual basis. Each user tested only one version of AGS (14 users tested AGS97 and 14 others tested AGS99) and made three different calls with three different requests. Users, on the basis of a "broad" instruction given by the tester, freely chose each request. The dialogues were audio-recorded and transcribed. Each dialogue with AGS was followed by an interview, which allowed us to investigate the way the system's responses were interpreted by users. Finally, each user was asked to fulfill a satisfaction questionnaire.

Performance indicators

To compare the quality of the AGS97 and AGS99 first prompt, users' response to this prompt was codified in two ways:

·  Types of initial user request formulation: this coding was done to determine whether AGS97 and AGS99 users behaved the same way at the outset of the dialogue. We operated on a dual hypothesis: with the AGS99 prompt, users would produce fewer inappropriate requests with respect to the system's function; they should also produce less complex requests, since the AGS97 prompt ("What can I do for you?") could convey a false, too human image of the system. Four types of user responses, covering all of the responses recorded, were identified:

1.      Naming the job search service: all responses such as "I would like information on job searches" or just "Jobs" were thus codified.

2.      Naming the job search service with a specification of a target parameter: this category includes responses such as "I would like servers for computer programmers" or "I would like information on jobs in the Midi-Pyrénées region."

3.      Naming the job search service with a specification of two target parameters: this category deals with responses that specify both an occupational field and a region, for instance: "I would like job search servers for teaching positions in the Paris area."

4.      Requests containing one or more irrelevant pieces of information: this category dealt with requests that were appropriate with respect to the system's function but that contained at least one irrelevant term. The irrelevant information could be a number of things: naming a person for whom the search was being done (i.e., "I would like job search information for my husband"), mention of the user's degree of education, or degree of the person for whom the search was being done (i.e. "I am looking for information for someone who has a B.S. in mechanical engineering…"), reason for the job search (i.e.: "I'm looking for a job in engineering for someone who wants to change jobs").

·  Types of comprehension of the user's request: this codification was used to determine whether AGS99 users managed to make themselves understood correctly more often than AGS97 users when they initially formulated their request. To this end, we encoded requests correctly understood by the system as "correct comprehension", as "partial comprehension" those requests in which only some of the target terms were understood by the system, and as "miscomprehension" all other requests.

Results

Figure 1: Types of initial user request formulation with AGS97 and AGS99

Type of system

AGS97

AGS99

 

Number

%

Number

%

1. Naming the job search service

19

45.2

40

95.2

2. Naming jobs + 1 target parameter

12

28.6

2

4.8

3. Naming jobs + 2 target parameters

4

9.5

0

0

4. Requests with irrelevant information

7

16.7

0

0

Total

42

100

42

100

Figure 2: Types of comprehension related to the user's initial formulation of the request with AGS97 and AGS99

Type of system

AGS97

AGS99

 

Number

%

Number

%

1. Miscomprehension 

14

33.3

6

14.3

2. Partial comprehension

12

28.6

0

0

3. Correct comprehension

16

38.1

36

85.7

Total

42

100

42

100

The results of this dual codification are presented in figures 1 and 2. We can notice that:

Discussion

The AGS97 and AGS99 first prompts did not have the same effect on the users’ linguistic behavior. Yet, if we only take into account the question asked by each system - "What can I do for you?" for AGS97 and "What would you like?" for AGS99 - it is hard to conceive how these questions alone can explain the differences noticed, in particular the nearly unanimous choice on the part of AGS99 users to make a simple request for the desired service, whereas AGS97 users show a greater diversity in the formulation of their initial request. The context in which the question is presented must therefore play a particularly important role.

We might make the hypothesis that the welcome message of AGS99 led most users to interpret the system's question "What would you like?" as a question meaning: "Would you like jobs or weather?" or "Would you like job search servers or weather servers?" As for the AGS97 welcome context, it was unable to orient the interpretation of the question "What can I do for you?" in any other direction than "What service can I provide for you?" or "What information would you like?" According to this hypothesis, AGS97 users would not have formulated requests with several - and sometimes irrelevant - information because it is "natural" to formulate such complex requests but, instead, because they had not clearly identified the information expected by the system.

An analysis of users’ comments made during interviews seems to confirm this hypothesis. Many AGS99 users said they understood the system prompt "What would you like?" as a request for a choice between the weather and jobs. On the other hand, several AGS97 users expressed the trouble they had in deciding on an answer to the question "What can I do for you?": they emphasized "the vagueness of this question", or, for some users, "the great effort required to figure out what request to make, knowing the machine cannot understand everything". No AGS99 user expressed any difficulty in answering the first system prompt.

This analysis leads us to believe that users of spoken natural language systems are not at ease with open questions and prefer, to some extent at least, their answers to be oriented by the system's questions. This analysis also leads us to believe that, without changing the open form of the questions the system asks, one can radically change their meaning and consequently affect the users’ linguistic behavior by defining an adequate transparency context.

Conclusion

During the last decade, a major discovery was made in the field of spoken natural language interfaces: we realized that the meaning of the expression "speaking naturally" was not "to speak with no constraints" but the opposite. Users need some constraints to determine the most effective way to talk to a machine. A transparent system can meet this requirement by making its functions, capabilities and limitations "visible".

This paper presented an example of how transparency may improve spoken natural language dialogue. However, the good results associated with the tested transparency strategy must not conceal the mass of unresolved issues: how and when to provide users with a clear indication of the system's limitations, knowing that these limitations concern the databases it can exploit and its language processing capabilities? How to help users identifying the system understanding of their requests without using explicit confirmation strategies? How to help users easily understand the system's comprehension errors and know the most effective way to correct them? Which users' explicit requests for transparency should be expected and made possible in a spoken human-computer dialogue? Finally, in the study presented in this paper, the system could provide users with only two services (jobs and weather). In this case, it was not tedious for users to listen a prompt listing both these services. It would be another problem if one had to design a similar prompt for a system offering, let us say, 10 or 20 services.

These are just a few of the issues raised by the transparent computer design philosophy.

Acknowledgments

The author acknowledges support from France Telecom for funding this research. Thanks to the research team at the National Research Center on Telecommunications, in particular Philippe Bretier, Alain Cozannet, Franck Panaget  and David Sadek.

References

Boyce S.J. & Gorin A.L. (1996). User Interface Issues for Natural Spoken Dialog Systems. Proceedings of the ISSD'96 International Symposium on Spoken Dialog (pp. 65-68), Philadelphia.

Brennan S.E. & Hulteen E.A. (1995). Interaction and feedback in a spoken language system: a theoretical framework. Knowledge-Based Systems, 8(2-3), 143-151.

Hansen B., Novick D.G. & Sutton S. (1996). Systematic Design of Spoken Prompts. Proceedings of CHI'96, pp. 157-164, Vancouver, April 13-18. ACM Press.

Hauptmann A.G. and Rudnicky A.I. (1988). Talking to computers: An empirical investigation. International Journal of Man-Machine Studies, 28, 583-604.

Kamm C. (1994). User Interfaces for Voice Applications. In: Roe D.B. & Wilpon J.G. (eds.) Voice communication between humans and machines (pp. 422-442). Washington: National Academy Press.

Kamm C., Litman D.J. & Walker M.A. (1998). From novice to expert: The effect of tutorial on user expertise with spoken dialogue systems. In: Robert H. M. and Jordi R.R. (Eds) Proceedings of ICSLP'98 (pp.1211-1214), Australian Speech Science and Technology Association, Inc.

Karis D. & Dobroth K.M. (1991). Automating services with speech recognition over the public switched telephone network: Human Factors considerations. IEEE Journal of Selected Areas in Communications, 9(4), 574-585.

Karsenty L. (1997). Cognitive analysis of user behaviors in spoken human-computer dialogue. Research report IRIT, Contract France Télécom, CNET n°96-75A3, Oct. 1997 (in french).

Karsenty L. (1999). Application of the transparency principle to human-computer telephone dialogues. Research report IRIT, Contract France Télécom, CNET n°99-1B-079, October 1999.

Luzzati D., & Neel F. (1989). Dialogue behaviour induced by the machine. Proceedings of the European Conference on Speech Communication Technology (EUROSPEECH'89), 601-604. CEP.

Maass S. (1983). Why systems transparency? In: T.R. Green, S.J. Payne & G.C. Ven Der Veer (eds.) The Psychology of Computer Use, pp. 19-28, London: Academic Press.

Oviatt S.L., Cohen P.R. & Wang M.Q. (1994). Toward interface design for human language technology: Modality and structure as determinants of linguistic complexity. Speech Communication, 15, 283-300.

Sadek D., Ferrieux A., Cozannet A., Bretier P., Panaget F. (1996). Effective human-computer cooperative spoken dialogue: The AGS demonstrator. Proceedings of ICSLP’96, Philadelphia, 1996.

Spitz J. (1991). Collection and analysis of data from real users: Implications for speech recognition/understanding systems. Proceedings of the Fourth DARPA Speech and Natural Language Workshop (pp. 164-169), Pacific Grove, California, Feb. 1991. Morgan Kaufmann.

Tucker P. & Jones D.M. (1991). Voice as Interface: An Overview. International Journal of Human-Computer Interaction, 3(2), 145-170.

Yankelovich N. (1996). How do users know what to say? ACM Interactions, 3(6), 1996, 32-43.

Zoltan-Ford E. (1991). How to get people to say and type what computers can understand. International Journal of Man-Machine Studies, 34, 527-547.