CHI 2000 Workshop on Natural Language Interfaces
The Hague, The Netherlands, April 3, 2000
Usability Evaluation of Voice Operated Information
Systems:
Analysis of Dialogues with Real Customers
Jacques Terken, Mieke Beers
IPO, Center for User-System Interaction
Eindhoven University of Technology
j.m.b.terken@tue.nl
1. Introduction
One of the main problems in "speech-only" spoken dialogue systems is explaining the functionality of the system to the user: what can be done at each turn in the dialogue. If the user doesn't know what are appropriate responses at each stage in the dialogue, the system will have to deal with many inappropriate responses which it cannot handle very well: utterances containing hesitations and restarts, and out-of-domain requests in which the user requests information that is beyond the system's competence.
The most obvious solution to avoid this problem is to make the system behave in a rather directive way: in each turn the system indicates precisely what information is expected from the user. However, this often results in somewhat clumsy and artificial dialogues.
Several solutions have been tried out to avoid the disadvantages of directiveness, involving the idea of zooming: the system asks a question for information at a fairly general level, and if the user one way or the other cannot produce an appropriate response, the system proceeds with a more specific question containing hints about the kind of answers that are appropriate at this stage of the dialogue (Kamm and Walker, 1997; Sturm, Den Os and Boves, 1999; Veldhuijzen Van Zanten, 1998). In this way, more experienced users may respond appropriately to the general questions, while less experienced users may wait until they get hints. Still, in the initial phases of each sub-dialogue the less experienced users are likely to give inappropriate answers, which may drive the system into behavior that is incomprehensible to the user.
Another problem concerns the evaluation of spoken dialogue systems. For instance, in order to evaluate the different approaches with respect to the explanation of the functionality, potential end users are invited to the laboratory in which they go through a number of scenarios. The advantages are obvious. Scenarios can be designed such that they focus on issues of interest; and the same scenario can be executed by different subjects so as to meet the need of replicability and experimental control. In addition, the objective data about system performance are supplemented by subjective judgments obtained on the basis of interviews or questionnaires completed by the subjects. The disadvantage is also obvious. Subjects who are invited to the laboratory are paid for participation and are not really in need of the information to be obtained. As a result, they may be much more benevolent and cooperative than real users who use the service to obtain information they really need and who pay for the service. Furthermore, the scenarios in the laboratory studies give guidance concerning what information to obtain from the system, whereas in real situations users may be less certain about what information can be obtained. As a result, the laboratory studies may in fact give a flattering view of the state-of-the-art.
In the present paper we present an analysis of dialogues collected from real customers with a speech-only spoken dialogue system in which the dialogue is mainly system-driven. Approximately 60 percent of the calls is successful in the sense that they end with a delivery of the requested information. Of the failed dialogues, approximately 15 percent fail because the callers require information which is outside the domain of the system. We contend that even with a system-driven approach it is quite difficult for users to identify the functionality of the system, and that users are in serious need for support in this respect. We suggest that the development of multi-modal interfaces is one such possibility. Secondly, we contend that the occurrence of Out-of-domain requests by the user may be seriously underestimated by laboratory studies, and that real-life evaluations are a necessary addition to laboratory evaluations.
2. Method
2353 dialogues collected from callers of a public transport information service were transcribed and annotated for analysis. The primary interest was in identifying apparent causes of failure, and to verify on the basis of analyses of successful dialogues whether indeed such "causes" give a high likelihood of dialogue failure. For that reason we did not take an a-select sample of dialogues, but instead specified the ratio of successful and failed dialogues to be analysed in advance. For practical reasons we ended up with a total number of 2352 dialogues, falling apart into 1278 failed dialogues and 1075 successful ones. Relevant characteristics such as the occurrence of recognition errors, the occurrence of mixed initiative, out-of-domain requests and so forth were coded by trained annotators. Also, the apparent immediate cause of dialogue failure was annotated.
The system has been operational since February 1998. Currently, people calling the operator service are offered the automatic service if they are in a waiting queue and need just information about train departure and arrival times. The system guides the user through the dialogue by prompting him to provide the parameters values for the requested information. The user may give more information than requested in the prompt, which means that the system can handle mixed initiative. Once all parameters values are known to the system, it accesses the database and translates the query result into a spoken travel advice. At each moment the user may quit the dialogue by dialing 9, which will connect him to the human operator.
Two mechanisms are applied to improve the efficiency of the dialogue. In the first place, the system assumes by default that the time mentioned is the departure time and that the period of the day at which the call takes place is the frame of reference for interpreting time expressions ("at eight o'clock" may be 8 a.m. or 8 p.m., and without defaults the system would have to check explicitly). Thus, only in cases where the defaults do not apply, the user needs to spend extra turns overruling the defaults. In the second place, the system does not explicitly verify information that has been extracted from the last utterance, but just includes what it has recognized in the prompt for the next parameter, as in the following example:
System: From where to where do you want to travel?
User: From Eindhoven to Amsterdam
System: When do you want to travel from Eindhoven to Amsterdam?
User: Tomorrow
By not responding to the presupposition in the second system utterance the user implicitly confirms that the information is correct. Thus, only if the system has made a recognition error the user needs to spend extra turns to correct the mistake.
3. Results and Discussion
Causes for failure
Table 1 gives the causes of dialogue failure for the failed dialogues.
Table 1 : Frequency distribution of
causes of failure in failed dialogues
|
Cause |
Failed dialogues |
% |
|
Total |
1278 |
100 |
|
Recognition error |
609 |
48 |
|
Out-of-domain request |
186 |
15 |
|
Mixed initiative |
13 |
1 |
|
Default overruling |
71 |
5 |
|
Silence |
33 |
3 |
|
Selfcorrection |
12 |
1 |
|
Interruption of call |
196 |
15 |
|
Other |
158 |
12 |
As can be seen, half of the failed dialogues failed because of recognition errors, and 15 percent of the dialogues stranded on out-of-domain requests. Also, in 15 percent the user simply interrupted the call for no immediate reason. Closer analysis of the loggings of these dialogues may give an indication of the motivation for doing so. Twelve percent were not finished for a variety of other reasons.
Although recognition errors and out-of-domain requests are responsible for the majority of failures, they need not really be fatal by necessity. There might be an even larger number of recognition errors and out-of-domain requests that are adequately handled by error recovery strategies. Therefore, we need to evaluate the seriousness of the main problems by looking at the way they are dealt with in cases where they do not result in dialogue failure. By definition, this is the case in successful dialogues.
The relevant data are given in Table 2.
Table 2: Frequency distribution of
recognition errors and out-of-domain requests in successful and failed
dialogues
|
|
Successful |
Failed |
||
|
Total |
1075 |
1278 |
||
|
No Error, no Out of Domain |
889 |
83% |
267 |
21% |
|
Errors |
178 |
17% |
808 |
63% |
|
Out of Domain |
4 |
<1% |
162 |
13% |
|
Error + Out of Domain |
4 |
<1% |
41 |
3% |
As can be seen from Table 2, less than 1 percent (8/1075) of the successful dialogues contain out-of-domain requests, whereas 16 percent (203/1278) of the failed dialogues contain out-of-domain requests. Across successful and failed dialogues we find 211 out-of-domain requests. As already shown in Table 1, in 186 failed dialogues the out-of-domain request was the cause of failure. This amounts to the observation that the occurrence of an out-of-domain request gives a very high probability of dialogue failure.
It may be noted that so far we have used only a rather broad definition of out-of-domain requests, which includes both cases which are really beyond the domain of discourse and cases which are within the domain of discourse but where the user requests information for a non-existent train station. Of the 211 out-of-domain requests, 111 cases were of the former type and 100 of the latter (non-existent train stations). The former case can be handled by providing the user with a better model of the domain of discourse, but the latter cannot. These cases requires that the user has precise knowledge of the ontology constituting the domain of discourse. Currently both situations are handled in the same way. The system maps the incoming speech onto the items in the vocabulary and proceeds as usual. It will for instance substitute the name of an existing station for a non-existing one. If the NLP-component cannot parse the utterance, the system will reply "I didn't understand". Both types of system behavior are quite uninformative as to the real problem. Obviously, a better way of handling the different types of out-of-domain requests will improve the performance of the system.
Recognition error and failure
As was shown in Table 1, recognition errors are the main cause for dialogue failure. However, from Tables 1 and 2 we can derive the observation that 182 out of 1075 successful dialogues also contain recognition errors, and that there are 849 failed dialogues containing recognition errors while in 609 of these the recognition error is the cause of dialogue failure, i.e. in 240 failed dialogues containing recognition errors the error is not the cause of failure. This means that a recognition error does not need to be fatal. Apparently there are ways to recover from errors. In order to identify factors which distinguish fatal from non-fatal errors and to identify potentially successful error recovery strategies, we first looked at potential cumulative effects of errors. Table 3 shows the frequencies of successful and failed dialogues containing different numbers of errors in the same dialogue.
Table 3: Frequency distribution of
successful and failed dialogues as a function of number of errors per dialogue
|
Number of errors per dialogue |
Successful |
Failed |
||
|
Total |
1075 |
1278 |
||
|
None |
893 |
83% |
429 |
34% |
|
1 |
128 |
12% |
309 |
24% |
|
2 |
35 |
3% |
305 |
24% |
|
3 |
13 |
1% |
140 |
11% |
|
> 3 |
6 |
1% |
95 |
7% |
As can be seen, 95 percent of the successful dialogues contain zero or one errors, with the majority containing zero errors. The picture is different for the failed dialogues. Here, only 58 percent contain zero or one errors, and 42 percent contain two or more errors. We may tentatively conclude that the occurrence of an error gives a high likelihood that there will be more errors in the remainder of the dialogue, and that the dialogue may ultimately fail.
One may ask why errors may lead to dialogue failure in some cases but not in other cases. There may be several explanations. In the first place, the failed dialogues may relate to situations where users request information concerning items that pose difficulties for the recogniser, such as infrequent monosyllabic station names. In this case, the system repeatedly makes an error on the same parameter, for instance it does not manage to get a station name right even after several attempts of the user. In the second place, failed dialogues may obtain from users who are difficult to handle by the system, for instance because their pronunciation diverges considerably from the materials on which the speech recogniser was trained. Here, we would expect that the system may get the station name right after an initial recognition error, but subsequently fail again on the travel date and/or the departure or arrival time. In the third place, there may be a difference between successful and unsuccessful error recovery strategies. However, the distribution of the data in Table 3 seems to argue against such an explanation, for the following reason. If the difference between successful and failed dialogues would indeed be a matter of successful and unsuccessful error recovery strategies, we would predict to find a more even distribution of successful and failed dialogues in the rows of Table 3: an explanation in terms of error recovery would not affect the frequency of occurrence of multiple errors, but only the way they are dealt with once they occur. Before more definite conclusions can be drawn, however, we need to look more closely into the precise data to see which situations apply.
Comparison with laboratory studies
As explained earlier, the data analyzed in the previous section consist of dialogues with real customers. These results can therefore be compared with the results of an earlier laboratory study involving the same system (Weegels, 1999). This study aimed at comparing different dialogue strategies, implemented in two different systems. Twenty subjects used both systems, going through eight scenarios, four for each system. The success rate in the laboratory study was relatively high compared to what has been observed for real life situations. For the system that was also the focus in the current study the success rate in the laboratory study was 78 %, compared to approximately 60% in the field situation. In the present context, we are mainly interested in the occurrence of out-of-domain requests. Only one out of thirty failed dialogues failed because of an out-of-domain request in the laboratory study, compared to 15 percent in the field situation (cf. Table 1). A likely explanation is that the users in the laboratory studies were less likely to come up with out-of-domain requests. In the first place the scenarios specify which station names to use. In the second place the scenarios may have shaped the users' expectations through their implicit guidance concerning the functionality of the system.
Thus, we conclude that laboratory studies may give a flattering impression with respect to the occurrence of out-of-domain requests, while precisely these types of utterances have a high probability of resulting in dialogue failure.
4. Conclusion and prospects
We have analysed data from real life situations and compared them with findings from earlier laboratory studies involving the same system. Although conclusions are tentative because further analyses are needed on detailed aspects of the data, two conclusions seem already supported by the current analyses. In the first place, success or failure does not appear to be a matter of successful or unsuccessful user strategies for error recovery but rather a matter of speaker characteristics or requests for information that is difficult to handle due to the precise properties of the recogniser. In the second place, laboratory studies seriously underestimate the occurrence of out-of-domain situations, while precisely those situations are quite likely to result in dialogue failure.
Obviously, out-of-domain situations are related to the immanent problem of explaining to the user the precise functionality of the system in speech-only spoken dialogue systems. Departing from the observation that even cellular phones have small screens, we are currently exploring the possibility of using the visual modality to inform the user about the domain model and the functionality of the system.
Acknowledgement
The current study was made possible by grant 305.00.532 from the Netherlands Organisation for Scientific research (NWO). KPN Telecom is gratefully acknowledged for making available the dialogue materials. Loe Boves and Mieke Weegels are gratefully acknowledged for support during various stages of the research.
References
Kamm. C. and Walker, M. (1997) "Design and evaluation of spoken dialogue systems", in 1997 IEEE Workshop on Automatic speech recognition and understanding proceedings (eds. S. Furui, B.H. Wang and W. Chou), IEEE Signal Processing Society.
Sturm, J., Den Os, E. and Boves, L. (1999), "Issues in spoken dialogue systems: Experiences with the Dutch ARISE system", in Proceedings of ESCA workshop on interactive dialogue in multi-modal systems, Kloster Irrsee, Germany.
Veldhuijzen Van Zanten, G. (1998), "Adaptive mixed-initiative dialogue management", in Proceedings of IVTTA'98: IEEE 4th workshop on interactive voice technology for telecommunications applications, Turin, Italy.
Weegels, M. (1999), "Usability evaluation of voice-operated information services", Internal report nr. 79, Priority Programme Language and Speech technology. IPO, Center for Research on User-System Interaction, Eindhoven University of Technology, The Netherlands.