CHI 2000 Workshop on Natural Language Interfaces

The Hague, The Netherlands, April 3, 2000

 

A Prototyping Tool for Telephone Speech Applications
(or, keeping your costs low and your users happy)

Cathy Pearl

Nuance Communications

Introduction
Speech recognition applications are an increasingly common example of natural language understanding in the real world. Many of these applications are carried out over the telephone, and are being used for such tasks as getting stock quotes, making travel arrangements, and as personal assistants. These types of applications differ from open-ended problems such as search engines and dictation, and provide a more structured framework.

The media is full of stories about the future of speech recognition: surf the web in your car, browse through the latest newspaper headlines, or get a reminder to send your spouse flowers and then carry out the transaction at a local flower shop. With the increasing amount of time people spend on their cell phones in their cars, the possibilities are endless. Unfortunately, trying to punch in the name of the movie "Breakin’ 2: Electric Boogaloo" on your touch-tone keypad while driving on the freeway is perhaps not the safest idea.

Speech recognition will become an increasingly important user interface as hands-free technology improves. Some countries in Europe have introduced laws prohibiting cell phone usage while driving, which will make hands-free models essential. Good natural language understanding is key to making these applications succeed. Telephone-based speech recognition systems also provide the user with less feedback than a visual interface such as web surfing on your computer. Because there is only audio feedback, the way the dialog is put together is essential to guiding users and decreasing their level of frustration.

As speech recognition applications become more commonplace, it is important that the NL community create tools to standardize the design and creation of these applications, which will speed up the development time and cut costs. One of the problems with designing for speech recognition is the difficulty in assessing what the dialog flow of the application will be by looking at a written description, such as a dialog design document.. When the application goes in to its first pilot, errors in the dialog are discovered which are costly and time-consuming to fix at that stage.

A dialog design document is one of the primary methods for designing directed speech recognition applications. It outlines each state of the application, the behavior of the state, what prompts are played, how errors are handled, and what to do next, depending on what the user says. The nature of this directed dialog approach is such that at any given state there are a limited number of things the user can say. There is flexibility in how the user can say these things, but only a certain number of options. A dialog design document outlines the different states of an application and is very useful as a specification for writing code, but it does not allow the user to get an understanding of how it will actually sound.

Application development typically includes one or more pilot testing phases, which allow users to interact with the system and find problems. Pilots are a time-consuming process which can require modification of grammars, re-recording of prompts, and recognition tuning. Much of this work could be avoided if appropriate testing were done in the initial stages.

Many applications use a different grammar for each recognition state, which must contain all possible words and phrases that are allowable. In addition to basic commands, these grammars must take into account fillers (‘um’s and ‘uh’s, please, ‘I want to’..) etc. Having a well-tuned grammar is essential for good speech recognition. If the grammar is too sparse, users will not be able to be flexible in their requests. If the grammar covers too many things users will rarely or never say, recognition suffers.

The directed dialog method can lend itself to smooth, conversational systems. The difficulty is in 1) wording prompts to elicit appropriate responses from users; and 2) predicting what users will say when using the system.

Although experience with designing and deploying applications is one way to help with these issues, every application presents a unique challenge. Usually, the first time the dialog flow is heard (complete with recorded prompts) is during the project’s pilot phase. During the pilot phase, valuable data is obtained by logging real user’s utterances and analyzing them with respect to the current grammars.

It is at this point that major dialog changes are made, which leads to re-recording prompts and enhancing and tuning grammars. This can be an expensive and time-consuming process. Many of the problems discovered could have been caught at an earlier stage, if the user had been able to listen to the dialog flow using a prototyping tool.

A Prototyping Tool
To catch problems before the application has even been coded up, a prototyping tool could be used. A simple Wizard of Oz (WOZ) approach would allow the designer to quickly construct the dialog flow and be able to run experiments with real users. The tool is used as a preliminary test to find awkward and ambiguous points in the dialog, as well as to gauge what users will say. When creating the actual grammars, the designer will have real data about what users will be saying.

No recognition needs to take place. Instead, a prompt is played and the user will respond. Based on what the user says, the experimenter can determine what the next state should be. The user will see a continuous dialog flow, regardless of what is said.

First, the designer would create a flowchart outlining all of the application states. For each state, the user would then record a prompt. Default behaviors for error states are also added, such as a prompt for "I’m sorry, I didn’t understand" and "I’m sorry, I didn’t hear you." These are available to the experimenter at any state.

Once the flowchart is complete and the prompts recorded, the designer can quickly listen to different application paths. This step will eliminate major flow errors which might have been missed when viewing the transitions on paper.

The next step is using the system with real users. Users should be chosen who will accurately represent the real users of the system. Each subject will be presented with the first prompt, which could be something such as "Welcome to TravelFlo. Would you like to rent a car, book a flight, or enter our sweepstakes?" Depending on what the user says, the person running the experiment will click on the next appropriate state. In this example, the user might say "uh, rent a car", and the experimenter would click on the "Rent Car" state. The prompt for this state would be played ("What type of car would you like?") and so on. Alternatively, an error state may be chosen. Users' responses are recorded and can be analyzed at a later time.

The fact that no recognition is occurring is important. At this stage, it's still being determined what will be part of each grammar. Because the designer does not have to spend time creating and modifying grammars, or worrying about what is or is not being recognized, it means a much shorter experiment cycle. Usability is often thought by companies to be a "nice if you have the time" phase in the development cycle, so anything that can make the process shorter and less expensive will help.

Examples
The previous example shows how something that seems straightforward may not be. After the phrase "Would you like to rent a car," some users will immediately respond with "yes", which would result in an out-of-grammar response. I have worked on a project in which this situation came up. The dialog specification was reviewed by many people, but the problem was not discovered until the first pilot when real people were calling the system. This required prompts to be re-recorded, which was expensive as it was outsourced to a professional voice company, as well as modifying the dialog.

Another example is the way in which people say phone numbers. Even if the prompt is "Please say a ten digit number," projects have shown that people will begin with a one, or say the seven digits followed by "area code six five oh."

Oftentimes, people will use different wordings than are initially considered by a designer. Designers are frequently engineers, and are very comfortable with words like ‘cancel’ and ‘main menu’. I have seen real users say ‘back up’ and ‘go to the top’ instead.

Overall flow is also an issue. Will users become confused when they are dumped back at the main menu after placing a call to someone who left a voice mail? Will it be too difficult for them to navigate through the menus to change their personal greeting?

More abstract things can be tested as well. Prompts can have very different styles; a stock quote application might be best served with a formal voice and wording, whereas a pizza delivery application might sound better with a casual style and plenty of slang. With a WOZ tool, the designer can easily record different sets of prompts and see how users respond.

These things will be discovered during the first pilot, but at this point it becomes a much more difficult process to change the application and have prompts re-recorded professionally. Doing this work up front allows the designer to make changes quickly and without affecting the people who are integrating the application.

Experimentation
In addition to initial testing of an application that is planned to be built, the WOZ environment allows designers to experiment with new methodologies that might otherwise be too expensive. Rather than write a full-blown application that includes recognition and is robust, the WOZ approach allows a researcher to test particular concepts.

A crucial part of speech recognition applications is error recovery. Errors can occur for two reasons: either the user said something that is not in the grammar, or the system failed to recognize it. In either case, the way the application handles the error is important if you are interested in having the user try your system again.

Prompt tone and wording makes a differences in a user’s perception of a system. What would the results be if the system said, "What did you say?" versus "I’m sorry, I didn’t understand you," or even "Pardon?" Although it might not affect the functionality of the system, how the user feels about his or her experience is important to whether they will be a return customer.

Another important area in dialog design is helping users when they are lost. Proper wording of help prompts is crucial to giving users a sense of where they're at in an application, and how to get to where they need to go.

Conclusion
Natural language understanding has been able to make important advancements in recent years. Part of its success is due to the increased presence of speech recognition applications. The telephone is a natural mechanism for speech recognition; people are more comfortable speaking on the phone than to the computer on their desktop.

Directed applications provide a more robust framework for natural language than open-ended ones. It’s easier to find out where someone wants to eat dinner than it is to ask "What would you like to find on the Web?" To make a successful directed application, the dialog flow must be smooth. Prompts must guide the user to say information the system needs, and the grammars must encompass what users will most commonly say.

A barrier to creating these speech recognition applications is cost and time. During an application's pilot phases, problems are discovered and addressed. However, making changes at this point is difficult, and time to fix problems is often not included in the schedule. Any work up front that can eliminate basic problems will save time and money, as well as keeping initial users less frustrated. Usability tests can be conducted within a predetermined time, and be built into the project's schedule.

The surest way to get your users to never call your system again is by giving them a frustrating experience. If users have a bad experience with one speech recognition system, they will be that much more reluctant to try out a new one, even if it’s ten times better. If we want natural language interfaces to succeed, we need tools to create intuitive, robust applications.

Before writing the application and signing off on a design dialog specification, a WOZ prototyping tool can be used to do basic testing. Real users can respond to prompts and give valid responses that may have not been considered. Real users will uncover awkward areas in the dialog flow, or show confusion when prompts do not give enough information.

Such a tool could also be used for experimentation with new methods, such as error recovery and prompt style. All of these experiments and studies can provide valuable information for the next time an application is designed.

The practical side is that customers need to roll out products, and do not have the time to continuously tweak the application after the pilot stage. Every pilot will uncover new issues and problems, but in reality the product must be delivered, imperfections and all. Using a prototyping tool is an up front cost that can be built into the development cycle, and have a set completion date. With tools like these, we can improve the development of natural language applications and create an experience a user will want to repeat.