CHI 2000 Workshop on Natural Language Interfaces
The Hague, The Netherlands, April 3, 2000
A Prototyping Tool for Telephone
Speech Applications
(or, keeping your costs low and your
users happy)
Cathy
Pearl
Nuance
Communications
Introduction
Speech recognition applications are an
increasingly common example of natural language understanding in the real
world. Many of these applications are carried out over the telephone, and are
being used for such tasks as getting stock quotes, making travel arrangements,
and as personal assistants. These types of applications differ from open-ended
problems such as search engines and dictation, and provide a more structured
framework.
The media is full of stories about the future
of speech recognition: surf the web in your car, browse through the latest
newspaper headlines, or get a reminder to send your spouse flowers and then carry
out the transaction at a local flower shop. With the increasing amount of time
people spend on their cell phones in their cars, the possibilities are endless.
Unfortunately, trying to punch in the name of the movie "Breakin’ 2:
Electric Boogaloo" on your touch-tone keypad while driving on the freeway
is perhaps not the safest idea.
Speech recognition will become an
increasingly important user interface as hands-free technology improves. Some
countries in Europe have introduced laws prohibiting cell phone usage while
driving, which will make hands-free models essential. Good natural language
understanding is key to making these applications succeed. Telephone-based
speech recognition systems also provide the user with less feedback than a
visual interface such as web surfing on your computer. Because there is only
audio feedback, the way the dialog is put together is essential to guiding
users and decreasing their level of frustration.
As speech recognition applications become
more commonplace, it is important that the NL community create tools to
standardize the design and creation of these applications, which will speed up
the development time and cut costs. One of the problems with designing for
speech recognition is the difficulty in assessing what the dialog flow of the
application will be by looking at a written description, such as a dialog
design document.. When the application goes in to its first pilot, errors in
the dialog are discovered which are costly and time-consuming to fix at that
stage.
A dialog design document is one of the
primary methods for designing directed speech recognition applications. It
outlines each state of the application, the behavior of the state, what prompts
are played, how errors are handled, and what to do next, depending on what the
user says. The nature of this directed dialog approach is such that at any
given state there are a limited number of things the user can say. There is
flexibility in how the user can say these things, but only a certain
number of options. A dialog design document outlines the different states of an
application and is very useful as a specification for writing code, but it does
not allow the user to get an understanding of how it will actually sound.
Application development typically includes
one or more pilot testing phases, which allow users to interact with the system
and find problems. Pilots are a time-consuming process which can require
modification of grammars, re-recording of prompts, and recognition tuning. Much
of this work could be avoided if appropriate testing were done in the initial
stages.
Many applications use a different grammar for
each recognition state, which must contain all possible words and phrases that
are allowable. In addition to basic commands, these grammars must take into
account fillers (‘um’s and ‘uh’s, please, ‘I want to’..) etc. Having a
well-tuned grammar is essential for good speech recognition. If the grammar is
too sparse, users will not be able to be flexible in their requests. If the
grammar covers too many things users will rarely or never say, recognition
suffers.
The directed dialog method can lend itself to
smooth, conversational systems. The difficulty is in 1) wording prompts to
elicit appropriate responses from users; and 2) predicting what users will say
when using the system.
Although experience with designing and
deploying applications is one way to help with these issues, every application
presents a unique challenge. Usually, the first time the dialog flow is heard
(complete with recorded prompts) is during the project’s pilot phase. During
the pilot phase, valuable data is obtained by logging real user’s utterances
and analyzing them with respect to the current grammars.
It is at this point that major dialog changes
are made, which leads to re-recording prompts and enhancing and tuning
grammars. This can be an expensive and time-consuming process. Many of the
problems discovered could have been caught at an earlier stage, if the user had
been able to listen to the dialog flow using a prototyping tool.
A Prototyping Tool
To catch problems before the
application has even been coded up, a prototyping tool could be used. A simple
Wizard of Oz (WOZ) approach would allow the designer to quickly construct the
dialog flow and be able to run experiments with real users. The tool is used as
a preliminary test to find awkward and ambiguous points in the dialog, as well
as to gauge what users will say. When creating the actual grammars, the
designer will have real data about what users will be saying.
No recognition needs to take place. Instead,
a prompt is played and the user will respond. Based on what the user says, the
experimenter can determine what the next state should be. The user will see a
continuous dialog flow, regardless of what is said.
First, the designer would create a flowchart
outlining all of the application states. For each state, the user would then
record a prompt. Default behaviors for error states are also added, such as a
prompt for "I’m sorry, I didn’t understand" and "I’m sorry, I didn’t
hear you." These are available to the experimenter at any state.
Once the flowchart is complete and the
prompts recorded, the designer can quickly listen to different application
paths. This step will eliminate major flow errors which might have been missed
when viewing the transitions on paper.
The next step is using the system with real
users. Users should be chosen who will accurately represent the real users of
the system. Each subject will be presented with the first prompt, which could
be something such as "Welcome to TravelFlo. Would you like to rent a car,
book a flight, or enter our sweepstakes?" Depending on what the user says,
the person running the experiment will click on the next appropriate state. In
this example, the user might say "uh, rent a car", and the
experimenter would click on the "Rent Car" state. The prompt for this
state would be played ("What type of car would you like?") and so on.
Alternatively, an error state may be chosen. Users' responses are recorded and
can be analyzed at a later time.
The fact that no recognition is occurring is
important. At this stage, it's still being determined what will be part of each
grammar. Because the designer does not have to spend time creating and
modifying grammars, or worrying about what is or is not being recognized, it
means a much shorter experiment cycle. Usability is often thought by companies
to be a "nice if you have the time" phase in the development cycle,
so anything that can make the process shorter and less expensive will help.
Examples
The previous example shows how
something that seems straightforward may not be. After the phrase "Would
you like to rent a car," some users will immediately respond with
"yes", which would result in an out-of-grammar response. I have worked
on a project in which this situation came up. The dialog specification was
reviewed by many people, but the problem was not discovered until the first
pilot when real people were calling the system. This required prompts to be
re-recorded, which was expensive as it was outsourced to a professional voice
company, as well as modifying the dialog.
Another example is the way in which people
say phone numbers. Even if the prompt is "Please say a ten digit
number," projects have shown that people will begin with a one, or say the
seven digits followed by "area code six five oh."
Oftentimes, people will use different
wordings than are initially considered by a designer. Designers are frequently
engineers, and are very comfortable with words like ‘cancel’ and ‘main menu’. I
have seen real users say ‘back up’ and ‘go to the top’ instead.
Overall flow is also an issue. Will users
become confused when they are dumped back at the main menu after placing a call
to someone who left a voice mail? Will it be too difficult for them to navigate
through the menus to change their personal greeting?
More abstract things can be tested as well.
Prompts can have very different styles; a stock quote application might be best
served with a formal voice and wording, whereas a pizza delivery application
might sound better with a casual style and plenty of slang. With a WOZ tool,
the designer can easily record different sets of prompts and see how users
respond.
These things will be discovered during the
first pilot, but at this point it becomes a much more difficult process to
change the application and have prompts re-recorded professionally. Doing this
work up front allows the designer to make changes quickly and without affecting
the people who are integrating the application.
Experimentation
In addition to initial testing of an
application that is planned to be built, the WOZ environment allows designers
to experiment with new methodologies that might otherwise be too expensive.
Rather than write a full-blown application that includes recognition and is
robust, the WOZ approach allows a researcher to test particular concepts.
A crucial part of speech recognition
applications is error recovery. Errors can occur for two reasons: either the
user said something that is not in the grammar, or the system failed to
recognize it. In either case, the way the application handles the error is
important if you are interested in having the user try your system again.
Prompt tone and wording makes a differences
in a user’s perception of a system. What would the results be if the system
said, "What did you say?" versus "I’m sorry, I didn’t understand
you," or even "Pardon?" Although it might not affect the
functionality of the system, how the user feels about his or her experience is
important to whether they will be a return customer.
Another important area in dialog design is
helping users when they are lost. Proper wording of help prompts is crucial to
giving users a sense of where they're at in an application, and how to get to
where they need to go.
Conclusion
Natural language understanding has
been able to make important advancements in recent years. Part of its success
is due to the increased presence of speech recognition applications. The
telephone is a natural mechanism for speech recognition; people are more
comfortable speaking on the phone than to the computer on their desktop.
Directed applications provide a more robust
framework for natural language than open-ended ones. It’s easier to find out
where someone wants to eat dinner than it is to ask "What would you like
to find on the Web?" To make a successful directed application, the dialog
flow must be smooth. Prompts must guide the user to say information the system
needs, and the grammars must encompass what users will most commonly say.
A barrier to creating these speech
recognition applications is cost and time. During an application's pilot
phases, problems are discovered and addressed. However, making changes at this
point is difficult, and time to fix problems is often not included in the
schedule. Any work up front that can eliminate basic problems will save time
and money, as well as keeping initial users less frustrated. Usability tests
can be conducted within a predetermined time, and be built into the project's
schedule.
The surest way to get your users to never
call your system again is by giving them a frustrating experience. If users
have a bad experience with one speech recognition system, they will be that
much more reluctant to try out a new one, even if it’s ten times better. If we
want natural language interfaces to succeed, we need tools to create intuitive,
robust applications.
Before writing the application and signing
off on a design dialog specification, a WOZ prototyping tool can be used to do basic
testing. Real users can respond to prompts and give valid responses that may
have not been considered. Real users will uncover awkward areas in the dialog
flow, or show confusion when prompts do not give enough information.
Such a tool could also be used for
experimentation with new methods, such as error recovery and prompt style. All
of these experiments and studies can provide valuable information for the next
time an application is designed.
The practical side is that customers need to
roll out products, and do not have the time to continuously tweak the
application after the pilot stage. Every pilot will uncover new issues and
problems, but in reality the product must be delivered, imperfections and all.
Using a prototyping tool is an up front cost that can be built into the
development cycle, and have a set completion date. With tools like these, we
can improve the development of natural language applications and create an
experience a user will want to repeat.