CHI 2000 Workshop on Natural Language Interfaces
The Hague, The Netherlands, April 3, 2000
Anoop K. Sinha
Graduate Student (Ph.D. candidate), Advisor: James Landay
Group for User Interface Research
EECS Department
University
of California, Berkeley
Berkeley, CA 94720-1776 USA
+1
510-642-3437
aks@cs.berkeley.edu
We see design tool support as a major
barrier to natural language adoption (in particular for speech) in HCI
interfaces. My research is focused
on creating design tools to support prototyping, modeling, and demonstrating
multi-modal interfaces. One of our
recent concerns in this area has been working on ways to support designers in
prototyping and defining natural speech input for their applications. We have worked on a couple of different
approaches, including work-in-progress on a card-based speech prototyping tool
and a sample approach to the speech input grammar problem. The latter project is described below.
Towards Automatic Speech Input Grammar Generation for Natural Language Interfaces
For designers who are adding speech
control to an existing graphical user interface, defining “what the user can
say,” also known as the speech recognition input grammar, is often an onerous
and manual task. We believe that
systematic techniques and new tools can assist this process. We describe a semi-automatic
methodology to define an initial speech input grammar that involves collecting
structured transcripts from Wizard of Oz-based user tests. We have successfully
used this methodology to create a speech input grammar for an e-mail management
task.
To create grammars for speech user
interfaces, some design guides recommend modeling the conversational dialog
found in interviews or Wizard of Oz (WOz) studies [1]. These guides note that substantial
designer creativity is required to fine-tune a speech grammar [1, 2, 3]. Though this tuning cannot be eliminated,
the WOz study can be systematized, making the process more accessible to
designers who are not speech user interface specialists. We have demonstrated
that a carefully structured WOz study can lead to transcripts that can be used
to directly generate an initial speech input grammar.
A well-defined speech input grammar can
greatly impact usability.
Incorporating users’ input phrase preferences can make speech control
more inviting. However, an individual
designer is extremely unlikely to anticipate all desired input phrases. Our
method incorporates a set of users’ preferences directly into the grammar.
A well-defined set of input phrases can
also aid the performance of the speech recognizer, resulting in magnitudes
better input accuracy than unrestricted recognition.
WOz studies in this methodology are
performed in a manner such that the audio transcripts from the sessions are
chatter free and suitable to run through a grammar generator (see Figure 1).

Each participant sequentially steps
through a set of pre-defined tasks. The tasks are the operations that invite
speech control (e.g., open the message, create an e-mail). The wizard processes each participant’s
spoken command phrase from a remote terminal; that preferred phrase is captured
in the transcript.
In our demonstration of this grammar
creation process, six users were asked to control Microsoft Outlook via speech
input. Tasks included reading,
creating, and finding specific e-mail messages.
Transcripts generated from traditional
speak aloud methods are unusable for input grammar generation. In particular, the participants voice
input channel needs to be dedicated to the speech control task [1]. Furthermore, any type of conversation
between the facilitator and the participant would show up in the transcript and
prevent automated generation of the speech grammar.
We used Microsoft NetMeeting as a
successful, low-cost Wizard of Oz (WOz) tool for remote control of the
participants’ application (i.e., Microsoft Outlook). The participants were instructed to assume that the computer
had perfect speech recognition.
Though they realized that a real recognizer would be less accurate, all
participants responded that they felt this set-up successfully modeled the way
that they would control a computer system via voice. Since the participant and the wizard can use a text-based
chat to communicate, this set-up also allowed a chatter free speech transcript
to be created, which was then used to create the actual grammar.
The transcripts from the various
participant sessions were manually created. Individual commands, generally
separated by a significant pause in speech, were placed on separate lines.
Ideally, a dictation recognizer could
generate the participants’ transcripts.
We attempted to use IBM’s ViaVoice Continuous Speech recognizer, but did
not get adequate results.
Dictation transcription during the user session would help to further
automate the input grammar generation, though the wizard or the participant
would still need to signal corrections and command boundaries.
We wrote a simple text processing routine that
took the transcripts from our user sessions and generated a simple context free
grammar from them. Such a routine
could utilize a number of different automatic optimizations on the input
grammar:
· Filtering:
Eliminating non-command phrases such as “umms…,” “hmm….”.
· Parameterization:
Specific names, dates, or other entities used in the WOz session can be parameterized to more generic forms.
· The “the” problem:
Articles, such as “the,” can proceed almost any noun, and the grammar needs to reflect this.
We performed the above
optimizations manually for the grammar we generated.
To test our grammar, we performed
an informal test with four users on an instrumented Microsoft Outlook
application. We enabled speech control
through the Microsoft Speech API (SAPI) using IBM’s ViaVoice Voice Command
recognition engine. We chose four
users who also participated in the WOz study with the thought that these users
would repeat the phrases that they gave during the WOz session.
Informally, about two out of
every three phrases that the users attempted were covered by the speech input
grammar. This means that we had
decent coverage of input phrases using the method that we have described. Certainly, it provided us with an excellent
starting point from which we could further tune the grammar.
Participants in the test
commented that they thought the recognition performance was reasonable. Three out of four participants gave the
overall system a rating of at least 7 out of 10 for potential to assist them in
their everyday e-mail management tasks.
The other participant gave a 3, but volunteered that he would give an 8
for potential if it were a system for e-mail management on a handheld device.
In our application, we provided a
visual window with the available phrases. This phrase window was uniformly
disliked. Users wanted their
initial and preferred phrases comprehended.
Our experiment has given support
to feasibility of automatic input grammar generation. Much of the future work on this topic will center on
improving the grammar generation process: better coverage and further
automation. One can imagine a
programming by demonstration [4] experiment designed to capture the tasks and
the input grammar while the user is operating the application using a GUI. We will attempt to record a transcript
similarly structured to the one we have described above to capture the user’s
preferred phrases during demonstration.
We have described a method that can assist
designers in generating a speech input grammar and provided a path to further
automate the process. Such a
process can be used by anyone needing to prototype a speech input grammar.
1. Yankelovich, N. (forthcoming) “Using Natural Dialogs as the Basis for Speech Interface Design” in Susann Luperfoy (ed.) Automated Spoken Dialog Systems. MIT Press.
2. Martin, P. The “Casual Cashmere Diaper Bag”: Constraining Speech Recognition Using Examples. Proceedings of the Association of Computational Linguistics, Madrid, July 1997.
3. Yankelovich, N. “How do users Know What to Say?” ACM Interactions, Vol. III, Number 6, November-December 1996.
4. Cypher,
A. (ed.) Watch What I Do - Programming by Demonstration.
MIT Press, Cambridge, MA, 1993.