CHI 2000 Workshop on Natural Language Interfaces

The Hague, The Netherlands, April 3, 2000

 

 

Anoop K. Sinha

Graduate Student (Ph.D. candidate), Advisor: James Landay

Group for User Interface Research

EECS Department

University of California, Berkeley
 Berkeley, CA 94720-1776  USA

+1 510-642-3437
aks@cs.berkeley.edu

 

POSITION SUMMARY

We see design tool support as a major barrier to natural language adoption (in particular for speech) in HCI interfaces.  My research is focused on creating design tools to support prototyping, modeling, and demonstrating multi-modal interfaces.  One of our recent concerns in this area has been working on ways to support designers in prototyping and defining natural speech input for their applications.  We have worked on a couple of different approaches, including work-in-progress on a card-based speech prototyping tool and a sample approach to the speech input grammar problem.  The latter project is described below.

 

 

Towards Automatic Speech Input Grammar Generation for Natural Language Interfaces

 

ABSTRACT

For designers who are adding speech control to an existing graphical user interface, defining “what the user can say,” also known as the speech recognition input grammar, is often an onerous and manual task.  We believe that systematic techniques and new tools can assist this process.  We describe a semi-automatic methodology to define an initial speech input grammar that involves collecting structured transcripts from Wizard of Oz-based user tests. We have successfully used this methodology to create a speech input grammar for an e-mail management task. 

 

INTRODUCTION

To create grammars for speech user interfaces, some design guides recommend modeling the conversational dialog found in interviews or Wizard of Oz (WOz) studies [1].  These guides note that substantial designer creativity is required to fine-tune a speech grammar [1, 2, 3].  Though this tuning cannot be eliminated, the WOz study can be systematized, making the process more accessible to designers who are not speech user interface specialists. We have demonstrated that a carefully structured WOz study can lead to transcripts that can be used to directly generate an initial speech input grammar.

 

MOTIVATION

A well-defined speech input grammar can greatly impact usability.  Incorporating users’ input phrase preferences can make speech control more inviting.  However, an individual designer is extremely unlikely to anticipate all desired input phrases. Our method incorporates a set of users’ preferences directly into the grammar.

A well-defined set of input phrases can also aid the performance of the speech recognizer, resulting in magnitudes better input accuracy than unrestricted recognition. 

 

PROCESS

WOz studies in this methodology are performed in a manner such that the audio transcripts from the sessions are chatter free and suitable to run through a grammar generator (see Figure 1).

Figure 1.  Multiple participants are each run through a Wizard of Oz study that leads to a highly structured transcript.  The transcripts are post-processed into a speech input grammar.

 

Tasks

Each participant sequentially steps through a set of pre-defined tasks. The tasks are the operations that invite speech control (e.g., open the message, create an e-mail).  The wizard processes each participant’s spoken command phrase from a remote terminal; that preferred phrase is captured in the transcript.

In our demonstration of this grammar creation process, six users were asked to control Microsoft Outlook via speech input.  Tasks included reading, creating, and finding specific e-mail messages.

 

Speak Aloud Problem

Transcripts generated from traditional speak aloud methods are unusable for input grammar generation.  In particular, the participants voice input channel needs to be dedicated to the speech control task [1].  Furthermore, any type of conversation between the facilitator and the participant would show up in the transcript and prevent automated generation of the speech grammar.

 

Wizard of Oz Set-up

We used Microsoft NetMeeting as a successful, low-cost Wizard of Oz (WOz) tool for remote control of the participants’ application (i.e., Microsoft Outlook).  The participants were instructed to assume that the computer had perfect speech recognition.  Though they realized that a real recognizer would be less accurate, all participants responded that they felt this set-up successfully modeled the way that they would control a computer system via voice.  Since the participant and the wizard can use a text-based chat to communicate, this set-up also allowed a chatter free speech transcript to be created, which was then used to create the actual grammar.

 

Generating the Transcript

The transcripts from the various participant sessions were manually created. Individual commands, generally separated by a significant pause in speech, were placed on separate lines. 

Ideally, a dictation recognizer could generate the participants’ transcripts.  We attempted to use IBM’s ViaVoice Continuous Speech recognizer, but did not get adequate results.  Dictation transcription during the user session would help to further automate the input grammar generation, though the wizard or the participant would still need to signal corrections and command boundaries.

 

Generating the Grammar

We wrote a simple text processing routine that took the transcripts from our user sessions and generated a simple context free grammar from them.  Such a routine could utilize a number of different automatic optimizations on the input grammar:

·         Filtering:

Eliminating non-command phrases such as “umms…,”  “hmm….”.

·         Parameterization:

Specific names, dates, or other entities used in the WOz session can be parameterized to more generic forms.

·         The “the” problem:

Articles, such as “the,” can proceed almost any noun, and the grammar needs to reflect this.

We performed the above optimizations manually for the grammar we generated.

 

USER STUDY

To test our grammar, we performed an informal test with four users on an instrumented Microsoft Outlook application.  We enabled speech control through the Microsoft Speech API (SAPI) using IBM’s ViaVoice Voice Command recognition engine.  We chose four users who also participated in the WOz study with the thought that these users would repeat the phrases that they gave during the WOz session. 

Informally, about two out of every three phrases that the users attempted were covered by the speech input grammar.  This means that we had decent coverage of input phrases using the method that we have described.  Certainly, it provided us with an excellent starting point from which we could further tune the grammar.

Participants in the test commented that they thought the recognition performance was reasonable.  Three out of four participants gave the overall system a rating of at least 7 out of 10 for potential to assist them in their everyday e-mail management tasks.  The other participant gave a 3, but volunteered that he would give an 8 for potential if it were a system for e-mail management on a handheld device.

In our application, we provided a visual window with the available phrases. This phrase window was uniformly disliked.  Users wanted their initial and preferred phrases comprehended.

 

FUTURE WORK

Our experiment has given support to feasibility of automatic input grammar generation.  Much of the future work on this topic will center on improving the grammar generation process: better coverage and further automation.  One can imagine a programming by demonstration [4] experiment designed to capture the tasks and the input grammar while the user is operating the application using a GUI.  We will attempt to record a transcript similarly structured to the one we have described above to capture the user’s preferred phrases during demonstration. 

CONCLUSION

We have described a method that can assist designers in generating a speech input grammar and provided a path to further automate the process.  Such a process can be used by anyone needing to prototype a speech input grammar.

REFERENCES

1.     Yankelovich, N. (forthcoming) “Using Natural Dialogs as the Basis for Speech Interface Design” in Susann Luperfoy (ed.) Automated Spoken Dialog Systems. MIT Press.

2.     Martin, P.  The “Casual Cashmere Diaper Bag”: Constraining Speech Recognition Using Examples.  Proceedings of the Association of Computational Linguistics, Madrid, July 1997.

3.     Yankelovich, N.  “How do users Know What to Say?”  ACM Interactions, Vol. III, Number 6, November-December 1996.

4.     Cypher, A. (ed.)  Watch What I Do - Programming by Demonstration. MIT Press, Cambridge, MA, 1993.