Nigel Ward: Current and Recent Projects
Using Prosodic Information to Improve Search in Audio
If only search in audio archives were as simple as search in text. While human speech is intrinsically challenging to process --- due to the variety of deliveries and pronunciations --- it does bring an additional point of leverage: prosodic information. Thus we wish to go beyond words, to also use information in the way people say things, in two ways.
First, we have applied Principal Component Analysis to time-spread prosodic features as a way to reduce the dimensionality. We thus map each moment in a dialog to a point in a vector space. We have found that point pairs that are close in this vector space are frequently similar, in terms of the dialog activities (planning, complaining, explaining, and so on), in terms of the stance (new, urgent, factual, etc., or just chat) and in terms of topic. Using proximity in this space as an indicator of similarity, we are building support for query-by-example search and also, in combination with lexical features, for traditional searchbox queries.
Second, we have used supervised techniques to explore the role of prosody in conveying stance, that is attitudes and related pragmatic functions. Working with 14 aspects of stance as they occur in radio news stories in English and Mandarin, and using a model based on time-spread prosodic features and the aggregation of local estimates, many aspects of stance were at least somewhat predictable, with results significantly better than chance for most stance aspects: including for in English, good, typical, local, urgent, new information, and relevant to a large group.
Responsive Prosodic Behaviors for Interactive Systems
Spoken language is an attractive way for people to interact with autonomous intelligent systems. When systems talk to people, speech can convey not only lexical information but also meta information, such as whether the information requires immediate attention or is just background, how well the system understands the user's goals and situation, and whether the system needs to provide more information or is done for the moment.
In human-human dialog such meta-information is mostly conveyed by prosody: subtle variations in the pitch, energy, rate and timing within utterances. Unfortunately this sort of expressiveness for agents today requires using pre-recorded prompts or hand-crafted synthesized utterances, but neither of these is flexible enough for systems operating in contexts where the possible configurations of communication needs are not known ahead of time.
This project is building models of prosody to support the creation of prosodically appropriate utterances in dynamic domains. Automatic methods will be developed to infer prosodic behaviors from dialog datasets, enabling rapid development of models for new tasks, domains, and user populations. The methods and models will be evaluated on their ability to accurately model observed human behavior and on their ability to make a system a more effective collaborator for humans. This work will also inform the design of better speech synthesizers.
Preliminary work was supported by the National Science Foundation as IIS-1449093: Eager: Preliminaries to the Development of Responsive Prosodic Behaviors for Interactive Systems, 2014-2016. In collaboration with Saiful Abu.
Methods for Identifying Non-Native Differences in Dialog Prosody
Language learners often have difficulty with prosody, especially for the prosodic forms used in dialog activities, but there are no diagnostic tools for dialog prosody. We are developing methods to work directly on unannotated non-native dialog data to automatically produce a listing of the prosodic constructions on which the non-natives are weak. We first create models of both native and non-native prosodic behavior in terms of pragmatic constructions, derived using Principal Components Analysis. The constructions involving weakness are then automatically identified as those native constructions for which there is no close non-native counterpart, as measured the cosine distance over the loadings of the component features. So far this method has been applied to 90 minutes of dialog behavior by six advanced native-Spanish learners of English, successfully discovering both minor differences and major deficits.
Automating the Discovery of Dialog Patterns
Building better dialog systems requires a better understanding of the low-level details of human communication. However the dynamics of interaction at the extreme time-scales characteristic of swift dialog are not accessible to casual observation. Progress here depends on tools for systematically analyzing these patterns of behavior. In recent years excellent freeware tools for audio data transcription, phonetic analysis, and speech manipulation have appeared, however none work well for dialog. We need tools that directly support search, comparison, hypothesis formulation, and hypothesis evaluation for dialog phenomena; this is essential to advancing scientific understanding and to engineering highly responsive systems.
We have built a toolkit for this kind of analysis including methods for semi-automatically identifying important dialog cues and patterns from conversation data in any language. We are currently extending this toolkit and applying it to new languages. Ultimately we hope to discover universal properties of prosody, true for all human languages.
with Paola Gallardo, Luis Ramirez, Alexandro Vega, Joshua McCartney, Huanchao Li, Tianyu Zhau, Tatsuya Kawahara, and Stefan Benus.
Improving Video Chat using Gaze Prediction
Video chat requires significant bandwidth. While previous work has advanced compression using models of human perception, we propose to use a model of human behavior, specifically, to avoid wastefully sending full-quality audio frames during times of gaze aversion.
Previous work has shown that people tend look away from their interlocutor around the time they are formulating utterances, taking the floor, and starting to speak. Going beyond generalities and statistics, we are building models that are predictive, using prosodic and other information to predict whether or not a person in dialog will be looking at the interlocutor, 200 to 800 milliseconds in the future.
The project will determine the achievable prediction accuracy and thus the possible reduction in transmission when using various kinds of sensor data. It will also examine the human factors of such systems, including the subjective and interpersonal perceptions that arise when the video feed freezes due to incorrect predictions.
The project will further examine how these patterns of attention and gaze in dialog vary from person to person and among groups. Eventually we will empirically derive a model of the dimensions of variation in gaze-aversion behavior, develop an algorithm for rapid adaptation to an unknown speaker, and test this algorithm's utility for improving gaze-aversion prediction.
In collaboration with Chelsey Jurado. Funding from the UT Transform Program.