Basics, bottlenecks and bossiness: being economical with complicated truths in dialogue
Ellen Gurman Bard
Social Speech Recognition
Humans are social beings and interact with and influence each other in subtle and complex ways, one of the most important of those being spoken discourse. While conversations between individuals might start out clumsy and as a result cognitively taxing, over time, they often become easier to ascertain because of an alignment process that eventually occurs. This process allows participants to more efficiently meet their communicative goals. Most large vocabulary speech recognition systems operate on speech corpora that originated from conversations between individuals. In the implementation of such systems, however, each side of a conversation is excised and processed independently from its original context. This eliminates any benefit that might be available from the cooperative nature of discourse. In this work, a new approach is suggested for speech recognition where multiple sides of a conversation in a dialog or meeting are processed and decoded jointly rather than independently. We introduce a practical initial implementation of this approach that demonstrates both language model perplexity and speech recognition word error rate improvements in conversational telephone speech.
Multiparty Turn-Taking: Models, Implementation, and Studies
Dan Bohus and Eric Horvitz
We outline several challenges and opportunities for endowing dialog systems with competencies for interacting with multiple people in open, dynamic, and relatively unconstrained environments. We present a representation and methodology for modeling the multiparty turn-taking process and show how we use the models in a working dialog system. The approach harnesses components for tracking the conversational dynamics in multiparty interactions, for making floor control decisions, and for rendering these decisions into appropriate behaviors. We describe a set of experiments that demonstrate how the proposed approach enables an embodied conversational agent to participate in multiparty interactions, to handle a diversity of natural turn-taking phenomena (e.g. multiparty floor management, barge-ins, restarts, and continuations), and to shape the multiparty conversational dynamics. Finally, we discuss results and lessons learned, as well as current and future planned work.
Predictive models for embodied dialogue amongst dialect speakers
When do people speak, and why?
Herbert H. Clark
Here's roughly what I will argue (this is NOT an abstract!): Traditional accounts of timing in dialogue assume that it is the dialogue that determines the timing of what people say. The Sacks-Schegloff rules of turn taking are an excellent example. But this is backwards. People time what they say as needed by the larger joint activity they are engaged in--playing a game, assembling a piece of furniture, carrying out a business transaction, etc. In each joint activity, people time their utterances to deal with the next step of the joint activity. The dialogue is secondary to the current joint activity, not the other way around. I will present examples from real joint activities to illustrate principles that account for these claims. (Well, as much of this as 20 minutes allows.)
Bimodal communication of depression severity by face and voice
Jeffrey F Cohn
Current methods of assessing psychopathology depend almost entirely on verbal report (clinical interview or questionnaire) of patients, their family, or caregivers. They lack systematic and efficient ways of incorporating behavioral observations that may be powerful indices of psychological disorder often outside of conscious awareness. In two studies, we investigated whether depression severity is perceptually salient to observers, can be quantified automatically via automated facial image and acoustic analysis, and the extent to which attenuated bimodal expression influences the dynamics of face-to-face interaction.
The architecture of anticipation: modelling projection in turn-taking
Jan de Ruiter
While listening, dialogue participants have to perform three simultaneous tasks largely in parallel: a) predict when the current turn ends, and also b) predict how it will end, because they c) need to prepare the content and timing of their response. The main question that I am working on is: how can one can appropriately *model* these phenomena? I will briefly describe a mathematical (stochastic) model of the temporal properties of such a turn-taking system. But in order to actually *implement* anticipatory turn taking in a real cognitive system (i.e., for simulations, or for use in artificial agents) there are two principled approaches, which I call I) the Gamble And Pray (GAP) architecture, and II) the Parallel Universe Tracking (PUT) architecture. I will briefly discuss these two options, and present some processing oriented and empirical arguments that may help in choosing the right architecture.
Social verticality, behavioral cues, and group interaction
This talk will discuss work on automatic estimation of aspects of the so-called vertical dimension of social relationships, including dominance and status, in the context of small-group meetings in multisensor spaces. These phenomena contribute to structure the behavior of individuals and groups, and have effects on the creation and maintenance of relations, on job satisfaction, and on how work gets done (or not) in organizations. Our methods characterize people by a number of nonverbal communicative cues. The talk will discuss how we have built on and extended classical work in social psychology to quantify speaking activity, body activity, and attention from audio and video signals, and how we have handled both multiple modalities and conversational context.
But is your predictive model any good? Exploring the impact of `affective signals' in human/virtual-human interaction
Jon Gratch and Stacy Marsella
Turn-initial position in responses to polar questions
Several considerations suggest that turn-initial position is a strategic site for the shaping of recipient expectations and understandings of what is to come. Most importantly, turn-initial position is structurally located at the intersection of 'prior turn' and 'next turn' (Schegloff 1996), while the timing of turn-transition (de Ruiter et al 2006; Stivers et al 2009) also suggests a positive value for the 'front-loading' of significant elements of turns (Levinson frth). This paper reviews a number of basic items that recurrently appear in turn-initial position that are significant signposts for recipient understandings and predictions of what is to follow.
Conversation analysis based on reactive tokens in poster sessions
I will introduce our project on multi-modal analysis of poster sessions, which will hopefully provide a scheme of interaction-based speech indexing and a model of intelligent conversational agents. We have recorded a number of poster sessions with multi-modal sensors, including motion-capturing and eye-tracking systems, and annotated them with multi-modal information such as gazing, nodding and pointing. We currently focus on backchannels, or non-lexical reactive tokens, made by the audience during the session. We have found that (1) particular syllabic and prosodic patterns of reactive tokens are related with the interest level of the audience, (2) their timing is coordinated by the presenter and audience, and (3) "hot spots" associated with the reactive tokens are useful.
Conversation robot recognizing and expressing paralinguistic information
The computational study of back-channel behavior (tentative)
Advances in machine analysis of facial behaviour: dynamic and spontaneous facial expressions
We have developed the technology of reality mining, which analyzing wearable sensor data to extract subtle patterns that predict future human behavior. These predictive patterns begin with biological "honest signals," human behaviors that evolved from ancient primate signaling mechanisms, and which are major factors in human decision making in situations ranging from job interviews to first dates. By using data from mobile phones, electronic ID badges, or digital media to track these honest signals, we can create a `gods eye' view of how the people in organizations interact, and even `see' the rhythms of interaction for everyone in a city.
Optimizing the responsiveness of spoken dialog systems
The challenges in giving up and taking turns
The turn taking literature provides a lot of material about how turn taking proceeds. But it doesn't tell us a great deal about what goes wrong in turn taking, other than the case of overlapped speech. There are other types of problems, namely false ends of turns, and late looking. Furthermore, the existing literature doesn't give us a detailed picture of what the participants are doing when these things happen. In this talk, I will discuss recent work collected from annotations of 14 pairs of conversants as they take turns in a set of question and answer conversations. I will show short video clips of their problems in turn taking and discuss the features of the interactions at and near the turn taking moment. These examples are instructive in learning to predict turn taking and in what might be the limits on that prediction processing (both for people and computers).
Speaker Differences: Insights from Higher-Level Features in Automatic Speaker Recognition
This talk will propose that recent findings on "higher-level features" (features beyond cepstral differences associated with the voice) in speaker recognition technology may have implications for human interaction models. The studies, which have used thousands of hours, thousands of speakers, and millions of trials of conversational speech, have found surprising benefit from adding a range of higher-level features to standard voice-related features. For human communication researchers, the speaker recognition paradigm offers a useful methodology for discovering speaker differences that are discriminative. After a brief overview on how to use speaker recognition in this way, I'll describe some systems that model higher-level features, and then seek your feedback on how such approaches might be used in future human communication research.
Context, Expectations and Predictions in Mixed-Initiative, Multiparty Multi-strategy, Multimodal Dialogue
In this talk I will give an overview of the dialogue model of the virtual humans from the ICT MRE and SASO projects. The focus will be on the models of conversations, turn-taking, initiative, and conversational roles, in particular, how context from the information state representation and on-going and previous utterances are used to expect and recognize dialogue acts of various sorts, and how the information state model is updated.
Social signal processing for conversations: roles, conflicts and personality
Dialog Prediction for a General Model of Turn-Taking
Nigel Ward and Olac Fuentes
Today there are solutions for some specific turn-taking problems, but no general model. We show how turn-taking can be reduced to two more general problems, prediction and selection. We also discuss the value of predicting not only future speech/silence but also prosodic features, thereby handing not only turn-taking but ``turn-shaping''. To illustrate how such predictions can be made, we trained a neural network predictor. This was adequate to support some specific turn-taking decisions and was modestly accurate overall.