Interspeech 2026 Challenge

Transfer of Pragmatic Intent in Speech-to-Speech Translation

Overview

Current S2ST systems do not prioritize pragmatic fidelity, and thus do not ideally support people conversing across languages. We are offering this task to promote and evaluate technical advances on this problem.

Teams will provide system-output translations for the utterances in the evaluation set, which will be taken from English and Spanish conversation. We will provide the results of two automatic evaluation metrics and human evaluation results.

Data

Data is in the form of Spanish-English utterance pairs, one taken from conversation and the other a pragmatically-faithful reenactment.

Data samples are available here and here.

Data for system tuning is also here. We will additionally provide transcripts for this data by December 15.

Evaluation data will be matched for recording conditions, recording protocol, and speaker demographics.

Baseline Systems

While we expect most participanting teams will be using their own systems, possibly tuned for this task, for those who want a starting point, we will provide a Jupyter notebook with a baseline system, by November 30.

Submissions

We will provide a portal for teams to upload their submissions, ready for a dry run by December 15.

Publication Plan

Participating teams are encouraged to submit system descriptions to Interspeech 2026 (deadline February 25) or elsewhere. We are also planning to submit a challenge-summary paper for the conference.

Details

Motivation and Aims of the Challenge

Current speech-to-speech translation (S2ST) systems work well for many purposes, but are less well suited for dialog (Liebling, CHI 2020) particular, their outputs are usually insensitive to the interpersonal and pragmatic goals underlying the source utterances. One reason for these limitations is the fact that evaluation methods currently focus only on semantic fidelity and output naturalness.

We accordingly propose a new evaluation, in terms of pragmatic fidelity. For example, given the input did you see her? with its specific prosody --- perhaps showing breathless interest, encouraging the interlocutor to continue his story, and displaying sensitivity to the complex emotions he’s feeling after a break-up --- the output should be judged not only on its ability to translate the lexical content, but also these elements of intent and stance.

Through this challenge task, we plan to:

  1. Obtain a baseline of how well current systems succeed at this aspect of translation.
  2. Support the development of ways for systems to do better at this aspect of translation.
  3. Discover what types of pragmatic function are not well handled by existing methods, aiming to inform linguistic inquiry, model design, and new data-resource collection.
  4. Indirectly, advance the design and tuning of metrics for S2ST.
  5. Indirectly, support the development of new methods for representing and reasoning about pragmatic intents as they develop and surface in conversation

Overall, we aim to jumpstart a new direction in speech-to-speech translation research. While novel, it aligns with the current wide interest in expanding the scope of S2ST, as seen, for example, in the recent flowering of work on speaker-property transfer and transfer of emotion and expressiveness through prosody.

Over the long term, this will enable the evaluation and development of systems able to translate pragmatic intent, and thereby better support speakers who need to communicate more than just specific information, including through needing to convey feelings, stances and intentions, to organize the flow and topic structure of dialog, to establish shared understandings and rapport, and so on. These systems will thus support many use cases that are currently ill-served.

Rules for Participation

Anyone is welcome to participate. The Technical Committee will be available to support participants needing help, from getting started through the final submission.

Systems will take as input a folder of about 500 English utterances extracted from dialog and output a folder of Spanish utterances, and conversely for Spanish to English.

An overall performance report will be released, with team names redacted, unless otherwise agreed.

Metrics

Evaluation will be done with four methods:
  1. Human Quantitative Evaluation. A panel of 5-6 judges will independently score the pragmatic fidelity of the translation on a scale from 1-5, using an adaptation of an existing protocol. As this is time-consuming, we will probably do this for only 50-80 utterances per submission.
  2. Black-Box Metric. Outputs will be scored by their similarity to the human-generated, gold translation, according to Segura’s similarity metric (Interspeech 2024), which uses the cosine similarity between the representations of two utterances in terms of 103 or 101 HuBert features, for English and Spanish respectively, selected to maximize match to a collection of pragmatic similarity judgments. Importantly, we are here measuring pragmatic appropriateness directly, rather than via aspects of the prosody, for example, by AutoPCP or F0 DTW.
  3. Explainable (Marco) Metric. This similarity metric is the Euclidean distance based on about 100 features, with appropriate weights, that explainably describes the pragmatically-important prosodic features of each. Importantly, we are here measuring essentially all aspects of prosody, not just events or features considered important by classical models.
  4. Human Qualitative Evaluation. A focus group of bilinguals will listen to about 20 outputs for each system, considering the source utterances and the reference human translations. We will apply qualitative-inductive methods to identify common strengths and weaknesses, for each system and also across all systems.

For the first three metrics we will report a topline: the ratings of human translations produced by an independent bilingual speaker. For the explainable and black-box metrics, we will also report the baseline of directly transferring the prosody of the source, and of using the prosody of a randomly-selected target-language utterance.

Datasets

The evaluation will be done over fresh data collected using the IRB-Approved “Dialogs Reenacted across Languages” (DRAL) protocol. This will consist of paired Spanish-English utterances, closely matched for communicative intent. These are created by having bilinguals engage in real, spontaneous conversations, and subsequently re-create selected utterances in their other language. Utterances are mostly 1-4 seconds in length and exhibit great pragmatic variety. This data will likely be released after the challenge is over.

The only matched training data available will be 2893 English-Spanish audio pairs, collected using the same protocol, and already available at https://www.cs.utep.edu/nigel/dral/ and also through the Linguistic Data Consortium (LDC2024S08), plus third-party re-enactment-style translations of these utterances. These were all collected under IRB approval and there are no restrictions on their release or use. While these are expected to be useful for tuning of meta-parameters, etc., we expect that systems will be mostly trained on other resources, perhaps monolingual dialogue data and bilingual monologue data.

Timeline

Dec 10-15 Dry-run evaluation to test workflow, for all participating teams
Jan 10 Teams submit outputs
Jan 15 Teams submit brief system descriptions
Jan 15 Technical Committee reports results to teams
Feb 25 Organizing Committee submits Challenge Overview paper to Interspeech; participating teams also submit their papers
Sep 27 – Oct 1 Presentations of accepted papers at Interspeech in Sydney

Conditions, Tentative

We will run several conditions, determined based on participant interest. These may include:

1. Unlimited. This is likely to appeal to well-resourced teams. There will be no limits on training data, with proprietary data in particular also allowed. Input information will include the full left context, with channel-separated audio from both speakers, and aligned transcriptions. Outputs will be fully realized utterances.

2. Circumscribed. This is likely to appeal to smaller teams. Training will be restricted to publicly available training data.

3. Entry-Level. This is intended to enable participation by teams lacking sophisticated machine learning, lacking large data and computational resources, and, in particular, lacking speech synthesis experience. We hope to see novel ideas emerge from experiments unencumbered by the need to build full systems, and thereby free to work on a selected key challenge, such as discourse marker translation or prosody transfer, perhaps build as independent modules. We accordingly aim to support participation by teams with new modeling ideas, including linguistically-inspired models and explainable models. In this condition, training will be limited to the DRAL data releases, with use of other data resources allowed only via the use of off-the-shelf pretrained models or publicly-available systems. The input will be only utterance-internal information, without context: specifically, the audio, a precomputed characterization of the audio in terms of 110 prosodic, and an aligned transcript. Output will be a specification for a target-language utterance in terms of the same set of 110 features. In this condition, evaluation will be only in terms of the Explainable (Marco) Metric, perhaps modified based on participant input.

The figure shows how all conditions will share the same data. Stages/modules in boxes will be run by us. The clouds show what the participants will contribute. Not shown, to avoid clutter, are the Qualitative Evaluations and the fact that the S2ST systems’ output will also be evaluated according to Criterion 3.

Organizing Committee

Technical Committee

Acknowledgment

This task is supported in part by an NSF-funded project: “Modeling Prosody for Speech-to-Speech Translation,” IIS-2348085.

Notes on Scope

Because we aim to keep this challenge focused and well-defined, we do not plan to score systems also on semantic-fidelity metrics, and, in the Entry-Level condition, we do not contemplate the evaluation of lexical appropriateness or of prosody-content compatibility.

We plan to eventually release the ratings according to the 4 metrics, to support future fine-tuning to better match human judgments and to support investigation of the relative advantages of automated versus human-in-the-loop metrics, single versus multiple-reference metrics, and so on.

Links

Interspeech 2026