Current S2ST systems do not prioritize pragmatic fidelity, and thus do not ideally support people conversing across languages. We are offering this task to promote and evaluate technical advances on this problem.
Teams will provide system-output translations for the utterances in the evaluation set, which will be taken from English and Spanish conversation. We will provide the results of two automatic evaluation metrics and human evaluation results.
Data is in the form of Spanish-English utterance pairs, one taken from conversation and the other a pragmatically-faithful reenactment.
Data samples are available here and here.
Data for system tuning is also here. We will additionally provide transcripts for this data by December 15.
Evaluation data will be matched for recording conditions, recording protocol, and speaker demographics.
While we expect most participanting teams will be using their own systems, possibly tuned for this task, for those who want a starting point, we will provide a Jupyter notebook with a baseline system, by November 30.
We will provide a portal for teams to upload their submissions, ready for a dry run by December 15.
Participating teams are encouraged to submit system descriptions to Interspeech 2026 (deadline February 25) or elsewhere. We are also planning to submit a challenge-summary paper for the conference.
Current speech-to-speech translation (S2ST) systems work well for many purposes, but are less well suited for dialog (Liebling, CHI 2020) particular, their outputs are usually insensitive to the interpersonal and pragmatic goals underlying the source utterances. One reason for these limitations is the fact that evaluation methods currently focus only on semantic fidelity and output naturalness.
We accordingly propose a new evaluation, in terms of pragmatic fidelity. For example, given the input did you see her? with its specific prosody --- perhaps showing breathless interest, encouraging the interlocutor to continue his story, and displaying sensitivity to the complex emotions he’s feeling after a break-up --- the output should be judged not only on its ability to translate the lexical content, but also these elements of intent and stance.
Through this challenge task, we plan to:
Overall, we aim to jumpstart a new direction in speech-to-speech translation research. While novel, it aligns with the current wide interest in expanding the scope of S2ST, as seen, for example, in the recent flowering of work on speaker-property transfer and transfer of emotion and expressiveness through prosody.
Over the long term, this will enable the evaluation and development of systems able to translate pragmatic intent, and thereby better support speakers who need to communicate more than just specific information, including through needing to convey feelings, stances and intentions, to organize the flow and topic structure of dialog, to establish shared understandings and rapport, and so on. These systems will thus support many use cases that are currently ill-served.
Anyone is welcome to participate. The Technical Committee will be available to support participants needing help, from getting started through the final submission.
Systems will take as input a folder of about 500 English utterances extracted from dialog and output a folder of Spanish utterances, and conversely for Spanish to English.
An overall performance report will be released, with team names redacted, unless otherwise agreed.
For the first three metrics we will report a topline: the ratings of human translations produced by an independent bilingual speaker. For the explainable and black-box metrics, we will also report the baseline of directly transferring the prosody of the source, and of using the prosody of a randomly-selected target-language utterance.
The evaluation will be done over fresh data collected using the IRB-Approved “Dialogs Reenacted across Languages” (DRAL) protocol. This will consist of paired Spanish-English utterances, closely matched for communicative intent. These are created by having bilinguals engage in real, spontaneous conversations, and subsequently re-create selected utterances in their other language. Utterances are mostly 1-4 seconds in length and exhibit great pragmatic variety. This data will likely be released after the challenge is over.
The only matched training data available will be 2893 English-Spanish audio pairs, collected using the same protocol, and already available at https://www.cs.utep.edu/nigel/dral/ and also through the Linguistic Data Consortium (LDC2024S08), plus third-party re-enactment-style translations of these utterances. These were all collected under IRB approval and there are no restrictions on their release or use. While these are expected to be useful for tuning of meta-parameters, etc., we expect that systems will be mostly trained on other resources, perhaps monolingual dialogue data and bilingual monologue data.
Dec 10-15 Dry-run evaluation to test workflow, for all participating teams
Jan 10 Teams submit outputs
Jan 15 Teams submit brief system descriptions
Jan 15 Technical Committee reports results to teams
Feb 25 Organizing Committee submits Challenge Overview paper to Interspeech;
participating teams also submit their papers
Sep 27 – Oct 1 Presentations of accepted papers at Interspeech in Sydney
1. Unlimited. This is likely to appeal to well-resourced teams. There will be no limits on training data, with proprietary data in particular also allowed. Input information will include the full left context, with channel-separated audio from both speakers, and aligned transcriptions. Outputs will be fully realized utterances.
2. Circumscribed. This is likely to appeal to smaller teams. Training will be restricted to publicly available training data.
3. Entry-Level. This is intended to enable participation by teams
lacking sophisticated machine learning, lacking large data and
computational resources, and, in particular, lacking speech synthesis
experience. We hope to see novel ideas emerge from experiments
unencumbered by the need to build full systems, and thereby free to
work on a selected key challenge, such as discourse marker translation
or prosody transfer, perhaps build as independent modules. We
accordingly aim to support participation by teams with new modeling
ideas, including linguistically-inspired models and explainable
models. In this condition, training will be limited to the DRAL data
releases, with use of other data resources allowed only via the use of
off-the-shelf pretrained models or publicly-available systems. The
input will be only utterance-internal information, without context:
specifically, the audio, a precomputed characterization of the audio
in terms of 110 prosodic, and an aligned transcript. Output will be a
specification for a target-language utterance in terms of the same set
of 110 features. In this condition, evaluation will be only in terms
of the Explainable (Marco) Metric, perhaps modified based on
participant input.
The figure shows how all conditions will share the same data. Stages/modules in boxes will be run by us. The clouds show what the participants will contribute. Not shown, to avoid clutter, are the Qualitative Evaluations and the fact that the S2ST systems’ output will also be evaluated according to Criterion 3.
We plan to eventually release the ratings according to the 4 metrics, to support future fine-tuning to better match human judgments and to support investigation of the relative advantages of automated versus human-in-the-loop metrics, single versus multiple-reference metrics, and so on.