Dialogs Re-enacted Across Languages (DRAL) Corpus

DRAL is a bilingual speech corpus of parallel utterances, using recorded conversations and fragments re-enacted in a different language. It is intended as a resource for research, especially for training and evaluating speech-to-speech translation models and systems.

DRAL is described in our technical report: Dialogs Re-enacted Across Languages, Version 2, Nigel G. Ward, Jonathan E. Avila, Emilia Rivas, Divette Marco. Some initial analyses of this data are described in our Interspeech 2023 paper.

DRAL is available through the Linguistic Data Consortium under catalog number LDC2024S08. We have dedicated DRAL to the public domain; there is no copyright (CC 0), and you can also download it here (DRAL-16kHz.tgz) (11 GB). If you need better quality, you can alternatively download the 48kHz versions.

The releases include 2893 short matched Spanish-English pairs (> 2 hours) taken from 104 conversations with 70 unique participants. There are also some illustrative, lower-quality, pairs in Bengali-English, Japanese-English, and French-English. All are packaged together with the full original conversations and full re-enactment recording sessions.

In addition we have a test set, consisting of about a third as much data again, held out for now in anticipation of use in a shared task.

Some sample pairs:

More examples, with context

Our data collection procedure is explained in a very short movie. (Erratum: the post-processing actually takes more than 10 minutes.)

The post-processing scripts are at Jonathan Avila's DRAL repository at Github.

The OneDrive working repository (UTEP-internal).

Thanks: Georgina Bugarini