DRAL is a bilingual speech corpus of parallel utterances, using recorded conversations and fragments re-enacted in a different language. It is intended as a resource for research, especially for training and evaluating speech-to-speech translation models and systems. We dedicate this corpus to the public domain; there is no copyright (CC 0).
DRAL is described in a new technical report: Dialogs Re-enacted Across Languages, Version 2, Nigel G. Ward, Jonathan E. Avila, Emilia Rivas, Divette Marco.Some initial analyses of this data are described in our Interspeech 2023 paper.
The releases include 2893 short matched Spanish-English pairs (> 2 hours) taken from 104 conversations with 70 unique participants. There are also some illustrative, lower-quality, pairs in Bengali-English, Japanese-English, and French-English. All are packaged together with the full original conversations and full re-enactment recording sessions.
download DRAL-16kHz.tgz (11 GB)
If you need better quality, you can alternatively download the 48kHz versions. In addition we have a test set, consisting of about a third as much data again, held out for now in anticipation of use in a shared task.
Some sample pairs:
Our data collection procedure is explained in a very short movie. (Erratum: the post-processing actually takes more than 10 minutes.)
The post-processing scripts are at Jonathan Avila's DRAL repository at Github.
Thanks: Georgina Bugarini