Data Collection for the Similar Segments in Social Speech Task

Technical Report UTEP-CS-13-58

Nigel G. Ward, Steven D. Werner

Department of Computer Science, University of Texas at El Paso

Abstract: Information retrieval systems rely heavily on models of similarity, but for spoken dialog such models currently use mostly standard textual-content similarity. As part of the MediaEval Benchmarking Initiative, we have created a new corpus to support development of similarity models for spoken dialog. This corpus includes 26 casual dialogs among members of two semi-cohesive groups, totaling about 5 hours, with 1889 labeled regions associated into 227 sets which annotators judged to be similar enough to share a tag. This technical report brings together information about this corpus and its intended uses.

