These results will be presented at MediaEval 2013, a workshop to be held October 18-19, in conjunction with ACM Multimedia 2013 in Barcelona, and will appear in the proceedings.
Feburary 21: potential participants contact Nigel Ward.
late March: familiarization pack sent to participants, so they can examine sample data and file formats, try the evaluation code, plan their strategy, and if necessary suggest fine-tuning of the challenge.
May 25: training data released to participants, so they can refine their algorithms and tune their systems.
July 15: test data released, consisting of new queries and new data.
September 5: search results on test data due.
September 18: final performance results provided to participants.
September 28: workshop paper submissions due.
October 18-19: MediaEval 2013 workshop: sharing of what worked and what didn't, discussion of implications and next steps.
Given a query (in the form of a region of one of the dialogs), the
system should return a set of pointers to hypothesized similar
regions. The ideal system will return the onset points of all such
similar regions, as identified by the human annotators, and no other
regions.
How will performance be evaluated?
This is described in section 4 of the paper.
The documentation, data, and metadata on the download page
Probably the obvious way to build a baseline system would be to gather all the words in the query region, then find all regions elsewhere in the corpus which densely contain those words or similar words. For this any traditional IR technique would work, although probably with modifications to deal with the special properties of spoken language (noisy, interlocutor-track information available, prosodic features also available), and with the lack of a segmentation into "documents", meaning that system can return regions from anywhere in the corpus and of any size.
No. In fact, the tags are there as comments only. The system you build will not be able to rely on the tags being there or meaning anything. The meaning of each similarity-set is just the set of regions in that set. And in particular, the test set will include as queries regions which were not seen in the training set, and which may not relate to any of the tags seen in the training set. We think this is realistic. For example, our campus recently had a bomb threat, the first in 10 years, so there is no talk anywhere in the corpus about anything like that, but we'd still want a system running today to be able to find other regions of talk about campus security issues if a segment related to a bomb threat was submitted as a query.
Training a classifier for each tag would be a poor strategy, since the tagset is not fixed. The goal of the task is to find similar segments, and the similarity-sets provided as examples of what counts as similar in this corpus, for these users. If you use these to build and refine a general similarity metric, then that similarity metric can be used for any retrieval request. For example, if you use a vector-space model, then for any query (e.g. a couple of utterances about the bomb threat), then you can find some speech regions that are close to it in the vector space, and return those. In a real system you'd probably use a nearest-neighbors algorithm to find these quickly, but for this task, due to the small corpus and lack of a real-time requirement, exhaustive search will probably be just fine. (But you're right to note that having only 1 example of some tags is not useful for anything; in the actual data release we'll aim to have 5-15 examples for most tags.)
While most teams will want to split the training set themselves into one part for training and one part for tuning, we aren't imposing any specific partition.
The test data will be pristine, so there is no risk. However for training purposes we decided to release also the pilot recordings and labelings, thinking that participants would like to have as much data as possible. However the metadata shows which files these are, so it's possible to exclude them from training if desired.
As described in section step 7 of the Annotators Guide, each similarity set was assigned a value, from 0 to 3. These numbers may be useful in the training process, since similarity sets with higher values may be more informative/valuable, and participants may want to tune their parameters to perform best on the higher-valued sets.