Similar Segments in Social Speech

FAQ

How does this differ from topic detection?

Some of the similarity sets will probably be largely or entirely topic-based, but most will probably also involve on other factors, such as user goals (talk about childcare experiences in order to find a new daycare provider, versus in order to find ways to make the child feel comfortable), and such as attitudes and purposes (talk about a course for the sake of deciding whether to take it, or for figuring out how best to study for it, or for just sharing stories about the professor).

What outcomes do you expect?

We expect that participants in this challenge will develop and evaluate search methods for social-interaction records. As this is the first year of the challenge, and the first-ever exploration of search for social multimedia, we welcome both highly novel approaches and applications of existing methods.

These results will be presented at MediaEval 2013, a workshop to be held October 18-19, in conjunction with ACM Multimedia 2013 in Barcelona, and will appear in the proceedings.

What is the schedule?

Feburary 21: potential participants contact Nigel Ward.

late March: familiarization pack sent to participants, so they can examine sample data and file formats, try the evaluation code, plan their strategy, and if necessary suggest fine-tuning of the challenge.

May 25: training data released to participants, so they can refine their algorithms and tune their systems.

July 15: test data released, consisting of new queries and new data.

September 5: search results on test data due.

September 18: final performance results provided to participants.

September 28: workshop paper submissions due.

October 18-19: MediaEval 2013 workshop: sharing of what worked and what didn't, discussion of implications and next steps.

What exactly do systems have to do?

Given a query (in the form of a region of one of the dialogs), the system should return a set of pointers to hypothesized similar regions. The ideal system will return the onset points of all such similar regions, as identified by the human annotators, and no other regions.

How will performance be evaluated?

This is described in section 4 of the paper.

What will the challenge organizers provide?

The documentation, data, and metadata on the download page

What's the context?

Could you provide us some more details on how a baseline system would work?

Probably the obvious way to build a baseline system would be to gather all the words in the query region, then find all regions elsewhere in the corpus which densely contain those words or similar words. For this any traditional IR technique would work, although probably with modifications to deal with the special properties of spoken language (noisy, interlocutor-track information available, prosodic features also available), and with the lack of a segmentation into "documents", meaning that system can return regions from anywhere in the corpus and of any size.

Will the final dataset contain a fixed number of tags?

No. In fact, the tags are there as comments only. The system you build will not be able to rely on the tags being there or meaning anything. The meaning of each similarity-set is just the set of regions in that set. And in particular, the test set will include as queries regions which were not seen in the training set, and which may not relate to any of the tags seen in the training set. We think this is realistic. For example, our campus recently had a bomb threat, the first in 10 years, so there is no talk anywhere in the corpus about anything like that, but we'd still want a system running today to be able to find other regions of talk about campus security issues if a segment related to a bomb threat was submitted as a query.

How many training segments will there be available for each tag? Right now, we are seeing tags that only have 1-2 training segments; we do not see how we can train classification models from these segments.

Training a classifier for each tag would be a poor strategy, since the tagset is not fixed. The goal of the task is to find similar segments, and the similarity-sets provided as examples of what counts as similar in this corpus, for these users. If you use these to build and refine a general similarity metric, then that similarity metric can be used for any retrieval request. For example, if you use a vector-space model, then for any query (e.g. a couple of utterances about the bomb threat), then you can find some speech regions that are close to it in the vector space, and return those. In a real system you'd probably use a nearest-neighbors algorithm to find these quickly, but for this task, due to the small corpus and lack of a real-time requirement, exhaustive search will probably be just fine. (But you're right to note that having only 1 example of some tags is not useful for anything; in the actual data release we'll aim to have 5-15 examples for most tags.)

Will you provide a training, development, and test set? Right now, there is only mention of a training and testing set.

While most teams will want to split the training set themselves into one part for training and one part for tuning, we aren't imposing any specific partition.

Why did the developers themselves contribute to the recordings and annnotations? Mightn't that skew things somehow?

The test data will be pristine, so there is no risk. However for training purposes we decided to release also the pilot recordings and labelings, thinking that participants would like to have as much data as possible. However the metadata shows which files these are, so it's possible to exclude them from training if desired.

Why are the value weights for?

As described in section step 7 of the Annotators Guide, each similarity set was assigned a value, from 0 to 3. These numbers may be useful in the training process, since similarity sets with higher values may be more informative/valuable, and participants may want to tune their parameters to perform best on the higher-valued sets.