Similar Segments of Social Speech Task GUIDE TO INTERPRETING THE METRICS Nigel Ward, July 30, 2013 The basic evaluation philosophy for this task is described in the workshop task-overview paper: The Similar Segments in Social Speech Task. In short, the idea is to use a simulation of user behavior to indicate how useful any similar-region-suggesting system would be, and a single overall quality metric is proposed. Please refer to that paper for the description. These notes discuss the realism and stability of the main measures, explain the normalization factors, and describe all the various measures. 1. Validation, Stability, and Region-Length Effects To validate the computation of these metrics, I created various small test sets, and verified that the computations worked as intended. I also tested both a random baseline and a "clever" reference system, described below, on both the trainingset queries and the testset queries. I also varied a few of the evaluation parameters. Across these experiments, the only unexpected influence on importance was region length. For example, the random algorithm did surprisingly well on the trainingset data, and this was mostly due to the fact that many of the tagged regions were very long (as discussed in the Annotation Notes document), and some of the tagsets were quite large. For example, a tag like "entertainment" could apply to large fraction of the entire corpus. Such general tags were not envisaged when the task was designed, however they do not seem unrealistic or inappropriate. Longer regions appear to work slightly to the advantage of random algorithm, in that points selected at random at tend to fall more in longer regions than in short regions, and that both the coverage and benefit values are increased to the extent that longer content is found. Although this doesn't seem to be entirely inappropriate, it reduces the incentive for systems to find lots of similar regions, and instead lets them score well by just finding one, very long region. Fortunately there were fewer such very-long regions in the testset, so this issue is moot. The prevalence of longer regions did lead us to abandon the "scan-back" action included in the original user-behavior simulation. This modeled the idea that a user encountering a jump-in point which took her to the middle of a useful region might then seek back to find the start of that region and listen from the start. For long regions and/or regions with diffuse content, like "entertainment", this seemed unrealistic; a user would probably just listen from the jump-in point to the end, and then go on to the next jump-in point, to find reasonable content with less hassle. The user-behavior simulator was simplified accordingly. 2. Adjustment Factors To estimate the maximum achievable performance, I built a "clever" reference algorithm that used the information in other tagsets. Specifically, given a query region, the algorithm looks through all regions in all other tagsets and finds the region which overlaps the query most closely. It then returns, as jump-in points, the onsets of all other regions in the same tagset as this strongly-overlapping region. If there are fewer than 20 such regions, it then does a new scan to find the the next most overlapping region, and uses the regions in its tagset to generate more jump-in points. It continues until 20 jump-in points have been generated or until there are no regions whose amount of overlap is 40% or more of the sum of the durations of the overlapping pair of regions. Thus, this algorithm exploits the information provided by other taggers. Of course, no realist similarity system would have access to such information. However this algorithm is useful for estimating the upper bound on system performance. In particular, even on the trainingset, replete with long regions, this algorithm attained an raw F-measure of only .43, clearly a long way from 1.00. Thus, as noted in the task-overview paper, adjustments are needed. In essence, adjustments are needed because tagsets identify some similar regions but not all. Thus they never let us know for sure that a putative result is {\em not} similar to the query region. Thus the raw measures severely understate the actual utility. As an extreme example, imagine that an annotaator the exact same region with two different tags, for example perhaps 'ai-class' and 'favorite professors'. A query with that region could validly return either other ai-class regions or other favorite- professor regions, but the scoring algorithm will pick one of the two tagsets and use that to evaluate all results returned, meaning that half of them will be counted as false alarms, unjustly. Indeed there are a few such multiply-tagged regions in the data, so this is not hypothetical. In principle we could overcome these problems by having human judges evaluate, post-hoc, the quality of each jump-in point. This would give us explicit judgments of non-similarity, but unfortunately this is not be affordable. Therefore we continue to rely on the tagsets, but adjust the raw scores upwards. The adjustments depend on the exact corpus subset and query set. For the testset data, the Searcher Utility Ratio is divided by 0.290 which is the raw score obtained by the clever algorithm. The recall is divided by 0.275, which is the value the clever algorithm would have obtained had it given jump-in points for all queries (not just the 67% it did answer) at the same recall level that it achieved for the ones it did answer (18.4%). 3. Explanation of Measures While the F-measure is the primary measure of system quality, there are other measures which can be used, especially to help understand the various strengths of the systems, and some of these are included in the output of score5.py. The "naive precision" is the fraction of jump-in points that matched a region in the same tagset as a query, without being converted to seconds and without penalties for jump-in points being early or late. The "average seconds early" and "average seconds late" are reported since it's helpful to know where the jump-in points are falling: it being better for a jump-in point to be before a target-region onset than after, and it being better for the jump-in points to be close to the region onset, rather than further away. The raw recall figure here is, as described in the task-overview paper, not the traditional fraction of total relevant segments retrieved, but the number of seconds of relevant data that a user (as simulated) could get from these within the 120 second per-query time, divided by the total number of seconds that an ideal system could deliver in that time. Finally, the F-measure is computed. While the searcher utility ratio by itself is probably the most meaningful measure, it is possible that a system might score very highly on this metric by only generating one jump-in point for one query; something that would not be very useful in most scenarios. Accordingly the recall factor is also incorporated in the final score, combined with utility as an F-measure. However utility is the most important component, so the F-measure is weighted to favor it 9 to 1. /home/users/nigel/papers/mediaeval/guide-to-metrics.txt