Similar Segments of Social Speech Task

                	    GUIDE TO INTERPRETING THE METRICS

Nigel Ward, July 30, 2013


The basic evaluation philosophy for this task is described in the
workshop task-overview paper: The Similar Segments in Social Speech
Task.  In short, the idea is to use a simulation of user behavior to
indicate how useful any similar-region-suggesting system would be, and
a single overall quality metric is proposed.  Please refer to that
paper for the description.

These notes discuss the realism and stability of the main measures,
explain the normalization factors, and describe all the various
measures.


1. Validation, Stability, and Region-Length Effects

To validate the computation of these metrics, I created various small
test sets, and verified that the computations worked as intended.  I
also tested both a random baseline and a "clever" reference system,
described below, on both the trainingset queries and the testset
queries.  I also varied a few of the evaluation parameters.

Across these experiments, the only unexpected influence on importance
was region length.  For example, the random algorithm did surprisingly
well on the trainingset data, and this was mostly due to the fact that
many of the tagged regions were very long (as discussed in the
Annotation Notes document), and some of the tagsets were quite large.
For example, a tag like "entertainment" could apply to large fraction
of the entire corpus.  Such general tags were not envisaged when the
task was designed, however they do not seem unrealistic or
inappropriate.  Longer regions appear to work slightly to the
advantage of random algorithm, in that points selected at random at
tend to fall more in longer regions than in short regions, and that
both the coverage and benefit values are increased to the extent that
longer content is found.  Although this doesn't seem to be entirely
inappropriate, it reduces the incentive for systems to find lots of
similar regions, and instead lets them score well by just finding one,
very long region.  Fortunately there were fewer such very-long regions
in the testset, so this issue is moot.

The prevalence of longer regions did lead us to abandon the
"scan-back" action included in the original user-behavior simulation.
This modeled the idea that a user encountering a jump-in point which
took her to the middle of a useful region might then seek back to find
the start of that region and listen from the start.  For long regions
and/or regions with diffuse content, like "entertainment", this seemed
unrealistic; a user would probably just listen from the jump-in point
to the end, and then go on to the next jump-in point, to find
reasonable content with less hassle.  The user-behavior simulator was
simplified accordingly.


2. Adjustment Factors

To estimate the maximum achievable performance, I built a "clever"
reference algorithm that used the information in other tagsets.
Specifically, given a query region, the algorithm looks through all
regions in all other tagsets and finds the region which overlaps the
query most closely.  It then returns, as jump-in points, the onsets of
all other regions in the same tagset as this strongly-overlapping
region.  If there are fewer than 20 such regions, it then does a new
scan to find the the next most overlapping region, and uses the
regions in its tagset to generate more jump-in points.  It continues
until 20 jump-in points have been generated or until there are no
regions whose amount of overlap is 40% or more of the sum of the
durations of the overlapping pair of regions.

Thus, this algorithm exploits the information provided by other
taggers.  Of course, no realist similarity system would have access to
such information.  However this algorithm is useful for estimating the
upper bound on system performance. In particular, even on the
trainingset, replete with long regions, this algorithm attained an raw
F-measure of only .43, clearly a long way from 1.00.  Thus, as noted
in the task-overview paper, adjustments are needed.

In essence, adjustments are needed because tagsets identify some
similar regions but not all.  Thus they never let us know for sure
that a putative result is {\em not} similar to the query region.  Thus
the raw measures severely understate the actual utility.  As an
extreme example, imagine that an annotaator the exact same region with
two different tags, for example perhaps 'ai-class' and 'favorite
professors'.  A query with that region could validly return either
other ai-class regions or other favorite- professor regions, but the
scoring algorithm will pick one of the two tagsets and use that to
evaluate all results returned, meaning that half of them will be
counted as false alarms, unjustly.  Indeed there are a few such
multiply-tagged regions in the data, so this is not hypothetical.

In principle we could overcome these problems by having human judges
evaluate, post-hoc, the quality of each jump-in point.  This would
give us explicit judgments of non-similarity, but unfortunately this
is not be affordable.  Therefore we continue to rely on the tagsets,
but adjust the raw scores upwards.

The adjustments depend on the exact corpus subset and query set.  For
the testset data, the Searcher Utility Ratio is divided by 0.290
which is the raw score obtained by the clever algorithm.  The recall
is divided by 0.275, which is the value the clever algorithm would
have obtained had it given jump-in points for all queries (not just
the 67% it did answer) at the same recall level that it achieved for
the ones it did answer (18.4%).


3. Explanation of Measures

While the F-measure is the primary measure of system quality, there
are other measures which can be used, especially to help understand
the various strengths of the systems, and some of these are included
in the output of score5.py.

The "naive precision" is the fraction of jump-in points that matched a
region in the same tagset as a query, without being converted to
seconds and without penalties for jump-in points being early or late.

The "average seconds early" and "average seconds late" are reported
since it's helpful to know where the jump-in points are falling: it
being better for a jump-in point to be before a target-region onset
than after, and it being better for the jump-in points to be close to
the region onset, rather than further away. 

The raw recall figure here is, as described in the task-overview
paper, not the traditional fraction of total relevant segments
retrieved, but the number of seconds of relevant data that a user (as
simulated) could get from these within the 120 second per-query time,
divided by the total number of seconds that an ideal system could
deliver in that time.

Finally, the F-measure is computed.  While the searcher utility ratio
by itself is probably the most meaningful measure, it is possible that
a system might score very highly on this metric by only generating one
jump-in point for one query; something that would not be very useful
in most scenarios.  Accordingly the recall factor is also incorporated
in the final score, combined with utility as an F-measure.  However
utility is the most important component, so the F-measure is weighted
to favor it 9 to 1.




/home/users/nigel/papers/mediaeval/guide-to-metrics.txt