Labelers Guide, Draft 1, May 22, 2012, Nigel Ward and Karen Richart


Welcome to ISG.  

Our research project is aiming to reveal the details of how people in
dialog interact, and in particular the moment-by-moment variation in
the significance of what they are saying.  

Sometimes what people say is critical, and other times they are
producing ums and ahs, that have little meaning ... but not none.

You will be labeling conversations for importance.  Each conversation
is a telephone conversation between two people.  You will hear these
in stereo, with each speaker in a separate track.  Some dialogs have
noise, echo, or bleeding, but just try to ignore it.

You will need to do two things:  split each track of the dialog
into segments, and  assign an importance label to each segment. 

We have created a tool, "dede," that will let you do these things.
[Demonstration of how to use dede.]

Importance labeling is subjective, meaning that your opinions may
differ from ours.  That's okay.  Although we will spot-check some of
your labels, to see how they differ from our judgments, generally we
will be using your judgments as-is, to build our system, so please be
thoughtful.

For us, importance means how important it would be to the listener in
the dialog.  Thus, for example, when you're labeling what the Left
speaker said, think about how important it would be for the listener
in the Right track to hear those words clearly.

Importance includes at least four aspects.

A. conveying content.  For example, if the speaker says he's from
   Dallas, the word "Dallas" is important information, as it's likely
   to come up later in the dialog.  Sometimes you can infer the
   importance from the word, example the word "is" is usually
   predictable from context and carries little information, whereas
   the word "shortstop" is rarer and generally more information-rich.

B. helping the listener predict what will come next.  For
   example, if the speaker says "um", that can indicate that he's
   thinking of a word, so the next word may be a long, important one,
   so the listener should be prepared.  Similarly, if a listener says
   "uh-huh", that's important because it tells the other person that
   it's okay to go on.

C. suggesting to the listener how to respond. For example, if the
   speaker says "Arizona's beautiful" in an enthusiastic tone of voice,
   then it's important for the listener to pick up the implication
   that he should probably express agreement or somehow say something
   on the topic of Arizona.

D. other information.  For example, "Hello" has little meaning, but is
   important for revealing the speaker's gender, age, etc.  Similarly,
   the sound of a child crying in the background doesn't mean
   anything, but helps the listener understand the speaker's situation
   and likely mental state. 

When you make your judgements, listen not only to the words but also
the way the words are said.  For example, stressed words and words
pronounced in higher volume are often the more important ones.

It may help to think about the clarity needed for the listener to
correctly get the meaning of the word.  For example, "three" should be
transmitted clearly so as not to be confused with "free", but "and uh"
probably doen't need to be that clear.

Also label things that are not words, for example loud inbreaths,
which are often important (by criterion 2) as indications that the
speaker is about to start a turn.  Laughter and even coughs etc. may also
have some importance.

Please label on a scale from 0 to 5.  

5 is for unusually important words, for example stressed words or
words that are somehow important to the dialog.  

4 is for most words in fluent speech, for example the word "live" in
"I live in Dallas", which brings some meaning but is not so critical.

3 is for somewhat less important things, for example word repetitions,
as in "I went to, drove to Houston", where "went to" is less
important.  Backchannels (uh-huh) and laughter probably are usually at
this level.  Connecting words such as a stretched-out "and" said while
the person decides what to say next may also be at this level.

2 is for for even less important things.  For example, many inbreaths
will be at this level.

1 is for things with almost no value, for example background noise.  

0 is usually just pure silence.  If you omit a label for a region it
will automatically be counted as 0.

Finally, if you're unsure at any point, just put a question mark after
the label, for example "4?".  Later on we'll ask you about these, and
maybe refine our descriptions of the levels to be more clear in future. 


Before you assign the importance values, you will need to break up the
speech into regions. Probably each region will be one word or two,
although if a speaker continues in the same tone of voice, and with
the same content density, then it's okay to have several seconds (up
to about a dozen words) all in one region.  Please align the region
boundaries roughly with word boundaries, to within 30 millisecond or
so.  Occasionally you may wish to split one word into two regions, for
example if the first syllable is loud and clear and the remainder of
the word is mumbled as if unimportant. 


Steps:

First, listen to the entire dialog, to get a sense for what the
speakers are saying.

Second, go through the left track, second by second, splitting the
speech into regions and labeling each one.

Third, to the same thing for the right track.

Then go on to the next file. 


Dede command summary:
  f
  b 
  s
  m 
...