Using Prosody to Spot Location Mentions

Speech Prosody 2020.

Gerardo Cervantes, Nigel G. Ward

Abstract: Identifying location mentions in speech is important for many information retrieval and information extraction tasks. Most commonly this is done with speech recognition and a gazetteer, but here we explore the value of prosody for this task. While previous work has explored the use of prosody for spotting named entities, including locations, the specific value of prosody for finding locations in spontaneous speech is not known. Using the Switchboard corpus and LSTM modeling we obtain modest performance. Further, we identify specific prosodic features and configurations that tend to mark locations in American English.

            paper

            code

            video

            thesis

Supplementary Information

The figure shows the correlations of various features with the existence (1/0) of a location mention overlapping the frame at 0 milliseconds. The drawing conventions are explained in detail in Chapter 9 of Prosodic Constructions in English Conversation. Here, for each curve the thin black horizontal line represesents a correlation of 0. Values above that are positive correlations and below it negative correlations. The y axis is not shown, but the highest positive correlation was .027 for the strength of evidence for narrow pitch (indirectly represented in the "pitch range" curve) over the region 1600 to 800 milliseconds before the frame in question, and the lowest -.026, for the strength of evidence for syllabic lengthening from 200 milliseconds before the frame in question to the frame in question. "Lengthening" being estimate based on the cepstral flux, and this feature is also an ambiguous indicator of articulator precision, meaning that the "low lengthing" before and around location frames may also indicate articulatory precision in that region. The full listing of correlations is here.

Here are two audio clips centered around a frame which is a close match to this pattern. One is a location mention, and one is not (according to our definition, although it is a deictic expression).

  • ... Red River ... The most location-like frame in a 4-minute conversation is on the second syllable of this phrase, at 119.53 seconds in conversation sw02280.
  • ...down here... The most location-like frame in a 10-minute conversation is on the second syllable of this phrase, at 302.09 seconds in conversation sw02316.


Nigel Ward's Publications