Which Prosodic Features Matter Most for Pragmatics?

Nigel G. Ward, Divette Marco, Olac Fuentes

ICASSP 2025

Abstract:   We investigate which prosodic features matter most in conveying pragmatic functions. We use the problem of predicting human perceptions of pragmatic similarity among utterance pairs to evaluate the utility of prosodic features of different types. We find evidence that the duration-related features are most important, that pitch-related features are much less important and less adequate, and that complete modeling will require additional acoustic and prosodic features, including nasality and phonetic reduction. These findings can guide future basic research in prosody, and suggest how to improve speech synthesis evaluation, among other applications.

Paper

Overview Video

English Audio Illustrations

Are Pitch Features Enough?

Tthe following utterance pair is one of many where the pitch-only model performed much worse than the all-feature model:

It judged the two clips as very similar, unlike the human judges' perceptions. In terms of acoustic-prosodic properties, the pitch-only model seems here to be suffering due to, saliently, not modeling differences in nasality and speaking rate variation. In terms of pragmatic functions missed, here the pitch-only model seems to be insensitive to the negative valence and to the turn-hold intention that are present in the one clip but not the other.

Which Features are Missing From Even the Best Model?

The following examples are audio pairs which the all-feature model rated much higher than the judges.

Across many such examples, we observe that the all-feature model seems to miss significant differences in: nasality, pause frequency, length, location, words with laughter vs without, phonetic form of non-lexical utterances, phonetic reduction such as devoicing, stressing of specific words, vibrato, falsetto, non-lexical sighs, uses of glottal stops, ejectives, strong harmonicity, speaking rate variations, and breathiness.

What Other Factors are Involved?

Conversely, the following examples are pairs which the all-feature model rated much lower than the judges.

These seem to involve individual- and gender-based variant prosodic forms for conveying the same meaning. Other examples often exhbit non-significant differences in pacing and pause placement.

Spanish (Results that didn't Fit into the Paper)

Feature Importance

Table 3: Pearson’s correlation between each models’ predictions and the human judgments.
Model Correlation
Linear Regression 0.57
KNN Regression 0.67
Random Forest Regression 0.73
cosine over selected HuBert 0.72

Table 4: Feature types, ordered by importance for the random forest regression model and also showing performance of a model using features of this type alone.
Feature Importance Correlation
speaking rate 40.6% 0.66
creakiness 29.8% 0.61
pitch wideness 6.4% 0.47
intensity 4.2% -0.12
peak disalignment 4.0% 0.17
pitch lowness 3.8% 0.25
lengthening 3.4% 0.36
pitch highness 3.2% 0.09
pitch narowness 2.4% -0.04
CPPS 2.2% 0.08
4 pitch features 15.8% 0.41
all 10 features 100.0% 0.73