ICASSP 2025
Abstract:   We investigate which prosodic features matter most in conveying pragmatic functions. We use the problem of predicting human perceptions of pragmatic similarity among utterance pairs to evaluate the utility of prosodic features of different types. We find evidence that the duration-related features are most important, that pitch-related features are much less important and less adequate, and that complete modeling will require additional acoustic and prosodic features, including nasality and phonetic reduction. These findings can guide future basic research in prosody, and suggest how to improve speech synthesis evaluation, among other applications.
Tthe following utterance pair is one of many where the pitch-only model performed much worse than the all-feature model:
It judged the two clips as very similar, unlike the human judges' perceptions. In terms of acoustic-prosodic properties, the pitch-only model seems here to be suffering due to, saliently, not modeling differences in nasality and speaking rate variation. In terms of pragmatic functions missed, here the pitch-only model seems to be insensitive to the negative valence and to the turn-hold intention that are present in the one clip but not the other.
The following examples are audio pairs which the all-feature model rated much higher than the judges.
Across many such examples, we observe that the all-feature model seems to miss significant differences in: nasality, pause frequency, length, location, words with laughter vs without, phonetic form of non-lexical utterances, phonetic reduction such as devoicing, stressing of specific words, vibrato, falsetto, non-lexical sighs, uses of glottal stops, ejectives, strong harmonicity, speaking rate variations, and breathiness.
Conversely, the following examples are pairs which the all-feature model rated much lower than the judges.
These seem to involve individual- and gender-based variant prosodic forms for conveying the same meaning. Other examples often exhbit non-significant differences in pacing and pause placement.
Model | Correlation |
---|---|
Linear Regression | 0.57 |
KNN Regression | 0.67 |
Random Forest Regression | 0.73 |
cosine over selected HuBert | 0.72 |
Feature | Importance | Correlation |
---|---|---|
speaking rate | 40.6% | 0.66 |
creakiness | 29.8% | 0.61 |
pitch wideness | 6.4% | 0.47 |
intensity | 4.2% | -0.12 |
peak disalignment | 4.0% | 0.17 |
pitch lowness | 3.8% | 0.25 |
lengthening | 3.4% | 0.36 |
pitch highness | 3.2% | 0.09 |
pitch narowness | 2.4% | -0.04 |
CPPS | 2.2% | 0.08 |
4 pitch features | 15.8% | 0.41 |
all 10 features | 100.0% | 0.73 |