Speech Synthesis Workshop 2025, to appear
Abstract:   We aim to improve the suitability of speech synthesis output for applications that are situated, embodied, and/or involve rich user interaction. For such purposes, better control of prosody is a priority. Basic research on prosody has found that voice quality features, notably creakiness and breathiness, and also probably nasality, play central roles in conveying various pragmatic functions. This paper investigates the extent to which proper control of these three feature can improve the perceived suitability of synthesized speech. Participants used the voice conversion tool VoiceQualityVC to make fine-grained adjustments to parameters affecting perceived voice quality and nasality. Working with utterances taken from a corpus of collaborative gameplay, they were able to modify synthesized speech to better match how they thought it should sound. A subsequent perception experiment showed that these adjusted utterances were rated as more suitable than the baseline. These findings demonstrate both the potential value and the feasibility of exploiting more prosody-related parameters in speech synthesis.
Original and Manipulated Audio Examples: Clips (842KB), Description
Appendix: Audio of the Examples (1:25)