HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS

Dake Guo¹, Xinfa Zhu¹, Liumeng Xue², Tao Li¹, Yuanjun Lv¹, Yuepeng Jiang¹, Lei Xie¹ ¹Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China ²School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China

1. Abstract

Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNN-TTS. Specifically, we add a virtual global node in the graph to strengthen the interconnection of word nodes and introduce a contextual attention mechanism to broaden the prosody modeling scope of GNNs from intra-sentence to inter-sentence. Additionally, we perform hierarchical supervision from acoustic prosody on each node of the graph to capture the prosodic variations with a high dynamic range. Ablation studies show the effectiveness of HiGNN-TTS in learning hierarchical prosody. Both objective and subjective evaluations demonstrate that HiGNN-TTS significantly improves the naturalness and expressiveness of long-form synthetic speech.

2. Demos

Compared Methods

FS2-BERT [1]:a FastSpeech2-based model, which considers context information by using cross-sentence BERT embedding .
HCE [2]:a FastSpeech2-based model, which utilizes Hierarchical Context Encoder (HCE) to predict the sentence-level style embedding from the hierarchical context information.
ATCE [3]:a FastSpeech2-based model with Acoustic and Text Context Encoders (ATCE), which uses both text and audio context to obtain context prosody representations.
HiGNN-TTS: proposed method.

2.1 Short-form Speech

2.2 Long-form Speech

References:

[1] Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Shubham Toshniwal, and Karen Livescu, "Pre-trained text embeddings for enhanced text-to-speech synthesis," in Proc. Interspeech, 2019, pp. 4430–4434.

[2] Detai Xin, Sharath Adavanne, Federico Ang, Ashish Kulkarni, Shinnosuke Takamichi, and Hiroshi Saruwatari, "Improving speech prosody of audiobook text-to-speech synthesis with acoustic and textual contexts," in Proc. ICASSP, 2023, pp. 1–5.

[3] Shun Lei, Yixuan Zhou, Liyang Chen, Zhiyong Wu, Shiyin Kang, and Helen Meng, "Towards expressive speaking style modelling with hierarchical context information for Mandarin speech synthesis," in Proc. ICASSP 2022, 2022, pp. 7922–7926.