HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS
Dake Guo1, Xinfa Zhu1, Liumeng Xue2, Tao Li1, Yuanjun Lv1, Yuepeng Jiang1, Lei Xie1
1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi'an, China
2School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China
1. Abstract
Recent advances in text-to-speech, particularly those based on Graph Neural Networks (GNNs), have significantly improved the expressiveness of short-form synthetic speech. However, generating human-parity long-form speech with high dynamic prosodic variations is still challenging. To address this problem, we expand the capabilities of GNNs with a hierarchical prosody modeling approach, named HiGNN-TTS. Specifically, we add a virtual global node in the graph to strengthen the interconnection of word nodes and introduce a contextual attention mechanism to broaden the prosody modeling scope of GNNs from intra-sentence to inter-sentence. Additionally, we perform hierarchical supervision from acoustic prosody on each node of the graph to capture the prosodic variations with a high dynamic range. Ablation studies show the effectiveness of HiGNN-TTS in learning hierarchical prosody. Both objective and subjective evaluations demonstrate that HiGNN-TTS significantly improves the naturalness and expressiveness of long-form synthetic speech.
2. Demos
Compared Methods
- FS2-BERT [1]:a FastSpeech2-based model, which considers context information by using cross-sentence BERT embedding .
- HCE [2]:a FastSpeech2-based model, which utilizes Hierarchical Context Encoder (HCE) to predict the sentence-level style embedding from the hierarchical context information.
- ATCE [3]:a FastSpeech2-based model with Acoustic and Text Context Encoders (ATCE), which uses both text and audio context to obtain context prosody representations.
- HiGNN-TTS: proposed method.