Your resource for web content, online publishing
and the distribution of digital products.
S M T W T F S
 
 
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
 
28
 
29
 
30
 
31
 
 

Neural Codec Language Models and Non-Autoregressive Models Explained

DATE POSTED:December 16, 2024
Table of Links

Abstract and 1 Introduction

2 Related Work

2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

2 RELATED WORK 2.1 Neural Codec Language Models

Conventional sequence-to-sequence auto-regressive TTS models, such as Tacotron [79], have successfully paved the way for speech synthesis technologies. TransformerTTS [53] first adopted a Transformer network for TTS, and VTN [28] also utilizes a Transformer network for VC. However, these auto-regressive models suffer from a slow inference speed, in addition to a lack of robustness owing to challenges in aligning text and acoustic representations and the difficulty in predicting a continuous acoustic representation. Recently, neural audio codec model [16], [89] have replaced conventional acoustic representations with a high-compressed audio codec, which can reproduce the original waveform audio. Vall-E [78] was the first neural codec language model for speech synthesis utilizing a discrete audio unit and language models. By scaling up the dataset to 60,000 h, Vall-E could perform in-context learning using a neural audio codec. However, it possessed the same limitations as auto-regressive TTS models, such as a slow inference speed and a lack of robustness. Furthermore they have a high-dependency of their pre-trained neural audio codec, resulting in lowquality audio. To overcome this limitation, high-quality neural audio codec models, such as HiFi-Codec [85] and DAC [41], have been investigated. Furthermore, SPEARTTS [31] and Make-A-Voice [23] introduced a hierarchical speech synthesis framework from semantic to acoustic token to reduce the gap between text and speech. Moreover, to reduce inference speed and improve the robustness of autoregressive methods, SoundStorm [6] proposed parallel audio generation methods that generate the token of a neural audio codec. UniAudio [86] presented a multi-scale Transformer architecture to reduce the computational complexity of long audio sequences.

2.2 Non-autoregressive Models

For fast and robust speech synthesis, FastSpeech [68] introduced a duration predictor to synthesize speech in parallel, and they significantly improved the robustness of speech synthesis by addressing the limitations of auto-regressive models such as repeating and skipping. To reduce the oneto-many mapping problem in non-autoregressive speech synthesis, FastSpeech 2 [67] adopted a variance adaptor that can reflect pitch and energy information. However, these models require an external duration extractor to align the text and speech. Glow-TTS [34] introduces a monotonic alignment search and normalizing flow to learn text-speech alignment and train the TTS model simultaneously. They add a blank token interspersed between phoneme tokens to increase robustness. VITS [35] combined the TTS model and a neural vocoder using VAE for end-to-end TTS frameworks with the aim of improving the quality of synthetic speech. NaturalSpeech [75] achieved human-level quality in a single speaker TTS by introducing a bidirectional normalizing flow and adopting a differentiable duration modeling and phoneme pre-training. Moreover, HierSpeech [48] leveraged a self-supervised speech representation in end-to-end speech synthesis, which significantly reduced the information gap between text and speech, thus addressing speech mispronunciations. In addition, HierVST [45] utilized a hierarchical VAE for zero-shot voice style transfer, and which significantly improved the voice style transfer performance in end-to-end speech synthesis models without any labels. ZS-TTS [37], WavthruVec [71] and VQTTS [17] utilized a self-supervised speech representation as an intermediate acoustic representation for robust speech synthesis. NANSY++ [9] introduced a unified speech synthesizer for various voice applications such as TTS, VC, singing voice synthesis, and voice control. Some studies [26] have combined a parallel TTS with LLMbased prosody modeling for expressive speech synthesis.

\

:::info This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

:::

:::info Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.

:::

\