Contents
1. Abstract
Recently, Language models (LMs) have flourished in natural language processing and computer vision domains. These models generate high-fidelity text or image outputs and can be extended to various tasks. In contrast, speech generative models are still struggling in terms of speech quality and task generalization. In this paper, we present Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech serving LMs. Based on the proposed speech codec, Vec-Tok Speech leverages LMs to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is used to improve the efficiency of LMs. Vec-Tok speech can be used for mono or cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech enhancement, and beyond. Experiments on 50k hours of speech show that Vec-Tok speech achieves multiple speech generation tasks with better performance compared to other SOTA models.
2. TTS demos
2.1 Style prompt and speaker prompt
In this section, we demonstrate the effect of style prompt and speaker prompt by generating speeches from different prompts
English
Text: The army found the people in poverty and left them in comparative wealth.
Chinese
Text: 而当下的中国产品不仅追求性价比,更需要对世界有影响力的。
2.2 Compare with Bark and Vall-E-X
In this section, we compare our system with Bark and Vall-E-X
2.3 Ablation of BPE encoding
In this section, we demonstrate the effect of BPE encoding, it enhances the stability of longer sentence generation
3. Voice Conversion Demos
In this Section, we show the voice conversion performance of our Vec-Tok Speech
4. S2ST Demos
In this Section, we show the speech to speech translation performance of our Vec-Tok Speech
5. Other Applications
In this Section, we demonstrate the result of our applications mentioned in Section 3.4
5.1 Speech denoising
5.2 Bandwidth extension
5.3 Speaker de-identification
5.4 Speaker anonymization
6. Speech Reconstruction
In this section, we demonstrate the reconstruction quality of our codec. Note that the source speech is 16kHz and the reconstructed speech is 24Khz.