Investigating the Effects of Large-Scale Pseudo-Stereo Data and Different Speech Foundation Model on
Dialogue Generative Spoken Language Model

Abstract:

Recent efforts in Spoken Dialogue Modeling aim to synthesize spoken dialogue without the need for direct transcription, thereby preserving the wealth of non-textual information inherent in speech. However, this approach faces a challenge when speakers talk simultaneously, requiring stereo dialogue data with speakers recorded on separate channels—a notably scarce resource. To address this, we have developed an innovative pipeline capable of transforming single-channel dialogue data into pseudo-stereo data. This expanded our training dataset from a mere 2,000 to an impressive 17,600 hours, significantly enriching the diversity and quality of the training examples available. The inclusion of this pseudo-stereo data has proven to be effective in improving the performance of spoken dialogue language models. Additionally, we explored the use of discrete units of different speech foundation models for spoken dialogue generation.

About this demo page:

Each audio sample is 30 seconds long and in stereo. The first 10 seconds are prompts, and the following 20 seconds are generated by our model.

We evaluate two sets of prompts from either Fisher or Podcast.

 

Fisher Prompts


Ground Truth

Train on Fisher

WavLM base+
WavLM large
HuBERT base
HuBERT large*
HuBERT large ft

Train on Fisher and Pseudo-Stereo Audio

WavLM base+
WavLM large
HuBERT base
HuBERT large*
HuBERT large ft

Podcast Prompts


Ground Truth

Train on Fisher

WavLM base+
WavLM large
HuBERT base
HuBERT large*
HuBERT large ft

Train on Fisher and Pseudo-Stereo Audio

WavLM base+
WavLM large
HuBERT base
HuBERT large*
HuBERT large ft


We built this sample page based on HiFi-GAN Demo.