Abstract

We present Char2Wav, an end-to-end model for speech synthesis. Char2Wav has two components: a reader and a neural vocoder. The reader is an encoder-decoder model with attention. The encoder is a bidirectional recurrent neural network that accepts text or phonemes as inputs, while the decoder is a recurrent neural network (RNN) with attention that produces vocoder acoustic features. Neural vocoder refers to a conditional extension of SampleRNN which generates raw waveform samples from intermediate representations. Unlike traditional models for speech synthesis, Char2Wav learns to produce audio directly from text.

Results

Comments

For now, we only have samples for Char2Wav in Spanish. We will complete the website with more models as they finish training. Unfortunately, we don't have unlimited GPU power :'(
Samples are not cherrypicked. We select 10 random sentences from the test set that the model has never seen.
Some of the samples fail to complete (check the last sample from Blizzard). Our intuition is that this is caused by a failure in the attention. In general, this model is hard to train and requires a few tricks. More details are coming soon.

Char2Wav: End-to-End Speech Synthesis

Jose Sotelo, Soroush Mehri, Kundan Kumar, João Felipe Santos, Kyle Kastner, Aaron Courville, Yoshua Bengio

Abstract

Results

Comments

Mexican Spanish

Dimex-100

English

VCTK

Blizzard

German

Pavoque