Generative Modelling of Sequential Data

Abstract

Autoregressive convolutional neural networks such as WaveNet are powerful models that have recently achieved state-of-the-art results on both text-to-speech tasks and language modelling. In spite of this, they have so-far been unable to generate coherent speech samples when learnt from audio alone. The original configuration of WaveNet uses repeated blocks of dilated convolutions to reach a receptive field of 300 ms. In this work, we test hypotheses relating to the role of WaveNet’s receptive field in learning to unconditionally generate coherent speech when not conditioned on auxiliary signals such as text. We also examine the usefulness of the learned representations for the downstream task of automatic speech recognition. By transforming the input data to stacks of multiple audio samples per timestep, we increase the receptive field to up to 5 seconds. We find that enlarging the receptive field alone is insufficient to generate coherent samples. We also provide evidence that WaveNets create representations of speech that are helpful in downstream tasks. Finally, we find that WaveNets lack capability to model natural language and argue that this is the limiting factor for direct speech generation.

WaveNet Samples:

Librispeech

WaveNet (S=8 N=5 L=10 C=32) trained on Librispeech 100h-clean - Sample

TIMIT

WaveNet (S=2 N=5 L=10 C=32) trained on TIMIT - Generated Sample
WaveNet (S=2 N=5 L=10 C=64) trained on TIMIT - Generated Sample
WaveNet (S=2 N=5 L=10 C=92) trained on TIMIT - Generated Sample
WaveNet (S=4 N=5 L=10 C=32) trained on TIMIT - Generated Sample
WaveNet (S=4 N=5 L=10 C=64) trained on TIMIT - Generated Sample
WaveNet (S=4 N=5 L=10 C=92) trained on TIMIT - Generated Sample
WaveNet (S=8 N=5 L=10 C=32) trained on TIMIT - Generated Sample
WaveNet (S=8 N=5 L=10 C=64) trained on TIMIT - Generated Sample
WaveNet (S=8 N=5 L=10 C=92) trained on TIMIT - Generated Sample
WaveNet (S=16 N=5 L=10 C=32) trained on TIMIT - Generated Sample
WaveNet (S=16 N=5 L=10 C=64) trained on TIMIT - Generated Sample
WaveNet (S=16 N=5 L=10 C=92) trained on TIMIT - Generated Sample

WaveNet Reconstructions:

Librispeech

WaveNet (S=8 N=5 L=10 C=32) trained on Librispeech 100h-clean - Reconstruction

TIMIT

WaveNet (S=2 N=5 L=10 C=32) trained on TIMIT - Reconstruction
WaveNet (S=2 N=5 L=10 C=64) trained on TIMIT - Reconstruction
WaveNet (S=2 N=5 L=10 C=92) trained on TIMIT - Reconstruction
WaveNet (S=4 N=5 L=10 C=32) trained on TIMIT - Reconstruction
WaveNet (S=4 N=5 L=10 C=64) trained on TIMIT - Reconstruction
WaveNet (S=4 N=5 L=10 C=92) trained on TIMIT - Reconstruction
WaveNet (S=8 N=5 L=10 C=32) trained on TIMIT - Reconstruction
WaveNet (S=8 N=5 L=10 C=64) trained on TIMIT - Reconstruction
WaveNet (S=8 N=5 L=10 C=92) trained on TIMIT - Reconstruction
WaveNet (S=16 N=5 L=10 C=32) trained on TIMIT - Reconstruction
WaveNet (S=16 N=5 L=10 C=64) trained on TIMIT - Reconstruction
WaveNet (S=16 N=5 L=10 C=92) trained on TIMIT - Reconstruction

Avatar
Magnus Berg Sletfjerding
Machine Learning Engineer

My research interests include probabilistic models, generative models and causality.

Related