Generative Modelling of Sequential Data

January 2022 academic, machine-learning

Abstract

Autoregressive convolutional neural networks such as WaveNet are powerful models that have recently achieved state-of-the-art results on both text-to-speech tasks and language modelling. In spite of this, they have so-far been unable to generate coherent speech samples when learnt from audio alone. The original configuration of WaveNet uses repeated blocks of dilated convolutions to reach a receptive field of 300 ms. In this work, we test hypotheses relating to the role of WaveNet’s receptive field in learning to unconditionally generate coherent speech when not conditioned on auxiliary signals such as text. We also examine the usefulness of the learned representations for the downstream task of automatic speech recognition. By transforming the input data to stacks of multiple audio samples per timestep, we increase the receptive field to up to 5 seconds. We find that enlarging the receptive field alone is insufficient to generate coherent samples. We also provide evidence that WaveNets create representations of speech that are helpful in downstream tasks. Finally, we find that WaveNets lack capability to model natural language and argue that this is the limiting factor for direct speech generation.

Type

Thesis

WaveNet Samples:

Librispeech

WaveNet (S=8 N=5 L=10 C=32) trained on Librispeech 100h-clean - Sample

TIMIT

WaveNet (S=2 N=5 L=10 C=32) trained on TIMIT - Generated Sample

WaveNet (S=2 N=5 L=10 C=64) trained on TIMIT - Generated Sample

WaveNet (S=2 N=5 L=10 C=92) trained on TIMIT - Generated Sample

WaveNet (S=4 N=5 L=10 C=32) trained on TIMIT - Generated Sample

WaveNet (S=4 N=5 L=10 C=64) trained on TIMIT - Generated Sample

WaveNet (S=4 N=5 L=10 C=92) trained on TIMIT - Generated Sample

WaveNet (S=8 N=5 L=10 C=32) trained on TIMIT - Generated Sample

WaveNet (S=8 N=5 L=10 C=64) trained on TIMIT - Generated Sample

WaveNet (S=8 N=5 L=10 C=92) trained on TIMIT - Generated Sample

WaveNet (S=16 N=5 L=10 C=32) trained on TIMIT - Generated Sample

WaveNet (S=16 N=5 L=10 C=64) trained on TIMIT - Generated Sample

WaveNet (S=16 N=5 L=10 C=92) trained on TIMIT - Generated Sample

WaveNet Reconstructions:

Librispeech

WaveNet (S=8 N=5 L=10 C=32) trained on Librispeech 100h-clean - Reconstruction

TIMIT

WaveNet (S=2 N=5 L=10 C=32) trained on TIMIT - Reconstruction

WaveNet (S=2 N=5 L=10 C=64) trained on TIMIT - Reconstruction

WaveNet (S=2 N=5 L=10 C=92) trained on TIMIT - Reconstruction

WaveNet (S=4 N=5 L=10 C=32) trained on TIMIT - Reconstruction

WaveNet (S=4 N=5 L=10 C=64) trained on TIMIT - Reconstruction

WaveNet (S=4 N=5 L=10 C=92) trained on TIMIT - Reconstruction

WaveNet (S=8 N=5 L=10 C=32) trained on TIMIT - Reconstruction

WaveNet (S=8 N=5 L=10 C=64) trained on TIMIT - Reconstruction

WaveNet (S=8 N=5 L=10 C=92) trained on TIMIT - Reconstruction

WaveNet (S=16 N=5 L=10 C=32) trained on TIMIT - Reconstruction

WaveNet (S=16 N=5 L=10 C=64) trained on TIMIT - Reconstruction

WaveNet (S=16 N=5 L=10 C=92) trained on TIMIT - Reconstruction

academic machine-learning

Magnus Berg Sletfjerding

Machine Learning Engineer

My research interests include probabilistic models, generative models and causality.