Synthesizing Realistic Audio using WaveGANs

14 min readAug 8, 2022

Audio is one of the most complicated forms of representing information. It has so many different levels of sound, different lengths, and intervals. Audio is also one of the most important forms of communication around the world, mainly through speech but also through music. As Machine Learning progresses at a rapid rate making advancements in Natural Language Processing to understand the text and in Computer Vision to understand visual data, it's important that we also create ML models to understand and create audio.

That’s exactly what I am going to talk about in this article, a project that can synthesize audio using a variation of Deep Convolutional Generative Adversarial Networks(DCGANs) known as WaveGANs.

Understanding what GANs are and how they function:

GANs are known as Generative Adversarial Networks and have 2 networks in them; the Generator and Discriminator. The Generator is a model that takes in an input(I will explain what the input is later) and attempts to replicate the real dataset that is passed to the Discriminator which attempts to differentiate between the output from the Generator and the real data.

A representation of what a GAN looks like.

The random input vector comes from a set of latent vectors from a certain probability distribution. In the case of WaveGAN that distribution was a uniform distribution of [-1, 1]. These latent vectors are sampled from this space, and throughout the training process through feedback from the loss functions the distribution of where noise vectors are sampled changes.

This is because WGAN-GP(the loss function used by WaveGAN) uses Wasserstein Distance aiming to minimize the distance between the generated data distribution and the real data distribution.

The Generator’s goal is to create outputs that can fool the Discriminator into thinking it's real data, and the Discriminator’s goal is to differentiate between real data and generated data really well.

As the name suggests, the 2 networks are competing with each other in a zero-sum game both trying to become better than the others. Each network’s loss functions have been designed to make one of the networks better than the other.

How does the training process for GANs work:

GANs are trained in an unsupervised manner, which means that there aren’t any labels for the training data for the model. The Generator and Discriminator both learn through iterations, and optimization is done to the models to make them improve throughout this process.

The process of training the Generator and Discriminator is slightly different according to specific implementations. It’s important that this training process remains as stable as possible because the networks have been designed so that if there is an imbalance in the accuracies of any of the networks then the model will collapse.

In the image above, it shows that there is an update to the Discriminator when the Generator is also updated. But, in a lot of different variations of GANs that can be 5 updates for the Discriminator for every update to the Generator. This is the parameter that WaveGAN uses.

The parameter on the # of updates for D(Discriminator) for every G(Generator) update.

You might think that this would cause an imbalance in the accuracies of the networks, but due to the model using the Wasserstein GAN + Gradient Penalty a better Discriminator(more updates=better predictions) leads to a better Generator because the Discriminator provides more gradient information critical for the Generator to create accurate predictions.

Why did I use WaveGAN specifically to synthesize audio?

WaveGANs are not the only model that allows us to do audio synthesis. There are lots of other models in the Generative Models field but also in Natural Language Processing.

For example, this paper suggests the use of VQ-VAEs(Variational Autoencoders) and compares them to another type of GAN for music synthesis called GANSynth.

The main reason for doing this project was not specifically for the WaveGAN model, but rather for learning about Generative Models and the capabilities that they have.

Most basic implementations in this space were mainly just with images, like generating images of fake celebrities, generating fake art, or generating fake rooms of homes. I didn’t want to do something as common as those projects, and wanted to work with a new type of data; audio.

Throughout the process of building this project, I realized that Generative Models are very good at solving certain types of problems specifically problems that are related to creating a “realistic” copy of the information.

GANs are not only interesting in this space, but so are other models like VAEs(Variational Autoencoders) which are very good at learning latent representations of information, and Diffusion Models that can retrieve information from an image that is suppressed under Gaussian noise.

Now that we understand how GANs work, let’s understand how Deep Generative Adversarial Networks(DCGANs) work which are the backbones of WaveGAN.

Understanding Deep Convolutional Generative Adversarial Networks:

This is an image of a Deep Convolutional Generative Adversarial Network Generator. You might have noticed that the network is completely comprised of Convolutional Layers or in the case of the Generator, Transposed Convolutional Layers.

This means that DCGANs by design is created for applications where the data is visual, which is most commonly images.

One of the most famous datasets, MNIST Handwritten Digits is a set of handwritten digits from 0–9 and DCGANs have been implemented to generate those handwritten digits.

DCGANs have also been implemented for generating accurate(decently accurate) images of human faces as you can see below.

By far the most common implementation of DCGANs is in generating abstract art, which produces outstanding results when the model is engineered well.

Architecture Details of DCGANs:

Here is the architecture of the Generator:

Layer 1: Input Layer — Takes a latent vector as input

Layer 2: Dense Layer

Layer 3: Reshape Layer — Reshapes the data from the input layer

Layer 4: ReLU Activation Function

Layer 5: Transposed Convolutional Layer(Stride of 2, and Convolutional filters of 5x5 pixels)

Layer 6: ReLU Activation Function

Layer 7: Batch Normalization

Layer 8: Transposed Convolutional Layer 2D(Stride of 2, and Convolutional filters of 5x5 pixels)

Layer 9: ReLU Activation Function

Layer 10: Batch Normalization

Layer 11: Transposed Convolutional Layer 2D(Stride of 2, and Convolutional filters of 5x5 pixels)

Layer 12: ReLU Activation Function

Layer 13: Batch Normalization

Layer 14: Transposed Convolutional Layer 2D(Stride of 2, and Convolutional filters of 5x5 pixels)

Layer 15: ReLU Activation Function

Layer 16: Batch Normalization

Layer 17: Tahn Activation Function(Output layer)

Here is the architecture of the Discriminator:

Layer 1: Input Layer — Takes in real data or generated data

Layer 2: Convolutional 2D layer(Stride=4 and filter size=5x5 pixels)

Layer 3: LeakyReLU Activation Function

Layer 4: Batch Normalization

Layer 5: Convolutional 2D layer(Stride=4 and filter size=5x5 pixels)

Layer 6: LeakyReLU Activation Function

Layer 7: Batch Normalization

Layer 8: Convolutional 2D layer(Stride=4 and filter size=5x5 pixels)

Layer 9: LeakyReLU Activation Function

Layer 10: Batch Normalization

Layer 11: Convolutional 2D layer(Stride=4 and filter size=5x5 pixels)

Layer 12: LeakyReLU Activation Function

Layer 13: Batch Normalization

Layer 14: Convolutional 2D layer(Stride=4 and filter size=5x5 pixels)

Layer 18: LeakyReLU Activation Function

Layer 19: Reshape Layer

Layer 20: Dense Layer(Output Layer)

The main difference between traditional GANs and DCGANs is that pooling layers are replaced by strided convolutional(in Discriminator) and fractional-strided(in Generator). Batch Normalization is used throughout the Generator and Discriminator. Replacing fully connected hidden layers for Convolutional and Transposed Convolutional 2D layers. Using ReLU activation throughout the Generator except for the output layer which is Tanh. LeakyReLU is used as the

Knowing the architecture of DCGAN and its difference from traditional GANs will help to build a good understanding of how WaveGANs work and their architecture.

Understanding WaveGAN:

As I mentioned before, WaveGANs are based on DCGANs and 90% of the architecture remains the same. Although, specific details of the model have been changed to be better designed for audio data.

In DCGAN’s Generator, there are Transposed Convolutional 2D layers that iteratively upsample an input vector into a high-resolution image. They have a stride of 2 and a filter size of 5x5 pixels.

In WaveGANs, the Generator uses Transposed Convolutional 1D layers because audio is single-dimensional just audio samples/time and because of that it has a one-dimensional 25-pixel filter(instead of a 5x5 two-dimensional filter) giving it a larger receptive field allowing it to process audio more efficiently. The number of strides happening between the transposed convolutional layers doubled from 2 to 4.

The image below shows a comparison between the details of the convolutional layers of DCGAN and WaveGAN.

The WaveGAN architecture does not include batch normalization in the Generator or Discriminator as that has proven to cause instability when training according to the paper.

The model uses a loss function that has proven to train more successfully than the traditional binary_crossentropy loss function used in vanilla GANs, instead using Wasserstein GAN + Gradient Penalty(WGAN-GP). In the next section, I will explain how this loss function is better than the traditional binary loss function and why it's popularly used in GANs.

The authors found that an algorithm known as Phase Shuffle solves a major problem in real data that can affect the Discriminator’s ability to differentiate between generated and real data. In audio data, it's common to find periodic patterns happening at certain phases and it's perceived as pitched noise with frequency commonplaces throughout the audio.

Due to this flaw, the Discriminator would discriminate against generated data that would not have those periodic patterns as the Generator can manipulate phases as the latent vector is upscaled into a waveform.

Phase Shuffle randomly manipulates the phase of the feature map at every convolutional layer in the Discriminator by hyperparameter n which is used to create audio samples in a uniform distribution [-n, n] to fill in missing samples by reflecting the samples from the opposite side.

This is an image that shows how Phase Shuffle works.

Model Architecture of WaveGAN:

Generator Architecture:

Layer 1: Input Layer — Takes in a latent vector in uniform distribution (-1,1)

Layer 2: Dense Layer

Layer 3: Reshape Layer