Synthesizing Realistic Audio using WaveGANs

Dev Parikh
14 min readAug 8, 2022

--

An image of an audio waveform.

Audio is one of the most complicated forms of representing information. It has so many different levels of sound, different lengths, and intervals. Audio is also one of the most important forms of communication around the world, mainly through speech but also through music. As Machine Learning progresses at a rapid rate making advancements in Natural Language Processing to understand the text and in Computer Vision to understand visual data, it's important that we also create ML models to understand and create audio.

That’s exactly what I am going to talk about in this article, a project that can synthesize audio using a variation of Deep Convolutional Generative Adversarial Networks(DCGANs) known as WaveGANs.

Understanding what GANs are and how they function:

GANs are known as Generative Adversarial Networks and have 2 networks in them; the Generator and Discriminator. The Generator is a model that takes in an input(I will explain what the input is later) and attempts to replicate the real dataset that is passed to the Discriminator which attempts to differentiate between the output from the Generator and the real data.

A representation of what a GAN looks like.

The random input vector comes from a set of latent vectors from a certain probability distribution. In the case of WaveGAN that distribution was a uniform distribution of [-1, 1]. These latent vectors are sampled from this space, and throughout the training process through feedback from the loss functions the distribution of where noise vectors are sampled changes.

This is because WGAN-GP(the loss function used by WaveGAN) uses Wasserstein Distance aiming to minimize the distance between the generated data distribution and the real data distribution.

The Generator’s goal is to create outputs that can fool the Discriminator into thinking it's real data, and the Discriminator’s goal is to differentiate between real data and generated data really well.

As the name suggests, the 2 networks are competing with each other in a zero-sum game both trying to become better than the others. Each network’s loss functions have been designed to make one of the networks better than the other.

How does the training process for GANs work:

GANs are trained in an unsupervised manner, which means that there aren’t any labels for the training data for the model. The Generator and Discriminator both learn through iterations, and optimization is done to the models to make them improve throughout this process.

The process of training the Generator and Discriminator is slightly different according to specific implementations. It’s important that this training process remains as stable as possible because the networks have been designed so that if there is an imbalance in the accuracies of any of the networks then the model will collapse.

In the image above, it shows that there is an update to the Discriminator when the Generator is also updated. But, in a lot of different variations of GANs that can be 5 updates for the Discriminator for every update to the Generator. This is the parameter that WaveGAN uses.

The parameter on the # of updates for D(Discriminator) for every G(Generator) update.

You might think that this would cause an imbalance in the accuracies of the networks, but due to the model using the Wasserstein GAN + Gradient Penalty a better Discriminator(more updates=better predictions) leads to a better Generator because the Discriminator provides more gradient information critical for the Generator to create accurate predictions.

Why did I use WaveGAN specifically to synthesize audio?

WaveGANs are not the only model that allows us to do audio synthesis. There are lots of other models in the Generative Models field but also in Natural Language Processing.

For example, this paper suggests the use of VQ-VAEs(Variational Autoencoders) and compares them to another type of GAN for music synthesis called GANSynth.

The main reason for doing this project was not specifically for the WaveGAN model, but rather for learning about Generative Models and the capabilities that they have.

Most basic implementations in this space were mainly just with images, like generating images of fake celebrities, generating fake art, or generating fake rooms of homes. I didn’t want to do something as common as those projects, and wanted to work with a new type of data; audio.

Throughout the process of building this project, I realized that Generative Models are very good at solving certain types of problems specifically problems that are related to creating a “realistic” copy of the information.

GANs are not only interesting in this space, but so are other models like VAEs(Variational Autoencoders) which are very good at learning latent representations of information, and Diffusion Models that can retrieve information from an image that is suppressed under Gaussian noise.

Now that we understand how GANs work, let’s understand how Deep Generative Adversarial Networks(DCGANs) work which are the backbones of WaveGAN.

Understanding Deep Convolutional Generative Adversarial Networks:

An image of DCGAN’s Generator.

This is an image of a Deep Convolutional Generative Adversarial Network Generator. You might have noticed that the network is completely comprised of Convolutional Layers or in the case of the Generator, Transposed Convolutional Layers.

This means that DCGANs by design is created for applications where the data is visual, which is most commonly images.

One of the most famous datasets, MNIST Handwritten Digits is a set of handwritten digits from 0–9 and DCGANs have been implemented to generate those handwritten digits.

DCGANs have also been implemented for generating accurate(decently accurate) images of human faces as you can see below.

By far the most common implementation of DCGANs is in generating abstract art, which produces outstanding results when the model is engineered well.

Images of applications of DCGANs.

Architecture Details of DCGANs:

Here is the architecture of the Generator:

Layer 1: Input Layer — Takes a latent vector as input

Layer 2: Dense Layer

Layer 3: Reshape Layer — Reshapes the data from the input layer

Layer 4: ReLU Activation Function

Layer 5: Transposed Convolutional Layer(Stride of 2, and Convolutional filters of 5x5 pixels)

Layer 6: ReLU Activation Function

Layer 7: Batch Normalization

Layer 8: Transposed Convolutional Layer 2D(Stride of 2, and Convolutional filters of 5x5 pixels)

Layer 9: ReLU Activation Function

Layer 10: Batch Normalization

Layer 11: Transposed Convolutional Layer 2D(Stride of 2, and Convolutional filters of 5x5 pixels)

Layer 12: ReLU Activation Function

Layer 13: Batch Normalization

Layer 14: Transposed Convolutional Layer 2D(Stride of 2, and Convolutional filters of 5x5 pixels)

Layer 15: ReLU Activation Function

Layer 16: Batch Normalization

Layer 17: Tahn Activation Function(Output layer)

Here is the architecture of the Discriminator:

Layer 1: Input Layer — Takes in real data or generated data

Layer 2: Convolutional 2D layer(Stride=4 and filter size=5x5 pixels)

Layer 3: LeakyReLU Activation Function

Layer 4: Batch Normalization

Layer 5: Convolutional 2D layer(Stride=4 and filter size=5x5 pixels)

Layer 6: LeakyReLU Activation Function

Layer 7: Batch Normalization

Layer 8: Convolutional 2D layer(Stride=4 and filter size=5x5 pixels)

Layer 9: LeakyReLU Activation Function

Layer 10: Batch Normalization

Layer 11: Convolutional 2D layer(Stride=4 and filter size=5x5 pixels)

Layer 12: LeakyReLU Activation Function

Layer 13: Batch Normalization

Layer 14: Convolutional 2D layer(Stride=4 and filter size=5x5 pixels)

Layer 18: LeakyReLU Activation Function

Layer 19: Reshape Layer

Layer 20: Dense Layer(Output Layer)

The main difference between traditional GANs and DCGANs is that pooling layers are replaced by strided convolutional(in Discriminator) and fractional-strided(in Generator). Batch Normalization is used throughout the Generator and Discriminator. Replacing fully connected hidden layers for Convolutional and Transposed Convolutional 2D layers. Using ReLU activation throughout the Generator except for the output layer which is Tanh. LeakyReLU is used as the

Knowing the architecture of DCGAN and its difference from traditional GANs will help to build a good understanding of how WaveGANs work and their architecture.

Understanding WaveGAN:

As I mentioned before, WaveGANs are based on DCGANs and 90% of the architecture remains the same. Although, specific details of the model have been changed to be better designed for audio data.

In DCGAN’s Generator, there are Transposed Convolutional 2D layers that iteratively upsample an input vector into a high-resolution image. They have a stride of 2 and a filter size of 5x5 pixels.

In WaveGANs, the Generator uses Transposed Convolutional 1D layers because audio is single-dimensional just audio samples/time and because of that it has a one-dimensional 25-pixel filter(instead of a 5x5 two-dimensional filter) giving it a larger receptive field allowing it to process audio more efficiently. The number of strides happening between the transposed convolutional layers doubled from 2 to 4.

The image below shows a comparison between the details of the convolutional layers of DCGAN and WaveGAN.

The WaveGAN architecture does not include batch normalization in the Generator or Discriminator as that has proven to cause instability when training according to the paper.

The model uses a loss function that has proven to train more successfully than the traditional binary_crossentropy loss function used in vanilla GANs, instead using Wasserstein GAN + Gradient Penalty(WGAN-GP). In the next section, I will explain how this loss function is better than the traditional binary loss function and why it's popularly used in GANs.

The authors found that an algorithm known as Phase Shuffle solves a major problem in real data that can affect the Discriminator’s ability to differentiate between generated and real data. In audio data, it's common to find periodic patterns happening at certain phases and it's perceived as pitched noise with frequency commonplaces throughout the audio.

Due to this flaw, the Discriminator would discriminate against generated data that would not have those periodic patterns as the Generator can manipulate phases as the latent vector is upscaled into a waveform.

Phase Shuffle randomly manipulates the phase of the feature map at every convolutional layer in the Discriminator by hyperparameter n which is used to create audio samples in a uniform distribution [-n, n] to fill in missing samples by reflecting the samples from the opposite side.

This is an image that shows how Phase Shuffle works.

Model Architecture of WaveGAN:

Generator Architecture:

Layer 1: Input Layer — Takes in a latent vector in uniform distribution (-1,1)

Layer 2: Dense Layer

Layer 3: Reshape Layer

Layer 4: ReLU Activation Function

Layer 5: Transposed Convolutional 1D layer(Stride=4, Kernel Size=25 pixels)

Layer 4: ReLU Activation Function

Layer 5: Transposed Convolutional 1D layer(Stride=4, Kernel Size=25 pixels)

Layer 4: ReLU Activation Function

Layer 5: Transposed Convolutional 1D layer(Stride=4, Kernel Size=25 pixels)

Layer 4: ReLU Activation Function

Layer 5: Transposed Convolutional 1D layer(Stride=4, Kernel Size=25 pixels)

Layer 4: ReLU Activation Function

Layer 5: Transposed Convolutional 1D layer(Stride=4, Kernel Size=25 pixels)

Layer 4: Tanh Activation Function(Output Layer)

Discriminator Architecture:

Layer 1: Input Layer — Takes as input generated data from Generator or real data

Layer 2: Convolutional 1D Layer(Stride=4, Kernel Size=25 pixel)

Layer 3: LeakyReLU Activation Function

Layer 4: Phase Shuffle(n=2)

Layer 5: Convolutional 1D Layer(Stride=4, Kernel Size=25 pixel)

Layer 6: LeakyReLU Activation Function

Layer 7: Phase Shuffle(n=2)

Layer 8: Convolutional 1D Layer(Stride=4, Kernel Size=25 pixel)

Layer 9: LeakyReLU Activation Function

Layer 10: Phase Shuffle(n=2)

Layer 11: Convolutional 1D Layer(Stride=4, Kernel Size=25 pixel)

Layer 12: LeakyReLU Activation Function

Layer 13: Phase Shuffle(n=2)

Layer 14: Convolutional 1D Layer(Stride=4, Kernel Size=25 pixel)

Layer 15: LeakyReLU Activation Function

Layer 16: Reshape Layer

Layer 17: Dense Layer(Output layer)

The input of the Generator is a latent vector from a uniform distribution(-1, 1) of the different audio samples from our training data. The training data I used for my implementation is called SC09 which is an audio dataset of people saying digits between 0 and 9.

The output of the Generator is audio waveforms upsampled from the input latent vectors. These waveforms are saved and played back by audio software.

The input of the Discriminator is either a batch of generated data or a batch of training data. The output of the network is 2 values that represent the realness of the given data.

How WaveGAN’s loss functions work:

WaveGAN’s original paper recommends that WGAN or Wasserstein GAN is used as the model’s loss function.

Before I explain how WGAN works and how it's better than WGAN, I need to explain how DCGAN’s loss functions work, its shortcomings and how WGAN improved on them.

A graph that visually and mathematically represents the Kullback–Leibler Divergence.

The image above is known as the Jensen-Shannon Divergence which is the difference between the probability distributions P and Q.

The Jensen-Shannon Divergence is used as a metric for GANs loss functions and is used to perform backpropagation on the Discriminator and Generator.

The Discriminator predicts a probability between 0 and 1 for a given piece of data coming from the real dataset. For a given fake sample in G(z), z ∼ Pz(z) where z comes from Pz(the generated data distribution placed on z) from the Discriminator’s loss function 𝔼x∼pr(x)[log(D(X)]. X, which is the real data is taken from the uniform distribution pr(the real data distribution over x) and then calculated as the logarithm of real data passed to the Discriminator.

The Generator is trained to predict a high probability for a generated sample, aiming to minimize the function 𝔼x∼pr(x)[log(1- D(G(z)].

A min-max function represents the interactions between the Generator and Discriminator loss functions.

The loss functions of the Discriminator and Generator can be described as a min-max function with the goal of minimizing and maximizing certain functions as I explained above.

Since both networks have loss functions that compete with each other in a zero-sum game, there is a significant threat of a large imbalance in the accuracy of the networks.

This problem would become even worse with how Jensen-Shannon Divergence works in that fewer probability distributions P and Q overlap, the less gradient information is given to the Generator by the Discriminator which restricts the number of improvements made to the network to output results in a distribution closer to the real distribution which is what the loss function is attempting to maximize.

As you can see in the image above, after a certain point in a vanilla GAN the gradients hit 0 completely vanishing and giving no important information to the Generator.

Eventually, as this problem persists, the model would end up collapsing or a mode collapse would happen where the Generator outputs the same data every time.

Hyperparameter tuning is the best way to solve this imbalance in the Generator and Discriminator’s accuracy, because if the model & loss function configurations are right then the model will have more equilibrium. The problems are first that this process is very time consuming and takes lots of computing power to perform many experiments.

Hyperparameter tuning might not be the most effective approach for creating stability in the model because of major flaws in the design of the loss functions, and can’t be fixed unless new loss functions are used.

Introducing Wasserstein GANs

Wasserstein GANs are GANs that use Wasserstein Distance instead of Jensen-Shannon Divergence as the loss functions. The Wasserstein distance is different than the JS Divergence in that it measures the effort that is required to transform one distribution into the other.

An intuitive way to think about it is that a distribution is a certain amount of earth(soil) piled on M land and the minimum cost of moving that pile of soil to the other pile(distribution) would be the amount of soil being moved multiplied by the mean distance between both piles.

This example is used to visualize how Wasserstein Distance works, that’s why it's also referred to as Earth Mover’s distance.

This function above represents the Wasserstein Distance where the nth Wasserstein distance between 2 probability distributions in the metric space of Pp(M).

The fundamental difference between how JS Divergence was it measured the overlap between 2 probability distributions and Wasserstein distance which measures the effort it takes to move distribution to the other affects the stability of training of GANs and the performance of the Generator and Discriminator.

When using Wasserstein distance-based loss functions there is more gradient information for both the Discriminator and Generator. So, even if one of the models is not performing as well as the other, it gets important information from the gradients which it can use to backpropagate to improve the network.

For example, if the Generator starts to create samples in a distribution further from the real data distribution then the Discriminator would return large gradients back to the Generator so it can change weights and biases in the network. As the Generator improves in creating data that is closer to the data in the real distribution, the model will get smaller gradients from the Discriminator. But, in that case, the Discriminator would get larger gradients due to the mistakes its makes in differentiating between real and generated data.

How the Discriminator and Generator loss functions work:

The difference between a traditional GAN Generator and WGAN Generator is that in the traditional GAN Generator the model is attempting to predict a probability between 0 and 1 of given data being real. Instead, the WGAN Generator measures the realness of data giving a higher(or lower depending on implementation) score to real data and a lower(or higher) score for generated data.

The Generator loss function is: D(G(Z))

The Discriminator loss function is: D(X) — D(G(Z))

The Generator’s loss function is attempting to maximize/minimize the average realness score given to generated data by the Discriminator denoted by D(G(Z). The goal is to penalize the Generator weighted on the average realness score of the generated data to the realness score of the real data.

The Discriminator’s loss function takes the mean realness scores for real data denoted by D(X) & fake data denoted by D(G(Z) from the model attempting to maximize the realness score given to real data and minimize the realness score given to generated data.

The average scores are collected for the data because for every epoch in training, x amount of latent vectors are created which generate data and x amount of real data is sampled from the training set. X denotes the batch size which is variable according to the implementation.

By design, when the Discriminator loss is smaller, the Generator loss would be larger, and when the Discriminator loss is larger the Generator loss is smaller.

When one network gets a larger loss, it gets larger gradients allowing it to create larger changes allowing it to improve substantially compared to the other network which would get a smaller change allowing it to improve slightly. In the next epoch, the network with the larger loss in the previous epoch would likely have a higher accuracy causing it to have a smaller loss and the other network to have a higher loss.

This process would continue to occur throughout training whereas one network performs better than the other in an epoch, in the next epoch the other network would likely outperform the network due to the changes made during backpropagation.

Explanation of the code for this project:

If you want to see a full explanation of the code behind my implementation check my video on this project.

That was my article on synthesizing audio through WaveGANs! I learned a lot throughout the process of building this, not just about WaveGAN but about Generative Modelling as a whole and the loss functions used to train them(probably the deepest I have ever gone into loss functions).

Stay on the lookout for further reads on the book Dive into Deep Learning, I am currently going through the textbook which builds from fundamentals and explains DL concepts in low-level.

--

--

Dev Parikh

I write about Machine Learning, Mindsets & Model Models and other emerging technologies