At first sight, one could argue, why does that work at all? A sample, e.g. 0.13 could be drawn easily
from a Gaussian distribution or any other like a uniform distribution.
In fact, the Discriminator can not be sure about that at all.
By knowing a single sample
the Discriminator can not distinguish at all.
But as the Generator predicts batch_size number of
samples, only in case the Generator models reality well, the overall amount of all the samples will follow
a specific pattern. By looking at a lot of samples, the mean prediction will be accurate.
The mean of a batch of samples will tell the truth, if the Generator was able to reproduce the behavior of reality.
In the end, the Discriminator is not used anymore, as soon as the Generator is finally trained
and reached an accepted point of quality.
When extending the training method with some code that snapshots different properties of the current state
of the networks, the training process can be visualized. By adding those snapshots into a series, in mp4
format, the training process can be drawn like a movie.
In total, four different visualizations are drawn during the training.
First of all, on the top left, sample distributions from
reality and the Generator. In yellow, the distribution of real, Gaussian distributed samples
are drawn through a histogram. As green histogram also, the samples from the Generator are drawn.
Throuhout the training process, the green distribution approximates the yellow one.
The training process is also shown on the top right then, as the loss/error of the models after each epoch
decreases. The Discriminator model loss in blue and the Generator one in red.
For the Discriminator, also in blue, the accuracy after each training epoch is drawn on the bottom left.
As can be derived from there, the accuracy reaches 0.5 as the GAN training progresses.
Finally, the right bottom image shows the mapping, the whole GAN structure tries to approach.
A mapping from latent space, which is uniformly distributed from U(-1, 1) to a target space.
To draw that curve, which is basically the nonlinear function the Generator learns, is to enter
a value from the original/latent space and draw the outcome on the y axis as target space.
So far, all references were made to the original GAN paper.
In core, the original GAN describes the process and it has been shown in the previous
article that BCE loss can be used during training
to fulfill the theory from the paper. It shows that it is possible to model a real
distribution by creating an auxilary discriminator network, which estimates reality through
samples. Further it showed, that there is a convergence where the generative model
distribution reaches the real underlying distribution. This is reached through minimizing Jenson Shannon divergence.
Some issues that remain are vanishing gradients and mode collapse.
After crunching through this one,
the next one recommended is the Wasserstein GAN paper.
Therin, the above problems are tackled.
[1] Generative Adversarial Networks - The original paper
[2] Wasserstein GAN - The original paper