monellimankeimontenegro

Did you learn at all?

Multiple roads lead to GAN. Nevertheless, the view from a minmax point with the decision to use neural networks for both adversarial models, leaves the question open of how to train them. Backpropagation will be available due to the prior decision, but what will be the ncessary loss/objective?

In the original GAN paper from Goodfellow [1, p.3], the log function is applied to scale the Discriminiator's \( D \) output. This scaled output enters the objective then. In that sense, the log is applied as loss-function to the Discriminator output \( D(x_{real}) \) of sample \( x_{real} \). When the sample comes from the original, real data, the discriminator's aim is to produce a value close to \( 1 \), i.e. a high probability to be real. The log-transform of that is \( log(1) = 0 \). In case the prediction is with \( log(0.9) \) not optimal yet, the outcome becomes \( -0,05 \). Therefore maximizing that outcome, optimizes the network to correctly assign close to \( 1 \) values to real input samples.

To have the same result for a sample \( x_{fake} \) from the generator, the synthetic, fake ones, \( log(1-D(x_{fake}))) \) should be applied, cause the Discriminator function \( D \), ideally would produce a value close to 0 as probability on a fake input sample \( x_{fake} \). The eventually \( log(1-0) \) results with the same properties as the sample from reality. The prediction will be greater or equal to 0 and less than 1. That means in case e.g. 0.1 the \( log(0.1) \) results in a negative value. As for the real data samples, the bigger the outcome, in other words, the closer to zero, the better the Discriminator model is. Again, maximizing the outcome, optimizes the model.

So far, the objective would then be: \( obj = log(D(x_{real})) + log(1-D(x_{fake})) \). Until now, this objective holds true for the Discriminator training, where synthetic samples \( x_{fake} \) from a Generator and real samples \( x_{real} \) are fed into. This loss value \( obj \) is often times indicated by \( E \) and due to the fact, that batchwise training is applied the objective averages over a batch of training samples. Samples \( x \) are either drawn from reality distribution \( p_{data} \) or the approximated reality \( p_g \) through the generator model. Therefor the above formula derives to: \( E = \mathbb{E}_{x\sim p_{data}}(log(D(x))) + \mathbb{E}_{x \sim p_g}(log(1-D(x))) \)

With regards to backpropagation this further develops to an algorithm by maximizing that loss function with outcome of loss E: \( max \,V(D) = \mathbb{E}_{x\sim p_{data}}(log(D(x))) + \mathbb{E}_{x\sim p_g}(log(1-D(x))) \). Last but not least, a sample \( x \) drawn from \( p_g \) corresponds to a sample \( z \) drawn from latent space \( p_z = U(-1, 1) \) to draw the synthetic sample from \( G \). Therefore, the function \(V \) to be maximized by adapting the parameters of the Discriminator network \( D \) writes the following: \( max \,V(D) = \mathbb{E}_{x\sim p_{data}}(log(D(x))) + \mathbb{E}_{z\sim p_z}(log(1-D(G(z)))) \)

With regards to implementation, this is nothing else than BCE as loss function. The Binary Cross Entropy is a loss function for binary outcomes of neural networks. Those outcomes typically are 0 and 1. It is calculated for a batch of N samples from a dataset. Each input \( x_i \) for the network has a corresponding output \( y_i \) from the dataset. \( BCE = -\frac{1}{N} \sum_{i=1}^{N}{y_i \cdot log(p(x_i)) + (1-y_i) \cdot log(1-log(p(x_i)))} \). In case sample \( x_i \) has label \( y_i=0 \). The first part of the term cancels out and the second term with (1-0) = 1 accounts to the average, i.e. the BCE. In the other case where sample \( x_i \) has label \( y_i = 1 \), the second term cancells out accordingly. The summation and average over N finally refers to \( \mathbb{E} \). For the final implementation, BCE with the right label from \( p_z \), which is \(y_i = 0\,\forall\,x_i \sim p_z \), and \( p_{data} \), i.e. \(y_i = 1\,\forall\,x_i \sim p_{data} \) leads to the proposed loss function from the original paper.

To conclude an algorithm, the maximization of the loss function of the Discriminator training is defined by \( max\,V(D) = \frac{1}{N} \sum_{i=1}^{N}{y_i * log(p(x_i)) + (1-y_i) * log(1-log(p(x_i)))} \) where \( y_i \in \{y_{data}, y_{fake}\} \), \( y_{data}=1, y_{fake}=0 \). So for real samples \( x_{data} \) with \( y_{data}=1 \), the second term cancells out, and like in the derived loss for the GAN, the average will be calculated. For synthetic samples \( x_{fake} \) with \( y_{fake}=0 \), the first part cancells out and the average will be calcuated in BCE as in the expected GAN loss. The only difference so far is the negative multiplier in BCE. This is due to the fact, that most optimization heads towards minimization. Maximization of a loss equals minimizing the negative of that loss. Maximization from a negative value to zero, equals minimazation of the equal positive amount to zero. This leads, in the context of the final implementation in code to \( min\,V(D) = - \frac{1}{N} \sum_{i=1}^{N}{y_i * log(p(x_i)) + (1-y_i) * log(1-log(p(x_i)))} \). In the Discriminator class this turns out in a new property called self.criterion. To backwards propagate that criteria, another property self.optimizer is also included. Here multiple options are possible, a default one is considered to be the Adam optimizer.

              
              class Discriminator(nn.Module):
                """ Decides if a sample is drawn from
                    Gaussian distribution.
                """
                def __init__(self, batch_size: int):
                    self.batch_size = batch_size
                    super(Discriminator, self).__init__()
                    self.lr = 0.0005
                    self.input_size = 1
                    self.output_size = 1
                    self.k = 1

                    self.fc = nn.Linear(self.input_size, 8)
                    self.fc2 = nn.Linear(8, self.output_size)
                    self.lrelu = nn.LeakyReLU(negative_slope=0.1)
                    self.sigmoid = nn.Sigmoid()
                    
                    self.optimizer = torch.optim.Adam(self.parameters(),
                                                      lr = self.lr)
                    self.criterion = torch.nn.BCELoss()

                def forward(self, x: Tensor):
                    output = self.fc(x)
                    output = self.lrelu(output)
                    output = self.fc2(output)
                    output = self.sigmoid(output)

                    return output

The orginal GAN paper from Goodfellow [1, p.3] indicates, the same formula can be used for the Generator training, by minimizing \( V(D, G) \). In that case \( D \) is fixed, and \( G \) is backwards modificable. For that training, only synthetic data \( x_{fake} \) and \( y_{fake}=0 \) is involved. Therefore, the first term cancells out. Nevertheless, the function term itself remains the same. A constant discriminator leads then, when improving the Generator, to a minimized loss. \( min\, V(D, G) = E_{z \sim p_z}(log(1-D(G(z)))) \), as the term \( (1-D(G(z))) \) gets closer to zero, when the generator outputs more realistic samples and \( log(0) \) heads towards negative infinity. So minimizing the loss function \( V(D, G) \) through adapting parameters in \( G \) means improving the generator model. As BCE is used, there is still the thing of the negative prefix, which turns the minimization to a maximization. For that the default optimizers in PyTorch are not prepared. This is why, an adaption during the training itself has to be done. By turning the synthetic labels from \( y_{fake} = 0 \), to \( y_{fake} = 1 \). This leads to the cancellation of the second term instead of the first one. Therefore it follows: \( min\, V(G, D) = - \frac{1}{N} \sum_{i=1}^{N}{y_i * log(p(x_i))} \). This minimizes \( log(D(G(z))) \) where \( D(x_{fake}) \) returns for the optimal discriminator \( D* \) the outcome of \( log(0) \) which heads to \( - \inf \). But due the negative prefix to \( + \inf \). Minimizing from there on towards 0, means, the discriminator returns values closer to 1 as the generator learns. A \( log(1) \) heads to 0, therefore minimizes. By changing the label during generator training, the same BCE loss can be used to implement the original paper's loss. The generator model has a property self.criteria. The label adaption to the inverse of \( 1 \) instead of \( 0 \) has to be done when implementing the actual training in a GAN class.

              
              class Generator(nn.Module):
                """ Maps from Uniform distribution to Gaussian distribution.
                    Latent space of 1D vector input to 1D vector output.
                """
                def __init__(self, batch_size: int):
                    self.batch_size = batch_size
                    super(Generator, self).__init__()
                    self.lr = 0.0005
                    self.input_size = 1
                    self.output_size = 1

                    self.fc = nn.Linear(self.input_size, 8)
                    self.fc2 = nn.Linear(8, 12)
                    self.fc4 = nn.Linear(12, self.output_size)
                    self.lrelu = nn.LeakyReLU(negative_slope=0.1)
                    
                    self.criterion = torch.nn.BCELoss()
                    self.optimizer = torch.optim.Adam(self.parameters(),
                                                      lr = self.lr)
                
                def forward(self, x: Tensor) -> Tensor:
                    output = self.fc(x)
                    output = self.lrelu(output)
                    output = self.fc2(output)
                    output = self.lrelu(output)
                    output = self.fc4(output)
                    output = self.lrelu(output)

                    return output

In the above section, D has been fixed all the time during minimization. So that only the generator network updates. The difference between the theoretical description of the problem, i.e. the min-max problem, is that in the practical implementation also the maximization part has to be transformed into a minimization to use the same algorithms for gradient descent. In the final implementation BCE is used as loss function. Besides that, the theoretical description is summarized in the formula \( min\, max\, V(G, D) = from paper!\)

[1] Generative Adversarial Networks - The original paper