monellimankeimontenegro

I cannot see that anymore!

The paper - Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network - focuses on the question of how to recover fine texture when enlarging images. This is called super resolution (SR) in computer vision. They refer to SR for a single image as SISR. There are approaches that recover high resolution (HR) images from multiple images, but this is not tackled here nor in the paper.

A standard approach for a neural network would be to enter the low resolution (LR) image and compare the output with the HR one through minimizing the Mean-Squared-Error (MSE). The MSE as objective function has some disadvantages. It is a pixel wise comparison. So each pixel of the output image of the network is compared with its ground truth, those errors are squared, summed up and then divided through the number of pixels. The higher the value, the greater the difference between model prediction and reality. The problem is, that this takes not account the relations between pixels, which results then in blurry results of the output images. That is why in the paper a different approach is undertaken by combining an adversarial loss from a GAN with a content loss. The latter focuses on perceiving a high perceptual similarity. This means adapting the content loss to something that for a human eye looks similar. Together those two losses in combination are then called perceptual loss. The results of this approach have been verified through a Mean-Opinion-Score (MOS).

Fig.1 - Combination of adversarial loss and content loss for perceptual loss.

A standard approach for a neural network would be to enter the low resolution (LR) image and compare the output with the HR one through minimizing the Mean-Squared-Error (MSE). The MSE as objective function has some disadvantages. That is why in the paper a different approach is undertaken by combining an adversarial loss from a GAN with a content loss. The latter focuses on perceiving a high perceptual similarity. This means adapting the content loss to something that for a human eye looks similar. Together those two losses in combination are then called perceptual loss. The results of this approach have been verified through a Mean-Opinion-Score (MOS).

A term most probably not very common is MOS. Basically it is the arithmetic mean of a group of N people's individual evaluation of an audio, audio-visual or visual product, where each evaluation R typically ranges between 1 and 5. The mean of those N evaluations results then in the MOS value: \( MOS = \)

[1] Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network - The original paper