GAN Comparison
GAN Comparison
net/publication/337876790
CITATIONS READS
17 5,465
3 authors:
SEE PROFILE
All content following this page was uploaded by Hamed Alqahtani on 11 December 2019.
A P REPRINT
A BSTRACT
Generative adversarial networks (GANs) have gained significant attention in recent years. A number
of GAN variants have been proposed and have been utilized in many applications. Despite a large
number of applications and developments in GANs, few works have studied the metrics that evaluate
GANs’ performance.
In this paper, we present a comprehensive analysis of the most commonly used evaluation metrics
for measuring the performance of GANs. We discuss their definitions of by explaining them
mathematically and analyzed their pros and cons in the context of GANs. Based on our analysis,
we observe that defining an appropriate metric for evaluating GAN’s performance is still an open
problem, not only for fair model comparison but also for understanding, improving, and developing
generative models. Overall, this study suggests that the choice of feature space in which to compute
various metrics is crucial. In addition, it is suggested to create a code repository of evaluation metrics
that enable the conduct of a comparative empirical and analytical studies of available measures for
benchmarking models under the same conditions using more than one metrics in the future.
1 Introduction
Generative adversarial networks (GANs) are a recently developed technique for learning in both semi-supervised and
unsupervised mode. These networks obtain it through modelling high-dimensional distributions of data implicitly.
Firstly, Goodfellow et al. [1] introduced the adversarial process to learn generative models. The fundamental aspect
of GAN is the min-max two-person zero-sum game. In this game, one player takes the advantages at the equivalent
loss of the other player. Here, the players correspond to different networks of GAN called discriminator and generator.
The main objective of the discriminator consists of determining whether a sample belongs to a fake distribution or real
distribution. Whereas, generator aims to deceive the discriminator by generating fake sample distribution. Discriminator
produces the chances or probability of a given sample to be a real sample. A higher value of probability shows that the
sample is likely to be a real sample. The value close to zero indicates that the sample is a fake sample. The probability
value near 0.5 indicates the generation of an optimal solution, such that discriminator is unable to differentiate fake and
real sample.
GAN models can be divided into two categories, namely, explicit and implicit. The explicit GAN models assume access
to the model likelihood function, whereas the implicit models utilizes a sampling mechanism to generate data. Explicit
models includes variational auto-encoders (VAEs) [2] and PixelCNN [3]. Implicit models includes GANs.
Generative adversarial networks (GANs) [1] have been studied extensively in recent years. Besides producing
surprisingly plausible images of faces [4][5] and bedrooms [6][7], they have also been innovatively applied in, for
example, semi-supervised learning [3][8], image-to-image translation [9], and simulated image refinement [10]. But,
despite the availability of a plethora of GAN models, these models are evaluated is still qualitative, very often resorting
to manual inspection of the visual fidelity of generated images. The manual inspection is time-consuming, subjective
and probably misleading.
A PREPRINT - D ECEMBER 11, 2019
Several evaluation metrics have been defined for measuring the performance of GAN models. In GANs, the objective
function for the generator and the discriminator usually measures how well they are doing relative to the opponent.
For example, we measure how well the generator is fooling the discriminator. It is not a good metric in measuring the
image quality or its diversity.
Nowadays, researchers focused on quantitative evaluation of GAN models along with qualitative metrics. Both,
qualitative and quantitative metrics have their own pros and cons. Quantitative metrics less subjective, and not directly
map to how humans perceive and judge generated images. These along with other problems like the variety of probability
criteria and the lack of a perceptually meaningful image similarity measures, have made evaluating generative models
difficult and challenging [11].
However, there is no globally accepted agreement regarding the best GAN evolution metrics. Some researchers
proposed to benchmark GANs [12]. These research efforts are beneficial for extending research for understanding GAN
evaluation metrics to analyze their pros and cons.
In general architecture, a generative adversarial network has two types of networks called discriminator and generator
denoted as D and G respectively.
1. The Generator (G): The G is a network that is used to generate the images using random noise Z. The
generated images using noise are recorded as G(z). The input that is commonly a Gaussian noise that is a
random point in latent space. Parameters of both the G and D networks are updated iteratively during the
training process of GAN. Infected the parameters of D remain static while training the G network. The output
of G network is labelled as fake distribution and given to D network as an input for determining its class. The
error is computed between the output of the discriminator and the label of the sample image. The error is
propagated back to update the weights of G network. A few constraints have been imposed on input parameters
of G network that can be added in the last layer of this network. In addition, noise can also be added to the
hidden layers. There is no limit on the dimensions of input Z of G networks.
2. The Discriminator (D): The D is considered as a discriminant network to determine whether a given image
belongs to a real distribution or not. It receives an input image X and produces the output D (x), representing
the probability that X belongs to a real distribution. If the output is 1, then it indicates a real image distribution.
The output value of D as 0 indicates that it belongs to a fake image distribution. During the training process of
the network, G network remains static. D network takes a real image as well as the fake image generated by g
network as input to compute the error in its label prediction. The weights of the discriminator D network are
updated on the basis of back-propagated error.
2
A PREPRINT - D ECEMBER 11, 2019
• Conditional GAN (CGAN) [14]: GANs can be extended to a conditional model if both the G and D networks
are conditioned on some extra information y to address the limitation of dependence only on random variables
in original model [14]. y could be any kind of auxiliary information, such as class labels or data from other
modalities. The conditional information can be added by feeding y into the both the D and G network as an
additional input layer.
In the G network, the prior input noise pz (z), and y are combined in joint hidden representation, and the
adversarial training framework allows for considerable flexibility in how this hidden representation is composed
[14]. In the D network, x and y are presented as inputs and to a D function. The objective function of a
two-player minimax game would be as Eq. 2.
GM inDM axV (D, G) = Ey,x∼pdata (y,x) [log(d(y, x))] (2)
3
A PREPRINT - D ECEMBER 11, 2019
there are some essential and desirable characteristics of GAN evaluation metrics as defined below. These characteristics
enable its meta measurement for evaluating and comparing the GAN performance. An effective GAN evaluation metric
should possess the following characteristics.
1. It should favour models that create highly distinguishable generated samples from real ones.
2. It should be sensitive over-fitting of the model.
3. It should be able to control with disentangled latent spaces as well as space continuity.
4. It should have well-defined boundary values.
5. It must be sensitive to image distortions and transformations.
6. It must agree with human perceptual judgments and human rankings of models, and
7. It must have a low sample and computational complexity.
Visual investigation of images by humans is the common and most intuitive ways to evaluate GANs [17]. But, it lacks
in many ways. Firstly, it is very costlier and biased to examine the quality of generated images with human vision.
Even it is difficult to reproduce and does not fully reflect the capacity of models. Secondly, human inspections have a
variance that makes it necessary to average over a large number of subjects. Thirdly, an evaluation based on samples
could be biased towards models that overfit and therefore a poor indicator of a good density model in a log-likelihood
sense [18]. Following methods have been used for measuring the performance of GAN qualitatively.
1. Nearest Neighbors: In order to detect over-fitting, traditionally some samples are against their nearest
neighbours in the training set. There exist two concerns for such examination.
Firstly, Nearest neighbours are typically determined on the basis of Euclidean distance. This distance is
sensitive to minor perceptual perturbations. This is a well-known phenomenon in psychophysics literature. It
is trivial to generate samples that are visually almost identical to a training image but have large Euclidean
distances with its [18].
A model that stores (transformed) training images can trivially pass the Nearest-Neighbor over-fitting test.
This problem can be alleviated by choosing the nearest neighbours based on perceptual measures, and by
showing more than one nearest neighbour.
2. Rating and Preference Judgment: These types of experiments invite subjects to rate models in terms
of the fidelity of their generated images. For example, Snell et al., [19] studied whether observers prefer
reconstructions produced by perceptually-optimized networks or by the pixelwise-loss optimized networks.
Participants were shown image triplets with the original (reference) image in the centre and the SSIM and
MSE optimized reconstructions on either side with the locations counterbalanced.
3. Rapid Scene Categorization: This type of metrics are based upon the fact that human beings are able to
reporting certain characteristics of scenes in a short glance [20]. For obtaining a quantitative metric of quality
of the image, Denton et al. [17] invited volunteers to differentiate their generated images from real images.
They varied the viewing time from 50ms to 2000ms. They concluded that their model was better than the
original GAN [1] since it did better in fooling the subjects (lower bound here is 0% and the upper bound is
100%).
Several quantitative metrics have been proposed for measuring performance of GANs described below.
1. Parzen Window Density: Parzen window estimation or Kernel density estimation (KDE) is a well-known
method for estimating the density function of a distribution from samples. For a probability kernel K
Pnan isotropic Gaussian) and i.i.d samples X1 , X2 ....Xn , a density function at x is defined as
(most often
p(x) ≈ z1 i=1 K(x − xi ). Here. z is normalizing constant.
4
A PREPRINT - D ECEMBER 11, 2019
The Parzen window approach to density estimation takes a finite set of images generated by a model and then
using those as the centroids of a Gaussian mixture. The constructed Parzen windows mix is then used for
computing a log-likelihood score on a set of test examples.
Wu et al. [21] suggested to utilize annealed importance sampling (AIS) [22] for estimating log-likelihoods
using a Gaussian observation model with a fixed variance. The key drawback of this approach is the assumption
of the Gaussian observation model that may not work quite well in high dimensional spaces. They observed
that AIS is two orders of magnitude more accurate than KDE, and is accurate enough for comparing generative
models.
Due to limitations of this metric, it becomes difficult to address the simple issue like whether GANs are simply
memorizing training examples, or whether they are missing important modes of the data distribution.
2. Inception Score (IS): This metric was proposed by Salimans et al. [23], it is widely used score for GAN
evaluation. It employs a pre-trained neural network for capturing the desirable properties of generated samples
like highly classifiable and diverse with respect to class labels. It measures the average KL divergence between
the conditional label distribution p(y | x) of samples and the marginal distribution p(y) obtained from all the
samples. It favors low entropy of p(y | x) but a large entropy of p(y).
exp(Ex [KL(p(y | x) k p(y))]) = exp(Hy ) − Ex [H(y | x)]) (3)
where, p(y | x) is the conditional label distribution for image x estimated using a pre-trained Inception model,
and p (y) is the marginal distribution. The IS shows a reasonable correlation with the quality and diversity of
generated images [23]. IS over real images can serve as the upper bound.
But, IS has several limitations as follows. It also favors a “memory GAN” that stores all training samples, thus
is unable to detect over-fitting [24]. It fails in detecting a model that has been trapped into one bad mode. since
IS uses Inception model that has been trained on ImageNet with many object classes, it may favor models
that generate good objects rather realistic images. It only considers Pg and ignores Pr . Manipulations such as
mixing in natural images from an entirely different distribution could deceive this score. As a result, it may
favor models that simply learn sharp and diversified images, instead of Pr .
3. Mode score: Che et al. [25] suggested this metric that addresses an important drawback of the IS ignoring the
the prior distribution of the ground truth labels.
exp(Ex [KL(p(y | x) k p(y train ))]) − KL(p(y) k p(y train )) (4)
train
where, p(y ) is the empirical distribution of labels computed from training data. Mode score adequately
reflects the variety and visual quality of generated images .
4. AM score: Zhou et al. [26] argued that the entropy term on y in the IS is not suitable when the data is
not evenly distributed over classes. To take y train into account, they proposed to replace H(y) with the KL
divergence between y train and y. The AM score is then defined as per following equation.
KL(p(y train ) k p(y)) + Ex [Hy (y | x)] (5)
The AM score contains two factors. The first factor is minimized when y train is close to y. The second factor
is minimized when the predicted class label for sample x (i.e. y | x) has low entropy. Thus, the smaller the
AM score, the better.
5. Frechet Inception Distance(FID): Heusel et al. [27] defined FID that embeds a set of generated images into
a feature space given by a specific layer of Inception Net (or any CNN). Viewing the embedding layer as a
continuous multivariate Gaussian, the mean and covariance are estimated for both the generated data and the
real data. The Frechet distance between these two Gaussians (a.k.a Wasserstein-2 distance) is then used to
quantify the quality of generated samples as per following equation.
X X XX 1
F ID(r, g) =k µr − µg k22 +Tr ( + +2( )2 ) (6)
r g r g
P P
where (µr ; r ) and (µg ; g ) are the mean and covariance of the real data and model distributions, respectively.
Lower FID means smaller distances between synthetic and real data distributions. FID performs well in terms
of discriminability, robustness and computational efficiency. It appears to be a good measure, even though
it only takes into consideration the first two order moments of the distributions. However, it assumes that
features are of Gaussian distribution which is often not guaranteed.
6. Maximum Mean Discrepancy (MMD): Fortet et al. [28] defined MMD that measures the dissimilarity
between two probability distributions Pr and Pg using samples drawn independently from each. A lower
MMD hence means that Pg is closer to Pr . MMD can be regarded as two-sample testing since, as in classifier
5
A PREPRINT - D ECEMBER 11, 2019
two samples test, it tests whether one model or another is closer to the true data distribution [29]. Such
hypothesis tests allow choosing one evaluation measure over another. The kernel MMD [30] measures (square)
MMD between Pr and Pg for some fixed characteristic kernel function k.
7. Image Retrieval Performance:
Wang et al. [31] proposed an image retrieval metrics to evaluate GANs. The main idea is to investigate images
in the dataset that are badly modeled by a network. Images from a held-out test set as well as generated images
are represented using a discriminatively trained CNN [32]. The nearest neighbors of generated images in the
test dataset are then retrieved. To evaluate the quality of the retrieval results, following metrics have been
proposed. The first metric considers dki,j to be the distance of the j th nearest image generated by method k to
test image i, and dkj = dk1;j ; ....; dkn;j ,the set of j th -nearest distances to all n test images (j is often set to 1).
The Wilcoxon signed-rank test is then used to test the hypothesis that the median of the difference between
two nearest distance distributions by two generators is zero, in which case they are equally good . If they are
not equal, the test can be used to assess which method is statistically better. The second metric considers dtj to
be the distribution of the j th nearest distance of the train images to the test dataset. Since train and test sets are
drawn from the same dataset, the distribution dtj can be considered the optimal distribution that a generator
could attain.
8. Generative Adversarial Metric (GAM): Im et al. [15] proposed to compare two GANs by having them
engaged in a battle against each other by swapping discriminators or generators across the two models. GAM
measures the relative performance of two GANs by measuring the likelihood ratio of the two models. The
likelihood-ratio is defined as per following equation.
6
A PREPRINT - D ECEMBER 11, 2019
7
A PREPRINT - D ECEMBER 11, 2019
Some evaluation methods like MS-SSIM aim to assess the diversity of the generated samples, regardless of the data
distribution. While being able to detect severe cases of mode collapse, these methods fall short in measuring how well a
generator captures the true data distribution.
Kernel MMD works surprising well when it operates in the feature space of a pre-trained ResNet. It is always able to
identify generative/noise images from real images, and both its sample complexity and computational complexity are
low. Given these advantages, even though MMD is biased, still it is recommended.
References
[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages
2672–2680, 2014.
[2] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[3] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier
gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2642–2651.
JMLR. org, 2017.
[4] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional
generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[5] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond
pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
[6] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
[7] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training
of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
[8] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders.
arXiv preprint arXiv:1511.05644, 2015.
[9] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional
adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
1125–1134, 2017.
[10] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning
from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 2107–2116, 2017.
[11] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
[12] Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. The gan landscape: Losses,
architectures, regularization, and normalization. arXiv preprint arXiv:1807.04720, 2018.
[13] A Karazeev. Generative adversarial networks (gans): Engine and applications, 2017.
[14] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784,
2014.
8
A PREPRINT - D ECEMBER 11, 2019
[15] Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, and Roland Memisevic. Generating images with recurrent
adversarial networks. arXiv preprint arXiv:1602.05110, 2016.
[16] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable
representation learning by information maximizing generative adversarial nets. In Advances in neural information
processing systems, pages 2172–2180, 2016.
[17] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian pyramid
of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
[18] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. arXiv
preprint arXiv:1511.01844, 2015.
[19] Jake Snell, Karl Ridgeway, Renjie Liao, Brett D Roads, Michael C Mozer, and Richard S Zemel. Learning to
generate images with perceptual similarity metrics. In 2017 IEEE International Conference on Image Processing
(ICIP), pages 4277–4281. IEEE, 2017.
[20] Aude Oliva. Gist of the scene. In Neurobiology of attention, pages 251–256. Elsevier, 2005.
[21] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. On the quantitative analysis of decoder-based
generative models. arXiv preprint arXiv:1611.04273, 2016.
[22] Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
[23] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved
techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
[24] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation from
visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
[25] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative adversarial
networks. arXiv preprint arXiv:1612.02136, 2016.
[26] Zhiming Zhou, Han Cai, Shu Rong, Yuxuan Song, Kan Ren, Weinan Zhang, Yong Yu, and Jun Wang. Activation
maximization generative adversarial nets. arXiv preprint arXiv:1703.02000, 2017.
[27] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing
Systems, pages 6626–6637, 2017.
[28] Robert Fortet and Edith Mourier. Convergence de la repartition empirique vers la repartition theorique. In Annales
scientifiques de Ecole Normale Superieure, volume 70, pages 267–285, 1953.
[29] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Scholkopf, et al. Kernel mean embedding
of distributions: A review and beyond. Foundations and Trends R in Machine Learning, 10(1-2):1–141, 2017.
[30] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Scholkopf, and Alexander Smola. A kernel
two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
[31] Yaxing Wang, Lichao Zhang, and Joost van de Weijer. Ensembles of generative adversarial networks. arXiv
preprint arXiv:1612.00991, 2016.
[32] Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[33] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error
visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
[34] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a
large-scale study. In Advances in neural information processing systems, pages 700–709, 2018.