Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views10 pages

GAN Comparison

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

GAN Comparison

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/337876790

AN ANALYSIS OF EVALUATION METRICS OF GANS

Conference Paper · July 2019

CITATIONS READS
17 5,465

3 authors:

Hamed Alqahtani Manolya Kavakli


Macquarie University Macquarie University
35 PUBLICATIONS 1,053 CITATIONS 144 PUBLICATIONS 2,005 CITATIONS

SEE PROFILE SEE PROFILE

Dr. Gulshan Kumar Ahuja


Shaheed Bhagat Singh State University, Ferozepur
68 PUBLICATIONS 1,183 CITATIONS

SEE PROFILE

All content following this page was uploaded by Hamed Alqahtani on 11 December 2019.

The user has requested enhancement of the downloaded file.


A N A NALYSIS OF E VALUATION M ETRICS OF GAN S

A P REPRINT

Hamed Alqahtani Manolya Kavakli-Thorne Gulshan Kumar


Postgraduate Student Associate Professor Assistant Professoor
Macquarie University Macquarie University SBSSTC, Ferozepur
[email protected] [email protected] [email protected]

December 11, 2019

A BSTRACT

Generative adversarial networks (GANs) have gained significant attention in recent years. A number
of GAN variants have been proposed and have been utilized in many applications. Despite a large
number of applications and developments in GANs, few works have studied the metrics that evaluate
GANs’ performance.
In this paper, we present a comprehensive analysis of the most commonly used evaluation metrics
for measuring the performance of GANs. We discuss their definitions of by explaining them
mathematically and analyzed their pros and cons in the context of GANs. Based on our analysis,
we observe that defining an appropriate metric for evaluating GAN’s performance is still an open
problem, not only for fair model comparison but also for understanding, improving, and developing
generative models. Overall, this study suggests that the choice of feature space in which to compute
various metrics is crucial. In addition, it is suggested to create a code repository of evaluation metrics
that enable the conduct of a comparative empirical and analytical studies of available measures for
benchmarking models under the same conditions using more than one metrics in the future.

1 Introduction

Generative adversarial networks (GANs) are a recently developed technique for learning in both semi-supervised and
unsupervised mode. These networks obtain it through modelling high-dimensional distributions of data implicitly.
Firstly, Goodfellow et al. [1] introduced the adversarial process to learn generative models. The fundamental aspect
of GAN is the min-max two-person zero-sum game. In this game, one player takes the advantages at the equivalent
loss of the other player. Here, the players correspond to different networks of GAN called discriminator and generator.
The main objective of the discriminator consists of determining whether a sample belongs to a fake distribution or real
distribution. Whereas, generator aims to deceive the discriminator by generating fake sample distribution. Discriminator
produces the chances or probability of a given sample to be a real sample. A higher value of probability shows that the
sample is likely to be a real sample. The value close to zero indicates that the sample is a fake sample. The probability
value near 0.5 indicates the generation of an optimal solution, such that discriminator is unable to differentiate fake and
real sample.
GAN models can be divided into two categories, namely, explicit and implicit. The explicit GAN models assume access
to the model likelihood function, whereas the implicit models utilizes a sampling mechanism to generate data. Explicit
models includes variational auto-encoders (VAEs) [2] and PixelCNN [3]. Implicit models includes GANs.
Generative adversarial networks (GANs) [1] have been studied extensively in recent years. Besides producing
surprisingly plausible images of faces [4][5] and bedrooms [6][7], they have also been innovatively applied in, for
example, semi-supervised learning [3][8], image-to-image translation [9], and simulated image refinement [10]. But,
despite the availability of a plethora of GAN models, these models are evaluated is still qualitative, very often resorting
to manual inspection of the visual fidelity of generated images. The manual inspection is time-consuming, subjective
and probably misleading.
A PREPRINT - D ECEMBER 11, 2019

Several evaluation metrics have been defined for measuring the performance of GAN models. In GANs, the objective
function for the generator and the discriminator usually measures how well they are doing relative to the opponent.
For example, we measure how well the generator is fooling the discriminator. It is not a good metric in measuring the
image quality or its diversity.
Nowadays, researchers focused on quantitative evaluation of GAN models along with qualitative metrics. Both,
qualitative and quantitative metrics have their own pros and cons. Quantitative metrics less subjective, and not directly
map to how humans perceive and judge generated images. These along with other problems like the variety of probability
criteria and the lack of a perceptually meaningful image similarity measures, have made evaluating generative models
difficult and challenging [11].
However, there is no globally accepted agreement regarding the best GAN evolution metrics. Some researchers
proposed to benchmark GANs [12]. These research efforts are beneficial for extending research for understanding GAN
evaluation metrics to analyze their pros and cons.

2 GAN and its Variants


Firstly, Goodfellow et al. [1] introduced the adversarial process to learn generative models. The fundamental aspect
of GAN is the min-max two-person zero-sum game. In this game, one player takes the advantages at the equivalent
loss of the other player. Here, the players correspond to different networks of GAN called discriminator and generator.
The main objective of the discriminator consists of determining whether a sample belongs to a fake distribution or real
distribution. Whereas, generator aims to deceive the discriminator by generating fake sample distribution. The general
architecture of GAN is shown in Figure 1.

Figure 1: The general architecture of GAN.

In general architecture, a generative adversarial network has two types of networks called discriminator and generator
denoted as D and G respectively.
1. The Generator (G): The G is a network that is used to generate the images using random noise Z. The
generated images using noise are recorded as G(z). The input that is commonly a Gaussian noise that is a
random point in latent space. Parameters of both the G and D networks are updated iteratively during the
training process of GAN. Infected the parameters of D remain static while training the G network. The output
of G network is labelled as fake distribution and given to D network as an input for determining its class. The
error is computed between the output of the discriminator and the label of the sample image. The error is
propagated back to update the weights of G network. A few constraints have been imposed on input parameters
of G network that can be added in the last layer of this network. In addition, noise can also be added to the
hidden layers. There is no limit on the dimensions of input Z of G networks.
2. The Discriminator (D): The D is considered as a discriminant network to determine whether a given image
belongs to a real distribution or not. It receives an input image X and produces the output D (x), representing
the probability that X belongs to a real distribution. If the output is 1, then it indicates a real image distribution.
The output value of D as 0 indicates that it belongs to a fake image distribution. During the training process of
the network, G network remains static. D network takes a real image as well as the fake image generated by g
network as input to compute the error in its label prediction. The weights of the discriminator D network are
updated on the basis of back-propagated error.

2
A PREPRINT - D ECEMBER 11, 2019

The objective function of a two-player minimax game would be as Eq. 1.


GM inDM axV (D, G) = Ex∼pdata (x) [log(d(x))] (1)

+Ez∼pg (z) [log(1 − D(G(z)))]


The major difference between discriminative and generative algorithms is that discriminative networks learn the
boundaries between classes (as the Discriminator does) while generative networks learn the distribution of classes (as
the Generator does) [13].
With the passage of time, several GAN models have been developed. Some important developments are as follows.

• Conditional GAN (CGAN) [14]: GANs can be extended to a conditional model if both the G and D networks
are conditioned on some extra information y to address the limitation of dependence only on random variables
in original model [14]. y could be any kind of auxiliary information, such as class labels or data from other
modalities. The conditional information can be added by feeding y into the both the D and G network as an
additional input layer.
In the G network, the prior input noise pz (z), and y are combined in joint hidden representation, and the
adversarial training framework allows for considerable flexibility in how this hidden representation is composed
[14]. In the D network, x and y are presented as inputs and to a D function. The objective function of a
two-player minimax game would be as Eq. 2.
GM inDM axV (D, G) = Ey,x∼pdata (y,x) [log(d(y, x))] (2)

+Ex∼px ,z∼pz (z) [log(1 − D(G(z, x), x))]


• Deep Convolutional Generative Adversarial Networks (DCGAN): Radford et al. [4] proposed a new class
of CNNs called Deep Convolutional Generative Adversarial Networks (DCGANs) having certain architectural
constraints. These constraints involved adopting and modifying three changes to the CNN architectures.
– Removing fully-connected hidden layers and replacing the pooling layers with strided convolutions on
the discriminator and fractional-strided convolutions on the generator
– Using batch normalization on both the generative and discriminative models
– Using ReLU activations in every layer of the generative model except the last layer and LeakyReLU
activations in all layers of the discriminative model
• Adversarial Autoencoders (AAE): Makhzani et al. [8] proposed adversarial autoencoder which is a proba-
bilistic autoencoder which makes use of GAN to perform variational inference by matching the aggregated
posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution. In adversarial
autoencoder, the autoencoder is trained with dual objectives - a traditional reconstruction error criteria, and an
adversarial training criterion that matches the aggregated posterior distribution of the latent representation
to an arbitrary prior distribution. After training, the encoder learns to convert the data distribution to the
prior distribution, while the decoder learns a deep generative model that maps the imposed prior to the data
distribution.
• Generative Recurrent Adversarial Networks (GRAN): Im et al. [15] proposed recurrent generative model
showing that unrolling the gradient based optimization yields a recurrent computation that creates images
by incrementally adding to a visual “canvas”. Here, the “encoder” convolutional network extracts images
of current “canvas”. The resulting code and the code for the reference image get fed to a “decoder” which
decides on an update to the “canvas”.
• Information Maximizing Generative Adversarial Networks (InfoGAN): Information maximizing GANs
(InfoGANs) [16] are an information-theoretic extension of GANs that are able to learn disentangled features
in a completely unsupervised manner. A disentangled representation is one which explicitly represents the
salient features of a data instance and can be useful for tasks such as face recognition and object recognition.
Here, InfoGANs modify the objective of GANs to learn meaningful representations by maximizing the mutual
information between a fixed small subset of GAN’s noise variables and observations.

3 Desirable Characteristics of GANs Evaluation Metrics


With the development of GAN architectures and their increasing use of numerous real-life applications, several
qualitative and quantitative metrics have been proposed for evaluating the performance of GANs. However, still there
exists no globally accepted benchmark metrics for evaluating the performance of a GAN architecture in all aspects. But,

3
A PREPRINT - D ECEMBER 11, 2019

there are some essential and desirable characteristics of GAN evaluation metrics as defined below. These characteristics
enable its meta measurement for evaluating and comparing the GAN performance. An effective GAN evaluation metric
should possess the following characteristics.

1. It should favour models that create highly distinguishable generated samples from real ones.
2. It should be sensitive over-fitting of the model.
3. It should be able to control with disentangled latent spaces as well as space continuity.
4. It should have well-defined boundary values.
5. It must be sensitive to image distortions and transformations.
6. It must agree with human perceptual judgments and human rankings of models, and
7. It must have a low sample and computational complexity.

4 GAN Evaluation Metrics


This section provides the GAN evaluation metrics along with their definitions, advantages and disadvantages. The GAN
evaluation metrics can be categorized as quantitative and qualitative metrics. A detailed description of these metrics is
provided below.

4.1 Qualitative metrics

Visual investigation of images by humans is the common and most intuitive ways to evaluate GANs [17]. But, it lacks
in many ways. Firstly, it is very costlier and biased to examine the quality of generated images with human vision.
Even it is difficult to reproduce and does not fully reflect the capacity of models. Secondly, human inspections have a
variance that makes it necessary to average over a large number of subjects. Thirdly, an evaluation based on samples
could be biased towards models that overfit and therefore a poor indicator of a good density model in a log-likelihood
sense [18]. Following methods have been used for measuring the performance of GAN qualitatively.

1. Nearest Neighbors: In order to detect over-fitting, traditionally some samples are against their nearest
neighbours in the training set. There exist two concerns for such examination.
Firstly, Nearest neighbours are typically determined on the basis of Euclidean distance. This distance is
sensitive to minor perceptual perturbations. This is a well-known phenomenon in psychophysics literature. It
is trivial to generate samples that are visually almost identical to a training image but have large Euclidean
distances with its [18].
A model that stores (transformed) training images can trivially pass the Nearest-Neighbor over-fitting test.
This problem can be alleviated by choosing the nearest neighbours based on perceptual measures, and by
showing more than one nearest neighbour.
2. Rating and Preference Judgment: These types of experiments invite subjects to rate models in terms
of the fidelity of their generated images. For example, Snell et al., [19] studied whether observers prefer
reconstructions produced by perceptually-optimized networks or by the pixelwise-loss optimized networks.
Participants were shown image triplets with the original (reference) image in the centre and the SSIM and
MSE optimized reconstructions on either side with the locations counterbalanced.
3. Rapid Scene Categorization: This type of metrics are based upon the fact that human beings are able to
reporting certain characteristics of scenes in a short glance [20]. For obtaining a quantitative metric of quality
of the image, Denton et al. [17] invited volunteers to differentiate their generated images from real images.
They varied the viewing time from 50ms to 2000ms. They concluded that their model was better than the
original GAN [1] since it did better in fooling the subjects (lower bound here is 0% and the upper bound is
100%).

4.2 Quantitative metrics

Several quantitative metrics have been proposed for measuring performance of GANs described below.

1. Parzen Window Density: Parzen window estimation or Kernel density estimation (KDE) is a well-known
method for estimating the density function of a distribution from samples. For a probability kernel K
Pnan isotropic Gaussian) and i.i.d samples X1 , X2 ....Xn , a density function at x is defined as
(most often
p(x) ≈ z1 i=1 K(x − xi ). Here. z is normalizing constant.

4
A PREPRINT - D ECEMBER 11, 2019

The Parzen window approach to density estimation takes a finite set of images generated by a model and then
using those as the centroids of a Gaussian mixture. The constructed Parzen windows mix is then used for
computing a log-likelihood score on a set of test examples.
Wu et al. [21] suggested to utilize annealed importance sampling (AIS) [22] for estimating log-likelihoods
using a Gaussian observation model with a fixed variance. The key drawback of this approach is the assumption
of the Gaussian observation model that may not work quite well in high dimensional spaces. They observed
that AIS is two orders of magnitude more accurate than KDE, and is accurate enough for comparing generative
models.
Due to limitations of this metric, it becomes difficult to address the simple issue like whether GANs are simply
memorizing training examples, or whether they are missing important modes of the data distribution.
2. Inception Score (IS): This metric was proposed by Salimans et al. [23], it is widely used score for GAN
evaluation. It employs a pre-trained neural network for capturing the desirable properties of generated samples
like highly classifiable and diverse with respect to class labels. It measures the average KL divergence between
the conditional label distribution p(y | x) of samples and the marginal distribution p(y) obtained from all the
samples. It favors low entropy of p(y | x) but a large entropy of p(y).
exp(Ex [KL(p(y | x) k p(y))]) = exp(Hy ) − Ex [H(y | x)]) (3)
where, p(y | x) is the conditional label distribution for image x estimated using a pre-trained Inception model,
and p (y) is the marginal distribution. The IS shows a reasonable correlation with the quality and diversity of
generated images [23]. IS over real images can serve as the upper bound.
But, IS has several limitations as follows. It also favors a “memory GAN” that stores all training samples, thus
is unable to detect over-fitting [24]. It fails in detecting a model that has been trapped into one bad mode. since
IS uses Inception model that has been trained on ImageNet with many object classes, it may favor models
that generate good objects rather realistic images. It only considers Pg and ignores Pr . Manipulations such as
mixing in natural images from an entirely different distribution could deceive this score. As a result, it may
favor models that simply learn sharp and diversified images, instead of Pr .
3. Mode score: Che et al. [25] suggested this metric that addresses an important drawback of the IS ignoring the
the prior distribution of the ground truth labels.
exp(Ex [KL(p(y | x) k p(y train ))]) − KL(p(y) k p(y train )) (4)
train
where, p(y ) is the empirical distribution of labels computed from training data. Mode score adequately
reflects the variety and visual quality of generated images .
4. AM score: Zhou et al. [26] argued that the entropy term on y in the IS is not suitable when the data is
not evenly distributed over classes. To take y train into account, they proposed to replace H(y) with the KL
divergence between y train and y. The AM score is then defined as per following equation.
KL(p(y train ) k p(y)) + Ex [Hy (y | x)] (5)
The AM score contains two factors. The first factor is minimized when y train is close to y. The second factor
is minimized when the predicted class label for sample x (i.e. y | x) has low entropy. Thus, the smaller the
AM score, the better.
5. Frechet Inception Distance(FID): Heusel et al. [27] defined FID that embeds a set of generated images into
a feature space given by a specific layer of Inception Net (or any CNN). Viewing the embedding layer as a
continuous multivariate Gaussian, the mean and covariance are estimated for both the generated data and the
real data. The Frechet distance between these two Gaussians (a.k.a Wasserstein-2 distance) is then used to
quantify the quality of generated samples as per following equation.
X X XX 1
F ID(r, g) =k µr − µg k22 +Tr ( + +2( )2 ) (6)
r g r g
P P
where (µr ; r ) and (µg ; g ) are the mean and covariance of the real data and model distributions, respectively.
Lower FID means smaller distances between synthetic and real data distributions. FID performs well in terms
of discriminability, robustness and computational efficiency. It appears to be a good measure, even though
it only takes into consideration the first two order moments of the distributions. However, it assumes that
features are of Gaussian distribution which is often not guaranteed.
6. Maximum Mean Discrepancy (MMD): Fortet et al. [28] defined MMD that measures the dissimilarity
between two probability distributions Pr and Pg using samples drawn independently from each. A lower
MMD hence means that Pg is closer to Pr . MMD can be regarded as two-sample testing since, as in classifier

5
A PREPRINT - D ECEMBER 11, 2019

two samples test, it tests whether one model or another is closer to the true data distribution [29]. Such
hypothesis tests allow choosing one evaluation measure over another. The kernel MMD [30] measures (square)
MMD between Pr and Pg for some fixed characteristic kernel function k.
7. Image Retrieval Performance:
Wang et al. [31] proposed an image retrieval metrics to evaluate GANs. The main idea is to investigate images
in the dataset that are badly modeled by a network. Images from a held-out test set as well as generated images
are represented using a discriminatively trained CNN [32]. The nearest neighbors of generated images in the
test dataset are then retrieved. To evaluate the quality of the retrieval results, following metrics have been
proposed. The first metric considers dki,j to be the distance of the j th nearest image generated by method k to
test image i, and dkj = dk1;j ; ....; dkn;j ,the set of j th -nearest distances to all n test images (j is often set to 1).
The Wilcoxon signed-rank test is then used to test the hypothesis that the median of the difference between
two nearest distance distributions by two generators is zero, in which case they are equally good . If they are
not equal, the test can be used to assess which method is statistically better. The second metric considers dtj to
be the distribution of the j th nearest distance of the train images to the test dataset. Since train and test sets are
drawn from the same dataset, the distribution dtj can be considered the optimal distribution that a generator
could attain.
8. Generative Adversarial Metric (GAM): Im et al. [15] proposed to compare two GANs by having them
engaged in a battle against each other by swapping discriminators or generators across the two models. GAM
measures the relative performance of two GANs by measuring the likelihood ratio of the two models. The
likelihood-ratio is defined as per following equation.

p(x | y = 1; Ḿ1 ) p(y = 1 | x; D1)p(x; G2)


= (7)
p(x | y = 1; Ḿ2 ) p(y = 1 | x; D1)p(x; G2)
where, Ḿ1 and Ḿ1 are the swapped pairs (D1 ; G2 ) and (D2 ; G1 ), p(x | y = 1; M ) is the likelihood of x
generated from the data distribution p(x) by model M, and p(y = 1 | x; D) indicates that discriminator D
thinks x is a real sample.
GAM suffers from two main caveats: a) it has a constraint where the two discriminators must have an
approximately similar performance on a calibration dataset, which can be difficult to satisfy in practice, and b)
it is expensive to compute because it has to be computed for all pairs of models.
9. Image Quality Measures: Some researchers have proposed to use measures from the image quality assess-
ment literature for training and evaluating GANs as described below.
• SSIM: Wang et al. [33] proposed a single-scale SSIM metric that is a well-characterized perceptual
similarity measure that aims to discount aspects of an image that are not important for human perception.
It compares corresponding pixels and their neighborhoods in two images, denoted by x and y, using three
quantities—luminance (I), contrast (C), and structure (S):
2µx µy + C1
I(x, y) = (8)
µ2x µ2y + C1
2σx σy + C2
C(x, y) = (9)
σx2 σy2 + C2
σxy + C3
S(x, y) = (10)
σx σy + C3
The variables µx , µy , σx , and σy denote mean and standard deviations of pixel intensity in a local image
patch centered at either x or y The variable σxy denotes the sample correlation coefficient between
corresponding pixels in the patches centered at x and y. The constants C1 , C2 , and C3 are small values
added for numerical stability. The three quantities are combined to form the SSIM score as per following
equation.
SSIM (x; y) = I(x; y)α C(x; y)β S(x; y)γ (11)
• PSNR: It measures the peak signal-to-noise ratio between two monochrome images I and K to assess the
quality of a generated image compared to its corresponding real image. The higher the PSNR (in db), the
better quality of the generated image. It is computed as per following equation.
M ax2I
P SN R(I; K) = 10Log1 0( ) (12)
M SE

6
A PREPRINT - D ECEMBER 11, 2019

= 20Log1 0(M ax2I ) − 20Log1 0(M SEI;K ) (13)


Where,
m−1 n−1
1 XX
M SEI;K = (I(m, n) − K(m, n))2 (14)
mn i=0 i=0
and, M AXI is the maximum possible pixel value of the image. This score can be used when a reference
image is available for example in training conditional GANs using paired data.
• Sharpness Difference (SD): It measures the loss of sharpness during image generation. It is compute as
per following equation.
M ax2I
SD(I; K) = 10Log1 0( ) (15)
GRADSI;K
Odena et al. [3] used 9 MS-SSIM to evaluate the diversity of generated images. The intuition is that
image pairs with higher MS-SSIM seem more similar than pairs with lower MS-SSIM. They measured
the MS-SSIM scores of 100 randomly chosen pairs of images within a given class. The higher (lower)
diversity within a class, the lower (the higher) mean MSSSIM score.
10. Precision and recall and F score:
Lucic et al. [34] proposed to compute precision, recall and F-1 score to quantify the degree of overfitting
in GANs. Intuitively precision measures the quality of the generated samples, whereas recall measures the
proportion of the reference distribution covered by the learned distribution. They argue that IS only captures
precision as it does not penalize a model for not producing all modes of the data distribution. Rather, it only
penalizes the model for not producing all classes. FID score, on the other hand, captures both precision and
recall.
To approximate these scores for a model, Lucic et al. [34] proposed to use toy datasets for which the data
manifold is known and distances of generated samples to the manifold can be computed. An example of such
dataset is the manifold of convex shapes.
Precision is defined as the fraction of the generated samples whose distance to the manifold is below a certain
threshold. Recall, on the other hand, is given by the fraction of test samples whose L2 distance to G(z) is
below the threshold. If the samples from the model distribution Pg are (on average) close to the manifold, its
precision is high. Similarly, high recall implies that the generator can recover (i.e. generate something close
to) any sample from the manifold, thus capturing most of the manifold.
The major drawback of these scores is that they are impractical for real images where the data manifold is
unknown, and their use is limited to evaluations on synthetic data.

5 Pros and Cons


Based on the above analysis, the advantages and inherent limitations of the most significant evaluation metrics can
be summarized, and conditions under which they produce meaningful results. Some metrics enable us to study the
problem of over-fitting, perform model selection on GAN models and compare GAN models without resorting to
human evaluation based on selected samples.
There is no consensus regarding the best score. Different scores assess various aspects of the image generation process,
and it is unlikely that a single score can cover all aspects. Nevertheless, some measures seem more plausible than others.
Quality metrics such as nearest neighbor visualizations or rapid categorization tasks may favor models that over-
fit. Overall, it seems that the main challenge is to have a measure that evaluates both diversity and visual fidelity
simultaneously. The former implies that all modes are covered while the latter implies that the generated samples
should have high likelihood.
Parzen windows estimation of likelihood favors trivial models and is irrelevant to visual fidelity of samples. Further, it
fails to approximate the true likelihood in high dimensional spaces or to rank models.
Two widely accepted scores, IS and FID, rely on pre-trained deep networks to represent and statistically compare
original and generated samples. The IS does show a reasonable correlation with the quality and diversity of generated
images, which explains the wide usage in practice. However, it is ill-posed mostly because it only evaluates Pg as
an image generation model rather than its similarity to Pr . Blunt violations like mixing in natural images from an
entirely different distribution completely deceives IS. As a result, it may encourage the models to simply learn sharp
and diversified images, instead of Pr .

7
A PREPRINT - D ECEMBER 11, 2019

Some evaluation methods like MS-SSIM aim to assess the diversity of the generated samples, regardless of the data
distribution. While being able to detect severe cases of mode collapse, these methods fall short in measuring how well a
generator captures the true data distribution.
Kernel MMD works surprising well when it operates in the feature space of a pre-trained ResNet. It is always able to
identify generative/noise images from real images, and both its sample complexity and computational complexity are
low. Given these advantages, even though MMD is biased, still it is recommended.

6 Summary and Future Research Directions


This paper presented an analysis of significant and most commonly used GAN evaluation metrics by highlighting their
pros and cons. It can be observed that defining an appropriate metric for evaluating GANs performance is still an open
problem, not only for fair model comparison but also for understanding, improving, and developing generative models.
Recently, Lucic et al. [34] found no empirical evidence in favour of GAN models who claimed superiority over the
original GAN. In this regard, borrowing from other fields such as natural scene statistics and cognitive vision can be
rewarding.
Overall, this study suggests that the choice of feature space in which to compute various metrics is crucial. In addition,
it is suggested to create a code repository of evaluation metrics that enables the conduct of a comparative empirical
and analytical studies of available measures for benchmarking models under the same conditions using more than one
metrics in the future.

References
[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages
2672–2680, 2014.
[2] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[3] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier
gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2642–2651.
JMLR. org, 2017.
[4] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional
generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[5] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond
pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
[6] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
[7] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training
of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017.
[8] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders.
arXiv preprint arXiv:1511.05644, 2015.
[9] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional
adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
1125–1134, 2017.
[10] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning
from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 2107–2116, 2017.
[11] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
[12] Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. The gan landscape: Losses,
architectures, regularization, and normalization. arXiv preprint arXiv:1807.04720, 2018.
[13] A Karazeev. Generative adversarial networks (gans): Engine and applications, 2017.
[14] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784,
2014.

8
A PREPRINT - D ECEMBER 11, 2019

[15] Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, and Roland Memisevic. Generating images with recurrent
adversarial networks. arXiv preprint arXiv:1602.05110, 2016.
[16] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable
representation learning by information maximizing generative adversarial nets. In Advances in neural information
processing systems, pages 2172–2180, 2016.
[17] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian pyramid
of adversarial networks. In Advances in neural information processing systems, pages 1486–1494, 2015.
[18] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. arXiv
preprint arXiv:1511.01844, 2015.
[19] Jake Snell, Karl Ridgeway, Renjie Liao, Brett D Roads, Michael C Mozer, and Richard S Zemel. Learning to
generate images with perceptual similarity metrics. In 2017 IEEE International Conference on Image Processing
(ICIP), pages 4277–4281. IEEE, 2017.
[20] Aude Oliva. Gist of the scene. In Neurobiology of attention, pages 251–256. Elsevier, 2005.
[21] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. On the quantitative analysis of decoder-based
generative models. arXiv preprint arXiv:1611.04273, 2016.
[22] Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
[23] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved
techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
[24] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation from
visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
[25] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative adversarial
networks. arXiv preprint arXiv:1612.02136, 2016.
[26] Zhiming Zhou, Han Cai, Shu Rong, Yuxuan Song, Kan Ren, Weinan Zhang, Yong Yu, and Jun Wang. Activation
maximization generative adversarial nets. arXiv preprint arXiv:1703.02000, 2017.
[27] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing
Systems, pages 6626–6637, 2017.
[28] Robert Fortet and Edith Mourier. Convergence de la repartition empirique vers la repartition theorique. In Annales
scientifiques de Ecole Normale Superieure, volume 70, pages 267–285, 1953.
[29] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Scholkopf, et al. Kernel mean embedding
of distributions: A review and beyond. Foundations and Trends R in Machine Learning, 10(1-2):1–141, 2017.
[30] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Scholkopf, and Alexander Smola. A kernel
two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
[31] Yaxing Wang, Lichao Zhang, and Joost van de Weijer. Ensembles of generative adversarial networks. arXiv
preprint arXiv:1612.00991, 2016.
[32] Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[33] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error
visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
[34] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a
large-scale study. In Advances in neural information processing systems, pages 700–709, 2018.

View publication stats

You might also like