Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views8 pages

Variational Deep Embedding

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views8 pages

Variational Deep Embedding

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

Variational Deep Embedding:


An Unsupervised and Generative Approach to Clustering
Zhuxi Jiang1 , Yin Zheng2 , Huachun Tan1 , Bangsheng Tang3 , Hanning Zhou3
1
Beijing Institute of Technology, Beijing, China
2
Tencent AI Lab, Shenzhen, China
3
Hulu LLC., Beijing, China
{zjiang, tanhc}@bit.edu.cn, [email protected],
[email protected], [email protected]
Abstract
Clustering is among the most fundamental tasks in
machine learning and artificial intelligence. In this
paper, we propose Variational Deep Embedding
(VaDE), a novel unsupervised generative cluster-
ing approach within the framework of Variational
Auto-Encoder (VAE). Specifically, VaDE models
the data generative procedure with a Gaussian Mix-
ture Model (GMM) and a deep neural network
(DNN): 1) the GMM picks a cluster; 2) from which
a latent embedding is generated; 3) then the DNN
decodes the latent embedding into an observable.
Inference in VaDE is done in a variational way: a Figure 1: The diagram of VaDE. The data generative process of
different DNN is used to encode observables to la- VaDE is done as follows: 1) a cluster is picked from a GMM model;
tent embeddings, so that the evidence lower bound 2) a latent embedding is generated based on the picked cluster; 3)
(ELBO) can be optimized using the Stochastic Gra- DNN f (z; θ) decodes the latent embedding into an observable x. A
dient Variational Bayes (SGVB) estimator and the encoder network g(x; φ) is used to maximize the ELBO of VaDE.
reparameterization trick. Quantitative comparisons
with strong baselines are included in this paper, and
experimental results show that VaDE significantly methods have the advantage that domain-specific similarity
outperforms the state-of-the-art clustering methods or kernel functions can be easily incorporated into the mod-
on 5 benchmarks from various modalities. More- els. But these methods suffer scalability issue due to super-
over, by VaDE’s generative nature, we show its ca- quadratic running time for computing spectra.
pability of generating highly realistic samples for Different from similarity-based methods, a feature-based
any specified cluster, without using supervised in- method takes a N × D matrix as input, where N is the num-
formation during training. ber of samples and D is the feature dimension. One popular
feature-based clustering method is K-means, which aims to
partition the samples into K clusters so as to minimize the
1 Introduction within-cluster sum of squared errors. Another representative
Clustering is the process of grouping similar objects to- feature-based clustering model is Gaussian Mixture Model
gether, which is one of the most fundamental tasks in ma- (GMM), which assumes that the data points are generated
chine learning and artificial intelligence. Over the past from a Mixture-of-Gaussians (MoG), and the parameters of
decades, a large family of clustering algorithms have been GMM are optimized by the Expectation Maximization (EM)
developed and successfully applied in enormous real world algorithm. One advantage of GMM over K-means is that
tasks [Ng et al., 2002; Ye et al., 2008; Yang et al., 2010; a GMM can generate samples by estimation of data den-
Xie et al., 2016]. Generally speaking, there is a dichotomy of sity. Although K-means, GMM and their variants [Ye et al.,
clustering methods: Similarity-based clustering and Feature- 2008; Liu et al., 2010] have been extensively used, learning
based clustering. Similarity-based clustering builds models good representations most suitable for clustering tasks is left
upon a distance matrix, which is a N × N matrix that mea- largely unexplored.
sures the distance between each pair of the N samples. One Recently, deep learning has achieved widespread success
of the most famous similarity-based clustering methods is in numerous machine learning tasks [Krizhevsky et al., 2012;
Spectral Clustering (SC) [Von Luxburg, 2007], which lever- Zheng et al., 2014b; Szegedy et al., 2015; Zheng et al., 2014a;
ages the Laplacian spectra of the distance matrix to reduce He et al., 2016; Zheng et al., 2015; 2016], where learning
dimensionality before clustering. Similarity-based clustering good representations by deep neural networks (DNN) lies in

1965
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

the core. Taking a similar approach, it is conceivable to con- the KL divergence to minimize the within-cluster distance of
duct clustering analysis on good representations, instead of each cluster. DEC achieved impressive performances on clus-
raw data points. In a recent work, Deep Embedded Clus- tering tasks. However, the feature embedding in DEC is de-
tering (DEC) [Xie et al., 2016] was proposed to simultane- signed specifically for clustering and fails to uncover the real
ously learn feature representations and cluster assignments underlying structure of the data, which makes the model lack
by deep neural networks. Although DEC performs well in of the ability to extend itself to other tasks beyond clustering,
clustering, similar to K-means, DEC cannot model the gen- such as generating samples.
erative process of data, hence is not able to generate sam-
ples. Some recent works, e.g. VAE [Kingma and Welling, The deep generative models have recently attracted much
2014], GAN [Goodfellow et al., 2014] , PixelRNN [Oord et attention in that they can capture the data distribution by
al., 2016], InfoGAN [Chen et al., 2016] and PPGN [Nguyen neural networks, from which unseen samples can be gener-
et al., 2016], have shown that neural networks can be trained ated. GAN and VAE are among the most successful deep
to generate meaningful samples. The motivation of this work generative models in recent years. Both of them are ap-
is to develop a clustering model based on neural networks pealing unsupervised generative models, and their variants
that 1) learns good representations that capture the statistical have been extensively studied and applied in various tasks
structure of the data, and 2) is capable of generating samples. such as semi-supervised classification [Kingma et al., 2014;
In this paper, we propose a clustering framework, Maaløe et al., 2016; Salimans et al., 2016; Makhzani et
Variational Deep Embedding (VaDE), that combines al., 2016; Abbasnejad et al., 2016], clustering [Makhzani
VAE [Kingma and Welling, 2014] and a Gaussian Mixture et al., 2016] and image generation [Radford et al., 2016;
Model for clustering tasks. VaDE models the data generative Dosovitskiy and Brox, 2016].
process by a GMM and a DNN f : 1) a cluster is picked
up by the GMM; 2) from which a latent representation
z is sampled; 3) DNN f decodes z to an observation x. For example, [Abbasnejad et al., 2016] proposed to use
Moreover, VaDE is optimized by using another DNN g to a mixture of VAEs for semi-supervised classification tasks,
encode observed data x into latent embedding z, so that the where the mixing coefficients of these VAEs are modeled
Stochastic Gradient Variational Bayes (SGVB) estimator and by a Dirichlet process to adapt its capacity to the input
the reparameterization trick [Kingma and Welling, 2014] data. SB-VAE [Nalisnick and Smyth, 2016] also applied
can be used to maximize the evidence lower bound (ELBO). Bayesian nonparametric techniques on VAE, which derived
VaDE generalizes VAE in that a Mixture-of-Gaussians prior a stochastic latent dimensionality by a stick-breaking prior
replaces the single Gaussian prior. Hence, VaDE is by design and achieved good performance on semi-supervised classifi-
more suitable for clustering tasks1 . Specifically, the main cation tasks. VaDE differs with SB-VAE in that the cluster
contributions of the paper are: assignment and the latent representation are jointly consid-
ered in the Gaussian mixture prior, whereas SB-VAE sepa-
• We propose an unsupervised generative clustering rately models the latent representation and the class variable,
framework, VaDE, that combines VAE and GMM to- which fails to capture the dependence between them. Addi-
gether. tionally, VaDE does not need the class label during training,
• We show how to optimize VaDE by maximizing the while the labels of data are required by SB-VAE due to its
ELBO using the SGVB estimator and the reparameteri- semi-supervised setting. Among the variants of VAE, Adver-
zation trick; sarial Auto-Encoder(AAE) [Makhzani et al., 2016] can also
• Experimental results show that VaDE outperforms the do unsupervised clustering tasks. Different from VaDE, AAE
state-of-the-art clustering models on 5 datasets from var- uses GAN to match the aggregated posterior with the prior of
ious modalities by a large margin; VAE, which is much more complex than VaDE on the training
procedure. We will compare AAE with VaDE in the experi-
• We show that VaDE can generate highly realistic sam- ments part.
ples for any specified cluster, without using supervised
information during training.
Similar to VaDE, [Nalisnick et al., 2016] proposed DL-
The diagram of VaDE is illustrated in Figure 1. GMM to combine VAE and GMM together. The crucial dif-
ference, however, is that VaDE uses a mixture of Gaussian
2 Related Work prior to replace the single Gaussian prior of VAE, which is
Recently, people find that learning good representations plays suitable for clustering tasks by nature, while DLGMM uses
an important role in clustering tasks. For example, DEC [Xie a mixture of Gaussian distribution as the approximate pos-
et al., 2016] was proposed to learn feature representations and terior of VAE and does not model the class variable. Hence,
cluster assignments simultaneously by deep neural networks. VaDE generalizes VAE to clustering tasks, whereas DLGMM
In fact, DEC learns a mapping from the observed space to a is used to improve the capacity of the original VAE and is not
lower-dimensional latent space, where it iteratively optimizes suitable for clustering tasks by design. The recently proposed
GM-CVAE [Shu et al., 2016] also combines VAE with GMM
1
Although people can use VaDE to do unsupervised feature together. However, the GMM in GM-CVAE is used to model
learning or semi-supervised learning tasks, we only focus on clus- the transitions between video frames, which is the main dif-
tering tasks in this work. ference with VaDE.

1966
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

3 Variational Deep Embedding where LELBO is the evidence lower bound (ELBO), q(z, c|x)
In this section, we describe Variational Deep Embedding is the variational posterior to approximate the true posterior
(VaDE), a model for probabilistic clustering problem within p(z, c|x). In VaDE, we assume q(z, c|x) to be a mean-field
the framework of Variational Auto-Encoder (VAE). distribution and can be factorized as:

3.1 The Generative Process q(z, c|x) = q(z|x)q(c|x). (8)


Since VaDE is a kind of unsupervised generative approach to Then, according to Equation 3 and 8, the LELBO (x) in
clustering, we herein first describe the generative process of Equation 7 can be rewritten as:
VaDE. Specifically, suppose there are K clusters, an observed  
sample x ∈ RD is generated by the following process: p(x, z, c)
LELBO (x) = Eq(z,c|x) log
1. Choose a cluster c ∼ Cat(π) q(z, c|x)
 = Eq(z,c|x) [log p(x, z, c) − log q(z, c|x)]
2. Choose a latent vector z ∼ N µc , σc2 I
3. Choose a sample x: = Eq(z,c|x) [log p(x|z) + log p(z|c) (9)
(a) If x is binary + log p(c) − log q(z|x) − log q(c|x)]
i. Compute the expectation vector µx In VaDE, similar to VAE, we use a neural network g to
µx = f (z; θ) (1) model q(z|x):

ii. Choose a sample x ∼ Ber(µx ) [µ̃; log σ̃ 2 ] = g(x; φ) (10)


2
(b) If x is real-valued q(z|x) = N (z; µ̃, σ̃ I) (11)
i. Compute µx and σx2
where φ is the parameter of network g.
[µx ; log σx2 ] = f (z; θ) (2) By substituting the terms in Equation 9 with Equations 4,
 5, 6 and 11, and using the SGVB estimator and the reparam-
ii. Choose a sample x ∼ N µx , σx2 I eterization trick, the LELBO (x) can be rewritten as: 2
where K is a predefined parameter, πk is the prior proba- L D
PK 1 XX
bility for cluster k, π ∈ RK +, 1 = k=1 πk , Cat(π) is the LELBO (x) = xi log µ(l) (l)
x |i + (1 − xi ) log(1 − µx |i )
L
categorical distribution parametrized by π, µc and σc2 are i=1
l=1

the mean and the variance of the Gaussian distribution cor- K


1X X
J
σ̃ 2 |j (µ̃|j − µc |j )2
responding to cluster c, I is an identity matrix, f (z; θ) is − γc (log σc2 |j + 2 + )
2 c=1 j=1 σc | j σc2 |j
a neural network whose input is z and is parametrized by
θ, Ber(µx ) and N (µx , σx2 ) are multivariate Bernoulli dis- K
X πc 1X
J

tribution and Gaussian distribution parametrized by µx and + γc log + (1 + log σ̃ 2 |j ) (12)


γc 2 j=1
µx , σx , respectively. The generative process is depicted in c=1

Figure 1. where L is the number of Monte Carlo samples in the SGVB


According to the generative process above, the joint prob- (l)
ability p(x, z, c) can be factorized as: estimator, D is the dimensionality of x and µx , xi is the
i element of x, J is the dimensionality of µc , σc2 , µ̃ and
th

p(x, z, c) = p(x|z)p(z|c)p(c), (3) σ̃ 2 , and ∗|j denotes the j th element of ∗, K is the number of
clusters, πc is the prior probability of cluster c, and γc denotes
since x and c are independent conditioned on z. And the q(c|x) for simplicity.
probabilities are defined as: (l)
In Equation 12, we compute µx as
p(c) = Cat(c|π) (4)
p(z|c) = N z|µc , σc2 I

(5) µ(l) (l)
x = f (z ; θ), (13)

p(x|z) = Ber(x|µx ) or N (x|µx , σx2 I) (6) where z(l) is the lth sample from q(z|x) by Equation 11 to
produce the Monte Carlo samples. According to the repa-
3.2 Variational Lower Bound rameterization trick, z(l) is obtained by
A VaDE instance is tuned to maximize the likelihood of the
given data points. Given the generative process in Section 3.1, z(l) = µ̃ + σ̃ ◦ (l) , (14)
by using Jensen’s inequality, the log-likelihood of VaDE can
be written as: where (l) ∼ N (0, I), ◦ is element-wise multiplication, and
Z X µ̃, σ̃ are derived by Equation 10.
log p(x) = log p(x, z, c)dz We now describe how to formulate γc , q(c|x) in Equa-
z c tion 12 to maximize the ELBO. Specifically, LELBO (x) can
p(x, z, c) 2
≥ Eq(z,c|x) [log ] = LELBO (x) (7) This is the case when the observation x is binary. For the real-
q(z, c|x) valued situation, the ELBO can be obtained in a similar way.

1967
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

(a) Epoch 0 (11.35%) (b) Epoch 1 (55.63%) (c) Epoch 5 (72.40%)

Figure 2: Clustering accuracy over number of epochs during training


on MNIST. We also illustrate the best performances of DEC, AAE,
(d) Epoch 50 (84.59%) (e) Epoch 120 (90.76%) (f) Epoch End (94.46%)
LDMGI and GMM. It is better to view the figure in color.

Figure 3: The illustration about how data is clustered in the latent


be rewritten as: space learned by VaDE during training on MNIST. Different colors
 
p(x, z, c) indicate different ground-truth classes and the clustering accuracy at
LELBO (x) = Eq(z,c|x) log the corresponding epoch is reported in the bracket. It is clear to see
q(z, c|x)
Z X   that the latent representations become more and more suitable for
p(x|z)p(z) p(c|z) clustering during training, which can also be proved by the increas-
= q(c|x)q(z|x) log + log dz
z c q(z|x) q(c|x) ing clustering accuracy.
Z Z
p(x|z)p(z)
= q(z|x) log dz − q(z|x)DKL (q(c|x)||p(c|z))dz
z q(z|x) z the second term is the Kullback-Leibler divergence from the
(15) Mixture-of-Gaussians (MoG) prior p(z, c) to the variational
posterior q(z, c|x), which regularizes the latent embedding z
In Equation 15, the first term has no relationship with c to lie on a MoG manifold.
and the second term is non-negative. Hence, to maximize To demonstrate the importance of the KL term in Equa-
LELBO (x), DKL (q(c|x)||p(c|z)) ≡ 0 should be satisfied. As tion 17, we train an Auto-Encoder (AE) with the same net-
a result, we use the following equation to compute q(c|x) in work architecture as VaDE first, and then apply GMM on
VaDE: the latent representations from the learned AE, since a VaDE
p(c)p(z|c) model without the KL term is almost equivalent to an AE. We
q(c|x) = p(c|z) ≡ PK (16) refer to this model as AE+GMM. We also show the perfor-
c0 =1 p(c0 )p(z|c0 )
mance of using GMM directly on the observed space (GMM),
By using Equation 16, the information loss induced by the using VAE on the observed space and then using GMM on
mean-field approximation can be mitigated, since p(c|z) cap- the latent space from VAE (VAE+GMM)4 , as well as the per-
tures the relationship between c and z. It is worth noting formances of LDMGI [Yang et al., 2010], AAE [Makhzani
that p(c|z) is only an approximation to q(c|x), and we find et al., 2016] and DEC [Xie et al., 2016], in Figure 2. The
it works well in practice3 . fact that VaDE outperforms AE+GMM (without KL term)
Once the training is done by maximizing the ELBO w.r.t and VAE+GMM significantly confirms the importance of the
the parameters of {π, µc , σc , θ, φ}, c ∈ {1, · · · , K}, a latent regularization term and the advantage of jointly optimizing
representation z can be extracted for each observed sample x VAE and GMM by VaDE. We also present the illustrations of
by Equation 10 and Equation 11, and the clustering assign- clusters and the way they are changed w.r.t. training epochs
ments can be obtained by Equation 16. on MNIST dataset in Figure 3, where we map the latent rep-
resentations z into 2D space by t-SNE [Maaten and Hinton,
3.3 Understanding the ELBO of VaDE 2008].
This section, we provide some intuitions of the ELBO of
VaDE. More specifically, the ELBO in Equation 7 can be fur- 4 Experiments
ther rewritten as:
In this section, we evaluate the performance of VaDE on 5
LELBO (x) = Eq(z,c|x) [log p(x|z)] − DKL (q(z, c|x)||p(z, c)) benchmarks from different modalities: MNIST [LeCun et al.,
(17) 1998], HHAR [Stisen et al., 2015], Reuters-10K [Lewis et
The first term in Equation 17 is the reconstruction term, al., 2004], Reuters [Lewis et al., 2004] and STL-10 [Coates
which encourages VaDE to explain the dataset well. And et al., 2011]. We provide quantitative comparisons of VaDE
with other clustering methods including GMM, AE+GMM,
We approximate q(c|x) by: 1) sampling a z(i) ∼ q(z|x); 2)
3

computing q(c|x) = p(c|z(i) ) according to Equation 16 4


By doing this, VAE and GMM are optimized separately.

1968
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

Dataset # Samples Input Dim # Clusters architectures of f and g in Equation 1 and Equation 10 are
MNIST 70000 784 10 10-2000-500-500-D and D-500-500-2000-10, respectively,
HHAR 10299 561 6 where D is the input dimensionality. All layers are fully con-
REUTERS-10K 10000 2000 4 nected. Adam optimizer [Kingma and Ba, 2015] is used to
REUTERS 685071 2000 4 maximize the ELBO of Equation 9, and the mini-batch size
STL-10 13000 2048 10 is 100. The learning rate for MNIST, HHAR, Reuters-10K
and STL-10 is 0.002 and decreases every 10 epochs with a
Table 1: Datasets statistics decay rate of 0.9, and the learning rate for Reuters is 0.0005
with a decay rate of 0.5 for every epoch. As for the generative
process in Section 3.1, the multivariate Bernoulli distribution
VAE+GMM, LDGMI [Yang et al., 2010], AAE [Makhzani et is used for MNIST dataset, and the multivariate Gaussian dis-
al., 2016] and the strong baseline DEC [Xie et al., 2016]. We tribution is used for the others. The number of clusters is fixed
use the same network architecture as DEC for a fair compar- to the number of classes for each dataset, similar to DEC. We
ison. The experimental results show that VaDE achieves the will vary the number of clusters in Section 4.6.
state-of-the-art performance on all these benchmarks. Ad- Similar to other VAE-based models [Sønderby et al., 2016;
ditionally, we also provide quantitatively comparisons with Kingma and Salimans, 2016], VaDE suffers from the problem
other variants of VAE on the discriminative quality of the that the reconstruction term in Equation 17 would be so weak
latent representations. The code of VaDE is available at in the beginning of training that the model might get stuck in
https://github.com/slim1017/VaDE. an undesirable local minima or saddle point, from which it is
hard to escape. In this work, pretraining is used to avoid this
4.1 Datasets Description problem. Specifically, we use a Stacked Auto-Encoder to pre-
The following datasets are used in our empirical experiments. train the networks f and g. Then all data points are projected
into the latent space z by the pretrained network g, where a
• MNIST: The MNIST dataset consists of 70000 hand- GMM is applied to initialize the parameters of {π, µc , σc },
written digits. The images are centered and of size 28 by c ∈ {1, · · · , K}. In practice, few epochs of pretraining are
28 pixels. We reshaped each image to a 784-dimensional enough to provide a good initialization of VaDE. We find that
vector. VaDE is not sensitive to hyperparameters after pretraining.
• HHAR: The Heterogeneity Human Activity Recogni- Hence, we did not spend a lot of effort to tune them.
tion (HHAR) dataset contains 10299 sensor records
from smart phones and smart watches. All samples are 4.3 Quantitative Comparison
partitioned into 6 categories of human activities and each Following DEC, the performance of VaDE is measured by
sample is of 561 dimensions. unsupervised clustering accuracy (ACC), which is defined as:
• REUTERS: There are around 810000 English news PN
i=1 1{li = m(ci )}
stories labeled with a category tree in original Reuters ACC = max
dataset. Following DEC, we used 4 root categories: cor- m∈M N
porate/industrial, government/social, markets, and eco- where N is the total number of samples, li is the ground-
nomics as labels and discarded all documents with mul- truth label, ci is the cluster assignment obtained by the model,
tiple labels, which results in a 685071-article dataset. and M is the set of all possible one-to-one mappings be-
We computed tf-idf features on the 2000 most frequent tween cluster assignments and labels. The best mapping can
words to represent all articles. Similar to DEC, a ran- be obtained by using the KuhnMunkres algorithm [Munkres,
dom subset of 10000 documents is sampled, which is 1957]. Similar to DEC, we perform 10 random restarts when
referred to as Reuters-10K, since some spectral cluster- initializing all clustering models and pick the result with the
ing methods (e.g. LDMGI) cannot scale to full Reuters best objective value. As for LDMGI, AAE and DEC, we
dataset. use the same configurations as their original papers. Table 2
compares the performance of VaDE with other baselines over
• STL-10: The STL-10 dataset consists of color images
all datasets. It can be seen that VaDE outperforms all these
of 96-by-96 pixel size. There are 10 classes with 1300
baselines by a large margin on all datasets. Specifically, on
examples each. Since clustering directly from raw pix-
MNIST, HHAR, Reuters-10K, Reuters and STL-10 dataset,
els of high resolution images is rather difficult, we ex-
VaDE achieves ACC of 94.46%, 84.46%, 79.83%, 79.38%
tracted features of images of STL-10 by ResNet-50 [He
and 84.45%, which outperforms DEC with a relative increase
et al., 2016], which were then used to test the perfor-
ratio of 12.05%, 5.76%, 7.41%, 4.96% and 4.75%, respec-
mance of VaDE and all baselines. More specifically, we
tively.
applied a 3 × 3 average pooling over the last feature map
We also compare VaDE with SB-VAE [Nalisnick and
of ResNet-50 and the dimensionality of the features is
Smyth, 2016] and DLGMM [Nalisnick et al., 2016] on the
2048.
discriminative power of the latent representations, since these
two baselines cannot do clustering tasks. Following SB-VAE,
4.2 Experimental Setup the discriminative powers of the models’ latent representa-
As mentioned before, the same network architecture as DEC tions are assessed by running a k-Nearest Neighbors classi-
is adopted by VaDE for a fair comparison. Specifically, the fier (kNN) on the latent representations of MNIST. Table 3

1969
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

Method MNIST HHAR REUTERS-10K REUTERS STL-10


GMM 53.73 60.34 54.72 55.81 72.44
AE+GMM 82.18 77.67 70.13 70.98 79.83
VAE+GMM 72.94 68.02 69.56 60.89 78.86
LDMGI 84.09† 63.43 65.62 N/A 79.22
AAE 83.48 83.77 69.82 75.12 80.01
DEC 84.30† 79.86 74.32 75.63† 80.62
VaDE 94.46 84.46 79.83 79.38 84.45
†: Taken from [Xie et al., 2016].

Table 2: Clustering accuracy (%) performance comparison on all datasets.

Method k=3 k=5 k=10


VAE 18.43 15.69 14.19
DLGMM 9.14 8.38 8.42
SB-VAE 7.64 7.25 7.31
VaDE 2.20 2.14 2.22

Table 3: MNIST test error-rate (%) for kNN on latent space.


(a) GMM (b) VAE
shows the error rate of the kNN classifier on the latent rep-
resentations. It can be seen that VaDE outperforms SB-VAE
and DLGMM significantly5 . Note that although VaDE can
learn discriminative representations of samples, the training
of VaDE is in a totally unsupervised way. Hence, we did not
compare VaDE with other supervised models.
4.4 Generating Samples by VaDE
One major advantage of VaDE over DEC [Xie et al., 2016] is (c) InfoGAN (d) VaDE
that it is by nature a generative clustering model and can gen-
erate highly realistic samples for any specified cluster (class).
In this section, we provide some qualitative comparisons on Figure 4: The digits generated by GMM, VAE, InfoGAN and VaDE.
generating samples among VaDE, GMM, VAE and the state- Except (b), digits in the same row come from the same cluster.
of-art generative method InfoGAN [Chen et al., 2016].
Figure 4 illustrates the generated samples for class 0 to 9
of MNIST by GMM, VAE, InfoGAN and VaDE, respectively. From Figure 5 we can see that the original VAE which used
It can be seen that the digits generated by VaDE are smooth a single Gaussian prior does not perform well in clustering
and diverse. Note that the classes of the samples from VAE tasks. It can also be observed that the embeddings learned by
cannot be specified. We can also see that the performance of VaDE are better than those by VAE and DEC, since the num-
VaDE is comparable with InfoGAN. ber of incorrectly clustered samples is smaller. Furthermore,
incorrectly clustered samples by VaDE are mostly located at
4.5 Visualization of Learned Embeddings the border of each cluster, where confusing samples usually
In this section, we visualize the learned representations of appear. In contrast, a lot of the incorrectly clustered samples
VAE, DEC and VaDE on MNIST dataset. To this end, we use of DEC appear in the interior of the clusters, which indicates
t-SNE [Maaten and Hinton, 2008] to reduce the dimensional- that DEC fails to preserve the inherent structure of the data.
ity of the latent representation z from 10 to 2, and plot 2000 Some mistakes made by DEC and VaDE are also marked in
randomly sampled digits in Figure 5. The first row of Fig- Figure 5.
ure 5 illustrates the ground-truth labels for each digit, where
different colors indicate different labels. The second row of 4.6 The Impact of the Number of Clusters
Figure 5 demonstrates the clustering results, where correctly So far, the number of clusters for VaDE is set to the number
clustered samples are colored with green and incorrect ones of classes for each dataset, which is a prior knowledge. To
with red. demonstrate VaDE’s representation power as an unsupervised
5
We use the same network architecture for VaDE, SB-VAE in clustering model, we deliberately choose different numbers
Table 3 for fair comparisons. Since there is no code available for of clusters K. Each row in Figure 6 illustrates the samples
DLGMM, we take the number of DLGMM directly from [Nalisnick from a cluster grouped by VaDE on MNIST dataset, where
et al., 2016]. Note that [Nalisnick and Smyth, 2016] has already K is set to 7 and 14 in Figure 6(a) and Figure 6(b), respec-
shown that the performance of SB-VAE is comparable to DLGMM. tively. We can see that, if K is smaller than the number of

1970
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

(a) 7 clusters (b) 14 clusters

Figure 6: Clustering MNIST with different numbers of clusters. We


illustrate samples belonging to each cluster by rows.

References
[Abbasnejad et al., 2016] Ehsan Abbasnejad, Anthony Dick,
Figure 5: Visualization of the embeddings learned by VAE, DEC and and Anton van den Hengel. Infinite variational au-
VaDE on MNIST, respectively. The first row illustrates the ground- toencoder for semi-supervised learning. arXiv preprint
truth labels for each digit, where different colors indicate different arXiv:1611.07800, 2016.
labels. The second row demonstrates the clustering results, where [Chen et al., 2016] Xi Chen, Yan Duan, Rein Houthooft,
correctly clustered samples are colored with green and, incorrect
John Schulman, Ilya Sutskever, and Pieter Abbeel. Info-
ones with red. GT:4 means the ground-truth label of the digit is 4,
DEC:4 means DEC assigns the digit to the cluster of 4, and VaDE:4 gan: Interpretable representation learning by information
denotes the assignment by VaDE is 4, and so on. It is better to view maximizing generative adversarial nets. In NIPS, 2016.
the figure in color. [Coates et al., 2011] Adam Coates, Andrew Y Ng, and
Honglak Lee. An analysis of single-layer networks in un-
supervised feature learning. In International Conference
classes, digits with similar appearances will be clustered to- on Artificial Intelligence and Statistics, 2011.
gether, such as 9 and 4, 3 and 8 in Figure 6(a). On the other
hand, if K is larger than the number of classes, some digits [Dosovitskiy and Brox, 2016] Alexey Dosovitskiy and
will fall into sub-classes by VaDE, such as the fatter 0 and Thomas Brox. Generating images with perceptual
thinner 0, and the upright 1 and oblique 1 in Figure 6(b). similarity metrics based on deep networks. In NIPS, 2016.
[Goodfellow et al., 2014] Ian Goodfellow, Jean Pouget-
Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
5 Conclusion Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-
In this paper, we proposed Variational Deep Embedding erative adversarial nets. In NIPS, 2014.
(VaDE) which embeds the probabilistic clustering problems [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing
into a Variational Auto-Encoder (VAE) framework. VaDE Ren, and Jian Sun. Deep residual learning for image recog-
models the data generative procedure by a GMM model and nition. In CVPR, 2016.
a neural network, and is optimized by maximizing the evi- [Kingma and Ba, 2015] Diederik Kingma and Jimmy Ba.
dence lower bound (ELBO) of the log-likelihood of data by
Adam: A method for stochastic optimization. In ICLR,
the SGVB estimator and the reparameterization trick. We
2015.
compared the clustering performance of VaDE with strong
baselines on 5 benchmarks from different modalities, and the [Kingma and Salimans, 2016] Diederik P Kingma and Tim
experimental results showed that VaDE outperforms the state- Salimans. Improving variational autoencoders with in-
of-the-art methods by a large margin. We also showed that verse autoregressive flow. In NIPS, 2016.
VaDE could generate highly realistic samples conditioned on [Kingma and Welling, 2014] Diederik P Kingma and Max
cluster information without using any supervised information Welling. Auto-encoding variational bayes. In ICLR, 2014.
during training. Note that although we use a MoG prior for
[Kingma et al., 2014] Diederik P. Kingma, Danilo J.
VaDE in this paper, other mixture models can also be adopted
in this framework flexibly, which will be our future work. Rezende, Shakir Mohamed, and Max Welling. Semi-
supervised learning with deep generative models. In
NIPS, 2014.
Acknowledgments [Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever,
We thank the School of Mechanical Engineering of BIT (Bei- and Geoffrey E Hinton. Imagenet classification with deep
jing Institute of Technology) and Collaborative Innovation convolutional neural networks. In NIPS, 2012.
Center of Electric Vehicles in Beijing for their support. This [LeCun et al., 1998] Yann LeCun, Léon Bottou, Yoshua
work was supported by the National Natural Science Founda- Bengio, and Patrick Haffner. Gradient-based learning ap-
tion of China (61620106002, 61271376). We also thank the plied to document recognition. Proceedings of the IEEE,
anonymous reviewers. 1998.

1971
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

[Lewis et al., 2004] David D Lewis, Yiming Yang, Tony G Smart devices are different: Assessing and mitigatingmo-
Rose, and Fan Li. Rcv1: A new benchmark collection for bile sensing heterogeneities for activity recognition. In
text categorization research. Journal of machine learning Proceedings of the 13th ACM Conference on Embedded
research, 2004. Networked Sensor Systems, 2015.
[Liu et al., 2010] Jialu Liu, Deng Cai, and Xiaofei He. Gaus- [Szegedy et al., 2015] Christian Szegedy, Wei Liu, Yangqing
sian mixture model with local consistency. In AAAI, 2010. Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
[Maaløe et al., 2016] Lars Maaløe, Casper Kaae Sønderby, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi-
novich. Going deeper with convolutions. In Proceedings
Søren Kaae Sønderby, and Ole Winther. Auxiliary deep
of the IEEE Conference on Computer Vision and Pattern
generative models. In ICML, 2016.
Recognition, pages 1–9, 2015.
[Maaten and Hinton, 2008] Laurens van der Maaten and Ge-
[Von Luxburg, 2007] Ulrike Von Luxburg. A tutorial on
offrey Hinton. Visualizing data using t-sne. Journal of
spectral clustering. Statistics and computing, 2007.
Machine Learning Research, 2008.
[Xie et al., 2016] Junyuan Xie, Ross Girshick, and Ali
[Makhzani et al., 2016] Alireza Makhzani, Jonathon Shlens,
Farhadi. Unsupervised deep embedding for clustering
Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Ad- analysis. In ICML, 2016.
versarial autoencoders. In NIPS, 2016.
[Yang et al., 2010] Yi Yang, Dong Xu, Feiping Nie,
[Munkres, 1957] James Munkres. Algorithms for the assign-
Shuicheng Yan, and Yueting Zhuang. Image clustering us-
ment and transportation problems. Journal of the society ing local discriminant models and global integration. IEEE
for industrial and applied mathematics, 1957. Transactions on Image Processing, 2010.
[Nalisnick and Smyth, 2016] Eric Nalisnick and Padhraic [Ye et al., 2008] Jieping Ye, Zheng Zhao, and Mingrui Wu.
Smyth. Stick-breaking variational autoencoders. arXiv Discriminative k-means for clustering. In NIPS, 2008.
preprint arXiv:1605.06197, 2016.
[Zheng et al., 2014a] Yin Zheng, Richard S Zemel, Yu-Jin
[Nalisnick et al., 2016] Eric Nalisnick, Lars Hertel, and Zhang, and Hugo Larochelle. A neural autoregressive ap-
Padhraic Smyth. Approximate inference for deep latent proach to attention-based recognition. International Jour-
gaussian mixtures. 2016. nal of Computer Vision, 113(1):67–79, 2014.
[Ng et al., 2002] Andrew Y Ng, Michael I Jordan, Yair [Zheng et al., 2014b] Yin Zheng, Yu-Jin Zhang, and
Weiss, et al. On spectral clustering: Analysis and an al- H. Larochelle. Topic modeling of multimodal data: An
gorithm. In NIPS, 2002. autoregressive approach. In Computer Vision and Pattern
[Nguyen et al., 2016] Anh Nguyen, Jason Yosinski, Yoshua Recognition (CVPR), 2014 IEEE Conference on, pages
Bengio, Alexey Dosovitskiy, and Jeff Clune. Plug & play 1370–1377, June 2014.
generative networks: Conditional iterative generation of [Zheng et al., 2015] Y. Zheng, Yu-Jin Zhang, and
images in latent space. arXiv preprint arXiv:1612.00005, H. Larochelle. A deep and autoregressive approach
2016. for topic modeling of multimodal data. Pattern Anal-
[Oord et al., 2016] Aaron van den Oord, Nal Kalchbrenner, ysis and Machine Intelligence, IEEE Transactions on,
and Koray Kavukcuoglu. Pixel recurrent neural networks. PP(99):1–1, 2015.
In ICML, 2016. [Zheng et al., 2016] Yin Zheng, Bangsheng Tang, Wenkui
[Radford et al., 2016] Alec Radford, Luke Metz, and Ding, and Hanning Zhou. A neural autoregressive ap-
Soumith Chintala. Unsupervised representation learning proach to collaborative filtering. In Proceedings of the
with deep convolutional generative adversarial networks. 33nd International Conference on Machine Learning,
In ICLR, 2016. pages 764–773, 2016.
[Salimans et al., 2016] Tim Salimans, Ian Goodfellow, Woj-
ciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Improved techniques for training gans. In NIPS, 2016.
[Shu et al., 2016] Rui Shu, James Brofos, Frank Zhang,
Hung Hai Bui, Mohammad Ghavamzadeh, and Mykel
Kochenderfer. Stochastic video prediction with condi-
tional density estimation. In ECCV Workshop on Action
and Anticipation for Visual Learning, 2016.
[Sønderby et al., 2016] Casper Kaae Sønderby, Tapani
Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole
Winther. Ladder variational autoencoders. In NIPS, 2016.
[Stisen et al., 2015] Allan Stisen, Henrik Blunck, Sourav
Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjær-
gaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen.

1972

You might also like