Variational Deep Embedding
Variational Deep Embedding
1965
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
the core. Taking a similar approach, it is conceivable to con- the KL divergence to minimize the within-cluster distance of
duct clustering analysis on good representations, instead of each cluster. DEC achieved impressive performances on clus-
raw data points. In a recent work, Deep Embedded Clus- tering tasks. However, the feature embedding in DEC is de-
tering (DEC) [Xie et al., 2016] was proposed to simultane- signed specifically for clustering and fails to uncover the real
ously learn feature representations and cluster assignments underlying structure of the data, which makes the model lack
by deep neural networks. Although DEC performs well in of the ability to extend itself to other tasks beyond clustering,
clustering, similar to K-means, DEC cannot model the gen- such as generating samples.
erative process of data, hence is not able to generate sam-
ples. Some recent works, e.g. VAE [Kingma and Welling, The deep generative models have recently attracted much
2014], GAN [Goodfellow et al., 2014] , PixelRNN [Oord et attention in that they can capture the data distribution by
al., 2016], InfoGAN [Chen et al., 2016] and PPGN [Nguyen neural networks, from which unseen samples can be gener-
et al., 2016], have shown that neural networks can be trained ated. GAN and VAE are among the most successful deep
to generate meaningful samples. The motivation of this work generative models in recent years. Both of them are ap-
is to develop a clustering model based on neural networks pealing unsupervised generative models, and their variants
that 1) learns good representations that capture the statistical have been extensively studied and applied in various tasks
structure of the data, and 2) is capable of generating samples. such as semi-supervised classification [Kingma et al., 2014;
In this paper, we propose a clustering framework, Maaløe et al., 2016; Salimans et al., 2016; Makhzani et
Variational Deep Embedding (VaDE), that combines al., 2016; Abbasnejad et al., 2016], clustering [Makhzani
VAE [Kingma and Welling, 2014] and a Gaussian Mixture et al., 2016] and image generation [Radford et al., 2016;
Model for clustering tasks. VaDE models the data generative Dosovitskiy and Brox, 2016].
process by a GMM and a DNN f : 1) a cluster is picked
up by the GMM; 2) from which a latent representation
z is sampled; 3) DNN f decodes z to an observation x. For example, [Abbasnejad et al., 2016] proposed to use
Moreover, VaDE is optimized by using another DNN g to a mixture of VAEs for semi-supervised classification tasks,
encode observed data x into latent embedding z, so that the where the mixing coefficients of these VAEs are modeled
Stochastic Gradient Variational Bayes (SGVB) estimator and by a Dirichlet process to adapt its capacity to the input
the reparameterization trick [Kingma and Welling, 2014] data. SB-VAE [Nalisnick and Smyth, 2016] also applied
can be used to maximize the evidence lower bound (ELBO). Bayesian nonparametric techniques on VAE, which derived
VaDE generalizes VAE in that a Mixture-of-Gaussians prior a stochastic latent dimensionality by a stick-breaking prior
replaces the single Gaussian prior. Hence, VaDE is by design and achieved good performance on semi-supervised classifi-
more suitable for clustering tasks1 . Specifically, the main cation tasks. VaDE differs with SB-VAE in that the cluster
contributions of the paper are: assignment and the latent representation are jointly consid-
ered in the Gaussian mixture prior, whereas SB-VAE sepa-
• We propose an unsupervised generative clustering rately models the latent representation and the class variable,
framework, VaDE, that combines VAE and GMM to- which fails to capture the dependence between them. Addi-
gether. tionally, VaDE does not need the class label during training,
• We show how to optimize VaDE by maximizing the while the labels of data are required by SB-VAE due to its
ELBO using the SGVB estimator and the reparameteri- semi-supervised setting. Among the variants of VAE, Adver-
zation trick; sarial Auto-Encoder(AAE) [Makhzani et al., 2016] can also
• Experimental results show that VaDE outperforms the do unsupervised clustering tasks. Different from VaDE, AAE
state-of-the-art clustering models on 5 datasets from var- uses GAN to match the aggregated posterior with the prior of
ious modalities by a large margin; VAE, which is much more complex than VaDE on the training
procedure. We will compare AAE with VaDE in the experi-
• We show that VaDE can generate highly realistic sam- ments part.
ples for any specified cluster, without using supervised
information during training.
Similar to VaDE, [Nalisnick et al., 2016] proposed DL-
The diagram of VaDE is illustrated in Figure 1. GMM to combine VAE and GMM together. The crucial dif-
ference, however, is that VaDE uses a mixture of Gaussian
2 Related Work prior to replace the single Gaussian prior of VAE, which is
Recently, people find that learning good representations plays suitable for clustering tasks by nature, while DLGMM uses
an important role in clustering tasks. For example, DEC [Xie a mixture of Gaussian distribution as the approximate pos-
et al., 2016] was proposed to learn feature representations and terior of VAE and does not model the class variable. Hence,
cluster assignments simultaneously by deep neural networks. VaDE generalizes VAE to clustering tasks, whereas DLGMM
In fact, DEC learns a mapping from the observed space to a is used to improve the capacity of the original VAE and is not
lower-dimensional latent space, where it iteratively optimizes suitable for clustering tasks by design. The recently proposed
GM-CVAE [Shu et al., 2016] also combines VAE with GMM
1
Although people can use VaDE to do unsupervised feature together. However, the GMM in GM-CVAE is used to model
learning or semi-supervised learning tasks, we only focus on clus- the transitions between video frames, which is the main dif-
tering tasks in this work. ference with VaDE.
1966
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
3 Variational Deep Embedding where LELBO is the evidence lower bound (ELBO), q(z, c|x)
In this section, we describe Variational Deep Embedding is the variational posterior to approximate the true posterior
(VaDE), a model for probabilistic clustering problem within p(z, c|x). In VaDE, we assume q(z, c|x) to be a mean-field
the framework of Variational Auto-Encoder (VAE). distribution and can be factorized as:
p(x, z, c) = p(x|z)p(z|c)p(c), (3) σ̃ 2 , and ∗|j denotes the j th element of ∗, K is the number of
clusters, πc is the prior probability of cluster c, and γc denotes
since x and c are independent conditioned on z. And the q(c|x) for simplicity.
probabilities are defined as: (l)
In Equation 12, we compute µx as
p(c) = Cat(c|π) (4)
p(z|c) = N z|µc , σc2 I
(5) µ(l) (l)
x = f (z ; θ), (13)
p(x|z) = Ber(x|µx ) or N (x|µx , σx2 I) (6) where z(l) is the lth sample from q(z|x) by Equation 11 to
produce the Monte Carlo samples. According to the repa-
3.2 Variational Lower Bound rameterization trick, z(l) is obtained by
A VaDE instance is tuned to maximize the likelihood of the
given data points. Given the generative process in Section 3.1, z(l) = µ̃ + σ̃ ◦ (l) , (14)
by using Jensen’s inequality, the log-likelihood of VaDE can
be written as: where (l) ∼ N (0, I), ◦ is element-wise multiplication, and
Z X µ̃, σ̃ are derived by Equation 10.
log p(x) = log p(x, z, c)dz We now describe how to formulate γc , q(c|x) in Equa-
z c tion 12 to maximize the ELBO. Specifically, LELBO (x) can
p(x, z, c) 2
≥ Eq(z,c|x) [log ] = LELBO (x) (7) This is the case when the observation x is binary. For the real-
q(z, c|x) valued situation, the ELBO can be obtained in a similar way.
1967
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
1968
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
Dataset # Samples Input Dim # Clusters architectures of f and g in Equation 1 and Equation 10 are
MNIST 70000 784 10 10-2000-500-500-D and D-500-500-2000-10, respectively,
HHAR 10299 561 6 where D is the input dimensionality. All layers are fully con-
REUTERS-10K 10000 2000 4 nected. Adam optimizer [Kingma and Ba, 2015] is used to
REUTERS 685071 2000 4 maximize the ELBO of Equation 9, and the mini-batch size
STL-10 13000 2048 10 is 100. The learning rate for MNIST, HHAR, Reuters-10K
and STL-10 is 0.002 and decreases every 10 epochs with a
Table 1: Datasets statistics decay rate of 0.9, and the learning rate for Reuters is 0.0005
with a decay rate of 0.5 for every epoch. As for the generative
process in Section 3.1, the multivariate Bernoulli distribution
VAE+GMM, LDGMI [Yang et al., 2010], AAE [Makhzani et is used for MNIST dataset, and the multivariate Gaussian dis-
al., 2016] and the strong baseline DEC [Xie et al., 2016]. We tribution is used for the others. The number of clusters is fixed
use the same network architecture as DEC for a fair compar- to the number of classes for each dataset, similar to DEC. We
ison. The experimental results show that VaDE achieves the will vary the number of clusters in Section 4.6.
state-of-the-art performance on all these benchmarks. Ad- Similar to other VAE-based models [Sønderby et al., 2016;
ditionally, we also provide quantitatively comparisons with Kingma and Salimans, 2016], VaDE suffers from the problem
other variants of VAE on the discriminative quality of the that the reconstruction term in Equation 17 would be so weak
latent representations. The code of VaDE is available at in the beginning of training that the model might get stuck in
https://github.com/slim1017/VaDE. an undesirable local minima or saddle point, from which it is
hard to escape. In this work, pretraining is used to avoid this
4.1 Datasets Description problem. Specifically, we use a Stacked Auto-Encoder to pre-
The following datasets are used in our empirical experiments. train the networks f and g. Then all data points are projected
into the latent space z by the pretrained network g, where a
• MNIST: The MNIST dataset consists of 70000 hand- GMM is applied to initialize the parameters of {π, µc , σc },
written digits. The images are centered and of size 28 by c ∈ {1, · · · , K}. In practice, few epochs of pretraining are
28 pixels. We reshaped each image to a 784-dimensional enough to provide a good initialization of VaDE. We find that
vector. VaDE is not sensitive to hyperparameters after pretraining.
• HHAR: The Heterogeneity Human Activity Recogni- Hence, we did not spend a lot of effort to tune them.
tion (HHAR) dataset contains 10299 sensor records
from smart phones and smart watches. All samples are 4.3 Quantitative Comparison
partitioned into 6 categories of human activities and each Following DEC, the performance of VaDE is measured by
sample is of 561 dimensions. unsupervised clustering accuracy (ACC), which is defined as:
• REUTERS: There are around 810000 English news PN
i=1 1{li = m(ci )}
stories labeled with a category tree in original Reuters ACC = max
dataset. Following DEC, we used 4 root categories: cor- m∈M N
porate/industrial, government/social, markets, and eco- where N is the total number of samples, li is the ground-
nomics as labels and discarded all documents with mul- truth label, ci is the cluster assignment obtained by the model,
tiple labels, which results in a 685071-article dataset. and M is the set of all possible one-to-one mappings be-
We computed tf-idf features on the 2000 most frequent tween cluster assignments and labels. The best mapping can
words to represent all articles. Similar to DEC, a ran- be obtained by using the KuhnMunkres algorithm [Munkres,
dom subset of 10000 documents is sampled, which is 1957]. Similar to DEC, we perform 10 random restarts when
referred to as Reuters-10K, since some spectral cluster- initializing all clustering models and pick the result with the
ing methods (e.g. LDMGI) cannot scale to full Reuters best objective value. As for LDMGI, AAE and DEC, we
dataset. use the same configurations as their original papers. Table 2
compares the performance of VaDE with other baselines over
• STL-10: The STL-10 dataset consists of color images
all datasets. It can be seen that VaDE outperforms all these
of 96-by-96 pixel size. There are 10 classes with 1300
baselines by a large margin on all datasets. Specifically, on
examples each. Since clustering directly from raw pix-
MNIST, HHAR, Reuters-10K, Reuters and STL-10 dataset,
els of high resolution images is rather difficult, we ex-
VaDE achieves ACC of 94.46%, 84.46%, 79.83%, 79.38%
tracted features of images of STL-10 by ResNet-50 [He
and 84.45%, which outperforms DEC with a relative increase
et al., 2016], which were then used to test the perfor-
ratio of 12.05%, 5.76%, 7.41%, 4.96% and 4.75%, respec-
mance of VaDE and all baselines. More specifically, we
tively.
applied a 3 × 3 average pooling over the last feature map
We also compare VaDE with SB-VAE [Nalisnick and
of ResNet-50 and the dimensionality of the features is
Smyth, 2016] and DLGMM [Nalisnick et al., 2016] on the
2048.
discriminative power of the latent representations, since these
two baselines cannot do clustering tasks. Following SB-VAE,
4.2 Experimental Setup the discriminative powers of the models’ latent representa-
As mentioned before, the same network architecture as DEC tions are assessed by running a k-Nearest Neighbors classi-
is adopted by VaDE for a fair comparison. Specifically, the fier (kNN) on the latent representations of MNIST. Table 3
1969
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
1970
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
References
[Abbasnejad et al., 2016] Ehsan Abbasnejad, Anthony Dick,
Figure 5: Visualization of the embeddings learned by VAE, DEC and and Anton van den Hengel. Infinite variational au-
VaDE on MNIST, respectively. The first row illustrates the ground- toencoder for semi-supervised learning. arXiv preprint
truth labels for each digit, where different colors indicate different arXiv:1611.07800, 2016.
labels. The second row demonstrates the clustering results, where [Chen et al., 2016] Xi Chen, Yan Duan, Rein Houthooft,
correctly clustered samples are colored with green and, incorrect
John Schulman, Ilya Sutskever, and Pieter Abbeel. Info-
ones with red. GT:4 means the ground-truth label of the digit is 4,
DEC:4 means DEC assigns the digit to the cluster of 4, and VaDE:4 gan: Interpretable representation learning by information
denotes the assignment by VaDE is 4, and so on. It is better to view maximizing generative adversarial nets. In NIPS, 2016.
the figure in color. [Coates et al., 2011] Adam Coates, Andrew Y Ng, and
Honglak Lee. An analysis of single-layer networks in un-
supervised feature learning. In International Conference
classes, digits with similar appearances will be clustered to- on Artificial Intelligence and Statistics, 2011.
gether, such as 9 and 4, 3 and 8 in Figure 6(a). On the other
hand, if K is larger than the number of classes, some digits [Dosovitskiy and Brox, 2016] Alexey Dosovitskiy and
will fall into sub-classes by VaDE, such as the fatter 0 and Thomas Brox. Generating images with perceptual
thinner 0, and the upright 1 and oblique 1 in Figure 6(b). similarity metrics based on deep networks. In NIPS, 2016.
[Goodfellow et al., 2014] Ian Goodfellow, Jean Pouget-
Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
5 Conclusion Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-
In this paper, we proposed Variational Deep Embedding erative adversarial nets. In NIPS, 2014.
(VaDE) which embeds the probabilistic clustering problems [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing
into a Variational Auto-Encoder (VAE) framework. VaDE Ren, and Jian Sun. Deep residual learning for image recog-
models the data generative procedure by a GMM model and nition. In CVPR, 2016.
a neural network, and is optimized by maximizing the evi- [Kingma and Ba, 2015] Diederik Kingma and Jimmy Ba.
dence lower bound (ELBO) of the log-likelihood of data by
Adam: A method for stochastic optimization. In ICLR,
the SGVB estimator and the reparameterization trick. We
2015.
compared the clustering performance of VaDE with strong
baselines on 5 benchmarks from different modalities, and the [Kingma and Salimans, 2016] Diederik P Kingma and Tim
experimental results showed that VaDE outperforms the state- Salimans. Improving variational autoencoders with in-
of-the-art methods by a large margin. We also showed that verse autoregressive flow. In NIPS, 2016.
VaDE could generate highly realistic samples conditioned on [Kingma and Welling, 2014] Diederik P Kingma and Max
cluster information without using any supervised information Welling. Auto-encoding variational bayes. In ICLR, 2014.
during training. Note that although we use a MoG prior for
[Kingma et al., 2014] Diederik P. Kingma, Danilo J.
VaDE in this paper, other mixture models can also be adopted
in this framework flexibly, which will be our future work. Rezende, Shakir Mohamed, and Max Welling. Semi-
supervised learning with deep generative models. In
NIPS, 2014.
Acknowledgments [Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever,
We thank the School of Mechanical Engineering of BIT (Bei- and Geoffrey E Hinton. Imagenet classification with deep
jing Institute of Technology) and Collaborative Innovation convolutional neural networks. In NIPS, 2012.
Center of Electric Vehicles in Beijing for their support. This [LeCun et al., 1998] Yann LeCun, Léon Bottou, Yoshua
work was supported by the National Natural Science Founda- Bengio, and Patrick Haffner. Gradient-based learning ap-
tion of China (61620106002, 61271376). We also thank the plied to document recognition. Proceedings of the IEEE,
anonymous reviewers. 1998.
1971
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
[Lewis et al., 2004] David D Lewis, Yiming Yang, Tony G Smart devices are different: Assessing and mitigatingmo-
Rose, and Fan Li. Rcv1: A new benchmark collection for bile sensing heterogeneities for activity recognition. In
text categorization research. Journal of machine learning Proceedings of the 13th ACM Conference on Embedded
research, 2004. Networked Sensor Systems, 2015.
[Liu et al., 2010] Jialu Liu, Deng Cai, and Xiaofei He. Gaus- [Szegedy et al., 2015] Christian Szegedy, Wei Liu, Yangqing
sian mixture model with local consistency. In AAAI, 2010. Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
[Maaløe et al., 2016] Lars Maaløe, Casper Kaae Sønderby, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi-
novich. Going deeper with convolutions. In Proceedings
Søren Kaae Sønderby, and Ole Winther. Auxiliary deep
of the IEEE Conference on Computer Vision and Pattern
generative models. In ICML, 2016.
Recognition, pages 1–9, 2015.
[Maaten and Hinton, 2008] Laurens van der Maaten and Ge-
[Von Luxburg, 2007] Ulrike Von Luxburg. A tutorial on
offrey Hinton. Visualizing data using t-sne. Journal of
spectral clustering. Statistics and computing, 2007.
Machine Learning Research, 2008.
[Xie et al., 2016] Junyuan Xie, Ross Girshick, and Ali
[Makhzani et al., 2016] Alireza Makhzani, Jonathon Shlens,
Farhadi. Unsupervised deep embedding for clustering
Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Ad- analysis. In ICML, 2016.
versarial autoencoders. In NIPS, 2016.
[Yang et al., 2010] Yi Yang, Dong Xu, Feiping Nie,
[Munkres, 1957] James Munkres. Algorithms for the assign-
Shuicheng Yan, and Yueting Zhuang. Image clustering us-
ment and transportation problems. Journal of the society ing local discriminant models and global integration. IEEE
for industrial and applied mathematics, 1957. Transactions on Image Processing, 2010.
[Nalisnick and Smyth, 2016] Eric Nalisnick and Padhraic [Ye et al., 2008] Jieping Ye, Zheng Zhao, and Mingrui Wu.
Smyth. Stick-breaking variational autoencoders. arXiv Discriminative k-means for clustering. In NIPS, 2008.
preprint arXiv:1605.06197, 2016.
[Zheng et al., 2014a] Yin Zheng, Richard S Zemel, Yu-Jin
[Nalisnick et al., 2016] Eric Nalisnick, Lars Hertel, and Zhang, and Hugo Larochelle. A neural autoregressive ap-
Padhraic Smyth. Approximate inference for deep latent proach to attention-based recognition. International Jour-
gaussian mixtures. 2016. nal of Computer Vision, 113(1):67–79, 2014.
[Ng et al., 2002] Andrew Y Ng, Michael I Jordan, Yair [Zheng et al., 2014b] Yin Zheng, Yu-Jin Zhang, and
Weiss, et al. On spectral clustering: Analysis and an al- H. Larochelle. Topic modeling of multimodal data: An
gorithm. In NIPS, 2002. autoregressive approach. In Computer Vision and Pattern
[Nguyen et al., 2016] Anh Nguyen, Jason Yosinski, Yoshua Recognition (CVPR), 2014 IEEE Conference on, pages
Bengio, Alexey Dosovitskiy, and Jeff Clune. Plug & play 1370–1377, June 2014.
generative networks: Conditional iterative generation of [Zheng et al., 2015] Y. Zheng, Yu-Jin Zhang, and
images in latent space. arXiv preprint arXiv:1612.00005, H. Larochelle. A deep and autoregressive approach
2016. for topic modeling of multimodal data. Pattern Anal-
[Oord et al., 2016] Aaron van den Oord, Nal Kalchbrenner, ysis and Machine Intelligence, IEEE Transactions on,
and Koray Kavukcuoglu. Pixel recurrent neural networks. PP(99):1–1, 2015.
In ICML, 2016. [Zheng et al., 2016] Yin Zheng, Bangsheng Tang, Wenkui
[Radford et al., 2016] Alec Radford, Luke Metz, and Ding, and Hanning Zhou. A neural autoregressive ap-
Soumith Chintala. Unsupervised representation learning proach to collaborative filtering. In Proceedings of the
with deep convolutional generative adversarial networks. 33nd International Conference on Machine Learning,
In ICLR, 2016. pages 764–773, 2016.
[Salimans et al., 2016] Tim Salimans, Ian Goodfellow, Woj-
ciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Improved techniques for training gans. In NIPS, 2016.
[Shu et al., 2016] Rui Shu, James Brofos, Frank Zhang,
Hung Hai Bui, Mohammad Ghavamzadeh, and Mykel
Kochenderfer. Stochastic video prediction with condi-
tional density estimation. In ECCV Workshop on Action
and Anticipation for Visual Learning, 2016.
[Sønderby et al., 2016] Casper Kaae Sønderby, Tapani
Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole
Winther. Ladder variational autoencoders. In NIPS, 2016.
[Stisen et al., 2015] Allan Stisen, Henrik Blunck, Sourav
Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjær-
gaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen.
1972