Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
34 views10 pages

Vaze Generalized Category Discovery CVPR 2022 Paper

This paper introduces a new setting called 'Generalized Category Discovery' (GCD), which aims to categorize unlabelled images in a dataset that may include both known and novel classes. The authors propose a method leveraging contrastively trained vision transformers and semi-supervised k-means clustering to assign labels effectively, addressing limitations of existing recognition methods. The approach is evaluated rigorously on various datasets, demonstrating substantial improvements over established baselines.

Uploaded by

lqsonghua888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views10 pages

Vaze Generalized Category Discovery CVPR 2022 Paper

This paper introduces a new setting called 'Generalized Category Discovery' (GCD), which aims to categorize unlabelled images in a dataset that may include both known and novel classes. The authors propose a method leveraging contrastively trained vision transformers and semi-supervised k-means clustering to assign labels effectively, addressing limitations of existing recognition methods. The approach is evaluated rigorously on various datasets, demonstrating substantial improvements over established baselines.

Uploaded by

lqsonghua888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Generalized Category Discovery

Sagar Vaze? Kai Han† Andrea Vedaldi? Andrew Zisserman?


?
Visual Geometry Group, Department of Engineering Science, University of Oxford

The University of Hong Kong
{sagar,vedaldi,az}@robots.ox.ac.uk [email protected]

Setting: Generalized Category Discovery Method


(1) Feature extraction with vision transformer

Elephant ? ? ? Frog Bird

...
ViT x768D

Input image Image patches Image embedding

Frog ? ? ? ?
(2) Supervised Contrastive (left) & Self-supervised Contrastive (right)
?
+
+ -
-
+
+
? Frog ? Bird Bird -
? +
(3) Semi-supervised K-Means Clustering

Figure 1. We present a new setting: ‘Generalized Category Discovery’ and a method to tackle it. Our setting can be succinctly described
as: given a dataset, a subset of which has class labels, categorize all unlabelled images in the dataset. The unlabelled images may come
from labelled or novel classes. Our method leverages contrastively trained vision transformers to assign labels directly through clustering.

Abstract baselines. Finally, we also propose a new approach to esti-


mate the number of classes in the unlabelled data. We thor-
In this paper, we consider a highly general image recog- oughly evaluate our approach on public datasets for generic
nition setting wherein, given a labelled and unlabelled set object classification and on fine-grained datasets, lever-
of images, the task is to categorize all images in the un- aging the recent Semantic Shift Benchmark suite. Code:
labelled set. Here, the unlabelled images may come from https://www.robots.ox.ac.uk/⇠vgg/research/gcd
labelled classes or from novel ones. Existing recognition
methods are not able to deal with this setting, because
they make several restrictive assumptions, such as the un- 1. Introduction
labelled instances only coming from known – or unknown
– classes, and the number of unknown classes being known Consider an infant sitting in a car and observing the
a-priori. We address the more unconstrained setting, nam- world. Object instances will pass the car and, for some of
ing it ‘Generalized Category Discovery’, and challenge all these, the infant may have been told their category (‘that
these assumptions. We first establish strong baselines by is a dog’, ‘that is a car’) and be able to recognize them.
taking state-of-the-art algorithms from novel category dis- There will also be instances that the infant has not seen be-
covery and adapting them for this task. Next, we pro- fore (cats and bicycles) and, having seen a number of these
pose the use of vision transformers with contrastive rep- instances, we might expect the infant’s visual recognition
resentation learning for this open-world setting. We then system to cluster these into new categories.
introduce a simple yet effective semi-supervised k-means This is the problem that we consider in this work: given
method to cluster the unlabelled data into seen and un- an image dataset where only some images are labelled with
seen classes automatically, substantially outperforming the their categories, assign a category label to each of the rest

7492
images, possibly using new categories not observed in the under-investigated problem in image recognition: estimat-
labelled set. We term this problem Generalized Category ing the number of categories in unlabelled data. Almost all
Discovery (GCD), and suggest that this is a realistic use case methods, including purely unsupervised ones, assume the
for many machine vision applications: whether that is rec- knowledge of the number of categories, a highly unrealis-
ognizing products in a supermarket; pathologies in medical tic assumption in the real world. We propose an algorithm
images; or vehicles in autonomous driving. In these and which leverages the labelled set to tackle this problem.
other realistic vision settings, it is often impossible to know Our contributions can be summarized as follows: (i) the
if new images come from labelled or novel categories. formalization of Generalized Category Discovery (GCD), a
In contrast, consider the limitations of existing image new and realistic setting for image recognition; (ii) the es-
recognition settings. In image classification, one of the tablishment of strong baselines by adapting state-of-the-art
most widely studied problems, all of the training images techniques from standard novel category discovery to this
come with class labels. Furthermore, all images at test task; (iii) a simple but effective method for GCD, which
time come from the same classes as the training set. Semi- uses contrastive representation learning and clustering to di-
supervised learning (SSL) [7] introduces the problem of rectly provide class labels, and outperforms the baselines
learning from unlabelled data, but still assumes that all un- substantially; (iv) a novel method for estimating the num-
labelled images come from the same set of classes as the ber of categories in unlabelled data, a largely understud-
labelled ones. More recently, the tasks of open-set recogni- ied problem; and (v) rigorous evaluation on standard im-
tion (OSR) [38] and novel-category discovery (NCD) [19] age recognition datasets as well as the recent Semantic Shift
have tackled open-world settings in which the images at test Benchmark suite [45].
time may belong to new classes. However, OSR aims only
to detect test-time images which do not belong to one of 2. Related work
the classes in the labelled set, but does not require any fur-
ther classification amongst these detected images. Mean- Our work relates to prior work on semi-supervised learn-
while, in NCD, the closest setting to the one tackled in this ing, open-set recognition, and novel category discovery,
work, methods learn from labelled and unlabelled images, which we briefly review next.
and aim to discover new classes in the unlabelled set. How- Semi-supervised learning A number of methods [7, 33,
ever, NCD still makes the limiting assumption that all of 37, 40, 49] have been proposed to tackle the problem of
the unlabelled images come from new categories, which is semi-supervised learning (SSL). SSL assumes that the la-
usually unrealistic. belled and unlabelled instances come from the same set
In this paper, we tackle Generalized Category Discovery of classes. The objective is to learn a robust classifica-
in a number of ways. Firstly, we establish strong baselines tion model leveraging both the labelled and unlabelled data
by taking representative methods from NCD and applying during training. Amongst existing methods, consistency
them to this task. To do this, we adapt their training and in- based approaches appear to be popular and effective, such
ference mechanisms to account for our more general setting, as LadderNet [36], PI model [29], Mean-teacher [43]. Re-
as well as retrain them with a more robust backbone archi- cently, with the success of self-supervised learning, meth-
tecture. We show that existing NCD methods are prone to ods have also been proposed to improve SSL by augmenting
overfit the labelled classes in this generalized setting. the methods with self-supervised objectives [37, 49].
Next, observing the potential for NCD methods to over- Open-set recognition The problem of open-set recogni-
fit their classification heads to the labelled classes, we pro- tion (OSR) is formalized in [38], with the objective be-
pose a simple but effective method for recognition by clus- ing to classify unlabelled instances from the same semantic
tering. Our key insight is to leverage the strong ‘near- classes as the labelled data, while detecting test instances
est neighbour’ classification property of vision transform- from unseen classes. OpenMax [3] is the first deep learn-
ers along with contrastive learning. We propose the use ing method to approach this problem with Extreme Value
of contrastive training and a semi-supervised k-means clus- Theory. GANs are often employed to generate adversar-
tering algorithm to recognize images without a parametric ial samples to train an open-set classifier, e.g, [14, 25, 32].
classifier. We show that these proposed methods substan- Several methods have been proposed to train models such
tially outperform the established baselines, both on generic that images with large reconstruction error are regarded as
object recognition datasets and, particularly, on more chal- open-set samples [34, 41, 48]. There are also methods that
lenging fine-grained benchmarks. For the latter evaluations, learns prototypes for the labelled classes, and identify the
we leverage the recently proposed Semantic Shift Bench- images from unknown classes by the distances to the proto-
mark suite [45], which was designed for the task of identi- types [8,9,39]. More recently, [8,9] proposed to learn recip-
fying semantic novelty. rocal points which describe ‘otherness’ with respect to the
Finally, we propose a solution to a challenging and labelled classes. [50] jointly trains a flow-based density es-

7493
timator and a classification based encoder for OSR. Finally, X ⇥ YL and DU = {(xi , yi )}M i=1 2 X ⇥ YU , where
Vaze et al. [45] study the correlation between the closed- YL ⇢ YU . During training, the model does not have ac-
set and open-set performance, showing that state-of-the-art cess to the labels in DU , and is tasked with predicting them
OSR results can be obtained by boosting the closed-set ac- at test time. Furthermore, we assume access to a validation
0
curacy of the standard cross-entropy baseline. set, DV = {(xi , yi )}Ni=1 2 X ⇥ YL , which is disjoint from
Novel category discovery The problem of novel cate- the training set and contains images from the same classes
gory discovery (NCD) is formalized in DTC [19]. Ear- as the labelled set. This formalization allows us to clearly
lier methods that could be applied to this problem include see the distinction with the novel category discovery setting.
KCL [21] and MCL [22], both of which maintain two NCD assumes YL \ YU = ; and existing methods rely on
models trained with labelled data and unlabelled data re- this prior knowledge during training.
spectively, for general task transfer learning. AutoNovel In this section, we describe the methods we propose to
(aka Rankstats) [17, 18] tackles the NCD problem with a tackle GCD. First, we describe our approach to the prob-
three stage method. The model is first trained with self- lem. Leveraging recent progress in self-supervised repre-
supervision on all data for low-level representation learning. sentation learning, we propose a simple but effective ap-
Then, it is further trained with full supervision on labelled proach based on contrastive learning, with classification
data to capture higher level semantic information. Finally, performed by a semi-supervised k-means algorithm. Next,
a joint learning stage is carried out to transfer knowledge we develop a method to estimate the number of categories
from the labelled to unlabelled data with ranking statistics. in the unlabelled data – a challenging task that is under-
Zhao and Han [51] propose a model with two branches, one studied in the literature. Finally, we build two strong base-
for global feature learning and the other for local feature lines for GCD by modifying state-of-the-art NCD methods,
learning, such that dual ranking statistics and mutual learn- RankStats [18] and UNO [13], to fit with our setting.
ing are conducted with these two branches for better repre-
sentation learning and new class discovery. OpenMix [53] 3.1. Our approach
mixes the labelled and unlabelled data to avoid the model The key insight of our approach for image recognition in
from over-fitting for NCD. NCL [52] extracts and aggre- an open-world setting is to remove the need for parametric
gates the pairwise pseudo-labels for the unlabelled data with classification heads. Instead, we perform clustering directly
contrastive learning and generates hard negatives by mix- in the feature space of a deep network (see Fig. 1, right).
ing the labelled and unlabelled data in the feature space for Classification heads (typically, linear classifiers on top of a
NCD. Jia et al. [23] propose an end-to-end NCD method for learned embedding) are best trained with the cross-entropy
single- and multi-modal data with contrastive learning and loss, which has been shown to be susceptible to noisy la-
winner-takes-all hashing. A unified cross-entropy loss is in- bels [12]. Furthermore, when training a linear classifier for
troduced in UNO [13] to allow the model to be trained on la- unlabelled classes, a typical method is to generate (noisy)
belled and unlabelled data jointly, by swapping the pseudo- pseudo-labels for the unlabelled instances. This would sug-
labels from labelled and unlabelled classification heads. gest that parametric heads are susceptible to performance
Finally, we highlight the work by Girish et al. [15] that deterioration on the unlabelled classes. Finally, we note
tackles a similar setting to GCD but for the task of GAN that, by necessity, classification heads must be trained from
attribution instead of image recognition, as well as the con- scratch, which further makes them vulnerable to overfitting
current work by Cao et al. [4] that tackles a similar set- on the labelled classes.
ting for image recognition under the name Open World
Meanwhile, self-supervised contrastive learning has
Semi-Supervised Learning. Different to our setting, they
been widely used as pre-training to achieve robust represen-
do not leverage large-scale pretraining or demonstrate per-
tations in NCD [23, 52]. Furthermore, when combined with
formance on the Semantic Shift Benchmark, which better
vision transformers, it generates models which are good
isolates the problem of detecting semantic novelty.
nearest neighbour classifiers [6]. Inspired by this, we find
3. Generalized category discovery that constrastively training a ViT model allows us to di-
rectly cluster in the model’s feature space, thereby remov-
We first formalize the task of Generalized Category Dis- ing the need for a linear head which could lead to overfit-
covery (GCD). In short, we consider the problem of classi- ting. Specifically, we train the representation with a noise
fying images in a dataset, a subset of which has known class contrastive loss [16] on all images without using any labels.
labels. The task is to assign class labels to all remaining im- This is important because it avoids overfitting the features
ages, using classes that may or may not be observed in the to the subset of classes that are (partially) labelled. We add
labelled images (see Fig. 1, left). a further supervised contrastive component [24] for the la-
Formally, we define GCD as follows. We consider a belled instances to make use of the labelled data (see Fig. 1,
dataset D comprising two parts DL = {(xi , yi )}N i=1 2 middle row on the right).

7494
3.1.1 Representation learning 3.1.2 Label assignment with semi-supervised k-means
We use, for all methods, a vision transformer (ViT-B- Given the learned representation for the data, we can now
16) [11] pretrained with DINO [6] self-supervision on (un- assign class or cluster labels for each unlabelled data point,
labelled) ImageNet [10] as our backbone. This is moti- either from the labelled classes or unseen new classes. In-
vated firstly because the DINO model is a strong nearest stead of performing this parametrically as is common in
neighbour classifier, which suggests non-parametric clus- NCD (and risk overfitting to the labelled data) we propose
tering in its feature space would work well. Secondly, self- to use a non-parametric method. Namely, we propose to
supervised vision transformers have demonstrated the at- modify the classic k-means into a constraint algorithm by
tractive quality of learning to attend to salient parts of an forcing the assignment of the instances in DL to the cor-
object without human annotation. We find this feature to rect cluster based on their ground-truth labels. Note, here
be useful for this task, because which object parts are im- we assume knowledge of the number of clusters, k. We
portant for classification is likely to transfer well from the tackle the problem of estimating this parameter in Sec. 3.2.
labelled to the unlabelled categories (see Sec. 4.5). The initial |YL | centroids for DL are obtained based on the
Finally, we wish to reflect a realistic and practical set- ground-truth class labels, and an additional |YU \YL | (num-
ting. In the NCD literature, it is standard to train a ResNet- ber of new classes) initial centroids are obtained from DU
18 [20] backbone from scratch for the target task. How- with k-means++ [1], constrained on the centroids of DL .
ever, in a real-world setting, a model is often initialized with During each centroid update and cluster assignment cycle,
large-scale pretrained weights to optimize performance (of- instances from the same class in DL are always forced to
ten ImageNet supervised pretraining). In order to avoid con- have the same cluster assignment, while each instance in
flicts with our experimental setting (which assumes a finite DU can be assigned to any cluster based on the distance to
labelled set), we use self-supervised ImageNet weights. To different centroids. After the semi-supervised k-means con-
enhance the representation such that it is more tailored for verges, each instance in DU can be assigned a cluster label.
the labelled and unlabelled data we have, we further fine- We provide a clear diagram of this in Appendix B.
tune the representation on our target data jointly with su-
pervised contrastive learning on the labelled data, and un-
3.2. Estimating the class number in unlabelled data
supervised contrastive learning on all the data. Here, we tackle the problem of finding the number of
Formally, let xi and x0i be two views (random augmen- classes in the unlabelled data. In the NCD and unsuper-
tations) of the same image in a mini-batch B. The unsuper- vised clustering settings, prior knowledge of the number of
vised contrastive loss is written as: categories in the dataset is often assumed, but this is unre-
exp (zi · z0i /τ ) alistic in the real world given that the labels themselves are
Lui = log P , (1) unknown. To estimate the number of categories in DU , we
n 1[n6=i] exp (zi · zn /τ )
leverage the information available in DL . Specifically, we
where zi = φ(f (xi )) and 1[n6=i] is an indicator function perform k-means clustering on the entire dataset, D, before
evaluating to 1 iff n 6= i, and τ is a temperature value. f evaluating clustering accuracy on only the labelled subset
is the feature backbone, and φ is a multi-layer perceptron (see Sec. 4.1 for the metric’s definition).
(MLP) projection head. Clustering accuracy is evaluated by running the Hungar-
The supervised contrastive loss is written as: ian algorithm [28] to find the optimal assignment between
1 X exp (zi · zq /τ ) the set of cluster indices and ground truth labels. If the num-
Lsi = log P , (2) ber of clusters is higher than the total number of classes, the
|N (i)| n [n6=i] exp (zi · zn /τ )
1
q2N (i) extra clusters are assigned to the null set, and all instances
where N (i) denotes the indices of other images having the assigned to those clusters are said to have been predicted
same label as xi in the mini-batch B. Finally, we construct incorrectly. Conversely, if the number of clusters is lower
the total loss over the batch as: than the number of classes, extra classes are assigned to
X X the null set, and all instances with those ground truth la-
Lt = (1 λ) Lui + λ Lsi (3) bels are said to have been to be predicted incorrectly. Thus,
i2B i2BL
we assume that, if the clustering (across D) is performed
where BL corresponds to the labelled subset of B and λ is with k too high or too low, then this will be reflected in a
a weight coefficient. Using the labels only in a contrastive sub-optimal clustering accuracy on DL . In other words, we
framework, rather than in a cross-entropy loss, means that assume clustering accuracy on the labelled set will be max-
unlabelled and labelled data are treated similarly. The su- imized when k = |YL [ YU |. This intuition leads us to use
pervised contrastive component is only used to nudge the the clustering accuracy as a ‘black box’ scoring function,
network towards a semantically meaningful representation, ACC = f (k; D), which we optimize with Brent’s algo-
thereby minimizing overfitting on the labelled classes. rithm to find the optimal k. Different to the method in [18],

7495
which exhaustively iterates through all possible values of k, Table 1. Datasets used in our experiments. We show the number
we find that the black-box optimization allows our method of classes in the labelled and unlabelled sets (|YL |, |YU |), as well
to scale to datasets with many categories. as the number of images (|DL |, |DU |).
CIFAR10 CIFAR100 ImageNet-100 CUB SCars Herb19
Finally, we highlight that labelled sets with different
|YL | 5 80 50 100 98 341
granularities would induce different estimates of the num-
|YU | 10 100 100 200 196 683
ber of classes. However, we suggest that the labelled set de-
|DL | 12.5k 20k 31.9k 1.5k 2.0k 8.9k
fines the system of categorization — that the granularity of a |DU | 37.5k 30k 95.3k 4.5k 6.1k 25.4k
real-world dataset is not an intrinsic property of the images,
but rather a framework imposed by the labels. For instance, ing instances from these classes, along with all instances
in Stanford Cars, the dataset could equally be labelled at the from the other classes, constitute DU . We further construct
‘Manufacturer’, ‘Model’ or ‘Variant’ level, with the catego- the validation set for the labelled classes from the test or
rization system defined by the labels assigned. validation split of each dataset.
We first demonstrate results on three generic object
3.3. Two strong baselines recognition datasets: CIFAR10 [27], CIFAR100 [27] and
We adapt two methods from the nearest image recog- ImageNet-100 [10]. ImageNet-100 refers to the ImageNet
nition sub-field, novel category discovery (NCD), for our dataset with 100 classes randomly subsampled. These
generalized category discovery (GCD) task. RankStats [18] datasets establish the methods’ performance on well-known
is widely used as a competitive baseline for novel category datasets in the standard image recognition literature.
discovery, while UNO [13] is to the best of our knowledge We further evaluate on the recently proposed Semantic
the state-of-the-art method for NCD. Shift Benchmark [45] (SSB, including CUB [46] and Stan-
Baseline: RankStats+ RankStats trains two classifiers ford Cars [26]), as well as on Herbarium19 [42]. SSB pro-
on top of a shared feature representation: the first head vides fine-grained evaluation datasets with clear ‘axes of
is fed instances from the labelled set and is trained with semantic variation’ and further provides categories for DU
the cross-entropy loss, while the second head sees only in- which are delineated from DL in a semantically coherent
stances from unlabelled classes (again, in the NCD setting, fashion. Thus, the user can be confident that the recognition
the labelled and unlabelled classes are disjoint). In order to system is identifying new classes based on a true semantic
adapt RankStats to GCD, we train it with a single classifi- signal, rather than simply responding to low-level distribu-
cation head for the total number of classes in the dataset. tional shifts in the data, as may be the case for generic ob-
We then train the first |YL | elements of the head with the ject recognition datasets. The long-tailed nature of Herbar-
cross-entropy loss, and train the entire head with the binary ium19 adds an additional challenge to the evaluation.
cross-entropy loss with pseudo-labels. The fine-grained datasets further reflect many real-world
Baseline: UNO+ Similarly to RankStats, UNO is trained use cases for image recognition systems, which are de-
with classification heads for labelled and unlabelled data. ployed in constrained environments with many similar ob-
The model is then trained in a SwAV-like manner [5]. First, jects (e.g. products in a supermarket, traffic monitoring,
multiple views (random augmentations) of a batch are gen- or animal tracking in the wild). In fact, the Herbarium19
erated and fed to the same model. For the labelled images in dataset itself represents a real-world use case for GCD:
the batch, the labelled head is trained with the cross-entropy while we are aware of roughly 400k species of plants, and
loss using the ground truth labels. For the unlabelled im- estimate that there are around 80k yet to be discovered,
ages, predictions (logits from the unlabelled head) are gath- it currently takes roughly 35 years from plant collection
ered for a given view and used as pseudo-labels with which to plant species description if performed manually [42].
to optimize the loss from other views. To adapt this mech- We summarize the dataset splits used in our evaluations
anism, we simply concatenate both the labelled and unla- in Tab. 1, and provide more details in Appendix A.
belled heads, thus allowing generated pseudo-labels for the Evaluation protocol For each dataset, we train the mod-
unlabelled samples to belong to any class in the dataset. els on D (without access to the ground truth labels in DU ).
At test-time, we measure the clustering accuracy between
4. Experiments the ground truth labels yi and the model’s predictions ŷi as:

4.1. Experimental setup 1 X


M
ACC = max 1{yi = p(ŷi )} (4)
Data We demonstrate results on six datasets in our pro- p2P(YU ) M
i=1
posed setting. For each dataset, we take the training set
and sample a set of classes for which we have labels during Here, M = |DU | and P(YU ) is the set of all permutations
training. We further sub-sample 50% of the images from of the class labels in the unlabelled set. Our main metric is
these classes to constitute the labelled set DL . The remain- ACC on ‘All’ instances, indicating image recognition accu-

7496
racy across the entire unlabelled set DU . We further report outperform our method, but this is at the expense of ACC
values for both the ‘Old’ classes subset (instances in DU on the ‘New’ categories. We also found that, if the baselines
belonging to classes in YL ) and ‘New’ classes subset (in- were trained for longer, they would begin to sacrifice ACC
stances in DU belonging to classes in YU \ YL ). on ‘Old’ categories for ACC on ‘New’ ones, but that best
The maximum over the set of permutations is computed overall performance was achieved using early stopping by
via the Hungarian optimal assignment algorithm [28]. Im- monitoring the performance on the validation set.
portantly, we compute the Hungarian assignment only once,
across all categories YU , and measure classification accu- Table 2. Results on generic image recognition datasets.
CIFAR10 CIFAR100 ImageNet-100
racy on ‘Old’ and ‘New’ subsets only afterwards. The in-
Classes All Old New All Old New All Old New
teraction between when the Hungarian assignment is per-
k-means [30] 83.6 85.7 82.5 52.0 52.2 50.8 72.7 75.5 71.3
formed, and the resultant ACC on the subsets, can be un- RankStats+ 46.8 19.2 60.5 58.2 77.6 19.3 37.1 61.6 24.8
intuitive and is elaborated upon in Appendix E. UNO+ 68.6 98.3 53.8 69.5 80.6 47.2 70.3 95.0 57.9
Ours 91.5 97.9 88.2 70.8 77.6 57.0 74.1 89.8 66.3
Implementation details All methods are trained with a
ViT-B-16 backbone with DINO pre-trained weights, and
use the output [CLS] token as the feature representation.
Table 3. Results on SSB [45] and Herbarium19 [42].
All methods were trained for 200 epochs, with the best
CUB Stanford Cars Herbarium19
model selected using accuracy on the validation set. We
Classes All Old New All Old New All Old New
fine-tune the final transformer block for all methods.
k-means [30] 34.3 38.9 32.1 12.8 10.6 13.8 12.9 12.9 12.8
For our method, we fine-tune the final block of the vision RankStats+ 33.3 51.6 24.2 28.3 61.8 12.1 27.9 55.8 12.8
UNO+ 35.1 49.0 28.1 35.5 70.5 18.6 28.3 53.7 14.7
transformer with an initial learning rate of 0.1 which we de-
Ours 51.3 56.6 48.7 39.0 57.6 29.9 35.4 51.0 27.0
cay with a cosine annealed schedule. We use a batch size
of 128 and λ = 0.35 in the loss (see Eq. (3)). Furthermore,
following standard practise in self-supervised learning, we
4.3. Estimating the number of classes
project the model’s output through a non-linear projection
head before applying the contrastive loss. We use the same We report results on estimating the number of classes
projection head as in [6] and discard it at test-time. For the in Tab. 4. We find that on the generic object recognition
baselines from NCD, we follow the original implementa- datasets, we can come very close to the ground truth number
tions and learning schedules as far as possible, referring to of categories in the unlabelled set, with a maximum error of
the original papers for details [13, 18]. 10%. On the fine-grained datasets, we report an average
Finally, in order to estimate k, we run our k-estimation discrepancy of 18.9%. We note the highly challenging na-
method on DINO features extracted from each of the con- ture of these datasets, with many constituent classes which
sidered benchmarks. We run Brent’s algorithm on a con- are visually similar.
strained domain for k, with the minimum set at |YL | and
the maximum set at 1000 classes for all datasets.
4.4. Ablation study
In Tab. 5, we inspect the contributions of the various el-
4.2. Comparison with the baselines ements of our proposed approach. Specifically, we identify
We report results for all compared methods in Tab. 2 the importance of the following components of the method:
and Tab. 3. As an additional baseline, we also report re- ViT backbone; contrastive fine-tuning (regular and super-
sults when running k-means directly on top of raw DINO vised); and semi-supervised k-means clustering.
features (reported as k-means). Tab. 2 presents results on ViT Backbone Rows (1) and (2) show the effect of the
the generic object recognition datasets, while Tab. 3 shows ViT model for the clustering task, as (1) and (2) represent
results on SSB and Herbarium19. We further show results a ResNet-50 model and ViT-B-16 trained with DINO re-
on the FGVC-Aircraft [31] evaluation from SSB in Ap- spectively. The ResNet model performs nearly 20% worse
pendix D. aggregated over ‘Old’ and ‘New’ classes. To disambiguate
Overall (across ‘All’ instances in DU ), our method out- this from the general capacity of the architecture, note that
performs RankStats+ and UNO+ baselines on the standard the ImageNet linear probe discrepancy (the standard evalua-
image recognition datasets by 9.3% in absolute terms, and tion protocol for self-supervised models) is roughly 3% [6].
11.5% in proportional terms. Meanwhile, on the more chal- Meanwhile, the discrepancy in their k-NN accuracies on
lenging fine-grained evaluations, our method outperforms Table 4. Estimation of the number of classes in unlabelled data.
the baselines by 8.9% in absolute terms and 27.0% in pro- CIFAR10 CIFAR100 ImageNet-100 CUB SCars Herb19
portional terms. Ground truth 10 100 100 200 196 683
We find that on categories with labelled examples (‘Old’ Ours 9 100 109 231 230 520
Error 10% 0% 9% 16% 15% 28%
classes), the baselines which use parametric classifiers can

7497
ResNet50 (DINO) ViT (DINO) ViT (Ours)

Figure 2. TSNE visualization of instances in CIFAR10 for features generated by a ResNet-50 and ViT model trained with DINO self-
supervision on ImageNet, and a ViT model after fine-tuning with our approach.

Table 5. Ablation study on the different components of our approach.


CIFAR100 Herbarium19
ViT Backbone Contrastive Loss Sup. Contrastive Loss Semi-Sup k-means
All Old New All Old New
(1) 7 7 7 7 34.0 34.8 32.4 12.1 12.5 11.9
(2) 3 7 7 7 52.0 52.2 50.8 12.9 12.9 12.8
(3) 3 3 7 7 54.6 54.1 53.7 14.3 15.1 13.9
(4) 3 7 3 7 60.5 72.2 35.0 17.8 22.7 15.4
(5) 3 3 3 7 71.1 78.3 56.6 28.7 32.1 26.9
(6) 3 3 3 3 73.0 76.2 66.5 35.4 51.0 27.0

ImageNet is roughly 9% [6], suggesting why the ViT model tween the reported ACC and the Hungarian assignment.
performs so much better for the clustering task. Summary Overall, we find that none of the components
Contrastive fine-tuning Rows (2)-(5) show the effects of our method are individually sufficient to achieve good
of introducing different combinations of contrastive fine- performance across our benchmark datasets. Specifically,
tuning on the target dataset. We find that including any the combination of a vision transformer backbone and
of the contrastive methods alone gives relatively marginal contrastive fine-tuning facilitates strong k-means cluster-
improvements over using the raw DINO features. We find ing directly in the feature space of the model. The semi-
that the full benefit is only realized when combining the supervised k-means algorithm further allows us to guide the
self-supervised and supervised contrastive losses on the tar- clustering process with labels and achieve better ACC, es-
get dataset. Specifically, the combination of the contrastive pecially on the ‘New’ classes in the fine-grained datasets.
losses allows us to boost aggregated clustering accuracy by We further illustrate this point in the TSNE visualiza-
a further 19% on CIFAR100 and by 16% on Herbarium19 tions in Fig. 2, performed on the CIFAR10 dataset. We
(more than doubling the ACC in this case). show TSNE projections of raw ResNet-50 and ViT DINO
Semi-supervised k-means Further performance gains features, as well as those of our model. For the ResNet-50
can be realized with semi-supervised clustering. Across features, points from the same class are generally projected
‘All’ classes, we observe a 2% and 7% increase in ACC close to each other, indicating that they are likely to be sep-
on CIFAR100 and Herbarium19 respectively. On Herbar- arable given a simple transformation (e.g. a linear probe).
ium19, ACC on the ‘Old’ classes is improved by 19%. In- However, they do not form clear clusters, hinting at these
terestingly, it appears that semi-supervised k-means slightly features’ poor down-stream clustering performance. In con-
hurts performance on the ‘Old’ classes on CIFAR100. We trast, the ViT features form far clearer clusters, which are
suggest that this is an artefact of the Hungarian algorithm, further distinguished when trained with our approach.
which opts to assign some ‘clean’ clusters to the ‘New’
4.5. Qualitative results
ground truth categories, to maximize overall ACC. This
can be observed in the 10% boost provided by the semi- Finally, we visualize the attention mechanism of our
supervised method on the ‘New’ classes in CIFAR100. Fur- model to better understand its performance. Specifically,
thermore, we found that if we performed the Hungarian al- in Fig. 3 we look at how the final multi-head attention layer
gorithm on ‘Old’ and ‘New’ instances independently (al- attends to different spatial locations when supporting the
lowing the reuse of clean clusters during evaluation), semi- output [CLS] token (which we use as our feature represen-
supervised k-means improved ACC on all data subsets. We tation). We show this both for the pre-trained DINO model
refer to Appendix E for more details on the interaction be- and after training with our method. We visualize the atten-

7498
DINO-ViT before fine-tuning ViT after fine-tuning w\ ours
Head1 Head2 Head3 Head1 Head2 Head3
Stanford Cars
CUB

Figure 3. Attention visualizations for the DINO-ViT model before (left) and after (right) fine-tuning with our approach. For Stanford Cars
and CUB, we show an image from the ‘Old’ (first row for each dataset) and ‘New’ classes (second row for each dataset). Our model learns
to specialize attention heads (shown as columns) to different semantically meaningful parts, which can transfer between the labelled and
unlabelled categories. The model’s heads learn ‘Windshield’, ‘Headlight’ and ‘Wheelhouse’ for the cars, and ‘Beak’, ‘Head’ and ’Belly’
for the birds. For both models, we select heads with as focused attention as possible. Recommended viewing in color with zoom.

tion maps for images from the ‘Old’ and ‘New’ classes for We highlight three take-home messages from this work:
Stanford Cars and CUB. first, GCD is a challenging and realistic setting for image
It is demonstrated in [6] that different attention heads in recognition; second, GCD removes limiting assumptions in
the DINO model focus on different regions of an image, existing image recognition sub-fields such as novel cate-
without the need for human annotation. We find this to be gory discovery and open-set recognition; and third, while
the case, with different heads attending to disjoint regions of parametric classifiers tend to overfit to labelled classes in
the image and typically focusing on important parts. How- the generalized setting, direct clustering of features from
ever, after training with our method, we find heads to be contrastively trained ViTs proves to be a surprisingly good
more specialized to semantic parts, displaying more con- method for classification.
centrated and local attention. In this way, we suggest the
model learns to attend to a set of parts which are transfer-
able between the ‘Old’ and ‘New’ classes, which allows it
to better generalize knowledge from the labelled data.
Acknowledgements We would like to thank Liliane Mo-
5. Conclusion meni for invaluable help with figures in this work. This re-
search is funded by a Facebook AI Research Scholarship, a
In this paper, we have proposed a new setting for im- Royal Society Research Professorship RP\R1\191132, and
age recognition, ‘Generalized Category Discovery’ (GCD). the EPSRC Programme Grant VisualAI EP/T028572/1.

7499
References [17] Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, An-
drea Vedaldi, and Andrew Zisserman. Automatically discov-
[1] David Arthur and Sergei Vassilvitskii. k-means++: the ad- ering and learning new visual categories with ranking statis-
vantages of careful seeding. In ACM-SIAM symposium on tics. In ICLR, 2020. 3
Discrete algorithms, 2007. 4
[18] Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, An-
[2] Yuki M. Asano, Christian Rupprecht, Andrew Zisserman, drea Vedaldi, and Andrew Zisserman. Autonovel: Automati-
and Andrea Vedaldi. PASS: An ImageNet replacement for cally discovering and learning novel visual categories. IEEE
self-supervised pretraining without human. NeurIPS Track TPAMI, 2021. 3, 4, 5, 6, 2
on Datasets and Benchmarks, 2021. 3
[19] Kai Han, Andrea Vedaldi, and Andrew Zisserman. Learning
[3] Abhijit Bendale and Terrance E. Boult. Towards open set to discover novel visual categories via deep transfer cluster-
deep networks. In CVPR, 2016. 2 ing. In ICCV, 2019. 2, 3
[4] Kaidi Cao, Maria Brbic, and Jure Leskovec. Open-world
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
semi-supervised learning. In International Conference on
Deep residual learning for image recognition. In CVPR,
Learning Representations, 2022. 3
2016. 4
[5] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi-
[21] Yen-Chang Hsu, Zhaoyang Lv, and Zsolt Kira. Learning
otr Bojanowski, and Armand Joulin. Unsupervised learn-
to cluster in order to transfer across domains and tasks. In
ing of visual features by contrasting cluster assignments. In
ICLR, 2018. 3
NeurIPS, 2020. 5
[22] Yen-Chang Hsu, Zhaoyang Lv, Joel Schlosser, Phillip Odom,
[6] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
and Zsolt Kira. Multi-class classification without multi-class
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg-
labels. In ICLR, 2019. 3
ing properties in self-supervised vision transformers. In
ICCV, 2021. 3, 4, 6, 7, 8 [23] Xuhui Jia, Kai Han, Yukun Zhu, and Bradley Green. Joint
representation learning and novel category discovery on
[7] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien.
single- and multi-modal data. In ICCV, 2021. 3
Semi-Supervised Learning. MIT Press, 2006. 2
[24] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
[8] Guangyao Chen, Peixi Peng, Xiangqian Wang, and
Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
Yonghong Tian. Adversarial reciprocal points learning for
Dilip Krishnan. Supervised contrastive learning. arXiv
open set recognition. IEEE TPAMI, 2021. 2
preprint arXiv:2004.11362, 2020. 3
[9] Guangyao Chen, Limeng Qiao, Yemin Shi, Peixi Peng, Jia
[25] Shu Kong and Deva Ramanan. Opengan: Open-set recogni-
Li, Tiejun Huang, Shiliang Pu, and Yonghong Tian. Learning
tion via open data generation. ICCV, 2021. 2
open set network with discriminative reciprocal points. In
ECCV, 2020. 2 [26] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
3d object representations for fine-grained categorization. In
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jua Li, Kai Li,
4th International IEEE Workshop on 3D Representation and
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
Recognition (3dRR-13), 2013. 5
database. In CVPR, 2009. 4, 5
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [27] Alex Krizhevsky and Geoffrey Hinton. Learning multiple
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, layers of features from tiny images. Technical report, 2009.
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- 5
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is [28] Harold W Kuhn. The hungarian method for the assignment
worth 16x16 words: Transformers for image recognition at problem. Naval research logistics quarterly, 1955. 4, 6, 2
scale. In ICLR, 2021. 4 [29] Samuli Laine and Timo Aila. Temporal ensembling for semi-
[12] Lei Feng, Senlin Shu, Zhuoyi Lin, Fengmao Lv, Li Li, and supervised learning. In ICLR, 2017. 2
Bo An. Can cross entropy loss be robust to label noise? In [30] James MacQueen. Some methods for classification and anal-
IJCAI, 2020. 3 ysis of multivariate observations. In Proceedings of the Fifth
[13] Enrico Fini, Enver Sangineto, Stéphane Lathuilière, Zhun Berkeley Symposium on Mathematical Statistics and Proba-
Zhong, Moin Nabi, and Elisa Ricci. A unified objective for bility, 1967. 6, 2
novel class discovery. In ICCV, 2021. 3, 5, 6, 2 [31] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew
[14] Zongyuan Ge, Sergey Demyanov, and Rahil Garnavi. Gen- Blaschko, and Andrea Vedaldi. Fine-grained visual classi-
erative openmax for multi-class open set classification. In fication of aircraft. arXiv preprint arXiv:1306.5151, 2013.
BMVC, 2017. 2 6, 1, 2
[15] Sharath Girish, Saksham Suri, Saketh Rambhatla, and Abhi- [32] Lawrence Neal, Matthew Olson, Xiaoli Fern, Weng-Keen
nav Shrivastava. Towards discovery and attribution of open- Wong, and Fuxin Li. Open set learning with counterfactual
world gan generated images. In ICCV, 2021. 3 images. In ECCV, 2018. 2
[16] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive [33] Avital Oliver, Augustus Odena, Colin Raffel, Ekin D. Cubuk,
estimation: A new estimation principle for unnormalized and Ian J. Goodfellow. Realistic evaluation of deep semi-
statistical models. In Proceedings of the Thirteenth Inter- supervised learning algorithms. In NeurIPS, 2018. 2
national Conference on Artificial Intelligence and Statistics, [34] Poojan Oza and Vishal M. Patel. C2ae: Class conditioned
2010. 3 auto-encoder for open-set recognition. In CVPR, 2019. 2

7500
[35] Vinay Uday Prabhu and Abeba Birhane. Large image [52] Zhun Zhong, Enrico Fini, Subhankar Roy, Zhiming Luo,
datasets: A pyrrhic win for computer vision? In Proc. Elisa Ricci, and Nicu Sebe. Neighborhood contrastive learn-
WACV, 2020. 3 ing for novel class discovery. In CVPR, 2021. 3
[36] Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias [53] Zhun Zhong, Linchao Zhu, Zhiming Luo, Shaozi Li, Yi
Berglund, and Tapani Raiko. Semi-supervised learning with Yang, and Nicu Sebe. Openmix: Reviving known knowledge
ladder networks. In NeurIPS, 2015. 2 for discovering novel visual categories in an open world. In
[37] Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Kai Han, An- CVPR, 2021. 3
drea Vedaldi, and Andrew Zisserman. Semi-supervised
learning with scarce annotations. arxiv, 2019. 2
[38] Walter J. Scheirer, Anderson Rocha, Archana Sapkota, and
Terrance E. Boult. Towards open set recognition. IEEE
TPAMI, 2013. 2
[39] Yu Shu, Yemin Shi, Yaowei Wang, Tiejun Huang, and
Yonghong Tian. P-odn: Prototype-based open deep network
for open set recognition. Scientific Reports, 2020. 2
[40] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao
Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin,
Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-
supervised learning with consistency and confidence. In
NeurIPS, 2020. 2
[41] Xin Sun, Zhenning Yang, Chi Zhang, Guohao Peng, and
Keck-Voon Ling. Conditional gaussian distribution learning
for open set recognition. In CVPR, 2020. 2
[42] Kiat Chuan Tan, Yulong Liu, Barbara Ambrose, Melissa
Tulig, and Serge Belongie. The herbarium challenge 2019
dataset. In Workshop on Fine-Grained Visual Categoriza-
tion, 2019. 5, 6
[43] Antti Tarvainen and Harri Valpola. Mean teachers are better
role models: Weight-averaged consistency targets improve
semi-supervised deep learning results. In NeurIPS, 2017. 2
[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017. 2
[45] Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisser-
man. Open-set recognition: A good closed-set classifier is
all you need. In International Conference on Learning Rep-
resentations, 2022. 2, 3, 5, 6, 1
[46] Catherine Wah, Steve Branson, Peter Welinder, Pietro Per-
ona, and Serge Belongie. The caltech-ucsd birds-200-2011
dataset. Technical Report CNS-TR-2011-001, California In-
stitute of Technology, 2011. 5
[47] Kaiyu Yang, Jacqueline Yau, Li Fei-Fei, Jia Deng, and Olga
Russakovsky. A study of face obfuscation in imagenet.
CoRR, abs/2103.06191, 2021. 3
[48] Ryota Yoshihashi, Wen Shao, Rei Kawakami, Shaodi
You, Makoto Iida, and Takeshi Naemura. Classification-
reconstruction learning for open-set recognition. In CVPR,
2019. 2
[49] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lu-
cas Beyer. S4l: Self-supervised semi-supervised learning. In
ICCV, 2019. 2
[50] Hongjie Zhang, Ang Li, Jie Guo, and Yanwen Guo. Hybrid
models for open set recognition. In ECCV, 2020. 2
[51] Bingchen Zhao and Kai Han. Novel visual category discov-
ery with dual ranking statistics and mutual knowledge distil-
lation. In NeurIPS, 2021. 3

7501

You might also like