Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views10 pages

Multilabel Image Classification Via Feature/Label Co-Projection

This article introduces a novel approach for multilabel image classification that combines direct label recognition with contextual information through feature/label co-projection. The method effectively captures label correlations by embedding both image features and labels into a shared latent space, achieving superior performance on COCO and PASCAL VOC benchmarks. The proposed model is efficient at test time, requiring minimal additional weights compared to traditional ConvNet architectures.

Uploaded by

jamesrosario
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

Multilabel Image Classification Via Feature/Label Co-Projection

This article introduces a novel approach for multilabel image classification that combines direct label recognition with contextual information through feature/label co-projection. The method effectively captures label correlations by embedding both image features and labels into a shared latent space, achieving superior performance on COCO and PASCAL VOC benchmarks. The proposed model is efficient at test time, requiring minimal additional weights compared to traditional ConvNet architectures.

Uploaded by

jamesrosario
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS 1

Multilabel Image Classification via


Feature/Label Co-Projection
Shiping Wen , Weiwei Liu , Yin Yang, Pan Zhou , Member, IEEE, Zhenyuan Guo , Zheng Yan ,
Yiran Chen, and Tingwen Huang , Fellow, IEEE

Abstract—This article presents a simple and intuitive solution I. I NTRODUCTION


for multilabel image classification, which achieves the compet-
ULTILABEL image classification is a fundamental task
itive performance on the popular COCO and PASCAL VOC
benchmarks. The main idea is to capture how humans perform
this task: we recognize both labels (i.e., objects and attributes)
M in computer vision with numerous applications [1]–[6].
In this task, each input image is associated with a set of
and the correlation of labels at the same time. Here, label recog- labels, where the universe of all possible labels are given,
nition is performed by a standard ConvNet pipeline, whereas but the number of labels matching an image is often not
label correlation modeling is done by projecting both labels and
known beforehand, and can vary from image to image. For
image features extracted by the ConvNet to a common latent
vector space. Specifically, we carefully design the loss function example, in Fig. 1(a), the image clearly matches labels, such
to ensure that: 1) labels and features that co-appear frequently as “person,” “tennis racket,” and “tennis ball.” The output of
are close to each other in the latent space and 2) conversely, multilabel classification is usually represented as a binary vec-
labels/features that do not appear together are far apart. This tor, in which each bit indicates the presence or absence of a
information is then combined with the original ConvNet outputs label in the given image.
to form the final prediction. The whole model is trained end-to-
end, with no additional supervised information other than the There has been a plethora of methods for multilabel image
image-level supervised information. Experiments show that the classification. Yet, few of them reflect how humans approach
proposed method consistently outperforms previous approaches this problem. To illustrate, consider Fig. 1(b), which covers up
on COCO and PASCAL VOC in terms of mAP, macro/micro the left half of the image in Fig. 1(a). To a human, the image
precision, recall, and F-measure. Further, our model is highly here presents a context (e.g., from the pose of the man and
efficient at test time, with only a small number of additional
weights compared to the base model for direct label recognition.
the position of his racket) that strongly suggests the existence
of a tennis ball. The photograph would be rather unsatisfying
Index Terms—Deep learning, label embedding, multilabel if it does not show a ball, and downright bizarre if instead of
classification, neural network.
a ball, there is a sheep or the face of a celebrity at the left
side of the image. Meanwhile, the context alone may be insuf-
ficient to identify all matching labels. Fig. 1(a), for instance,
also matches the label “chair,” which is not obvious from the
context, and needs to be recognized from its own visual fea-
Manuscript received January 31, 2019; revised May 16, 2019; accepted tures. However, the style of the machine algorithm is quite
January 13, 2020. This work was supported in part by the Natural Science
Foundation of China under Grant 61673187, and in part by the National different from that of human beings to understand data. It is
Priorities Research Program through Qatar National Research Fund under much harder for the algorithm to recognize negligible objects
Grant NPRP 9-466-1-103. This article was recommended by Associate Editor like ball than to identify large objects like the person. In order
W. Hsu. (Corresponding author: Yin Yang.)
Shiping Wen is with the School of Computer Science and Engineering, to relate image data to the corresponding label, we propose
University of Electronic Science and Technology of China, Chengdu 611731, to map the image and its label to the same latent space. In
China (e-mail: [email protected]). latent space, we can explicitly model the label correlation
Weiwei Liu is with the School of Artificial Intelligence and Automation,
Huazhong University of Science and Technology, Wuhan 430074, China information.
(e-mail: [email protected]). In this article, we propose a novel solution that captures
Yin Yang is with the College of Science and Engineering, Hamad Bin the above intuitions, and combines both direct label recogni-
Khalifa University, Doha, Qatar (e-mail: [email protected]).
Pan Zhou is with the School of Cyber Science and Engineering, Huazhong tion with image feature extraction, as illustrated in Fig. 1(c).
University of Science and Technology, Wuhan 430074, China (e-mail: Specifically, a ConvNet pipeline extracts features from the
[email protected]). input image, which are fed to a fully connected layer for
Zhenyuan Guo is with the College of Mathematics and Econometrics,
Hunan University, Changsha 410082, China (e-mail: [email protected]). direct label recognition. Meanwhile, these image features, as
Zheng Yan is with the Centre for Artificial Intelligence, University of well as the labels associated with the image, are projected to
Technology Sydney, Ultimo, NSW 2007, Australia. a common vector space through embedding. There is a cer-
Yiran Chen is with the Department of Electrical and Computer Engineering,
Duke University, Durham, NC 27708 USA (e-mail: [email protected]). tain correlation among labels that often appear in the same
Tingwen Huang is with the Science Program, Texas A&M University at image. Therefore, in this latent space, we require that: 1) the
Qatar, Doha, Qatar (e-mail: [email protected]). projection of image features should be close to those of the
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. associated labels, as well as features from images associated
Digital Object Identifier 10.1109/TSMC.2020.2967071 with the correlated labels and 2) conversely, the projected
2168-2216 
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: SRM University. Downloaded on March 28,2021 at 11:31:02 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

We have experimentally evaluated the proposed solution on


the popular COCO [32] and PASCAL VOC [33] benchmark
datasets. The result demonstrates that our solution consistently
and significantly outperforms existing methods on various met-
rics, including mean average precision (mAP), micro/macro
precision, recall, and F-measure. Finally, our solution is highly
efficient at test time, since it only introduces 2048 × C addi-
tional weights to the base ConvNet model, where C is the
number of possible labels.

II. R ELATED W ORK


Multilabel classification is a fundamental problem in
machine learning, with a wide range of applications in com-
Fig. 1. (a) Input image associated with labels person, tennis racket and tennis
puter vision, text topic categorization, music retrieval, and
ball. (b) Right half of the same image, from which the presence of the tennis gene analysis. One strategy to approach multilabel classifi-
ball can be inferred. (c) Proposed neural network pipeline that combines both cation is to transform the problem to multiple single-label
direct predictions from a ConvNet and contextual information extracted by
projecting image features and labels to a common vector space.
classification tasks (e.g., [34]–[36]), which can be either binary
of multiclass. Those methods can be categorized as first-
order strategy and ignore correlation among labels. There are
second-order strategy [36]–[39] and high-order strategy meth-
image features should be far apart from labels that are not asso- ods [40]–[43]. Other methods adapt single-label classifiers,
ciated with image. These requirements are enforced through such as decision trees [44], boosting [45], K-nearest neigh-
our well-designed loss function, which also includes classifi- bors [46], and neural networks [47]. These methods, however,
cation loss of the final predictions. In our implementation, the are not designed for large-scale image classification problems
final prediction is simply the sum of the direct predictions and and fail to exploit label correlation.
the feature learned from latent space. In addition, other researchers proposed to relate image fea-
A naive approach for multilabel classification is to con- tures and label domain data in a latent space and learn label
struct a binary classifier for each label [7], which disregards correlation in latent space. To achieve this, C2AE [30] intro-
the correlation among labels completely. Similarly, methods duces DNN architecture to CCA and the autoencoder model.
based on region proposals, e.g., HCP [8] improves the accu- C2AE [30] builds the embedding space through an autoen-
racy of direct label recognition by focusing on relevant image coder on the labels, and then projects the image features
patches; yet, this method fails to capture label semantics extracted by a ConvNet to the same latent space, via CCA.
information. A refined solution by Zhu et al. [2] applied visual C2AE also lacks a direct label recognition module with the
attention to model spatial and semantic correlations between assumption that the number of labels associated with an image
labels. None of these methods, however, ignore that exploit is known in advance. Further, methods based on embedding
label correlation. In our implementation, we use the plain-old also used in image retrieval [48], visual-semantic embed-
ConvNet [9] for direct label recognition; the above techniques ding [49], and neural language task [50]. However, what
could potentially further enhance the accuracy of our model. different with this embedding method is that relation learning
Among methods that aim to model the label dependencies, is considered in our solution by our designed ranking loss.
earlier attempts mainly focus on utilizing label correlations The idea of modeling context by constructing a latent vector
(e.g., [10]–[29]) as auxiliary information. One problem with space for labels and image features has also been explored
this idea is that it fails to capture visual correlation: for in previous methods, e.g., using SVD [51], compressed sens-
instance, the label person by itself is not strongly correlated ing [52], and SLEEC [1]. A common problem with these
with tennis ball, but the specific visual features of the person earlier approaches is that they lack a modern, ConvNet-based
(e.g., his pose and attire) in Fig. 1(b) do suggest the presence direct label recognition module. As explained in Section I, not
of a tennis ball. Recent work by Yeh et al. [30] performed label all labels can be inferred from the context (such as chair in
embedding through an autoencoder, and additionally projects Fig. 1), and direct recognition is necessary for such labels.
ConvNet features to the embedding space through canonical Deep learning provides a new feasibility solution for large-
correlation analysis (CCA). This approach, however, does not scale image multilabel classification. Most deep learning
contain a direct label recognition module. Finally, another line methods designed CNN-RNN architecture to solve multilabel
of work applies recurrent networks, e.g., [4] and [31], which classification by learning semantic information or capturing
recognizes labels sequentially, e.g., first a person, then a tennis global dependencies among learned features [3], [4], [31],
racket, and third a tennis ball. Earlier labels then provide con- [53]. HCP [8] follows an object detection pipeline that gener-
text for later ones. Intuitively, humans normally do not identify ates region proposals, and applies a classifier to each region
objects or attributes sequentially, except for solving puzzles. proposal for multilabel classification. WSD [54] proposed
Instead, we construct a holistic mental picture of the image to improve multilabel classification performance by distill-
context, as in the proposed solution. ing knowledge from weakly supervised detection task without

Authorized licensed use limited to: SRM University. Downloaded on March 28,2021 at 11:31:02 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WEN et al.: MULTILABEL IMAGE CLASSIFICATION VIA FEATURE/LABEL CO-PROJECTION 3

Fig. 2. Overview of the proposed solution for multilabel image classification. Orange squares represent neural network layers, and blue cubes denote the
feature maps output by a network pipeline. The proposed network consists of three modules: feature extractor, feature/label co-projector, and classifier. The
feature extractor outputs a feature map as a 14 × 14 × 2048 tensor. A subsequent convolution layer then generates the new image features as inputs to the
feature/label co-projector, which embed these features and the labels associated with the image to the same latent space, shown as a gray circle. Dots with
different colors represent different embedded data: red and blue ones are embedded positive and negative labels, respectively, whereas orange/gray dots are
embedded positive/negative image features, respectively. Green (resp., blue) arrows indicate that data should be away from (resp., close to) each other, which
are embodied in the proposed loss function.

bounding box. SRN [2] used spatial regularization learning of visual contents in the image. From these features, we can
attention maps for multilabel recognition. To further exploit build a direct label recognizer for each label, e.g., with a fully
label correlation information, DDPP [55] proposed the DPP connected layer on top of the visual features. In addition, fea-
module to capture label-correlations while incorporate external tures from deeper neural network layer have richer semantic
knowledge about label co-occurrence. CorrLog [56] explicitly information and are more abstract.
modeled the pairwise correlation between labels and improved The feature/label co-projector is responsible for embed-
the performance of multilabel recognition. Further, CGL [57] ding image convolutional features and corresponding label,
modeled formulate multilabel problem as conditional graph- respectively, as explained in Section I. Both projector can be
ical lasso inference problem and focused on image feature viewed as the encoder. The feature/label co-projector takes
when exploiting label correlation. Therefore, label correlation feature extractor’s features and labels as inputs, respectively.
becomes hot topic for the multilabel problem. Specifically, the projector component maps both visual fea-
To summarize, previous methods, to our knowledge, miss tures and one-hot-encoded labels to the same latent space. In
either explicit context construction, or a ConvNet-based direct the latent space, we can explicitly model label correlation.
label recognition module; meanwhile, many of them require Then, the metric learning method is used to force the distance
the knowledge of the number of labels associated with the between correlated embedding vectors from image feature and
image. The proposed solution, presented next, combines both label are small than noncorrelated ones. Meanwhile, our well
context, and direct recognition, and can identify an arbitrary designed constrained ranking loss ensures that the mapping
number of labels from an image. correctly reflects the semantic relationships between images
and labels. Finally, we extract the feature of image feature
embedding network as part of the feature for label prediction.
III. P ROPOSED S OLUTION
Finally, the classifier combines direct label recognition
The proposed solution contains three main components: results (one confidence value per class) with the image context
1) a feature extractor; 2) a feature/label co-projector that from this latent mapping. In our implementation, the combina-
map both image features and labels to the same latent vector tion is an element-wise sum for simplicity. The whole model
space; and 3) a classifier that combines direct label recognition can be trained end-to-end with no additional data other than
results using the feature extractor with contextual information the images and ground truth labels in the training set.
extracted from latent vector space. Fig. 2 shows the overall
architecture of the proposed framework.
The feature extractor extracts visual features from the input A. Feature Extractor
image, which can be performed with a standard ConvNet As explained earlier, the feature extractor can be done with
pipeline commonly used for single-label image recognition any standard ConvNet pipeline for single-label classification.
tasks. These features can be viewed as abstract representations Our implementation employs ResNet-101 [9], which achieves

Authorized licensed use limited to: SRM University. Downloaded on March 28,2021 at 11:31:02 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

competitive performance (7.1 top-5 error) on the ImageNet the main classification module, we use the fusion of global
dataset. We remove the last pooling layer and the last classifi- max pooling and global average pooling operation to reduce
cation layer and use the features map from the last convolution the dimension of an image feature, and a fully connect layer
layer, as the inputs for our classification and embedding is followed to compute the initial prediction. We add them
branches. together to get the final predicted confidence P ∈ RN∗C . Max
Formally, let D = {x1 , x2 , . . . , xi , . . . , xn }d∗n denotes pooling can find the activation of small objects in image, but
the set of images with corresponding label Y = average pooling can find the activation of bigger objects. The
{y1 , y2 , . . . , yi , . . . , yn }, y ∈ {0, 1}C∗n , where yi is a C dimen- fusion of both pooling is helpful to find all labeled object. For
sion label vector for image xi . Meanwhile, let d, C, and n our channel-wise pooling, global max pooling is employed
denote the image data dimension, total label number, and  
image dataset size, respectively; yil is +1 when xi has the P = fpool (fconv (Fx ; θconv )) + fc fpool (Fx ); θc . (5)
lth label, and 0 otherwise. We feed the image x to the feature
extractor fcnn to get the image features Fx IV. L OSS F UNCTION
Fx = fcnn (x; θcnn ), Fx ∈ R 14∗14∗2048
. (1) A. Multilabel Soft Margin
In order to optimize our proposed framework, we use the
B. Feature/Label Embedding Multi Label Soft Margin classification loss and constrained
The embedding components of our solution captures the ranking loss as our loss function. First, multilabel can be
correlations between the image and its labels, as well as viewed as a one-to-many classification problem between the
between different labels and features from different images. image and its labels. Note that we assume the general set-
For this purpose, we design two mapping networks that embed ting where the number of labels corresponding to each image
visual features and labels to the common latent space, respec- is unknown. Previous works such as [58] incorporate a label
tively. The projections of features and labels in this space are decision module into the model, which estimate the optimal
then adjusted through back-propagation, using the proposed confidence thresholds for each visual concept. The Multi Label
constrained ranking loss function, detailed later in Section IV. Soft Margin chooses 0 as the label thresholds instead of esti-
Following common embedding network designs, we design mating the label thresholds. This makes it easier to optimize
a convolution network for projecting visual features from and more stable.
the feature extractor, and another pipeline consisting of fully Specifically, our Multi Label Soft Margin creates a crite-
connected layers for projecting one-hot-encoded labels. In rion that optimizes a multilabel one-vs-all loss based on cross
general, our framework can work with any such projection entropy between inputs X and the ground truth Y
pipelines, and our specific implementation is detailed later in   −1 
−F(xi )
Section V-A. In particular, for image feature projection, we Loss(x, y) = − yi ∗ log 1 + e
first use a convolution layer fconv to map the image feature i
 
Fx . The role of the convolutional layer is to turn image fea- e−F(xi )
tures Fx ∈ R14×14×2048 into a form Fe ∈ R14×14×C that is + (1 − yi ) ∗ log (6)
1 + e−F(xi )
easier to optimize and understand. Each channel of Fe rep-
resents the corresponding object class feature. If the label is where F denotes the mapping for image x to label y. Ideally,
included in the image, the corresponding channel has a larger F should have F(xi ) = yi for i in range N. Since the Multi
activation. Label Soft Margin loss is based on cross entropy, it cannot
Formally, let fim and fl denote the convolution network capture the label dependency in multilabel mission.
(for image feature projection)) and the fully connect networks
(for label projection), respectively. We can get the embedding B. Constrained Ranking Loss
representation Fe and Le as follows: To exploit feature/label correlations, we design a con-
Fe = fim (fconv (Fx ; θconv ); θim ), Fe ∈ R C
(2) strained ranking loss to capture the label dependency.
Fconv = fconv (Fx ; θconv ), Fconv ∈ R14∗14∗C (3) The ranking loss has been studied in the predeep-learning
multilabel classification setting, such as SVM [37]. The rank-
Le = fl (y, θl ), Le ∈ RC . (4)
ing loss mining multilabel data is computed in [7], where
During training, the projected vectors are adjusted through the ranking loss averages over the samples, and the num-
the proposed constrained ranking loss function, elaborated in ber of label pairs are incorrectly ordered, such as true labels
Section IV. Intuitively, in the latent space, we aim to move the have a lower score than false labels. And the lowest achiev-
projections of the image closer to the projections of its asso- able ranking loss is 0. The ranking loss used in this method
ciated labels (which we call positive labels), and away from indicates the number of irrelevant labels that are higher than
the projections of labels not associated with the image (nega- the relevant labels. However, not all the labels are consid-
tive labels). Meanwhile, labels that are semantically correlated ered simultaneously; instead, only the incorrectly ranked labels
are moved close together through the training process, so are are considered. In fact, the label correlation is naturally local
semantically correlated image features. where the subsets of images share the correlation rather than
Finally, we fuse the features from the image and the map- all image instances. Huang and Zhou [59] measured the sim-
ping network to calculate the final prediction. Specifically, in ilarity between image instances in the label space rather than

Authorized licensed use limited to: SRM University. Downloaded on March 28,2021 at 11:31:02 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WEN et al.: MULTILABEL IMAGE CLASSIFICATION VIA FEATURE/LABEL CO-PROJECTION 5

TABLE I
the feature space because the image instances with the same A RCHITECTURE OF THE I MAGE F EATURE P ROJECTION N ETWORK
label share the same correlation. IN O UR I MPLEMENTATION
Meanwhile, researchers map the label into a low dimension
or high dimension latent space [60], [61] to solve multilabel
classification. All these methods can be viewed as label
embedding. In the latent space, the correlation between labels
can be implicitly exploited. In our proposed solution, we use
a deep convolution network U : RHW → RC to map the
image feature maps to a latent space and a fully connected
network V : RC → RC to map the corresponding labels to Moreover, the positive and negative labels will be gathered
the same latent space. H and W point at the height and width together, respectively, in the latent space. Therefore, the local
of the corresponding image. Let U(f ) denote the embedded label dependency can be implicitly exploited. If other losses,
image features and V(y) for embedded labels. Furthermore, such as common ranking loss, cross-entropy loss, or the mean
we design a constrained ranking loss to measure the simi- square error loss are considered, the local label correlation
larity between embedded images and labels. We consider all cannot be modeled and exploited.
positive label and the negative label simultaneously. In the The loss function is the sum of classification loss and the
embedding space, let d(fi+ , y+ j ) denote the distance between constrained ranking loss. It is shown as follows:
embedded positive features and embedded positive labels. And
let d(fi+ , y− Loss = α ∗ Losscls + β ∗ Lossr (10)
k ) denote the distance between embedded positive
features and the embedded negative labels. The y+ i and yk

where α and β are the hyperparameters, we simply set both
denotes the embedded positive labels and negative labels. We of them as 1.
expect the distance d(fi+ , y+ j ) to be smaller than the distance
d(fi+ , y−
k ), with a large margin of δ which is set as 0.5 here. V. E XPERIMENTS
This leads to the following formulation:
   
We have implemented the proposed solution and evalu-
d fi+ , y+
j + δ ≤ d fi+ , y−
k
ated it on two popular benchmark datasets: PASCAL VOC
2007 [33], which contains 20 different object labels, and
∀y+
j ∈Y
+
∀y− −
k ∈Y . (7) MS COCO 2014 [32], which contains 80 different object
In our solution, Fe and Le which introduced in Section III-B labels. We also compare our results with the those reported
is point at fi and yi , respectively. Here, d(f , y) denotes the in previous research papers. In the following, we present the
Euclidean distance between image features and label features. implementation of the proposed solution and the model train-
Intuitively, in the same latent space, the positive features and ing process, evaluation metrics, evaluation results, and result
corresponding labels have the similar embedding and have visualizations.
large margin with negative labels.
We also define the constraints for the label side A. Model Implementation and Training
   −
d y+ +
i , yj + δ ≤ d y+
i , yk
The proposed solution is implemented using PyTorch (avail-
able at pytorch.org). As shown in Fig. 2, the feature extractor
∀y+
j ∈Y
+
∀y− −
k ∈Y . (8) of our model is implemented using ResNet-101 [9], pretrained
These constraints ensure that the embedded positive labels using the ImageNet dataset [62]. Specifically, we removed the
are as close as possible with each other, and as far away as last two layers (i.e., global average pooling and 1000-way clas-
possible from the embedded negative labels. We then add the sification full-connected, respectively), and added instead: 1) a
constraints terms corresponding to our baseline ranking loss new global max-pooling layer and 2) C-way fully connected
function layers, where C denotes the number of object categories, which
     is 20 and 80 in PASCAL VOC and MS COCO datasets,
Lossr = λ1 ∗ δ + d fi+ , y+
j − d fi+ , y−
k respectively.
+
i,j,k We set the size of each input image to 448 × 448. Then,
    − after the ResNet-101 pipeline, the extracted feature maps (i.e.,
+ λ2 ∗ δ + d y+ +
i , yj − d y+
i , yk (9) before the pooling layer) has size 14 × 14 × 2048. These fea-
+
i,j,k tures are fed to the feature/label co-projector branch, which
where λ1 and λ2 are the hyperparameters to balance the uses a small ConvNet to embed these features to a latent vec-
ranking loss. We set both to 0.5. tor space. Table I lists the detailed layers of this neural net for
Our constrained ranking loss can measure the similarity image feature projection.
between the embedded labels and features. We simultaneously Regarding label projection, we use two fully connected lay-
consider all the ranked labels, because minimizing the above ers to embed one-hot-encoded label vectors to the same latent
loss function is equivalent to maximizing the predicted value vector space as the image features, as shown in Table II.
of all positive label attribute pairs while minimizing the pre- Then, the proposed ranking loss is used to model the correla-
dicted value of all negative label attribute pair, which implicitly tion between embedded labels and image features. Finally, we
forces the label co-occurrence information to be retained. obtain the final prediction results by aggregating the outputs

Authorized licensed use limited to: SRM University. Downloaded on March 28,2021 at 11:31:02 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

TABLE II
A RCHITECTURE OF THE L ABEL P ROJECTION N ETWORK F1-score metrics are defined as
IN O UR I MPLEMENTATION
1
C C
i TPi TPC
i
RO = , RC = (12)
C
(TPi + FNi )
i
C
i
TP C
i + FNC
i
2 ∗ (PO ∗ RO) 2 ∗ (PC ∗ RC)
F1O = , F1C = (13)
PO + RO PC + RC
where FN denotes the number of false negatives for each class.
The F score can be viewed as a weighted average of the
the direct label recognition (i.e., ResNet-101) and feature/label precision and recall. For F1, the precision and recall have
co-projector branches as shown in Fig. 2. The specific the same weight. All seven evaluation metrics used in the
aggregation in our implementation is a simple element-wise experiments have range between 0 and 1, with higher values
sum. indicating better performance.
At test time, the feature/label co-projection module no
longer applies, since the label for a test image is unknown. C. Evaluation Results
Hence, we simply remove the network layers that project We compare the proposed solution against previous
image features and labels to a common latent space. Note multilabel image classification methods on MS COCO
that at test time, compared with our base model, i.e., ResNet- 2014 [32] and PASCAL VOC 2007 benchmark datasets [33].
101, the proposed solution only introduces 2048×C additional The results are shown in Tables III (for COCO) and IV (for
weights. Hence, the proposed model is highly efficient; yet VOC), respectively.
it achieves state-of-the-art performance as shown in later Specifically, on the MS COCO dataset, we compare our
sections. solution against reported results (directly from their respec-
Model Training: The proposed deep neural network is tive papers) for WARP [64], CNN-RNN [4], RDAR [31],
trained end-to-end with the training set of the data and no RARL [3], and SRN [2]. Our results in Table III also
additional information. To demonstrate the robustness of the includes the performance of the base model of our solu-
proposed solution, we used simple training techniques with- tion, i.e., ResNet-101. Note that some methods require the
out much hyperparameter tuning. Specifically, during training, knowledge of the number of labels associated with an image;
we simply resize each raw input images from the dataset to consequently, they cannot predict the set of all labels for a
448×448, with no other data augmentation. The training steps given image. Therefore, we also include the results for top-3
are performed by an SGD optimizer, with momentum 0.9 and labels.
weight decay 1e-4, respectively. We used different learning Clearly, the proposed solution outperforms its base model
rates for different network layers. In particular, we set the ResNet-101 on all evaluation metrics. We observe that the
learning rate of features extraction layers (i.e., ResNet-101) to base model is, in fact, a strong baseline, which, by itself, out-
0.001, and the learning rate of the other layers as 0.01. The performs several earlier approaches. More importantly, with
reason is that the ResNet-101 layers have already been pre- two exceptions (P-C for all labels and R-O for top-3 labels),
trained on ImageNet data, and using a small learning rate is the proposed solution achieves the best performance on all
necessary for transfer learning. evaluation metrics, usually with significant performance gaps.
Notably, our mAP is 81.1%, compared to the previous best
B. Evaluation Metrics 77.1% obtained by a recent work SRN [2]; similarly, our
F1 scores are also several percentage points higher than the
Following a recent paper [2], we evaluate the proposed best previous results. Hence, these evaluation results firmly
solution using seven metrics for multilabel classification establish the proposed model as the new state-of-the-art for
performance: mean average precision (mAP), macro/micro multilabel classification on MS COCO.
precision (P-C/P-O), macro/micro recall (R-C/R-O), and Another evaluation dataset used in the experiments, i.e.,
macro/micro F measure (F1 − C/F1 − O). Specifically, mAP is PASCAL VOC 2007, contains 9963 images of 20 different
the mean value of average precision [63] for each class, where object categories, split into a training set of 5011 images and
average precision is calculated by the average fraction of rele- a validation set of 4952 images. On this dataset, we com-
vant labels ranked higher than one other relevant label. Macro pare the result of our solution against the reported results
precision (denoted as P-C) is evaluated by averaging per-class (again directly from their respective papers) of the follow-
precision measurements. Micro precision (P-O) is an overall ing methods: CNN-SVM [65], CNN-RNN [4], VeryDeep [66],
measure that counts true predictions for all images over all RLSD [67], HCP [68], RDAR [31], and RARL [3]. The
classes. Formally, they are defined as follows: results, as shown in Table IV, list the average precision
for each of the 20 classes, as well as the mAP score.
1
C C
i TPi TPC
i In terms of overall mAP, our method significantly outper-
PO = , PC = (11)
C
i (TPi + FPi ) C
i
TP C
i + FPC
i forms the previous best result obtained by RARL [31].
Note that RARL involves a complicated network archi-
where TP is the number of true positives and FP the number tecture involving ConvNet, RNN, and attention, whereas
of false positives for each class, respectively. The recall and the proposed method has a much simpler architecture,

Authorized licensed use limited to: SRM University. Downloaded on March 28,2021 at 11:31:02 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WEN et al.: MULTILABEL IMAGE CLASSIFICATION VIA FEATURE/LABEL CO-PROJECTION 7

TABLE III
C OMPARISON R ESULTS OF AVERAGE P RECISION AND M AP OF OTHER M ETHODS AND O UR M ETHOD ON THE MSCOCO DATASET.
T HE R ED F RONT I S U SED TO M ARK THE B EST R ESULTS

TABLE IV
C OMPARISON OF AVERAGE P RECISION AND M AP OF OTHER M ETHODS AND O UR M ETHOD ON VOC DATASET.
T HE B EST E VALUATION VALUE I S H IGHLIGHTED IN R ED F RONT

Fig. 3. CAM visualization results. Images and their activation attention maps from Fx with classification layer and fconv are shown in the first and second
rows, respectively. A label in blue font indicates that it is not associated with the corresponding image. The features of fconv are learned using the proposed
constrained ranking loss, which captures correlations between image features and labels. From the results, clearly the attention map from fconv has a greater
response to people and skis than skateboard in the skiing scene. Meanwhile, in the baseball ball scene, features that are not affected by our constrained ranking
loss have a greater response in the human area. On the other hand, the features affected by the constrained ranking loss have clear responses to the bat, glove
and the ball in this scene, even though the response to the human is smaller.

and much fewer weights at test time. Finally, for specific the performance of our model is also highly competitive.
classes, our method achieves the highest average precision Therefore, we achieve a new state-of-the-art on PASCAL
for the majority of the classes; for the remaining classes, VOC 2007.

Authorized licensed use limited to: SRM University. Downloaded on March 28,2021 at 11:31:02 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

TABLE V
E XPERIMENTAL R ESULTS A BOUT THE E FFECT OF E XTRA B RANCH FOR combines direct label recognition using a base model (ResNet-
O UR M ODEL . TBA AND TBC D ENOTE R ESULTS OF A DDITIONAL 101 in our implementation) and a feature/label co-projection
B RANCH A RE D IRECTLY A DDED TO THE C LASSIFICATION R ESULTS OF module that explicitly models the context of the image. Our
THE M AIN C LASSIFICATION B RANCH , AND F EATURES OF T WO
B RANCHES A RE C ASCADED TO E ACH OTHER , R ESPECTIVELY. R ESNET implementation of the proposed method achieves state-of-the-
BASELINE C OMES F ROM SRN art performance on two popular benchmark dataset: 1) MS
COCO and 2) PASCAL VOC, while being highly efficient at
test time, with only 2048 × C additional weights compared to
the base model, where C is the number of possible classes.
As future work, we plan to further improve the proposed
solution using effective techniques such as visual attention.
TABLE VI
Meanwhile, we intend to investigated multilabel classification
E XPERIMENTAL R ESULTS A BOUT THE E FFECT OF C ONSTRAINED in other contexts, e.g., with abstract attribute labels, and for
R ANKING L OSS . MSE D ENOTES M EAN S QUARE E RROR L OSS . CRL I S other types of challenging data such as video.
O UR P ROPOSED C ONSTRAINED R ANKING L OSS

R EFERENCES
[1] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain, “Sparse local embed-
dings for extreme multi-label classification,” in Proc. Adv. Neural Inf.
Process. Syst., 2015, pp. 730–738.
[2] F. Zhu, H. Li, W. Ouyang, N. Yu, and X. Wang, “Learning spa-
D. Ablation Experiments tial regularization with image-level supervisions for multi-label image
classification,” in Proc. CVPR, 2017, pp. 2027–2036.
To evaluate our model, we decompose our deep neural [3] T. Chen, Z. Wang, G. Li, and L. Lin, “Recurrent attentional rein-
network and valid the effect of image/label co-projector in the forcement learning for multi-label image recognition,” 2017. [Online].
COCO dataset. Ablation for backbone: In our experiments, we Available: arXiv:1712.07465.
use Resnet101 as the backbone of our model following SRN. [4] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “CNN-RNN:
A unified framework for multi-label image classification,” in Proc. IEEE
We can easily know from Tables I and II that we achieve Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2285–2294.
competitive performance compared with other great methods. [5] Y. Li, C. Huang, C. C. Loy, and X. Tang, “Human attribute recognition
Further, in order to rule out the impact of additional extra by deep hierarchical contexts,” in Proc. Eur. Conf. Comput. Vis., 2016,
pp. 684–700.
branch on our model, we discard the feature map part of the [6] Z. Yan, W. Liu, S. Wen, and Y. Yang, “Multi-label image classification
model and leave the rest to complete the experiment. We con- by feature attention network,” IEEE Access, vol. 7, pp. 98005–98013,
sider the case where the results of the additional branch are 2019.
[7] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining multi-label data,” in
directly added to the classification results of the main classi- Data Mining and Knowledge Discovery Handbook. Boston, MA, USA:
fication branch, or the features of two branches are cascaded Springer, 2009, pp. 667–685.
to each other. Corresponding results are shown in Table V. [8] Y. Wei et al., “HCP: A flexible CNN framework for multi-label image
classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 9,
Ablation experiments were done on the COCO dataset. pp. 1901–1907, Sep. 2016.
In addition, we also compared the proposed constrained [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
ranking loss with mean square error loss. The experimen- recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
pp. 770–778.
tal results are shown in Table VI. We can easily know that [10] B. Hariharan, L. Zelnik-Manor, M. Varma, and S. V. N. Vishwanathan,
constrained ranking loss is working well in our method. “Large scale max-margin multi-label classification with priors,” in
Proc. 27th Int. Conf. Mach. Learn. (ICML), 2010, pp. 423–430.
[11] G. Tsoumakas, A. Dimou, E. Spyromitros, V. Mezaris, I. Kompatsiaris,
E. Visualizations and I. Vlahavas, “Correlation-based pruning of stacked binary relevance
CAM Visualizations: To provide further insights into the models for multi-label learning,” in Proc. 1st Int. Workshop Learn. Multi
Label Data, 2009, pp. 101–116.
proposed solution, we visualize the features extracted by the [12] X. Shu, J. Tang, G.-J. Qi, Z. Li, Y.-G. Jiang, and S. Yan, “Image clas-
feature extractor of our method using CAM [69], which shows sification with tailored fine-grained dictionaries,” IEEE Trans. Circuits
the attention map of these features. Meanwhile, to demonstrate Syst. Video Technol., vol. 28, no. 2, pp. 454–467, Feb. 2018.
[13] J. Tang et al., “Tri-clustered tensor completion for social-aware image
that our feature/label co-projection module correctly learns the tag refinement,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 8,
correlations between image features and different labels, we pp. 1662–1674, Aug. 2017.
also visualize the features from the fconv (Fx ) layer in the fea- [14] X. Zeng, S. Wen, Z. Zeng, and T. Huang, “Design of memristor-based
image convolution calculation in convolutional neural network,” Neural
ture/label co-projector. Fig. 3 shows the visualization results Comput. Appl., vol. 30, no. 2, pp. 502–508, 2018.
on the MS COCO dataset. From these results, we observe [15] S. Wen, R. Hu, Y. Yang, T. Huang, Z. Zeng, and Y. Song, “Memristor-
that the label correlation is well represented in the features based echo state network with online least mean square,” IEEE Trans.
Syst., Man, Cybern., Syst., vol. 49, no. 9, pp. 1787–1796, Sep. 2019.
fconv (Fx ) with the help of label embedding and our constrained
[16] S. Wen, X. Xie, Z. Yan, T. Huang, and Z. Zeng, “General memristor
ranking loss. The results on PASCAL VOC lead to similar with applications in multilayer neural networks,” Neural Netw., vol. 103,
conclusions, and are omitted for brevity. pp. 142–148, Jul. 2018.
[17] S. Wen et al., “Memristive LSTM networks for sentiment analysis,”
IEEE Trans. Syst., Man, Cybern., Syst., to be published.
VI. C ONCLUSION [18] Y. Cao, S. Wang, Z. Guo, T. Huang, and S. Wen, “Synchronization
of memristive neural networks with leakage delay and parameters mis-
We presented a simple and intuitive solution to the fun- match via event-triggered control,” Neural Netw., vol. 119, pp. 178–189,
damental problem of multilabel image recognition, which Nov. 2019.

Authorized licensed use limited to: SRM University. Downloaded on March 28,2021 at 11:31:02 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WEN et al.: MULTILABEL IMAGE CLASSIFICATION VIA FEATURE/LABEL CO-PROJECTION 9

[19] S. Wang, Y. Cao, T. Huang, Y. Chen, P. Li, and S. Wen, “Sliding mode [45] R. E. Schapire and Y. Singer, “BoosTexter: A boosting-based system
control of neural networks via continuous or periodic sampling event- for text categorization,” Mach. Learn., vol. 39, nos. 2–3, pp. 135–168,
triggering algorithm,” Neural Netw., vol. 121, pp. 1–11, Jan. 2019. 2000.
[20] S. Wen et al., “Memristor-based design of sparse compact convolutional [46] M.-L. Zhang and Z.-H. Zhou, “ML-KNN: A lazy learning approach to
neural networks,” IEEE Trans. Netw. Sci. Eng., to be published. multi-label learning,” Pattern Recognit., vol. 40, no. 7, pp. 2038–2048,
[21] S. Wen, M. Z. Q. Chen, X. Yu, Z. Zeng, and T. Huang, “Fuzzy control 2007.
for uncertain vehicle active suspension systems via dynamic sliding- [47] M.-L. Zhang and Z.-H. Zhou, “Multilabel neural networks with appli-
mode approach,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 47, no. 1, cations to functional genomics and text categorization,” IEEE Trans.
pp. 24–32, Jan. 2017. Knowl. Data Eng., vol. 18, no. 10, pp. 1338–1351, Aug. 2006.
[22] S. Wen, T. Huang, X. Yu, M. Z. Chen, and Z. Zeng, “Aperiodic sampled-
[48] M. Carvalho, R. Cadène, D. Picard, L. Soulier, N. Thome, and M. Cord,
data sliding-mode control of fuzzy systems with communication delays
“Cross-modal retrieval in the cooking context: Learning semantic text-
via the event-triggered method,” IEEE Trans. Fuzzy Syst., vol. 24, no. 5,
image embeddings,” 2018. [Online]. Available: arXiv:1804.11146.
pp. 1048–1057, Oct. 2016.
[23] M. Dong, S. Wen, Z. Zeng, Z. Yan, and T. Huang, “Sparse fully [49] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++:
convolutional network for face labeling,” Neurocomputing, vol. 331, Improved visual-semantic embeddings,” 2017. [Online]. Available:
pp. 465–472, Feb. 2019. arXiv:1707.05612.
[24] S. Wang, Y. Cao, T. Huang, and S. Wen, “Passivity and passification of [50] A. Salvador et al., “Learning cross-modal embeddings for cooking
memristive neural networks with leakage term and time-varying delays,” recipes and food images,” Training, vol. 720, p. 2, Jul. 2017.
Appl. Math. Comput., vol. 361, pp. 294–310, Nov. 2019. [51] F. Tai and H.-T. Lin, “Multilabel classification with principal label
[25] G. Ren, Y. Cao, S. Wen, Z. Zeng, and T. Huang, “A modified Elman space transformation,” Neural Comput., vol. 24, no. 9, pp. 2508–2542,
neural network with a new learning rate,” Neurocomputing, vol. 286, Sep. 2012.
pp. 11–18, Apr. 2018. [52] D. J. Hsu, S. M. Kakade, J. Langford, and T. Zhang, “Multi-label
[26] S. Wen, W. Liu, Y. Yang, Z. Zeng, and T. Huang, “Generating realistic prediction via compressed sensing,” in Proc. Adv. Neural Inf. Process.
videos from keyframes with concatenated GANs,” IEEE Trans. Circuits Syst., 2009, pp. 772–780.
Syst. Video Technol., vol. 29, no. 8, pp. 2337–2348, Aug. 2019. [53] F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun, “Semantic reg-
[27] Y. Cao, Y. Cao, S. Wen, Z. Zeng, and T. Huang, “Passivity analysis of ularization for recurrent image annotation,” in Proc. IEEE Conf. Comput.
reaction–diffusion memristor-based neural networks with and without Vis. Pattern Recognit., 2017, pp. 2872–2880.
time-varying delays,” Neural Netw., vol. 109, pp. 159–167, Jan. 2019.
[54] Y. Liu, L. Sheng, J. Shao, J. Yan, S. Xiang, and C. Pan, “Multi-label
[28] X. Xie, S. Wen, Z. Zeng, and T. Huang, “Memristor-based circuit imple-
image classification via knowledge distillation from weakly-supervised
mentation of pulse-coupled neural network with dynamical threshold
detection,” in Proc. ACM Multimedia Conf. Multimedia Conf., 2018,
generator,” Neurocomputing, vol. 284, pp. 10–16, Apr. 2018.
pp. 700–708.
[29] Z. Li, M. Dong, S. Wen, X. Hu, P. Zhou, and Z. Zeng, “CLU-
CNNs: Object detection for medical images,” Neurocomputing, vol. 350, [55] P. Xie, R. Salakhutdinov, L. Mou, and E. P. Xing, “Deep determinantal
pp. 53–59, Mar. 2019. point process for large-scale multi-label classification,” in Proc. IEEE
[30] C.-K. Yeh, W.-C. Wu, W.-J. Ko, and Y.-C. F. Wang, “Learning Int. Conf. Comput. Vis., 2017, pp. 473–482.
deep latent space for multi-label classification,” in Proc. AAAI, 2017, [56] Q. Li, B. Xie, J. You, W. Bian, and D. Tao, “Correlated logistic model
pp. 2838–2844. with elastic net regularization for multilabel image classification,” IEEE
[31] Z. Wang, T. Chen, G. Li, R. Xu, and L. Lin, “Multi-label image recog- Trans. Image Process., vol. 25, no. 8, pp. 3801–3813, Aug. 2016.
nition by recurrently discovering attentional regions,” in Proc. IEEE Int. [57] Q. Li, M. Qiao, W. Bian, and D. Tao, “Conditional graphical lasso
Conf. Comput. Vis., 2017, pp. 464–472. for multi-label image classification,” in Proc. IEEE Conf. Comput. Vis.
[32] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Pattern Recognit., 2016, pp. 2977–2986.
Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755. [58] Y. Li, Y. Song, and J. Luo, “Improving pairwise ranking for multi-
[33] M. Everingham, L. Gool, C. Williams, J. Winn, and A. Zisserman, “The label image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern
PASCAL visual object classes (VOC) challenge,” Int. J. Comput. Vis., Recognit. (CVPR), 2017, pp. 1837–1845.
vol. 88, pp. 303–338, 2010. [59] S.-J. Huang and Z.-H. Zhou, “Multi-label learning by exploiting label
[34] J. Read, L. Martino, and D. Luengo, “Efficient Monte Carlo methods correlations locally,” in Proc. AAAI, 2012, pp. 949–955.
for multi-dimensional learning with classifier chains,” Pattern Recognit.,
[60] X. Li and Y. Guo, “Multi-label classification with feature-aware non-
vol. 47, no. 3, pp. 1535–1546, 2014.
linear label space transformation,” in Proc. IJCAI, 2015, pp. 3635–3642.
[35] J. Read, L. Martino, P. M. Olmos, and D. Luengo, “Scalable multi-output
label prediction: From classifier chains to classifier trellises,” Pattern [61] C.-S. Ferng and H.-T. Lin, “Multilabel classification using error-
Recognit., vol. 48, no. 6, pp. 2096–2109, 2015. correcting codes of hard or soft bits,” IEEE Trans. Neural Netw. Learn.
[36] J. Fürnkranz, E. Hüllermeier, E. L. Mencía, and K. Brinker, “Multilabel Syst., vol. 24, no. 11, pp. 1888–1900, Nov. 2013.
classification via calibrated label ranking,” Mach. Learn., vol. 73, no. 2, [62] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “ImageNet: A
pp. 133–153, 2008. large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
[37] A. Elisseeff and J. Weston, “A kernel method for multi-labelled classi- Vis. Pattern Recognit. (CVPR), 2009, pp. 248–255.
fication,” in Proc. Adv. Neural Inf. Process. Syst., 2002, pp. 681–687. [63] X.-Z. Wu and Z.-H. Zhou, “A unified view of multi-label performance
[38] N. Ghamrawi and A. McCallum, “Collective multi-label classification,” measures,” 2016. [Online]. Available: arXiv:1609.00288.
in Proc. 14th ACM Int. Conf. Inf. Knowl. Manag., 2005, pp. 195–200. [64] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe, “Deep convolutional
[39] G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, and H.-J. Zhang, ranking for multilabel image annotation,” 2013. [Online]. Available:
“Correlative multi-label video annotation,” in Proc. 15th ACM Int. Conf. arXiv:1312.4894.
Multimedia, 2007, pp. 17–26. [65] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN features
[40] R. Yan, J. Tesic, and J. R. Smith, “Model-shared subspace boosting off-the-shelf: An astounding baseline for recognition,” in Proc. IEEE
for multi-label classification,” in Proc. 13th ACM SIGKDD Int. Conf. Conf. Comput. Vis. Pattern Recognit. Workshops, 2014, pp. 806–813.
Knowl. Disc. Data Min., 2007, pp. 834–843.
[66] K. Simonyan and A. Zisserman, “Very deep convolutional networks
[41] S. Ji, L. Tang, S. Yu, and J. Ye, “Extracting shared subspace for multi-
for large-scale image recognition,” 2014. [Online]. Available:
label classification,” in Proc. 14th ACM SIGKDD Int. Conf. Knowl. Disc.
arXiv:1409.1556.
Data Min., 2008, pp. 381–389.
[42] S. Godbole and S. Sarawagi, “Discriminative methods for multi-labeled [67] J. Zhang, Q. Wu, C. Shen, J. Zhang, and J. Lu, “Multi-label image
classification,” in Proc. Pac.–Asia Conf. Knowl. Disc. Data Min., 2004, classification with regional latent semantic dependencies,” IEEE Trans.
pp. 22–30. Multimedia, vol. 20, no. 10, pp. 2801–2813, Oct. 2018.
[43] W. Cheng and E. Hüllermeier, “Combining instance-based learning and [68] H. Yang, J. T. Zhou, Y. Zhang, B.-B. Gao, J. Wu, and J. Cai, “Exploit
logistic regression for multilabel classification,” Mach. Learn., vol. 76, bounding box annotations for multi-label object recognition,” in Proc.
nos. 2–3, pp. 211–225, 2009. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 280–288.
[44] Y.-L. Chen, C.-L. Hsu, and S.-C. Chou, “Constructing a multi-valued [69] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba, “Learning
and multi-labeled decision tree,” Expert Syst. Appl., vol. 25, no. 2, deep features for discriminative localization,” Proc. CVPR, 2016,
pp. 199–209, 2003. pp. 2921–2929.

Authorized licensed use limited to: SRM University. Downloaded on March 28,2021 at 11:31:02 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

Shiping Wen received the M.Eng. degree in con- Zheng Yan received the B.Eng. degree in automa-
trol science and engineering from the School tion and computer-aided engineering and the Ph.D.
of Automation, Wuhan University of Technology, degree in mechanical and automation engineering
Wuhan, China, in 2010, and the Ph.D. degree in from the Chinese University of Hong Kong, Hong
control science and engineering from the School of Kong, in 2010 and 2014, respectively.
Automation, Huazhong University of Science and Dr. Yan was a recipient of the Graduate Research
Technology, Wuhan, in 2013. Grant from the IEEE Computational Intelligence
He is currently a Professor with the School Society in 2014.
of Computer Science and Engineering, University
of Electronic Science and Technology of China,
Chengdu, China. His current research interests
include memristor-based circuits and systems, neural networks, and deep
learning.

Weiwei Liu received the B.Eng. degree from the


Information Engineering School, Henan University
of Science and Technology, Wuhan, China, in 2017.
He is currently pursuing the M.Eng. degree in con-
trol engineering with the Huazhong University of Yiran Chen received B.S. and M.S. degrees in
Science and Technology, Wuhan. electronic engineering from Tsinghua University,
His research interests include computer ver- Beijing, China, 1998 and 2001, respectively, and
sion, multilabel classification, generative adversarial the Ph.D. degree from Purdue University, West
network, and deep learning. Lafayette, IN, USA, in 2005.
After five years in industry, he joined the
University of Pittsburgh, Pittsburgh, PA, USA, in
2010 as an Assistant Professor and then promoted to
Yin Yang received the B.Eng. degree in computer
an Associate Professor (with tenure) in 2014, held
science from the Department of Computer Science
a Bicentennial Alumni Faculty Fellow. He is cur-
and Engineering, Shanghai Jiao Tong University,
rently a Tenured Professor with the Department of
Shanghai, China, in 2004, and the Ph.D. degree in
Electrical and Computer Engineering, Duke University, Durham, NC, USA,
computer science from the Department of Computer
where he is serving as the Co-Director of Duke Center for Evolutionary
Science and Engineering, Hong Kong University of
Intelligence, focusing on the research of new memory and storage systems,
Science and Technology, Hong Kong, in 2009.
machine learning and neuromorphic computing, and mobile computing
He is currently an Assistant Professor with the
systems. He has published one book and more than 300 technical publications
College of Science and Engineering, Hamad Bin
and has been granted 93 U.S. patents.
Khalifa University, Doha, Qatar. He has published
Dr. Chen received 6 Best Paper Awards and 14 Best Paper Nominations
extensively in top venues on differentially private
from International Conferences. He is a recipient of NSF CAREER Award and
data publication and analysis, and on query authentication in outsourced
the ACM SIGDA Outstanding New Faculty Award. He is an Associate Editor
databases. He is currently working actively on cloud-based big-data analytics,
of the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING
with a focus on fast streaming data. His main research interests include cloud
S YSTEMS, IEEE T RANSACTIONS ON C OMPUTER -A IDED D ESIGN OF
computing, database security and privacy, and query optimization.
I NTEGRATED C IRCUITS AND S YSTEMS, IEEE D ESIGN &T EST, IEEE
E MBEDDED S YSTEMS L ETTERS, ACM Journal on Emerging Technologies
in Computing Systems, and ACM Transactions on Cyber-Physical Systems,
Pan Zhou (Member, received the B.S. degree and served on the technical and organization committees of more than 40
(Advanced Class) and the M.S. degree in electronic international conferences.
engineering from the School of EIC, Huazhong
University of Science and Technology (HUST),
Wuhan, China, in 2006, and the Ph.D. degree from
the School of Electrical and Computer Engineering,
Georia Institute of Technology (Georgia Tech),
Atlanta, GA, USA, in 2011.
He is currently an Associate Professor with the
School of Cyber Science and Engineering, HUST.
He was a Senior Technical Member with Oracle Inc.,
Boston, MA, USA, from 2011 to 2013. He was an Associate Professor with
the School of Electronic Information and Communications, HUST from 2013
to 2019. His current research interests include security and privacy, machine Tingwen Huang (Fellow, IEEE) received the B.S.
learning and big data analytics, and informationnetworks. degree from Southwest Normal University (cur-
rently, Southwest University), Chongqing, China,
1990, the M.S. degree from Sichuan University,
Zhenyuan Guo received the B.S. degree in math- Chengdu, China, 1993, and the Ph.D. degree from
ematics and applied mathematics and the Ph.D. Texas A & M University, College Station, TX, USA,
degree in applied mathematics from the College of in 2002.
Mathematics and Econometrics, Hunan University, After graduated from Texas A & M University,
Changsha, China, in 2004 and 2009, respectively. where he worked as a Visiting Assistant Professor.
He was a joint Ph.D. student to visit the Then he joined Texas A & M University at Qatar
Department of Applied Mathematics, University of (TAMUQ), Doha, Qatar, as an Assistant Professor in
Western Ontario, London, ON, Canada, from 2008 August 2003, then he was promoted to a Professor in 2013. He is a Professor
to 2009. From 2013 to 2015, he was a Post-Doctoral with TAMUQ. He has published more than 300 peer-review reputable journal
Research Fellow with the Department of Mechanical papers, including more than 100 papers in IEEE Transactions. His research
and Automation Engineering, Chinese University of interests include neural networks-based computational intelligence, distributed
Hong Kong, Hong Kong. He is currently a Professor with the College control and optimization, nonlinear dynamics and applications in smart grids.
of Mathematics and Econometrics, Hunan University. His current research Prof. Huang currently serves as an Associate Editor for four jour-
interests include theory of functional differential equations and differential nals, including the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND
equations with discontinuous right-hands, and their applications to dynamics L EARNING S YSTEMS, IEEE T RANSACTIONS ON C YBERNETICS, and
of neural networks, memristive systems, and control systems. Cognitive Computation.

Authorized licensed use limited to: SRM University. Downloaded on March 28,2021 at 11:31:02 UTC from IEEE Xplore. Restrictions apply.

You might also like