Centralized Feature Pyramid For Object Detection
Centralized Feature Pyramid For Object Detection
Abstract— The visual feature pyramid has shown its superior- bone followed by a two-stage (e.g., Fast/Faster R-CNN [4],
ity in both effectiveness and efficiency in a variety of applications. [5]) or single-stage (e.g., SSD [6] and YOLO [7]) framework.
However, current methods overly focus on inter-layer feature However, due to the uncertainty of object sizes, a single
interactions while disregarding the importance of intra-layer
feature regulation. Despite some attempts to learn a compact feature scale cannot meet the requirements of high-accuracy
intra-layer feature representation with the use of attention mech- recognition performance [8], [9]. To this end, methods
anisms or vision transformers, they overlook the crucial corner (e.g., SSD [6] and FFP [10]) based on the in-network fea-
regions that are essential for dense prediction tasks. To address ture pyramid are proposed and achieve satisfactory results
this problem, we propose a Centralized Feature Pyramid (CFP) effectively and efficiently. The unified principle behind these
network for object detection, which is based on a globally explicit
centralized feature regulation. Specifically, we first propose a methods is to assign a region of interest for each object
spatial explicit visual center scheme, where a lightweight MLP of different size with the appropriate contextual information
is used to capture the globally long-range dependencies, and a and enable these objects to be recognized in different feature
parallel learnable visual center mechanism is used to capture the layers.
local corner regions of the input images. Based on this, we then The interaction between pixels or objects is crucial in image
propose a globally centralized regulation for the commonly-used
feature pyramid in a top-down fashion, where the explicit visual recognition tasks [11]. Effective feature interactions allow
center information obtained from the deepest intra-layer feature image features to obtain a wider and richer representation,
is used to regulate frontal shallow features. Compared to the enabling object detection models to learn the implicit rela-
existing feature pyramids, CFP not only has the ability to capture tionship (such as favorable co-occurrence features [12], [13])
the global long-range dependencies but also efficiently obtain an
between pixels or objects. This has been shown to improve
all-round yet discriminative feature representation. Experimental
results on the challenging MS-COCO validate that our proposed visual recognition performance [14], [15], [16], [17], [18],
CFP can achieve consistent performance gains on the state-of- [19], [20]. For example, FPN [19] proposes a top-down
the-art YOLOv5 and YOLOX object detection baselines. inter-layer feature interaction mechanism, which enables shal-
Index Terms— Feature pyramid, visual center, object detection, low features to obtain global contextual information and
attention learning mechanism, long-range dependencies. semantic representations of deep features. NAS-FPN [15]
tries to learn the network structure of the feature pyramid
I. I NTRODUCTION part via a network architecture search strategy and obtains
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4342 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
QUAN et al.: CENTRALIZED FEATURE PYRAMID FOR OBJECT DETECTION 4343
global feature representations. To address this, single-stage a more lightweight MLP. The experimental results validate its
detectors [6], [7], [44], [45] directly perform prediction and effectiveness and efficiency.
region classification by generating bounding boxes. These
single-stage methods have a global perspective in the design III. O UR A PPROACH
of feature extraction and use the backbone network to extract In this section, we introduce the proposed centralized feature
feature maps from the entire image to predict each bounding pyramid (CFP) implementation details. We first make an
box. In this work, we also adopt single-stage object detectors overview architecture description for CFP in Section III-A.
(i.e., YOLOv5 [37] and YOLOX [38]) as our baseline models. Then, we show the implementation details of the explicit visual
Our goal is to enhance the representation of the feature center in Section III-B. Finally, we show how to implement
pyramid used in these detectors. the explicit visual center on an image feature pyramid and
propose our global centralized regulation in Section III-C.
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4344 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023
Fig. 2. An illustration of the overall architecture, which mainly consists of four components: input image, a backbone network for feature extraction, the
centralized feature pyramid which is based on a commonly-used vision feature pyramid following [38], and the object detection head network which includes
a classification (i.e., Cls.) loss and a regression (i.e., Reg.) loss. C denotes the class size of the used dataset. Our contribution lines in that we propose an
intra-layer feature regulation method in a feature pyramid, and a top-to-down global centralized regulation.
B. Explicit Visual Center (EVC) BN(·) denotes a batch normalization layer, and σ (·) denotes
As illustrated in Figure 3, our proposed EVC mainly con- the ReLU activation function.
sists of two blocks connected in parallel, where a lightweight 1) Lightweight MLP: The proposed lightweight MLP
MLP is used to capture the global long-range dependencies mainly consists of two modules: a depthwise convolution-
(i.e., the global information) of the top-level features X4 . based module [58] and a channel MLP-based block, where the
At the same time, to reserve the local corner regions (i.e., input of the MLP-based module is the output of the depthwise
the local information), we propose a learnable vision center convolution-based [51] module. These two blocks are both
mechanism implemented on X4 to aggregate the intra-layer followed by a channel scaling operation [53] and DropPath
local region features. The result feature maps of these two operation [59] to improve the feature generalization and
blocks are concatenated together along the channel dimension robustness ability. Specifically, for the depthwise convolution-
as the output of EVC for the downstream recognition. Next, based module, features output from the Stem module Xin are
instead of mapping the original features directly like [37], first fed into a depthwise convolution layer, which has been
we added a Stem block between X4 and EVC for features processed by a group normalization (i.e., feature maps are
smoothing. The Stem block consists of a 7 × 7 convolution grouped along the channel dimension). Compared to tradi-
with an output channel size of 256, followed by a batch tional spatial convolution, depthwise convolution can increase
normalization layer and an activation function layer. The above the feature representation ability while reducing computational
processes can be formulated as: costs. Then, channel scaling and DropPath are implemented.
long-range dependencies After that, a residual connection of Xin is implemented. The
above processes can be formulated as:
X = Cat MLP(Xin ) ; LVC(Xin ) ,
local corner regions X̃in = DConv(GN(Xin )) + Xin , (3)
(1)
where X̃in is the output of the depthwise convolution-based
module. GN(·) is the group normalization, and DConv(·) is a
where X is the output of EVC. Cat(·) denotes the feature map depthwise convolution [58] with a kernel size of 1 × 1.
concatenation along the channel dimension. MLP(Xin ) and For the channel MLP-based module, features output from
LVC(Xin ) denotes the output features of the used lightweight the depthwise convolution-based module X̃in are first fed to
MLP and the learnable visual center mechanism, respectively. the a group normalization, and then the channel MLP [51]
Xin is the output of the Stem block, which is obtained by: is implemented on these features. Compared to space MLP,
channel MLP can not only effectively reduce the computa-
Xin = σ (BN(Conv7×7 (X4 ))), (2)
tional complexity but also meet the requirements of general
where Conv7×7 (·) denotes a 7×7 convolution with stride 1 and vision tasks [38], [45]. After that, channel scaling, DropPath,
the channel size is set to 256 in our work following [19]. and a residual connection of X̃in are implemented in sequence.
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
QUAN et al.: CENTRALIZED FEATURE PYRAMID FOR OBJECT DETECTION 4345
Fig. 3. An illustration of the proposed explicit visual center, where a lightweight MLP architecture is used to capture the long-range dependencies and a
parallel learnable visual center mechanism is used to aggregate the local corner regions of the input image. The integrated features contain advantages of
these two blocks, so that the detection model can learn an all-round yet discriminative feature representation.
The above processes are expressed as: processed by a CBR block, consisting of a 3 × 3 convolution
with a BN layer and a ReLU activation function. Through
MLP(Xin ) = CMLP(GN(X̃in )) + X̃in , (4) the above steps, the encoded features X̌in are entered into the
where CMLP(·) is the channel MLP [51]. In our paper, for codebook. the learned smoothing factors sk (i = 1, 2, . . . , K )
the presentation convenience, we omit channel scaling and can be used in codebook to match X̌i (i = 1, 2, . . . , N ) with
droppath in Eq. 3 and Eq. 4. the corresponding b K (i = 1, 2, . . . , K ). Then, the learnable
2) Learnable Visual Center (LVC): LVC maps the input weights are generated based on the difference between the two,
feature Xin of shape and the weights are multiplied over the difference to sum up,
C × H × W to a set of C-dimensional and the final output is a C-dimensional vector e (as shown in
features X̌in = xˇ1 , xˇ2 , . . . , xˇN , where N = H × W is
Eq. 5 and Eq. 6). We can selectively focus on category-related
the total number of the input features. Next, X̌in will learn
feature map information. The information of the whole image
an inherent codebook B = {b1 , b2 , . . . , b K } containing two
with respect to the k-th codeword can be calculated by:
important components: 1) K number of codewords (i.e. visual
2
centers); 2) a set of smoothing factors S = {s1 , s2 , . . . , s K } for N
X e−sk ∥x̌i −bk ∥
the visual centers. Specifically, features from the Stem block ek = 2
(x̌i − bk ), (5)
e−s j ∥x̌i −b j ∥
PK
Xin are first encoded by a combination of a set of convolution i=1 j=1
layers (which consist of a 1 × 1 convolution, a 3 × 3 convolu- where x̌i − bk is the difference between the N C-dimensional
tion, and a 1 × 1 convolution). It is worth noting that the first feature vectors and the K codeword vectors. Then, the learn-
−sk ∥x̌i −bk ∥2
1×1 is used to reduce the channel of the input features, and the able weight P e 2 is generated based on the x̌i −bk ,
last 1×1 is used to increase the channel of the output features, K −s j ∥x̌i −b j ∥
j=1 e
while the middle 3 × 3 layer has a smaller channel for the where the ∥·∥ denotes L2 parametric operation, sk denotes the
2
input and output features. Thus, this set of convolution layers smoothing factor of the k-th codeword, and sk x̌i − bk is
not only reduces the computational volume but also allows the output value of the k-th codeword. Next, the weights are
the model to converge better. Next, the encoded features are multiplied by the x̌i − bk to obtain information about the
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4346 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
QUAN et al.: CENTRALIZED FEATURE PYRAMID FOR OBJECT DETECTION 4347
Fig. 4. MLP and attention-based variants structure diagram. (a) is the PoolFormer structure in [53]. (c) and (e) imitate PoolFormer structure and replace
Pooling layer with CPSLayer [56] and Depthwise Convolution as token mixers respectively. Moreover, (b), (d) and (f) structures replace the channel MLP
module with an attention-based module in transformer. Norm denotes the normalization. ⊕ represents channel-wise addition operation and ⊗ represents
channel-wise multiplication operation. P.E. represents positional encoding.
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4348 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023
TABLE II
R ESULT C OMPARISONS OF THE I NTRA -I NTER L AYER F EATURE M ETHODS .
“†” D ENOTES T HAT T HIS I S O UR R E -I MPLEMENTED R ESULT.
“-” D ENOTES T HAT T HERE I S N O S UCH A R ESULT IN I TS PAPER
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
QUAN et al.: CENTRALIZED FEATURE PYRAMID FOR OBJECT DETECTION 4349
TABLE III
R ESULT C OMPARISONS OF O UR L IGHTWEIGHT MLP(O URS ) YOLOX-L W ITH THE MLP VARIANTS AND THE S ELF -ATTENTION
VARIANTS ON THE MS-COCO val S ET [36]. “†” I S O UR R E -I MPLEMENTATION R ESULT
TABLE IV TABLE V
R ESULT C OMPARISONS OF O UR L IGHTWEIGHT MLP(O URS ) YOLOX-L E FFECT OF THE N UMBER OF V ISUAL C ENTERS K ON LVC YOLOX-L .
W ITH THE S TANDARD MLP-M IXER AND THE ATTENTION ON THE “-” D ENOTES T HAT T HERE I S N O S UCH A S ETTING .
MS-COCO val S ET [36]. “†” D ENOTES T HAT T HIS “†” I S O UR R E -I MPLEMENTATION R ESULT
I S O UR R E -I MPLEMENTED R ESULT
Mixer [51] and attention-based methods shows that the mAP TABLE VI
of the lightweight MLP increases by 0.6% to 1.0% with less R ESULT C OMPARISONS OF THE N UMBER OF R EPETITIONS OF THE
P ROPOSED CFP IN THE YOLOX-L BASELINE . “-” D ENOTES T HAT
computational costs. We can get the conclusion that our pro- T HERE I S N O S UCH A S ETTING . “†” I S O UR R E -I MPLEMENTATION
posed lightweight MLP for capturing long-range dependencies R ESULT. R D ENOTES THE N UMBER OF R EPETITIONS
of intra-layer features performed significantly.
4) Effect of K: As shown in Table V, we analyze the effect
of the number of visual centers K on the performance of LVC.
We choose YOLOX-L as the baseline, and with increasing K ,
we can observe that its performance shows an increasing trend.
At the same time, the parameter number, computation volume,
and the Latency of the model also tend to increase gradually.
Notably, when K = 64, the mAP of the model can reach
49.10%, and when K = 128, the mAP of the model can get
49.20%. Although the performance of the model improves by increases the computational cost. Therefore, based on the
0.1% as K increases, its extra computational cost increases above observations, we choose R = 1.
by 10.01G, and the corresponding inference time increases
by 3.21ms. This may be because too many visual centers
bring more redundant semantic information. Not only is the D. Efficiency Analysis
performance not significantly improved, but the computational We analyze the performance of the MLP variants and
effort is increased. So we choose K = 64. attention-based variants from a multi-metric perspective.
5) Effect of R: From Table VI, we analyze the effect of In Figure 6, all models take YOLOX-L [38] as a baseline
the number of repetitions R of CFP on the performance. and are trained on the MS-COCO [36] emphval set with the
We still choose the YOLOX-L baseline, and as R increases, same data augmentation settings. Meanwhile, to demonstrate
we can observe a trend of increasing and decreasing and the effectiveness of the MLP structure, as shown in Table VII,
then stabilizing the performance compared to the YOLOX-L we compare it with the state-of-the-art transformer methods
model. Meanwhile, the number of parameters, computation and the MLP methods at this stage. As can be observed from
volume, and Latency gradually increase. In particular, when Figure 6, we can intuitively see that the MLP (Ours) structure
R = 1, CFPYOLOX-L achieves the best performance mAP of is significantly better than the other structures in terms of
49.40%. When R = 2, the performance is instead reduced mAP, and it is lower than the other structures in terms of the
by 0.2% compared to R = 1. The reason may be that number of parameters, computation volume, and the inference
redundant feature information is not helpful for the task but time. It can be shown that the MLP structure can guarantee
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4350 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023
TABLE VII
R ESULT C OMPARISONS OF O UR L IGHTWEIGHT MLP W ITH T RANSFORMER VARIANTS AND MLP VARIANT M ETHODS
TABLE VIII
R ESULT C OMPARISONS W ITH YOLOV 5 AND YOLOX. “†” I S O UR R E -I MPLEMENTATION R ESULT. “-” D ENOTES T HAT T HERE I S N O S UCH A S ETTING
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
QUAN et al.: CENTRALIZED FEATURE PYRAMID FOR OBJECT DETECTION 4351
TABLE IX
C OMPARISON OF THE S PEED AND ACCURACY OF D IFFERENT O BJECT D ETECTORS ON MS-COCO val S ET. W E S ELECT A LL THE M ODELS T RAINED
ON 150 E POCHS FOR FAIR C OMPARISON . “†” I S O UR R E -I MPLEMENTATION R ESULT. “-” D ENOTES T HAT T HERE I S N O S UCH A S ETTING
1) Comparisons With YOLOv5 and YOLOX Baseline: As the network structure of YOLOv5 has no advantage; the mAP
shown in Table VIII, when YOLOv5 is chosen as the baseline, of our CFP method can still obtain 46.60%. Meanwhile, our
the mAP of our CFP method is enhanced by 0.5%, 0.5%, mAP reaches 49.40% at the YOLOX baseline. Moreover, the
and 1.4% on the Small, Media, and Large size models, CFPYOLOX on the small backbone network is improved by
respectively. When YOLOX [38] is used as the baseline, the 7.0% over YOLOX [38]. The main reason for this is that the
mAP improves by 7.0%, 0.8%, and 1.6% on the backbone LVC in our CFP can enhance the feature representations of the
networks of different sizes. It is worth noting that the main local corner regions through visual centers at the pixel level.
reason why we choose YOLOv5 (an anchor mechanism) and 2) Comparisons on Speed and Accuracy: We perform a
YOLOX (an anchor-free mechanism), as the baseline, is that series of comparisons on the MS-COCO val set with single-
the reciprocity of these two models in terms of network stage, and two-stage detectors, and the results are shown in
structure can fully demonstrate the effectiveness of our CPF Table IX. We can first see the two-stage object detection mod-
approach. Most importantly, compared with YOLOX, although els, including the Faster R-CNN series with different backbone
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4352 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023
Fig. 7. Qualitative results on the test set of MS-COCO 2017 [36]. We show the results of object detection from baseline and our approaches for comparison.
networks, Mask R-CNN, and D2Det. Our CFPYOLOX-L model partially detect the “zebra” at a distance. Therefore, it is
has significant advantages in precision, inference speed, and intuitively proved that EVC is very effective for small object
time. Immediately after, we divide the single-stage detec- detection in some intensive detection tasks. In the second line
tion methods into three parts in chronological order and of the figure, YOLOX-L does not fully detect the “Cups” in
then analyze them. There is no doubt that the proposed the cabinet due to factors such as occlusion and illumination.
CFPYOLOX-L method improves the mAP by up to 27.80% com- The EVCYOLOX-L model alleviates this problem by using
pared to YOLOv3-ultralytics [44] and its previous detectors. MLP structures to capture the long-range dependencies of the
With nearly the same average precision, the CFPYOLOv5-M features in the object. Finally, the CFPYOLOX-L model uses the
inferred 1.5 times faster compared to the EfficientDet-D2 GCR-assisted EVC scheme and gets better results. In the third
detector. And comparing CFPYOLOX-L with EfficientDet-D3, line of the figure, the CFPYOLOX-L model performs better in
the average accuracy is improved by 1.9%, and the inference complex scenarios. Based on the EVC scheme, GCR is used
speed is 1.8 times higher. In addition, in comparison with to adjust intra-layer features for top-down, and CFPYOLOX-L
YOLOv4 [45] series, it can be found that the mAP of can solve the problem of classification better.
CFPYOLOv5-L is improved by 2.7% compared to YOLOv4-
CSP [80]. Besides, we can see all scaled YOLOv5 [37] V. C ONCLUSION AND F UTURE W ORK
models, including YOLOv5-S [37], YOLOv5-M [37], and In this work, we proposed a CFP for object detection, which
YOLOv5-L [37]. The average precision of its best YOLOv5- was based on a globally explicit centralized feature regulation.
L [37] model is 1.4% lower than the CFPYOLOv5-L . In the same We first proposed a spatial explicit visual center scheme, where
way, our CFP method obtains a maximum average accuracy a lightweight MLP was used to capture the globally long-range
of 49.40%, which is 1.6% higher than YOLOX-L. dependencies and a parallel learnable visual center was used
3) Qualitative Results on MS-COCO [36] 2017 test Set: In to capture the local corner regions of the input images. Based
addition, we also show in Figure 7 some visualization results on the proposed EVC, we then proposed a GCR for a feature
of baseline (YOLOX-L [38]), EVCYOLOX-L , and CFPYOLOX-L pyramid in a top-down manner, where the explicit visual center
on the MS-COCOCO [36] test set. It is worth noting that we information obtained from the deepest intra-layer feature was
use white, red, and orange boxes to mark where the detec- used to regulate all frontal shallow features. Compared to the
tion task failures, respectively. White boxes indicate misses existing methods, CFP not only has the ability to capture the
due to occlusion, light influence, or small object size. Red global long-range dependencies, but also efficiently obtain an
boxes indicate detection errors due to insufficient contextual all-round yet discriminative feature representation. Experimen-
semantic relationships, e.g., causing one object to be detected tal results on MS-COCO dataset verified that our CFP can
as two objects. The yellow boxes indicate an error in the achieve the consistent performance gains on the state-of-the-
object classification. In the first line, the detection result of art object detection baselines. CFP is a generalized approach
YOLOX-L in part marked in the white box is not ideal due that can not only extract global long-range dependencies of
to the distance factor of “zebra”. And the EVCYOLOX-L can the intra-layer features but also preserve the local corner
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
QUAN et al.: CENTRALIZED FEATURE PYRAMID FOR OBJECT DETECTION 4353
regional information as much as possible, which is very [22] A. Vaswani et al., “Attention is all you need,” in Proc. Neural Inf.
important for dense prediction tasks. Therefore, in the future, Process. Syst. (NeurIPS), 2017, pp. 1–11.
[23] Z. Luo, A. Mishra, A. Achkar, J. Eichel, S. Li, and P.-M. Jodoin, “Non-
we will start to develop some advanced intra-layer feature local deep features for salient object detection,” in Proc. IEEE Conf.
regulate methods to improve the feature representation ability Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6593–6601.
further. Besides, we will try to apply EVC and GCR to other [24] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-local networks
feature pyramid-based computer vision tasks, e.g., semantic meet squeeze-excitation networks and beyond,” in Proc. IEEE/CVF Int.
Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1971–1980.
segmentation, object localization, instance segmentation and [25] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
person re-identification. S. Zagoruyko, “End-to-end object detection with transformers,” in Proc.
Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 213–229.
ACKNOWLEDGMENT [26] A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers
for image recognition at scale,” 2020, arXiv:2010.11929.
The code and weights have been released at: CFPNet. [27] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using
shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
R EFERENCES Oct. 2021, pp. 9992–10002.
[1] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with deep [28] W. Wang et al., “Pyramid vision transformer: A versatile backbone for
learning: A review,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, dense prediction without convolutions,” in Proc. IEEE/CVF Int. Conf.
no. 11, pp. 3212–3232, Nov. 2019. Comput. Vis. (ICCV), Oct. 2021, pp. 548–558.
[2] M. Treml et al., “Speeding up semantic segmentation for autonomous [29] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár,
driving,” in Proc. Neural Inf. Process. Syst. (NeurIPS), 2016, pp. 1–7. “Designing network design spaces,” in Proc. IEEE/CVF Conf. Comput.
[3] M. Havaei et al., “Brain tumor segmentation with deep neural networks,” Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10425–10433.
Med. Image Anal., vol. 35, pp. 18–31, Jan. 2017. [30] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning
[4] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. deep features for discriminative localization,” in Proc. IEEE Conf.
(ICCV), Dec. 2015, pp. 1440–1448. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2921–2929.
[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards [31] L. Ru, Y. Zhan, B. Yu, and B. Du, “Learning affinity from attention: End-
real-time object detection with region proposal networks,” IEEE Trans. to-end weakly-supervised semantic segmentation with transformers,”
Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017. 2022, arXiv:2203.02664.
[6] W. Liu et al., “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf. [32] R. Li, Z. Mai, C. Trabelsi, Z. Zhang, J. Jang, and S. Sanner, “TransCAM:
Comput. Vis. (ECCV), 2016, pp. 1–17. Transformer attention-based CAM refinement for weakly supervised
[7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look semantic segmentation,” 2022, arXiv:2203.07239.
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. [33] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Trans-
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788. former for semantic segmentation,” in Proc. IEEE/CVF Int. Conf.
[8] D. Zhang, L. Zhang, and J. Tang, “Augmented FCN: Rethinking context Comput. Vis. (ICCV), Oct. 2021, pp. 7242–7252.
modeling for semantic segmentation,” Sci. China Inf. Sci., vol. 66, no. 4, [34] F. Zhu, Y. Zhu, L. Zhang, C. Wu, Y. Fu, and M. Li, “A unified efficient
Apr. 2023, Art. no. 142105. pyramid transformer for semantic segmentation,” in Proc. IEEE/CVF Int.
[9] D. Zhang, J. Tang, and K.-T. Cheng, “Graph reasoning transformer for Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, pp. 2667–2677.
image parsing,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, [35] D. Zhang, H. Zhang, J. Tang, X.-S. Hua, and Q. Sun, “Self-regulation
pp. 2380–2389. for semantic segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
[10] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids (ICCV), Oct. 2021, pp. 6933–6943.
for object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, [36] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
no. 8, pp. 1532–1545, Aug. 2014. Proc. Eur. Conf. Comput. Vis. (ECCV), 2014, pp. 213–229.
[11] S. Vashishth, S. Sanyal, V. Nitin, N. Agrawal, and P. Talukdar, “Inter- [37] G. Jocher et al., “Ultralytics/YOLOv5: V4.0-nn.SiLU() activa-
actE: Improving convolution-based knowledge graph embeddings by tions weights & biases logging PyTorch hub integration (ver-
increasing feature interactions,” in Proc. AAAI Conf. Artif. Intell. (AAAI), sion v4.0),” Jan. 2021. [Online]. Available: https://zenodo.org/record/
2020, pp. 1–8. 4418161#.YFCo2nHvfC0
[12] H. Zhang et al., “Context encoding for semantic segmentation,” in [38] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, series in 2021,” 2021, arXiv:2107.08430.
pp. 7151–7160. [39] Q. Zhao et al., “M2Det: A single-shot object detector based on multi-
[13] H. Zhang, H. Zhang, C. Wang, and J. Xie, “Co-occurrent features in level feature pyramid network,” in Proc. Conf. Artif. Intell. (AAAI), 2019,
semantic segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern pp. 1–8.
Recognit. (CVPR), Jun. 2019, pp. 548–557.
[40] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
[14] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and efficient with deep convolutional neural networks,” in Proc. Neural Inf. Process.
object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- Syst. (NeurIPS), vol. 25, no. 6, 2017, pp. 80–90.
nit. (CVPR), Jun. 2020, pp. 10778–10787.
[41] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar-
[15] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “NAS-FPN: Learning scalable feature
chies for accurate object detection and semantic segmentation,” in Proc.
pyramid architecture for object detection,” in Proc. IEEE/CVF Conf.
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 7029–7038.
[16] K. Chen, Y. Cao, C. C. Loy, D. Lin, and C. Feichtenhofer, “Feature [42] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-
pyramid grids,” 2020, arXiv:2004.03580. based fully convolutional networks,” in Proc. Neural Inf. Process. Syst.
(NeurIPS), 2016, pp. 1–9.
[17] D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, “Feature
pyramid transformer,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2020, [43] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
pp. 323–329. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988.
[18] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for [44] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
instance segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern 2018, arXiv:1804.02767.
Recognit., Jun. 2018, pp. 8759–8768. [45] A. Bochkovskiy, C.-Y. Wang, and H.-Y. Mark Liao, “YOLOv4: Optimal
[19] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, speed and accuracy of object detection,” 2020, arXiv:2004.10934.
“Feature pyramid networks for object detection,” in Proc. IEEE Conf. [46] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 936–944. H. Jégou, “Training data-efficient image transformers & distillation
[20] M. Yin et al., “Disentangled non-local neural networks,” in Proc. Eur. through attention,” in Proc. Int. Conf. Mach. Learn. (ICML), 2021,
Conf. Comput. Vis. (ECCV), 2020, pp. 191–207. pp. 10347–10357.
[21] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural [47] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable
networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., DETR: Deformable transformers for end-to-end object detection,” 2020,
Jun. 2018, pp. 7794–7803. arXiv:2010.04159.
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023
[48] Y. Chen, N. Ashizawa, C. K. Yeo, N. Yanai, and S. Yean, “Multi-scale [77] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
self-organizing map assisted deep autoencoding Gaussian mixture model dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
for unsupervised intrusion detection,” Knowl.-Based Syst., vol. 224, Oct. 2017, pp. 2999–3007.
Jul. 2021, Art. no. 107086. [78] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in
[49] J. Beal, E. Kim, E. Tzeng, D. Huk Park, A. Zhai, and D. Kislyuk, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
“Toward transformer-based object detection,” 2020, arXiv:2012.09958. pp. 6517–6525.
[50] S. Zheng et al., “Rethinking semantic segmentation from a sequence- [79] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD:
to-sequence perspective with transformers,” in Proc. IEEE/CVF Conf. Deconvolutional single shot detector,” 2017, arXiv:1701.06659.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 6877–6886. [80] C.-Y. Wang, A. Bochkovskiy, and H. M. Liao, “Scaled-YOLOv4:
[51] I. Tolstikhin et al., “MLP-mixer: An all-MLP architecture for vision,” Scaling cross stage partial network,” in Proc. IEEE/CVF Conf. Comput.
in Proc. Neural Inf. Process. Syst. (NeurIPS), 2021, pp. 1–12. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 13024–13033.
[52] H. Liu, Z. Dai, D. R. So, and Q. V. Le, “Pay attention to MLPs,” in [81] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolu-
Proc. Neural Inf. Process. Syst. (NeurIPS), 2021, pp. 1–12. tional neural networks,” in Proc. Int. Conf. Mach. Learn. (ICML), 2019,
[53] W. Yu et al., “MetaFormer is actually what you need for vision,” 2021, pp. 6105–6114.
arXiv:2111.11418.
[54] Q. Hou, Z. Jiang, L. Yuan, M.-M. Cheng, S. Yan, and J. Feng, “Vision
permutator: A permutable MLP-like architecture for visual recognition,”
2021, arXiv:2106.12368. Yu Quan received the B.S. and M.S. degrees in com-
[55] Z. Peng et al., “Conformer: Local features coupling global representa- puter science and technology from Guangxi Normal
tions for visual recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. University, China. She is currently pursuing the
(ICCV), Oct. 2021, pp. 357–366. Ph.D. degree with the School of Computer Science
[56] C.-Y. Wang, H.-Y. Mark Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and Engineering, Nanjing University of Science and
and I.-H. Yeh, “CSPNet: A new backbone that can enhance learning Technology, China. Her research interests include
capability of CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern deep learning, object detection, and their appli-
Recognit. Workshops (CVPRW), Jun. 2020, pp. 1571–1580. cations. In these areas, she has published several
[57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for journal articles and conference papers, including
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. NCA, IEEE ACCESS, PRICAI, and ICPR.
(CVPR), Jun. 2016, pp. 770–778.
[58] A. G. Howard et al., “MobileNets: Efficient convolutional neural net-
works for mobile vision applications,” 2017, arXiv:1704.04861.
[59] G. Larsson, M. Maire, and G. Shakhnarovich, “FractalNet: Ultra-deep
neural networks without residuals,” 2016, arXiv:1605.07648. Dong Zhang (Member, IEEE) received the Ph.D.
[60] J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang, “A simple degree in computer science and technology from
pooling-based design for real-time salient object detection,” in Proc. the Nanjing University of Science and Technology
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, in 2021. He is currently a Postdoctoral Research
pp. 3912–3921. Scientist with the Department of Computer Sci-
[61] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond ence and Engineering, The Hong Kong University
empirical risk minimization,” in Proc. Int. Conf. Learn. Represent. of Science and Technology. His research interests
(ICLR), 2018, pp. 1–13. include machine learning, image processing, com-
[62] P. Ramachandran, B. Zoph, and Q. V. Le, “Swish: A self-gated activation puter vision, and medical image analysis, especially
function,” 2017, arXiv:1710.05941. in image classification, semantic image segmenta-
[63] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in tion, object detection, and their potential applica-
deep convolutional networks for visual recognition,” IEEE Trans. Pattern tions. In these areas, he has published several journal articles and conference
Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2015. papers, including IEEE T RANSACTIONS ON C YBERNETICS, PR, AAAI,
[64] P. Goyal et al., “Accurate, large minibatch SGD: Training ImageNet in ACM MM, IJCAI, ECCV, ICCV, and NeurIPS.
1 hour,” 2017, arXiv:1706.02677.
[65] L. He and S. Todorovic, “DESTR: Object detection with split trans-
former,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Liyan Zhang received the Ph.D. degree in computer
(CVPR), Jun. 2022, pp. 9367–9376. science from the University of California at Irvine,
[66] Y. Li et al., “MViTv2: Improved multiscale vision transformers for Irvine, CA, USA, in 2014. She is currently a Pro-
classification and detection,” in Proc. IEEE/CVF Conf. Comput. Vis. fessor with the College of Computer Science and
Pattern Recognit. (CVPR), Jun. 2022, pp. 4794–4804. Technology, Nanjing University of Aeronautics and
[67] Z. Tu et al., “MaxViT: Multi-axis vision transformer,” in Proc. Eur. Conf. Astronautics, Nanjing, China. Her research inter-
Comput. Vis. (ECCV), 2022, pp. 1–31. ests include multimedia analysis, computer vision,
[68] Q. Chen, Y. Wang, T. Yang, X. Zhang, J. Cheng, and J. Sun, “You and deep learning. She received the Best Paper
only look one-level feature,” in Proc. IEEE Conf. Comput. Vis. Pattern Award from the International Conference on Multi-
Recognit. (CVPR), Aug. 2021, pp. 13034–13043. media Retrieval (ICMR) 2013 and the Best Student
[69] Z. Jin, D. Yu, L. Song, Z. Yuan, and L. Yu, “You should look at all Paper Award from the International Conference on
objects,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 332–349. Multimedia Modeling (MMM) 2016.
[70] Y.-H. Wu, Y. Liu, X. Zhan, and M.-M. Cheng, “P2T: Pyramid pooling
transformer for scene understanding,” IEEE Trans. Pattern Anal. Mach.
Intell., early access, Aug. 30, 2022, doi: 10.1109/TPAMI.2022.3202765.
[71] D. Lian, Z. Yu, X. Sun, and S. Gao, “AS-MLP: An axial shifted MLP Jinhui Tang (Senior Member, IEEE) received the
architecture for vision,” 2021, arXiv:2107.08391. B.E. and Ph.D. degrees from the University of
[72] Z. Chen, J. Zhang, and D. Tao, “Recurrent glimpse-based decoder for Science and Technology of China, Hefei, China,
detection with transformer,” 2021, arXiv:2112.04632. in 2003 and 2008, respectively. He is currently a
[73] Y. Fang et al., “You only look at one sequence: Rethinking transformer Professor with the Nanjing University of Science and
in vision through object detection,” in Proc. Neural Inf. Process. Syst. Technology, Nanjing, China. He has authored more
(NeurIPS), 2021, pp. 26183–26197. than 200 papers in top tier journals and conferences.
[74] H. Song et al., “ViDT: An efficient and effective fully transformer-based His research interests include multimedia analysis
object detector,” 2021, arXiv:2110.03921. and computer vision. He is a fellow of IAPR. He was
[75] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual a recipient of the Best Paper Award from ACM MM
transformations for deep neural networks,” in Proc. IEEE Conf. Comput. 2007 and ACM MM Asia 2020 and the Best Paper
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5987–5995. Runner-Up from ACM MM 2015. He has served as an Associate Editor for the
[76] J. Cao, H. Cholakkal, R. M. Anwer, F. S. Khan, Y. Pang, and L. Shao, IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS,
“D2Det: Towards high quality object detection and instance segmen- IEEE T RANSACTIONS ON K NOWLEDGE AND DATA E NGINEERING, IEEE
tation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), T RANSACTIONS ON M ULTIMEDIA, and IEEE T RANSACTIONS ON C IRCUITS
Jun. 2020, pp. 11485–11494. AND S YSTEMS FOR V IDEO T ECHNOLOGY .
Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.