Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
58 views14 pages

Centralized Feature Pyramid For Object Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views14 pages

Centralized Feature Pyramid For Object Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL.

32, 2023 4341

Centralized Feature Pyramid for Object Detection


Yu Quan, Dong Zhang , Member, IEEE, Liyan Zhang , and Jinhui Tang , Senior Member, IEEE

Abstract— The visual feature pyramid has shown its superior- bone followed by a two-stage (e.g., Fast/Faster R-CNN [4],
ity in both effectiveness and efficiency in a variety of applications. [5]) or single-stage (e.g., SSD [6] and YOLO [7]) framework.
However, current methods overly focus on inter-layer feature However, due to the uncertainty of object sizes, a single
interactions while disregarding the importance of intra-layer
feature regulation. Despite some attempts to learn a compact feature scale cannot meet the requirements of high-accuracy
intra-layer feature representation with the use of attention mech- recognition performance [8], [9]. To this end, methods
anisms or vision transformers, they overlook the crucial corner (e.g., SSD [6] and FFP [10]) based on the in-network fea-
regions that are essential for dense prediction tasks. To address ture pyramid are proposed and achieve satisfactory results
this problem, we propose a Centralized Feature Pyramid (CFP) effectively and efficiently. The unified principle behind these
network for object detection, which is based on a globally explicit
centralized feature regulation. Specifically, we first propose a methods is to assign a region of interest for each object
spatial explicit visual center scheme, where a lightweight MLP of different size with the appropriate contextual information
is used to capture the globally long-range dependencies, and a and enable these objects to be recognized in different feature
parallel learnable visual center mechanism is used to capture the layers.
local corner regions of the input images. Based on this, we then The interaction between pixels or objects is crucial in image
propose a globally centralized regulation for the commonly-used
feature pyramid in a top-down fashion, where the explicit visual recognition tasks [11]. Effective feature interactions allow
center information obtained from the deepest intra-layer feature image features to obtain a wider and richer representation,
is used to regulate frontal shallow features. Compared to the enabling object detection models to learn the implicit rela-
existing feature pyramids, CFP not only has the ability to capture tionship (such as favorable co-occurrence features [12], [13])
the global long-range dependencies but also efficiently obtain an
between pixels or objects. This has been shown to improve
all-round yet discriminative feature representation. Experimental
results on the challenging MS-COCO validate that our proposed visual recognition performance [14], [15], [16], [17], [18],
CFP can achieve consistent performance gains on the state-of- [19], [20]. For example, FPN [19] proposes a top-down
the-art YOLOv5 and YOLOX object detection baselines. inter-layer feature interaction mechanism, which enables shal-
Index Terms— Feature pyramid, visual center, object detection, low features to obtain global contextual information and
attention learning mechanism, long-range dependencies. semantic representations of deep features. NAS-FPN [15]
tries to learn the network structure of the feature pyramid
I. I NTRODUCTION part via a network architecture search strategy and obtains

O BJECT detection is one of the most fundamental yet


challenging research tasks in the community of computer
vision, which aims to predict a unique bounding box for each
a scalable feature representation. In addition to inter-layer
interactions inspired by the non-local or self-attention mech-
anism [21], [22], finer intra-layer interactions for spatial
object of the input image that contains not only the location feature regulation are also applied to object detection tasks,
but also the category information [1]. In the past few years, such as non-local features [23] and GCNet [24]. FPT [17]
this task has been extensively developed and applied to a wide further integrates inter-layer cross-layer and intra-layer cross-
range of potential applications, e.g., autonomous driving [2] space feature regulation methods and achieves exceptional
and computer-aided diagnosis [3]. performance.
The successful object detection methods are mainly based Despite their success in object detection, the above meth-
on the Convolutional Neural Networks (CNNs) as the back- ods are limited by the inherent small receptive fields of
CNNs’ backbones [9]. As shown in Figure 1 (a), the stan-
Manuscript received 26 July 2022; revised 10 February 2023 and 18 May
2023; accepted 12 July 2023. Date of publication 25 July 2023; date of current dard CNNs’ backbone features can only locate those most
version 2 August 2023. This work was supported by the National Natural discriminative object regions (e.g., the “body of an airplane”
Science Foundation of China under Grant 61925204 and Grant 62172212. The and the “motorcycle pedals”). To solve this problem, vision
associate editor coordinating the review of this manuscript and approving it for
publication was Dr. Zhenzhong Chen. (Corresponding author: Liyan Zhang.) transformer-based object detection methods [25], [26], [27],
Yu Quan and Jinhui Tang are with the School of Computer Science and [28] have been recently proposed and flourished. These
Engineering, Nanjing University of Science and Technology, Nanjing 210094, methods first divide the input image into different image
China (e-mail: [email protected]; [email protected]).
Dong Zhang was with the School of Computer Science and Engineering, patches and then use the multi-head attention-based feature
Nanjing University of Science and Technology, Nanjing 210094, China. He is interaction among patches to complete the purpose of obtain-
now with the Hong Kong University of Science and Technology, Hong Kong ing the global long-range dependencies. As expected, the
(e-mail: [email protected]).
Liyan Zhang is with the College of Computer Science and Technology, feature pyramid is also employed in a vision transformer,
MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Collabo- e.g., PVT [28] and Swin Transformer [27]. Although these
rative Innovation Center of Novel Software Technology and Industrialization, methods can address the limited receptive fields and the
Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
(e-mail: [email protected]). local contextual information in CNNs, an obvious draw-
Digital Object Identifier 10.1109/TIP.2023.3297408 back is their large computational complexity. For example,
1941-0042 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4342 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023

riority, we conduct extensive experiments on the challenging


MS-COCO dataset [36]. Results validate that our proposed
CFP can achieve consistent performance gains on the state-
of-the-art YOLOv5 [37] and YOLOX [38] object detection
baselines.
Our contributions can be summarized as follows:
• We introduce a spatial explicit visual center scheme,
incorporating an improved lightweight MLP for capturing
global dependencies and a learnable visual center for
aggregating local key regions.
• We propose a globally centralized regulation for the
commonly used feature pyramid, utilizing the deepest
Fig. 1. Visualizations of image feature evolution for vision recognition tasks.
For the input images in (a), a CNN model in (b) only locates those most features to regulate the shallow features.
discriminative regions; although the progressive model in (c) can see wider • Our centralized feature pyramid network demonstrates
under the help of the attention mechanism [22] or transformer [25], it usually consistent improvement on state-of-the-art object detec-
ignores the corner cues that are important for dense prediction tasks; our model
in (d) can not only see wider but also more well-rounded by attaching the tion baselines.
centralized constraints on features with the advanced long-range dependencies,
which is more suitable for dense prediction tasks.
II. R ELATED W ORK
a Swin-B [27] has almost 3× model FLOPs (i.e., 47.0 G A. Feature Pyramid in Computer Vision
vs. 16.0 G) than a performance-comparable CNNs model Feature pyramids are essential building blocks in modern
RegNetY [29] with the input size of 224 × 224. Besides, object recognition systems, capable of effectively and effi-
as shown in Figure 1 (b), since vision transformer-based ciently detecting objects of different scales. The concept was
methods are implemented in an omnidirectional and unbiased first introduced in SSD [6], where a hierarchical representation
learning pattern, which is easy to ignore some corner regions of multi-scale features was used to capture multi-scale infor-
(e.g., the “airplane engine,” the “motorcycle wheel” and the mation and improve recognition accuracy. FPN [19] further
“bat”) that are important for dense prediction tasks. These developed the concept by building a top-down hierarchy of
drawbacks are more obvious in large-scale input images. Our features from bottom-up in-network feature maps. PANet [18]
research raises the question: is it necessary to use transform- added a bottom-up pathway to share feature information
ers on all layers? By analyzing shallow features, we argue between layers, allowing high-level features to obtain suffi-
that this may not be necessary. Research of the advanced cient details from low-level features. NAS-FPN [15] used a
methods [30], [31], [32] shows that the shallow features spatial search strategy to connect layers in the feature pyramid,
mainly contain some general object feature patterns, e.g., resulting in extensible feature information. M2Det [39] con-
texture, color, and orientation, which are often not global. structed a multi-stage feature pyramid to extract multi-scale
In contrast, the deep features reflect object-specific informa- and cross-level features. In general, feature pyramids allow
tion, which usually requires global information [33], [34]. for dealing with the multi-scale changes in object recognition
Therefore, we gracefully argue that a transformer may not be without incurring high computational overhead and provide
necessary for all layers. multi-scale feature representations, including high-resolution
In this work, we propose a Centralized Feature Pyramid features. Our proposed method improves upon current methods
(CFP) network for object detection, which is based on a by introducing an intra-layer feature regulation approach to the
globally explicit centralized regulation scheme. Specifically, feature pyramid, addressing their shortcomings in this aspect.
based on a visual feature pyramid extracted from the CNNs
backbone, we first propose an explicit visual center scheme,
where an improved lightweight MLP architecture is used to B. Object Detection
capture the long-range dependencies and a parallel learnable Object detection is a critical computer vision task that aims
visual center mechanism is used to aggregate the local key to identify objects of interest within an image and provide
regions of the input images. Considering the fact that the a complete scene description, including object category and
deepest features usually contain the most abstract feature location. With the rapid progress in Convolutional Neural
representations scarce in the shallow features [35], based on Networks (CNNs) [40], many object detection models have
the proposed regulation scheme, we then propose a globally achieved remarkable advances. The existing methods can be
centralized regulation for the extracted feature pyramid in broadly classified into two categories: two-stage and single-
a top-down manner, where the spatial explicit visual center stage detectors. Two-stage object detectors [4], [5], [41], [42],
obtained from the deepest features are used to regulate all [43] typically use a Region Proposal Network (RPN) to gener-
the frontal shallow features simultaneously. Compared to the ate a set of region proposals, and then extract features of these
existing feature pyramids, as shown in Figure 1 (c), CFP proposals and perform the classification and regression process
not only has the ability to capture the global long-range using a separate learning module. However, the requirement
dependencies but also efficiently obtain an all-around yet to store and repeatedly extract features of each proposal
discriminative feature representation. To demonstrate the supe- leads to high computational costs and hinders the capture of

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
QUAN et al.: CENTRALIZED FEATURE PYRAMID FOR OBJECT DETECTION 4343

global feature representations. To address this, single-stage a more lightweight MLP. The experimental results validate its
detectors [6], [7], [44], [45] directly perform prediction and effectiveness and efficiency.
region classification by generating bounding boxes. These
single-stage methods have a global perspective in the design III. O UR A PPROACH
of feature extraction and use the backbone network to extract In this section, we introduce the proposed centralized feature
feature maps from the entire image to predict each bounding pyramid (CFP) implementation details. We first make an
box. In this work, we also adopt single-stage object detectors overview architecture description for CFP in Section III-A.
(i.e., YOLOv5 [37] and YOLOX [38]) as our baseline models. Then, we show the implementation details of the explicit visual
Our goal is to enhance the representation of the feature center in Section III-B. Finally, we show how to implement
pyramid used in these detectors. the explicit visual center on an image feature pyramid and
propose our global centralized regulation in Section III-C.

C. Attention Learning and Long-Range Dependency A. Centralized Feature Pyramid (CFP)


CNNs [40] focuse more on the representative learning of Although the existing methods have been largely concen-
local regions. However, this local representation does not trated on the inter-layer feature interactions, they ignore the
satisfy the requirement for global context and long-range intra-layer feature regulations, which have been empirically
dependencies of modern recognition systems. To this end, the proven beneficial to vision recognition tasks. In our work,
attention learning mechanism [22] is proposed that focuses inspired by the previous works on dense prediction tasks [51],
on deciding where to project more attention to an image. [53], [55], we propose a CFP for object detection, which is
For example, the non-local operation [21] uses the non-local based on the globally explicit centralized intra-layer feature
neural network to directly capture long-range dependencies, regulation. Compared to the existing feature pyramids, our
demonstrating the significance of non-local modeling for tasks proposed CFP can capture the global long-range dependencies
of video classification, object detection, and segmentation. and enable comprehensive and discriminative feature represen-
However, the local representation of the internal nature of tations. As illustrated in Figure 2, CFP mainly consists of the
CNNs are not resolved, i.e., CNNs features can only cap- following parts: the input image, a CNNs backbone used to
ture limited contextual information. To address this problem, extract the vision feature pyramid, the proposed Explicit Visual
Transformer [22], which mainly benefits from the multi-head Center (EVC), the proposed Global Centralized Regulation
attention mechanism has caused a great sensation recently (GCR) and a decoupled head network (which consists of
and achieved great success in the field of computer vision, a classification loss, a regression loss, and a segmentation
such as image recognition [25], [26], [27], [46], [47], [48]. loss) for object detection. In Figure 2, EVC and GCR is
For example, the representative ViT divides the image into implemented on the extracted feature pyramid.
a sequence with position encoding and then uses the cas- Concretely, we first feed an arbitrary RGB image into
caded transformer block to extract the parameterized vector the backbone network (i.e., the Modified CSP v5 [56] and
as the visual representation. On this basis, many excellent ResNet [57]) to extract a five-level one feature pyramid X,
models [46], [49], [50] have been proposed through further where the spatial size of each layer of features Xi (i =
improvement and have achieved good performance in various 0, 1, 2, 3, 4) is 1/2, 1/4, 1/8, 1/16, 1/32 of the input image,
tasks of computer vision. Nevertheless, the transformer-based respectively. Based on the extracted feature pyramid, our CFP
image recognition models still have the disadvantages of being can be implemented on it. A lightweight MLP architecture is
computationally intensive and complex. proposed to capture the global long-range feature dependen-
While transformer-based image recognition models have cies based on X4 , where an MLP layer replaces the multi-head
shown remarkable performance, they are computationally self-attention module of a standard transformer encoder. Com-
intensive and complex. To mitigate these limitations, recent pared to the transformer encoder based on the multi-head
studies [51], [52], [53], [54] have shown that replacing attention mechanism, our lightweight MLP architecture is
attention-based modules in transformer models with MLP can simple in structure and has a lighter volume and higher com-
still achieve good performance. This is because MLP and putational efficiency (cf. Section III-B). Besides, a learnable
attention mechanisms are both global information processing visual center mechanism, along with the lightweight MLP,
modules. The introduction of MLP-Mixer [51] into computer is used to aggregate the local corner regions of the input image.
vision reduces changes to the data layout and can better cap- We name the above parallel structure network as the spatial
ture long-range dependencies and spatial relationships through EVC, which is implemented on the top layer (i.e., X4 ) of the
the interaction between spatial and channel features. Although feature pyramid. Based on the proposed ECV, to enable the
MLP-style models have limitations in capturing fine-grained shallow layer features of the feature pyramid to benefit from
feature representations, they have the advantage of a simpler the centralized visual information of the deepest feature at the
network structure than transformers. In our work, we also use same time, in an efficient pattern, we then propose a GCR in a
MLP to capture global contextual information and long-range top-down fashion, where the explicit visual center information
dependencies in images. Our contribution lies in using a obtained from the deepest intra-layer feature is used to regulate
proposed spatial explicit visual center scheme to centralize the all the frontal shallow features (i.e., X3 to X1 ) simultaneously.
information captured. Besides, our other main contribution is Finally, we aggregate these features into a decoupled head net-
the improvement of the existing MLP architecture, proposing work for instance classification and bounding-box regression.

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4344 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023

Fig. 2. An illustration of the overall architecture, which mainly consists of four components: input image, a backbone network for feature extraction, the
centralized feature pyramid which is based on a commonly-used vision feature pyramid following [38], and the object detection head network which includes
a classification (i.e., Cls.) loss and a regression (i.e., Reg.) loss. C denotes the class size of the used dataset. Our contribution lines in that we propose an
intra-layer feature regulation method in a feature pyramid, and a top-to-down global centralized regulation.

B. Explicit Visual Center (EVC) BN(·) denotes a batch normalization layer, and σ (·) denotes
As illustrated in Figure 3, our proposed EVC mainly con- the ReLU activation function.
sists of two blocks connected in parallel, where a lightweight 1) Lightweight MLP: The proposed lightweight MLP
MLP is used to capture the global long-range dependencies mainly consists of two modules: a depthwise convolution-
(i.e., the global information) of the top-level features X4 . based module [58] and a channel MLP-based block, where the
At the same time, to reserve the local corner regions (i.e., input of the MLP-based module is the output of the depthwise
the local information), we propose a learnable vision center convolution-based [51] module. These two blocks are both
mechanism implemented on X4 to aggregate the intra-layer followed by a channel scaling operation [53] and DropPath
local region features. The result feature maps of these two operation [59] to improve the feature generalization and
blocks are concatenated together along the channel dimension robustness ability. Specifically, for the depthwise convolution-
as the output of EVC for the downstream recognition. Next, based module, features output from the Stem module Xin are
instead of mapping the original features directly like [37], first fed into a depthwise convolution layer, which has been
we added a Stem block between X4 and EVC for features processed by a group normalization (i.e., feature maps are
smoothing. The Stem block consists of a 7 × 7 convolution grouped along the channel dimension). Compared to tradi-
with an output channel size of 256, followed by a batch tional spatial convolution, depthwise convolution can increase
normalization layer and an activation function layer. The above the feature representation ability while reducing computational
processes can be formulated as: costs. Then, channel scaling and DropPath are implemented.
long-range dependencies After that, a residual connection of Xin is implemented. The
  above processes can be formulated as:
X = Cat MLP(Xin ) ; LVC(Xin ) ,
local corner regions X̃in = DConv(GN(Xin )) + Xin , (3)
(1)
where X̃in is the output of the depthwise convolution-based
module. GN(·) is the group normalization, and DConv(·) is a
where X is the output of EVC. Cat(·) denotes the feature map depthwise convolution [58] with a kernel size of 1 × 1.
concatenation along the channel dimension. MLP(Xin ) and For the channel MLP-based module, features output from
LVC(Xin ) denotes the output features of the used lightweight the depthwise convolution-based module X̃in are first fed to
MLP and the learnable visual center mechanism, respectively. the a group normalization, and then the channel MLP [51]
Xin is the output of the Stem block, which is obtained by: is implemented on these features. Compared to space MLP,
channel MLP can not only effectively reduce the computa-
Xin = σ (BN(Conv7×7 (X4 ))), (2)
tional complexity but also meet the requirements of general
where Conv7×7 (·) denotes a 7×7 convolution with stride 1 and vision tasks [38], [45]. After that, channel scaling, DropPath,
the channel size is set to 256 in our work following [19]. and a residual connection of X̃in are implemented in sequence.

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
QUAN et al.: CENTRALIZED FEATURE PYRAMID FOR OBJECT DETECTION 4345

Fig. 3. An illustration of the proposed explicit visual center, where a lightweight MLP architecture is used to capture the long-range dependencies and a
parallel learnable visual center mechanism is used to aggregate the local corner regions of the input image. The integrated features contain advantages of
these two blocks, so that the detection model can learn an all-round yet discriminative feature representation.

The above processes are expressed as: processed by a CBR block, consisting of a 3 × 3 convolution
with a BN layer and a ReLU activation function. Through
MLP(Xin ) = CMLP(GN(X̃in )) + X̃in , (4) the above steps, the encoded features X̌in are entered into the
where CMLP(·) is the channel MLP [51]. In our paper, for codebook. the learned smoothing factors sk (i = 1, 2, . . . , K )
the presentation convenience, we omit channel scaling and can be used in codebook to match X̌i (i = 1, 2, . . . , N ) with
droppath in Eq. 3 and Eq. 4. the corresponding b K (i = 1, 2, . . . , K ). Then, the learnable
2) Learnable Visual Center (LVC): LVC maps the input weights are generated based on the difference between the two,
feature Xin of shape and the weights are multiplied over the difference to sum up,
 C × H × W to a set of C-dimensional and the final output is a C-dimensional vector e (as shown in
features X̌in = xˇ1 , xˇ2 , . . . , xˇN , where N = H × W is
Eq. 5 and Eq. 6). We can selectively focus on category-related
the total number of the input features. Next, X̌in will learn
feature map information. The information of the whole image
an inherent codebook B = {b1 , b2 , . . . , b K } containing two
with respect to the k-th codeword can be calculated by:
important components: 1) K number of codewords (i.e. visual
2
centers); 2) a set of smoothing factors S = {s1 , s2 , . . . , s K } for N
X e−sk ∥x̌i −bk ∥
the visual centers. Specifically, features from the Stem block ek = 2
(x̌i − bk ), (5)
e−s j ∥x̌i −b j ∥
PK
Xin are first encoded by a combination of a set of convolution i=1 j=1
layers (which consist of a 1 × 1 convolution, a 3 × 3 convolu- where x̌i − bk is the difference between the N C-dimensional
tion, and a 1 × 1 convolution). It is worth noting that the first feature vectors and the K codeword vectors. Then, the learn-
−sk ∥x̌i −bk ∥2
1×1 is used to reduce the channel of the input features, and the able weight P e 2 is generated based on the x̌i −bk ,
last 1×1 is used to increase the channel of the output features, K −s j ∥x̌i −b j ∥
j=1 e
while the middle 3 × 3 layer has a smaller channel for the where the ∥·∥ denotes L2 parametric operation, sk denotes the
2
input and output features. Thus, this set of convolution layers smoothing factor of the k-th codeword, and sk x̌i − bk is
not only reduces the computational volume but also allows the output value of the k-th codeword. Next, the weights are
the model to converge better. Next, the encoded features are multiplied by the x̌i − bk to obtain information about the

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4346 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023

position of a pixel relative to a codeword. Finally, We sum IV. E XPERIMENTS


these N results to obtain ek , which is information about the A. Dataset and Evaluation Metrics
whole image relative to the k-th codeword. After that, we use
1) Dataset: In this work, Microsoft Common Objects in
φ to fuse all ek , where φ contains BN layer with ReLU and
Context (MS-COCO) [36] is used to validate the superiority
the mean layer. Finally, we sum over the K results to obtain e,
of our proposed CFP. MS-COCO contains 80 classes of the
which is the full information of the whole image concerning
common scene objects, where the training set, val set, and
the K codewords. Both ek and e are C-dimensional vectors.
test set contain 118k, 5k, and 20k images, respectively. In our
K
X experiments, for a fair comparison, all the training images are
e= φ(ek ). (6) resized into a fixed size of 640 × 640 as in [19]. For data
k=1 augmentation, we adopt the commonly used Mosaic [45] and
MixUp [61] in our experiments. Mosaic can not only enrich
A set of impact factors is then predicted to highlight the the image data but also indirectly increase our batch size.
categories that need to be highlighted. First, we used FC to MixUp can play a role in increasing the model generalization
map the feature e to C × 1 × 1 size as the impact factor. ability. In particular, following [38], our model turns the data
After that, we use the channel-wise multiplication between augmentation strategy off at the last 15 epochs in training.
the input features Xin from the Stem block and the impact 2) Evaluation Metrics: We mainly follow the commonly
factor coefficient δ(·). The feature representations, after being used object detection evaluation metric – Average Precision
highlighted, are obtained as follows: (AP) in our experiments, which include AP50 , AP75 , APS ,
APM and APL . Besides, to quantitative the model efficiency,
Z = Xin ⊗ (δ(FC(e))), (7)
GFLOPs, Frame Per Second (FPS), Latency, and parameters
(Params.) are also used. In particular, following [38], Latency
where FC denotes a fully connected layer, and δ(·) is the
and FPS are measured without post-processing for a fair
sigmoid function. ⊗ is channel-wise multiplication. Finally,
comparison.
we perform a channel-wise addition between features Xin
output from the Stem block and the local corner region features
B. Implementation Details
Z, which is formulated as:
1) Baselines: To validate the generality of CFP, we use
LVC(Xin ) = Xin ⊕ Z, (8) two state-of-the-art baseline models in our experiments:
YOLOv5 [37] and YOLOX [38]. In our experiments, we use
where ⊕ is the channel-wise addition. the end-to-end training strategy and employ their default
training and inference settings unless otherwise stated.
• YOLOv5 [37]. The backbone is a modified cross-stage
C. Global Centralized Regulation (GCR) partial network v5 [56] and DarkNet53 [44], where the
EVC is a generalized intra-layer feature regulation method modified cross-stage partial network v5 is used in the
that can extract global long-range dependencies and preserve ablation study and the DarkNet53 is used in a result
the local corner regional information of the input image as comparisons with the state-of-the-art. The neck network
much as possible, which is very important for dense prediction is FPN [19]. The object detection head is the coupled
tasks. However, using EVC at every level of the feature head network, which contains a classification branch and
pyramid would result in a large computational overhead. a regression branch. In YOLOv5, according to the scaling
To improve the computational efficiency of intra-layer feature of network depth and width, three different scale networks
regulation, we further propose a GCR for a feature pyramid are generated, they are YOLOv5-Small (YOLOv5-S),
in a top-down manner. Specifically, as illustrated in Figure 2, YOLOv5-Medium (YOLOv5-M), and YOLOv5-Large
because the deepest features usually contain the most abstract (YOLOv5-L).
feature representations scarce in the shallow features [35], • YOLOX [38]. Compared to YOLOv5, the whole network
[60], our spatial EVC is first implemented on the top layer structure of YOLOX remains unchanged except for the
(i.e., X4 ) of the feature pyramid. Then, the obtained features coupled head network. In YOLOX, the object detection
X, which includes the spatial explicit visual centers, are used head is the decoupled head network.
to regulate all the frontal shallow features (i.e., X3 to X1 ) 2) Backbones: In our experiments, two backbones are used.
simultaneously. In our implementation, on each corresponding • DarkNet53 [44]. DarkNet53 mainly consists of 53 convo-
low-level features, the features obtained in the deep layer are lutional layers (basically 1 × 1 with 3 × 3 convolutions),
upsampled to the same spatial scale as the low-level features which is mainly used for the performance comparisons
and then are concatenated along the channel dimension. Based with state-of-the-art methods in Table IX.
on this, the concatenated features are downsampled by a • Modified CSPNet v5 [37]. For a fair comparison,
1 × 1 convolution into the channel size of 256 as [19]. In this we choose YOLOv5 (i.e., the Modified CSPNet v5) as our
way, we can explicitly increase the spatial weight of the global backbone network. The output feature maps are the ones
representations at each layer of the feature pyramid in the from stage5, which consists of three convolutions (Conv,
top-down path, such that our CFP can effectively achieve an BN and SiLU [62]) operations and a spatial pyramid
all-around yet discriminative feature representation. pooling [63] layer (5 × 5, 9 × 9 and 13 × 13).

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
QUAN et al.: CENTRALIZED FEATURE PYRAMID FOR OBJECT DETECTION 4347

Fig. 4. MLP and attention-based variants structure diagram. (a) is the PoolFormer structure in [53]. (c) and (e) imitate PoolFormer structure and replace
Pooling layer with CPSLayer [56] and Depthwise Convolution as token mixers respectively. Moreover, (b), (d) and (f) structures replace the channel MLP
module with an attention-based module in transformer. Norm denotes the normalization. ⊕ represents channel-wise addition operation and ⊗ represents
channel-wise multiplication operation. P.E. represents positional encoding.

3) Comparison Methods: We consider using MLP instead TABLE I


of attention-based, which performs well and is computation- A BLATION S TUDY R ESULTS ON YOLOV 5 [37] AND YOLOX-L [38].
“†” D ENOTES T HAT T HIS I S O UR R E -I MPLEMENTED R ESULT
ally less expensive. Therefore, we design a series of MLPs
and attention-based variants. Through the ablation study,
we choose an optimal variant for our LVC mechanism, as well
as CFP approach, called lightweight MLP. Figure 4 (a) shows
the PoolFormer structure [53], which consists of a Pool-
ing operation sub-block and a two-layered MLP sub-block.
Considering that the Pooling operation corrupts the detailed
features, we choose some convolutions that are structurally
lightweight and guarantee accuracy simultaneously. Therefore,
we designate CPSLayer [56] and depthwise convolution as
token mixers. They are called CSPM [53], [56] and MLP
(Ours) in (c) and (e) of Figure 4, respectively. Compared with
MLP variants, the structures (b), (d), and (f) are corresponding
attention-based variants, respectively. It is worth noting that we and batch = 1 on a single GeForce RTX 3090. However,
choose the channel MLP in the MLP variants. Then, we use keep in mind that the inference speed of the models is often
convolutional position-encoding to prevent the translational uncontrolled, as speed varies with software and hardware.
invariance caused by absolute position-encoding.
4) Training Settings: We first train our CFP on MS-COCO
using pre-trained weights from the YOLOX [38] or C. Ablation Study
YOLOv5 [37] backbone, where all other training parameters Our ablation study aims to investigate the effectiveness of
are similar in all models. Considering the local hardware LVC, MLP, EVC, and CFP in object detection. To this end,
condition, our model is trained for 150 epochs, including five we perform a series of experiments on the MS-COCO val
epochs for learning rate warmup, as in [57]. We use 2 GeForce set [36]. From Table I, it can be seen that we analyze the
RTX 3090 GPUs with the Batch Size is 16. Our training effects of LVC, MLP, and EVC on the average precision,
settings remain largely consistent from the baseline to the the number of parameters, computation volume, and Latency
final model. The input image training size is 640 × 640. The using YOLOv5-L [37] and YOLOX-L [38] as the baselines,
learning rate is set to lr × BatchSize / 64 (i.e., the linear respectively. We respectively present the feature visualization
scaling strategy [64]), where the initial learning rate is set to results of lightweight MLP, LVC, and EVC generated on
lr = 0.01 and the cosine lr schedule is used. The weight decay the YOLOX-L [38] model, and carry out a detailed analysis
is set to 0.0005. The optimizer for the model training process of Figure 5. A detailed analysis of our MLP variants and
selects stochastic gradient descent, where the momentum is set attention-based variants in precision and Latency are presented
to 0.9. Besides, following [19], we evaluate the AP every ten in Table III, using YOLOX-L as the baseline. Table V shows
training epochs and report the best one on the MS-COCO [36] the effect of the number of visual centers K on the LVC at
val set. the YOLOX-L baseline. From Table VI, we can intuitively see
5) Inference Settings: For the inference of our model, the the impact of our CFP method on the model with the number
original image is scaled to the object size (640 × 640) and of repetitions R at the YOLOX-L baseline.
the rest of the image is filled with gray. Then, we feed the 1) Effectiveness on Different Baselines: In Table I, we per-
image into the trained model for detection. In the inference form ablation studies on the MS-COCO [36] val set using
that FPS and Latency are all measured with FP16-precision YOLOv5-L [37] and YOLOX-L [38] as baselines for the

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4348 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023

TABLE II
R ESULT C OMPARISONS OF THE I NTRA -I NTER L AYER F EATURE M ETHODS .
“†” D ENOTES T HAT T HIS I S O UR R E -I MPLEMENTED R ESULT.
“-” D ENOTES T HAT T HERE I S N O S UCH A R ESULT IN I TS PAPER

Fig. 5. Visualization of image feature evolution. Based on YOLOX-L [38]


model, compared with the input image in (a), the visualization feature
generated by the lightweight MLP module in (b) is very effective in capturing
edge information. (c) is the visual feature generated after the LVC module,
which can capture the local corner regions well; (d) helps edge information
and corner information to better integrate through the EVC module.

proposed MLP, LVC, and EVC, respectively. As shown in


Table I, when we only use the LVC mechanism to aggregate
local corner region features, and the parameters, compu-
tation volume, and Latency are all within the acceptable
growth range, the mAP of our YOLOv5-L and YOLOX-L
models are improved by 1.0% and 1.3%, respectively. Further- As shown in Table II, our CFPYOLOX-L performs signifi-
more, when we capture the global long-range dependencies cantly well compared to Non-Lacal [21] and DESTR [65]
using only the lightweight MLP structure, the mAP of the in the intra-layer feature representations ablation experiments.
YOLOv5-L [37] and YOLOX-L [38] models improve by 0.6% In comparison with MViTv2 [66] and MaxViT [67], although
and 1.3%, respectively. Most importantly, when we use both CFPYOLOX-L has no advantage in accuracy, our model con-
LVC and MLP (the EVC scheme) on the YOLOv5-L and sumes fewer memory resources. In inter-layer feature inter-
YOLOX-L baselines, the mAP of both models are improved actions, our CFPYOLOX-L performs significantly better. While
by 1.4%. Further analysis shows that when EVC scheme is reducing memory resource consumption, the mAP improves
applied to the YOLOv5-L baseline and YOLOX-L baseline, by 2.5% to 12.4%. The above results indicate that CFP
respectively, mAP of the YOLOX-L model can be improved provides significant improvement in object detection tasks
to 49.2%. Its parameter number and computation volume are through intra-layer feature representations and inter-layer fea-
lower than those of the YOLOv5-L model. The results show ture interactions.
that the EVC scheme is more effective in the YOLOX-L 3) Comparisons With MLP Variants: Table III shows the
baseline, and the overhead is slightly smaller than that of the detection performance of MLP and attention-based variants
YOLOv5-L baseline. YOLOX-L is used as the baseline in based on the YOLOX-L baseline on the MS-COCO val set.
subsequent ablation experiments. We first analyze the comparison results of MLP variants.
In Figure 5, we show two examples of feature visualization. We can observe that the PoolFormer structure obtains the
Firstly, lightweight MLP can better capture long-range depen- same mAP (i.e., 47.80%) as the YOLOX-L [38] model. The
dencies. For example, the edge information of “person” and performance of CSPM [53], [56] is even worse compared to
“zebras” can be comprehensively captured. After “person” and the YOLOX-L model, which not only reduces the average pre-
“zebras” pass through the LVC module, the capture of local cision by 0.1% but also increases the Latency by 0.74ms. But
corner regions is very effective. In addition, it can be seen our proposed lightweight MLP structure obtains the highest
intuitively from the last column that “person” and “zebras” mAP (i.e., 49.10%) in the MLP variants, which is 1.3% better
can realize the fusion of edge information and corner feature than the mAP of YOLOX-L. The result demonstrates that our
information through the EVC module. choice of depthwise convolution as the token mixer in the
2) Superiority of EVC: An intra-layer feature regulation MLP variant performs better. In the attention-based variants,
from the perspective of inter-layer feature interactions and the performance of PoolA, CSPA [22], [56], and DWA [22],
intra-layer feature regulations of feature pyramids, which [58] are all improved compared to YOLOX-L, and the mAP of
makes up for the shortcomings of current methods in this DWA can reach 49.20%. But in fact, we compare the two best
regard. We conducted ablation experiments with inter-layer performing structures (MLP and DWA) find that the Latency of
feature interactions represented by the feature pyramid DWA increases by 2.84ms in the same hardware environment
approaches and intra-layer feature representations expressed than MLP (Ours). From Table III, it can be found that our
by the attention-based methods, respectively. The experimental lightweight MLP is not only better but also faster in capturing
results are given in Table II. To ensure the fairness and consis- long-range dependencies.
tency of the experiments. We uniformly used ResNet-50 as the Besides, we also use YOLOX-L as the baseline, and an
backbone and MS-COCO val set as the dataset. In addition, MLP-Mixer [51] module or an attention [22] module is built
we set the epoch to 50. The mAP, AP50 , AP75 , and Params behind the backbone of YOLOX-L [38], respectively. Results
are used as evaluation metrics in the experimental evaluation. are shown in Table IV. Comparison with the standard MLP-

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
QUAN et al.: CENTRALIZED FEATURE PYRAMID FOR OBJECT DETECTION 4349

TABLE III
R ESULT C OMPARISONS OF O UR L IGHTWEIGHT MLP(O URS ) YOLOX-L W ITH THE MLP VARIANTS AND THE S ELF -ATTENTION
VARIANTS ON THE MS-COCO val S ET [36]. “†” I S O UR R E -I MPLEMENTATION R ESULT

TABLE IV TABLE V
R ESULT C OMPARISONS OF O UR L IGHTWEIGHT MLP(O URS ) YOLOX-L E FFECT OF THE N UMBER OF V ISUAL C ENTERS K ON LVC YOLOX-L .
W ITH THE S TANDARD MLP-M IXER AND THE ATTENTION ON THE “-” D ENOTES T HAT T HERE I S N O S UCH A S ETTING .
MS-COCO val S ET [36]. “†” D ENOTES T HAT T HIS “†” I S O UR R E -I MPLEMENTATION R ESULT
I S O UR R E -I MPLEMENTED R ESULT

Mixer [51] and attention-based methods shows that the mAP TABLE VI
of the lightweight MLP increases by 0.6% to 1.0% with less R ESULT C OMPARISONS OF THE N UMBER OF R EPETITIONS OF THE
P ROPOSED CFP IN THE YOLOX-L BASELINE . “-” D ENOTES T HAT
computational costs. We can get the conclusion that our pro- T HERE I S N O S UCH A S ETTING . “†” I S O UR R E -I MPLEMENTATION
posed lightweight MLP for capturing long-range dependencies R ESULT. R D ENOTES THE N UMBER OF R EPETITIONS
of intra-layer features performed significantly.
4) Effect of K: As shown in Table V, we analyze the effect
of the number of visual centers K on the performance of LVC.
We choose YOLOX-L as the baseline, and with increasing K ,
we can observe that its performance shows an increasing trend.
At the same time, the parameter number, computation volume,
and the Latency of the model also tend to increase gradually.
Notably, when K = 64, the mAP of the model can reach
49.10%, and when K = 128, the mAP of the model can get
49.20%. Although the performance of the model improves by increases the computational cost. Therefore, based on the
0.1% as K increases, its extra computational cost increases above observations, we choose R = 1.
by 10.01G, and the corresponding inference time increases
by 3.21ms. This may be because too many visual centers
bring more redundant semantic information. Not only is the D. Efficiency Analysis
performance not significantly improved, but the computational We analyze the performance of the MLP variants and
effort is increased. So we choose K = 64. attention-based variants from a multi-metric perspective.
5) Effect of R: From Table VI, we analyze the effect of In Figure 6, all models take YOLOX-L [38] as a baseline
the number of repetitions R of CFP on the performance. and are trained on the MS-COCO [36] emphval set with the
We still choose the YOLOX-L baseline, and as R increases, same data augmentation settings. Meanwhile, to demonstrate
we can observe a trend of increasing and decreasing and the effectiveness of the MLP structure, as shown in Table VII,
then stabilizing the performance compared to the YOLOX-L we compare it with the state-of-the-art transformer methods
model. Meanwhile, the number of parameters, computation and the MLP methods at this stage. As can be observed from
volume, and Latency gradually increase. In particular, when Figure 6, we can intuitively see that the MLP (Ours) structure
R = 1, CFPYOLOX-L achieves the best performance mAP of is significantly better than the other structures in terms of
49.40%. When R = 2, the performance is instead reduced mAP, and it is lower than the other structures in terms of the
by 0.2% compared to R = 1. The reason may be that number of parameters, computation volume, and the inference
redundant feature information is not helpful for the task but time. It can be shown that the MLP structure can guarantee

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4350 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023

TABLE VII
R ESULT C OMPARISONS OF O UR L IGHTWEIGHT MLP W ITH T RANSFORMER VARIANTS AND MLP VARIANT M ETHODS

TABLE VIII
R ESULT C OMPARISONS W ITH YOLOV 5 AND YOLOX. “†” I S O UR R E -I MPLEMENTATION R ESULT. “-” D ENOTES T HAT T HERE I S N O S UCH A S ETTING

and has an average precision of 1.3% higher compared to


Mask R-CNN (backbone as AS-MLP-S [71]). In the middle
part of Table VII, we find that our MLPYOLOX-L can improve
the mAP by up to 7.1% compared to the transformer method
(DETR [25]) without extra computational cost. With the
same mAP, the number of parameters of MLP is reduced by
62.43M compared to REGO-Deformable DETR [72]. There-
fore, we can find that MLP not only has high precision but also
takes up less memory compared to the transformer methods.
All in all, our MLP has outstanding performance in capturing
long-range feature dependencies.
Fig. 6. Multi-metrics comparison results between MLP variants and
attention-based variants based on the MS-COCO [36] val set. E. Comparisons With State-of-the-Art Methods
As shown in Table VIII, we validate the proposed CFP
a lower number of parameters and computation volume under method on the MS-COCO val set with YOLOv5 (Small,
the condition of obtaining better precision. Media, and Large) and YOLOX (Small, Media, and Large)
In Table VII, we give the comparative results of the MLP as baselines. In addition, the data in table IX show the com-
and transformer methods that are outstanding performers in parison results of our CFP method compared to the advanced
object detection tasks at this stage. IN the first half of single-stage and two-stage detectors. Finally, we offer some
Table VII, our MLPYOLOX-L method occupies less memory visual comparison plots in Figure 7.

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
QUAN et al.: CENTRALIZED FEATURE PYRAMID FOR OBJECT DETECTION 4351

TABLE IX
C OMPARISON OF THE S PEED AND ACCURACY OF D IFFERENT O BJECT D ETECTORS ON MS-COCO val S ET. W E S ELECT A LL THE M ODELS T RAINED
ON 150 E POCHS FOR FAIR C OMPARISON . “†” I S O UR R E -I MPLEMENTATION R ESULT. “-” D ENOTES T HAT T HERE I S N O S UCH A S ETTING

1) Comparisons With YOLOv5 and YOLOX Baseline: As the network structure of YOLOv5 has no advantage; the mAP
shown in Table VIII, when YOLOv5 is chosen as the baseline, of our CFP method can still obtain 46.60%. Meanwhile, our
the mAP of our CFP method is enhanced by 0.5%, 0.5%, mAP reaches 49.40% at the YOLOX baseline. Moreover, the
and 1.4% on the Small, Media, and Large size models, CFPYOLOX on the small backbone network is improved by
respectively. When YOLOX [38] is used as the baseline, the 7.0% over YOLOX [38]. The main reason for this is that the
mAP improves by 7.0%, 0.8%, and 1.6% on the backbone LVC in our CFP can enhance the feature representations of the
networks of different sizes. It is worth noting that the main local corner regions through visual centers at the pixel level.
reason why we choose YOLOv5 (an anchor mechanism) and 2) Comparisons on Speed and Accuracy: We perform a
YOLOX (an anchor-free mechanism), as the baseline, is that series of comparisons on the MS-COCO val set with single-
the reciprocity of these two models in terms of network stage, and two-stage detectors, and the results are shown in
structure can fully demonstrate the effectiveness of our CPF Table IX. We can first see the two-stage object detection mod-
approach. Most importantly, compared with YOLOX, although els, including the Faster R-CNN series with different backbone

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4352 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023

Fig. 7. Qualitative results on the test set of MS-COCO 2017 [36]. We show the results of object detection from baseline and our approaches for comparison.

networks, Mask R-CNN, and D2Det. Our CFPYOLOX-L model partially detect the “zebra” at a distance. Therefore, it is
has significant advantages in precision, inference speed, and intuitively proved that EVC is very effective for small object
time. Immediately after, we divide the single-stage detec- detection in some intensive detection tasks. In the second line
tion methods into three parts in chronological order and of the figure, YOLOX-L does not fully detect the “Cups” in
then analyze them. There is no doubt that the proposed the cabinet due to factors such as occlusion and illumination.
CFPYOLOX-L method improves the mAP by up to 27.80% com- The EVCYOLOX-L model alleviates this problem by using
pared to YOLOv3-ultralytics [44] and its previous detectors. MLP structures to capture the long-range dependencies of the
With nearly the same average precision, the CFPYOLOv5-M features in the object. Finally, the CFPYOLOX-L model uses the
inferred 1.5 times faster compared to the EfficientDet-D2 GCR-assisted EVC scheme and gets better results. In the third
detector. And comparing CFPYOLOX-L with EfficientDet-D3, line of the figure, the CFPYOLOX-L model performs better in
the average accuracy is improved by 1.9%, and the inference complex scenarios. Based on the EVC scheme, GCR is used
speed is 1.8 times higher. In addition, in comparison with to adjust intra-layer features for top-down, and CFPYOLOX-L
YOLOv4 [45] series, it can be found that the mAP of can solve the problem of classification better.
CFPYOLOv5-L is improved by 2.7% compared to YOLOv4-
CSP [80]. Besides, we can see all scaled YOLOv5 [37] V. C ONCLUSION AND F UTURE W ORK
models, including YOLOv5-S [37], YOLOv5-M [37], and In this work, we proposed a CFP for object detection, which
YOLOv5-L [37]. The average precision of its best YOLOv5- was based on a globally explicit centralized feature regulation.
L [37] model is 1.4% lower than the CFPYOLOv5-L . In the same We first proposed a spatial explicit visual center scheme, where
way, our CFP method obtains a maximum average accuracy a lightweight MLP was used to capture the globally long-range
of 49.40%, which is 1.6% higher than YOLOX-L. dependencies and a parallel learnable visual center was used
3) Qualitative Results on MS-COCO [36] 2017 test Set: In to capture the local corner regions of the input images. Based
addition, we also show in Figure 7 some visualization results on the proposed EVC, we then proposed a GCR for a feature
of baseline (YOLOX-L [38]), EVCYOLOX-L , and CFPYOLOX-L pyramid in a top-down manner, where the explicit visual center
on the MS-COCOCO [36] test set. It is worth noting that we information obtained from the deepest intra-layer feature was
use white, red, and orange boxes to mark where the detec- used to regulate all frontal shallow features. Compared to the
tion task failures, respectively. White boxes indicate misses existing methods, CFP not only has the ability to capture the
due to occlusion, light influence, or small object size. Red global long-range dependencies, but also efficiently obtain an
boxes indicate detection errors due to insufficient contextual all-round yet discriminative feature representation. Experimen-
semantic relationships, e.g., causing one object to be detected tal results on MS-COCO dataset verified that our CFP can
as two objects. The yellow boxes indicate an error in the achieve the consistent performance gains on the state-of-the-
object classification. In the first line, the detection result of art object detection baselines. CFP is a generalized approach
YOLOX-L in part marked in the white box is not ideal due that can not only extract global long-range dependencies of
to the distance factor of “zebra”. And the EVCYOLOX-L can the intra-layer features but also preserve the local corner

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
QUAN et al.: CENTRALIZED FEATURE PYRAMID FOR OBJECT DETECTION 4353

regional information as much as possible, which is very [22] A. Vaswani et al., “Attention is all you need,” in Proc. Neural Inf.
important for dense prediction tasks. Therefore, in the future, Process. Syst. (NeurIPS), 2017, pp. 1–11.
[23] Z. Luo, A. Mishra, A. Achkar, J. Eichel, S. Li, and P.-M. Jodoin, “Non-
we will start to develop some advanced intra-layer feature local deep features for salient object detection,” in Proc. IEEE Conf.
regulate methods to improve the feature representation ability Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6593–6601.
further. Besides, we will try to apply EVC and GCR to other [24] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-local networks
feature pyramid-based computer vision tasks, e.g., semantic meet squeeze-excitation networks and beyond,” in Proc. IEEE/CVF Int.
Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1971–1980.
segmentation, object localization, instance segmentation and [25] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
person re-identification. S. Zagoruyko, “End-to-end object detection with transformers,” in Proc.
Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 213–229.
ACKNOWLEDGMENT [26] A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers
for image recognition at scale,” 2020, arXiv:2010.11929.
The code and weights have been released at: CFPNet. [27] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using
shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
R EFERENCES Oct. 2021, pp. 9992–10002.
[1] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with deep [28] W. Wang et al., “Pyramid vision transformer: A versatile backbone for
learning: A review,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, dense prediction without convolutions,” in Proc. IEEE/CVF Int. Conf.
no. 11, pp. 3212–3232, Nov. 2019. Comput. Vis. (ICCV), Oct. 2021, pp. 548–558.
[2] M. Treml et al., “Speeding up semantic segmentation for autonomous [29] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár,
driving,” in Proc. Neural Inf. Process. Syst. (NeurIPS), 2016, pp. 1–7. “Designing network design spaces,” in Proc. IEEE/CVF Conf. Comput.
[3] M. Havaei et al., “Brain tumor segmentation with deep neural networks,” Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10425–10433.
Med. Image Anal., vol. 35, pp. 18–31, Jan. 2017. [30] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning
[4] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. deep features for discriminative localization,” in Proc. IEEE Conf.
(ICCV), Dec. 2015, pp. 1440–1448. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2921–2929.
[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards [31] L. Ru, Y. Zhan, B. Yu, and B. Du, “Learning affinity from attention: End-
real-time object detection with region proposal networks,” IEEE Trans. to-end weakly-supervised semantic segmentation with transformers,”
Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017. 2022, arXiv:2203.02664.
[6] W. Liu et al., “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf. [32] R. Li, Z. Mai, C. Trabelsi, Z. Zhang, J. Jang, and S. Sanner, “TransCAM:
Comput. Vis. (ECCV), 2016, pp. 1–17. Transformer attention-based CAM refinement for weakly supervised
[7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look semantic segmentation,” 2022, arXiv:2203.07239.
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. [33] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Trans-
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788. former for semantic segmentation,” in Proc. IEEE/CVF Int. Conf.
[8] D. Zhang, L. Zhang, and J. Tang, “Augmented FCN: Rethinking context Comput. Vis. (ICCV), Oct. 2021, pp. 7242–7252.
modeling for semantic segmentation,” Sci. China Inf. Sci., vol. 66, no. 4, [34] F. Zhu, Y. Zhu, L. Zhang, C. Wu, Y. Fu, and M. Li, “A unified efficient
Apr. 2023, Art. no. 142105. pyramid transformer for semantic segmentation,” in Proc. IEEE/CVF Int.
[9] D. Zhang, J. Tang, and K.-T. Cheng, “Graph reasoning transformer for Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, pp. 2667–2677.
image parsing,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, [35] D. Zhang, H. Zhang, J. Tang, X.-S. Hua, and Q. Sun, “Self-regulation
pp. 2380–2389. for semantic segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
[10] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids (ICCV), Oct. 2021, pp. 6933–6943.
for object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, [36] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
no. 8, pp. 1532–1545, Aug. 2014. Proc. Eur. Conf. Comput. Vis. (ECCV), 2014, pp. 213–229.
[11] S. Vashishth, S. Sanyal, V. Nitin, N. Agrawal, and P. Talukdar, “Inter- [37] G. Jocher et al., “Ultralytics/YOLOv5: V4.0-nn.SiLU() activa-
actE: Improving convolution-based knowledge graph embeddings by tions weights & biases logging PyTorch hub integration (ver-
increasing feature interactions,” in Proc. AAAI Conf. Artif. Intell. (AAAI), sion v4.0),” Jan. 2021. [Online]. Available: https://zenodo.org/record/
2020, pp. 1–8. 4418161#.YFCo2nHvfC0
[12] H. Zhang et al., “Context encoding for semantic segmentation,” in [38] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, series in 2021,” 2021, arXiv:2107.08430.
pp. 7151–7160. [39] Q. Zhao et al., “M2Det: A single-shot object detector based on multi-
[13] H. Zhang, H. Zhang, C. Wang, and J. Xie, “Co-occurrent features in level feature pyramid network,” in Proc. Conf. Artif. Intell. (AAAI), 2019,
semantic segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern pp. 1–8.
Recognit. (CVPR), Jun. 2019, pp. 548–557.
[40] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
[14] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and efficient with deep convolutional neural networks,” in Proc. Neural Inf. Process.
object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- Syst. (NeurIPS), vol. 25, no. 6, 2017, pp. 80–90.
nit. (CVPR), Jun. 2020, pp. 10778–10787.
[41] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar-
[15] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “NAS-FPN: Learning scalable feature
chies for accurate object detection and semantic segmentation,” in Proc.
pyramid architecture for object detection,” in Proc. IEEE/CVF Conf.
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 7029–7038.
[16] K. Chen, Y. Cao, C. C. Loy, D. Lin, and C. Feichtenhofer, “Feature [42] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-
pyramid grids,” 2020, arXiv:2004.03580. based fully convolutional networks,” in Proc. Neural Inf. Process. Syst.
(NeurIPS), 2016, pp. 1–9.
[17] D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, “Feature
pyramid transformer,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2020, [43] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
pp. 323–329. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988.
[18] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for [44] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
instance segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern 2018, arXiv:1804.02767.
Recognit., Jun. 2018, pp. 8759–8768. [45] A. Bochkovskiy, C.-Y. Wang, and H.-Y. Mark Liao, “YOLOv4: Optimal
[19] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, speed and accuracy of object detection,” 2020, arXiv:2004.10934.
“Feature pyramid networks for object detection,” in Proc. IEEE Conf. [46] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 936–944. H. Jégou, “Training data-efficient image transformers & distillation
[20] M. Yin et al., “Disentangled non-local neural networks,” in Proc. Eur. through attention,” in Proc. Int. Conf. Mach. Learn. (ICML), 2021,
Conf. Comput. Vis. (ECCV), 2020, pp. 191–207. pp. 10347–10357.
[21] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural [47] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable
networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., DETR: Deformable transformers for end-to-end object detection,” 2020,
Jun. 2018, pp. 7794–7803. arXiv:2010.04159.

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.
4354 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023

[48] Y. Chen, N. Ashizawa, C. K. Yeo, N. Yanai, and S. Yean, “Multi-scale [77] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
self-organizing map assisted deep autoencoding Gaussian mixture model dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
for unsupervised intrusion detection,” Knowl.-Based Syst., vol. 224, Oct. 2017, pp. 2999–3007.
Jul. 2021, Art. no. 107086. [78] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in
[49] J. Beal, E. Kim, E. Tzeng, D. Huk Park, A. Zhai, and D. Kislyuk, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
“Toward transformer-based object detection,” 2020, arXiv:2012.09958. pp. 6517–6525.
[50] S. Zheng et al., “Rethinking semantic segmentation from a sequence- [79] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD:
to-sequence perspective with transformers,” in Proc. IEEE/CVF Conf. Deconvolutional single shot detector,” 2017, arXiv:1701.06659.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 6877–6886. [80] C.-Y. Wang, A. Bochkovskiy, and H. M. Liao, “Scaled-YOLOv4:
[51] I. Tolstikhin et al., “MLP-mixer: An all-MLP architecture for vision,” Scaling cross stage partial network,” in Proc. IEEE/CVF Conf. Comput.
in Proc. Neural Inf. Process. Syst. (NeurIPS), 2021, pp. 1–12. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 13024–13033.
[52] H. Liu, Z. Dai, D. R. So, and Q. V. Le, “Pay attention to MLPs,” in [81] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolu-
Proc. Neural Inf. Process. Syst. (NeurIPS), 2021, pp. 1–12. tional neural networks,” in Proc. Int. Conf. Mach. Learn. (ICML), 2019,
[53] W. Yu et al., “MetaFormer is actually what you need for vision,” 2021, pp. 6105–6114.
arXiv:2111.11418.
[54] Q. Hou, Z. Jiang, L. Yuan, M.-M. Cheng, S. Yan, and J. Feng, “Vision
permutator: A permutable MLP-like architecture for visual recognition,”
2021, arXiv:2106.12368. Yu Quan received the B.S. and M.S. degrees in com-
[55] Z. Peng et al., “Conformer: Local features coupling global representa- puter science and technology from Guangxi Normal
tions for visual recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. University, China. She is currently pursuing the
(ICCV), Oct. 2021, pp. 357–366. Ph.D. degree with the School of Computer Science
[56] C.-Y. Wang, H.-Y. Mark Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and Engineering, Nanjing University of Science and
and I.-H. Yeh, “CSPNet: A new backbone that can enhance learning Technology, China. Her research interests include
capability of CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern deep learning, object detection, and their appli-
Recognit. Workshops (CVPRW), Jun. 2020, pp. 1571–1580. cations. In these areas, she has published several
[57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for journal articles and conference papers, including
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. NCA, IEEE ACCESS, PRICAI, and ICPR.
(CVPR), Jun. 2016, pp. 770–778.
[58] A. G. Howard et al., “MobileNets: Efficient convolutional neural net-
works for mobile vision applications,” 2017, arXiv:1704.04861.
[59] G. Larsson, M. Maire, and G. Shakhnarovich, “FractalNet: Ultra-deep
neural networks without residuals,” 2016, arXiv:1605.07648. Dong Zhang (Member, IEEE) received the Ph.D.
[60] J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang, “A simple degree in computer science and technology from
pooling-based design for real-time salient object detection,” in Proc. the Nanjing University of Science and Technology
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, in 2021. He is currently a Postdoctoral Research
pp. 3912–3921. Scientist with the Department of Computer Sci-
[61] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond ence and Engineering, The Hong Kong University
empirical risk minimization,” in Proc. Int. Conf. Learn. Represent. of Science and Technology. His research interests
(ICLR), 2018, pp. 1–13. include machine learning, image processing, com-
[62] P. Ramachandran, B. Zoph, and Q. V. Le, “Swish: A self-gated activation puter vision, and medical image analysis, especially
function,” 2017, arXiv:1710.05941. in image classification, semantic image segmenta-
[63] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in tion, object detection, and their potential applica-
deep convolutional networks for visual recognition,” IEEE Trans. Pattern tions. In these areas, he has published several journal articles and conference
Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2015. papers, including IEEE T RANSACTIONS ON C YBERNETICS, PR, AAAI,
[64] P. Goyal et al., “Accurate, large minibatch SGD: Training ImageNet in ACM MM, IJCAI, ECCV, ICCV, and NeurIPS.
1 hour,” 2017, arXiv:1706.02677.
[65] L. He and S. Todorovic, “DESTR: Object detection with split trans-
former,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Liyan Zhang received the Ph.D. degree in computer
(CVPR), Jun. 2022, pp. 9367–9376. science from the University of California at Irvine,
[66] Y. Li et al., “MViTv2: Improved multiscale vision transformers for Irvine, CA, USA, in 2014. She is currently a Pro-
classification and detection,” in Proc. IEEE/CVF Conf. Comput. Vis. fessor with the College of Computer Science and
Pattern Recognit. (CVPR), Jun. 2022, pp. 4794–4804. Technology, Nanjing University of Aeronautics and
[67] Z. Tu et al., “MaxViT: Multi-axis vision transformer,” in Proc. Eur. Conf. Astronautics, Nanjing, China. Her research inter-
Comput. Vis. (ECCV), 2022, pp. 1–31. ests include multimedia analysis, computer vision,
[68] Q. Chen, Y. Wang, T. Yang, X. Zhang, J. Cheng, and J. Sun, “You and deep learning. She received the Best Paper
only look one-level feature,” in Proc. IEEE Conf. Comput. Vis. Pattern Award from the International Conference on Multi-
Recognit. (CVPR), Aug. 2021, pp. 13034–13043. media Retrieval (ICMR) 2013 and the Best Student
[69] Z. Jin, D. Yu, L. Song, Z. Yuan, and L. Yu, “You should look at all Paper Award from the International Conference on
objects,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 332–349. Multimedia Modeling (MMM) 2016.
[70] Y.-H. Wu, Y. Liu, X. Zhan, and M.-M. Cheng, “P2T: Pyramid pooling
transformer for scene understanding,” IEEE Trans. Pattern Anal. Mach.
Intell., early access, Aug. 30, 2022, doi: 10.1109/TPAMI.2022.3202765.
[71] D. Lian, Z. Yu, X. Sun, and S. Gao, “AS-MLP: An axial shifted MLP Jinhui Tang (Senior Member, IEEE) received the
architecture for vision,” 2021, arXiv:2107.08391. B.E. and Ph.D. degrees from the University of
[72] Z. Chen, J. Zhang, and D. Tao, “Recurrent glimpse-based decoder for Science and Technology of China, Hefei, China,
detection with transformer,” 2021, arXiv:2112.04632. in 2003 and 2008, respectively. He is currently a
[73] Y. Fang et al., “You only look at one sequence: Rethinking transformer Professor with the Nanjing University of Science and
in vision through object detection,” in Proc. Neural Inf. Process. Syst. Technology, Nanjing, China. He has authored more
(NeurIPS), 2021, pp. 26183–26197. than 200 papers in top tier journals and conferences.
[74] H. Song et al., “ViDT: An efficient and effective fully transformer-based His research interests include multimedia analysis
object detector,” 2021, arXiv:2110.03921. and computer vision. He is a fellow of IAPR. He was
[75] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual a recipient of the Best Paper Award from ACM MM
transformations for deep neural networks,” in Proc. IEEE Conf. Comput. 2007 and ACM MM Asia 2020 and the Best Paper
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5987–5995. Runner-Up from ACM MM 2015. He has served as an Associate Editor for the
[76] J. Cao, H. Cholakkal, R. M. Anwer, F. S. Khan, Y. Pang, and L. Shao, IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS,
“D2Det: Towards high quality object detection and instance segmen- IEEE T RANSACTIONS ON K NOWLEDGE AND DATA E NGINEERING, IEEE
tation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), T RANSACTIONS ON M ULTIMEDIA, and IEEE T RANSACTIONS ON C IRCUITS
Jun. 2020, pp. 11485–11494. AND S YSTEMS FOR V IDEO T ECHNOLOGY .

Authorized licensed use limited to: DALIAN MARITIME UNIVERSITY. Downloaded on April 05,2024 at 10:38:30 UTC from IEEE Xplore. Restrictions apply.

You might also like