Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views26 pages

Transformer-Based Visual Segmentation A Survey

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views26 pages

Transformer-Based Visual Segmentation A Survey

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

10138 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO.

12, DECEMBER 2024

Transformer-Based Visual Segmentation: A Survey


Xiangtai Li , Henghui Ding , Haobo Yuan , Wenwei Zhang , Jiangmiao Pang , Guangliang Cheng ,
Kai Chen , Ziwei Liu , and Chen Change Loy , Senior Member, IEEE

(Survey Paper)

Abstract—Visual segmentation seeks to partition images, video I. INTRODUCTION


frames, or point clouds into multiple segments or groups. This tech-
ISUAL segmentation aims to group pixels of the given
nique has numerous real-world applications, such as autonomous
driving, image editing, robot sensing, and medical analysis. Over
the past decade, deep learning-based methods have made remark-
V image or video into a set of semantic regions. It is a
fundamental problem in computer vision and involves numerous
able strides in this area. Recently, transformers, a type of neural real-world applications, such as robotics, automated surveil-
network based on self-attention originally designed for natural
language processing, have considerably surpassed previous convo- lance, image/video editing, social media, autonomous driving,
lutional or recurrent approaches in various vision processing tasks. etc. Starting from the hand-crafted features [1], [2] and classical
Specifically, vision transformers offer robust, unified, and even machine learning models [3], [4], [5], segmentation problems
simpler solutions for various segmentation tasks. This survey pro- have been involved with a lot of research efforts. During the last
vides a thorough overview of transformer-based visual segmenta- ten years, deep neural networks, Convolution Neural Networks
tion, summarizing recent advancements. We first review the back-
ground, encompassing problem definitions, datasets, and prior (CNNs) [6], [7], [8], such as Fully Convolutional Networks
convolutional methods. Next, we summarize a meta-architecture (FCNs) [9], [10], [11], [12] have achieved remarkable successes
that unifies all recent transformer-based approaches. Based on for different segmentation tasks and led to much better results.
this meta-architecture, we examine various method designs, in- Compared to traditional segmentation approaches, CNNs based
cluding modifications to the meta-architecture and associated ap- approaches have better generalization ability. Because of their
plications. We also present several specific subfields, including
3D point cloud segmentation, foundation model tuning, domain- exceptional performance, CNNs and FCN architecture have
aware segmentation, efficient segmentation, and medical segmen- been the basic components in the segmentation research works.
tation. Additionally, we compile and re-evaluate the reviewed Recently, with the success of natural language processing
methods on several well-established datasets. Finally, we identify (NLP), transformer [13] is introduced as a replacement for
open challenges in this field and propose directions for future recurrent neural networks [14]. Transformer contains a novel
research.
self-attention design and can process various tokens in parallel.
Index Terms—Vision transformer review, dense prediction, Then, based on transformer design, BERT [15] and GPT-3 [16]
image segmentation, video segmentation, scene understanding. scale the model parameters up and pre-train with huge unlabeled
text information. They achieve strong performance on many
NLP tasks, accelerating the development of transformers into
the vision community. Recently, researchers applied transform-
ers to computer vision (CV) tasks. Early methods [17], [18]
combine the self-attention layers to augment CNNs. Meanwhile,
Manuscript received 2 June 2023; revised 17 April 2024; accepted 22 July
2024. Date of publication 29 July 2024; date of current version 5 November several works [19], [20] used pure self-attention layers to replace
2024. This work was supported in part by The Alan Turing Institute (U.K.) convolution layers. After that, two remarkable methods boost
through the project ‘Turing-DSO Labs Singapore Collaboration’ under Grant the CV tasks. One is vision transformer (ViT) [21], which is
SDCfP2\100009. This study was also supported under the RIE2020 Industry
Alignment Fund Industry Collaboration Projects (IAF-ICP) Funding Initiative a pure transformer that directly takes the sequences of image
and Singapore MOE AcRF Tier 1 under Grant RG16/21, as well as cash and patches to classify the full image. It achieves state-of-the-art
in-kind contributions from the industry partner(s). Recommended for acceptance performance on multiple image recognition datasets. Another is
by N. Sebe. (Corresponding author: Guangliang Cheng.)
Xiangtai Li, Haobo Yuan, Wenwei Zhang, Ziwei Liu, and Chen Change Loy detection transformer (DETR) [22], which introduces the con-
are with S-Lab, Nanyang Technological University, Singapore 639798 (e-mail: cept of object query. Each object query represents one instance.
[email protected]; [email protected]; [email protected]). The object query replaces the complex anchor design in the
Henghui Ding is with the Institute of Big Data, Fudan University, Shanghai
200437, China (e-mail: [email protected]). previous detection framework, which simplifies the pipeline of
Jiangmiao Pang and Kai Chen are with the Shanghai AI Laboratory, Shanghai detection and segmentation. Then, the following works adopt
200240, China (e-mail: [email protected]; [email protected]). improved designs on various vision tasks, including representa-
Guangliang Cheng is with the University of Liverpool, L69 7ZX Liverpool,
U.K. (e-mail: [email protected]). tion learning [23], [24], object detection [25], segmentation [26],
The project page can be found at https://github.com/lxtGH/Awesome- low-level image processing [27], video understanding [28], 3D
Segmentation-With-Transformer. scene understanding [29], and image/video generation [30].
This article has supplementary downloadable material available at
https://doi.org/10.1109/TPAMI.2024.3434373, provided by the authors. As for visual segmentation, recent state-of-the-art methods
Digital Object Identifier 10.1109/TPAMI.2024.3434373 are all based on transformer architecture. Compared with CNN

© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
LI et al.: TRANSFORMER-BASED VISUAL SEGMENTATION: A SURVEY 10139

Fig. 1. A diagram that summarizes this survey. Different colors represent specific sections. Best viewed in color.

approaches, most transformer-based approaches have simpler • Scope: This survey will cover several mainstream segmen-
pipelines but stronger performance. Because of a rapid upsurge tation tasks, including semantic segmentation, instance seg-
in transformer-based vision models, there are several surveys mentation, panoptic segmentation, and their variants, such as
on vision transformer [31], [32], [33]. However, most of them video and point cloud segmentation. Additionally, we cover
mainly focus on general transformer design and its application related subfields in Section IV. We focus on transformer-based
on several specific vision tasks [34], [35], [36]. Meanwhile, approaches and only review a few closely related CNN-based
there are previous surveys on the deep-learning-based segmen- approaches for reference. Although there are many preprints or
tation [37], [38], [39]. However, to the best of our knowledge, published works, we only include the most representative works.
there are no surveys focusing on using vision transformers for • Organization: The rest of the survey is organized as fol-
visual segmentation or query-based object detection. We believe lows. Overall, Fig. 1 shows the pipeline of our survey. We first
it would be beneficial for the community to summarize these introduce the background knowledge on problem definition,
works and keep tracking this evolving field. datasets, and CNN-based approaches in Section II. Then, we
• Contribution: In this survey, we systematically introduce review representative papers on transformer-based segmentation
recent advances in transformer-based visual segmentation meth- methods in Sections III and IV. We compare the experiment
ods. We start by defining the task, datasets, and CNN-based results in Section V. Finally, we raise the future directions in
approaches and then move on to transformer-based approaches, Section VI and conclude the survey in Section VII. We provide
covering existing methods and future work directions. Our sur- more benchmarks and details in the appendix, available online.
vey groups existing representative works from a more technical
perspective of the method details. In particular, for the main
review part, we first summarize the core framework of existing II. BACKGROUND
approaches into a meta-architecture in Section III-A, which is
In this section, we first present a unified problem definition
an extension of DETR [22]. By changing the components of
of different segmentation tasks. Then, we detail the common
the meta-architecture, we divide existing approaches into six
datasets and evaluation metrics. Next, we present a summary of
categories in Section III-B, including Representation Learning,
previous approaches before the transformer. Finally, we present
Interaction Design in Decoder, Optimizing Object Query, Using
a review of basic concepts in transformers. To facilitate under-
Query For Association, and Conditional Query Generation.
standing of this survey, we list the brief notations in Table I for
Moreover, we also survey closely related specific subfields,
reference.
including point cloud segmentation, tuning foundation models,
domain-aware segmentation, data/model efficient segmentation,
class agnostic segmentation and tracking, and medical segmen-
tation. We also evaluate the performance of influential works A. Problem Definition
published in top-tier conferences and journals on several widely • Image Segmentation: Given an input image I ∈ RH×W ×3 ,
used segmentation benchmarks. Additionally, we provide an the goal of image segmentation is to output a group of masks
overview of previous CNN-based models and relevant literature {yi }G
i=1 = {(mi , ci )}i=1 where ci denotes the ground truth
G

in other areas, such as object detection, object tracking, and class label of the binary mask mi and G is the number of
referring segmentation in the background section. masks, H × W are the spatial size. According to the scope
10140 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

TABLE I • Video Segmentation: Given a video clip input as V ∈


NOTATION AND ABBREVIATIONS USED IN THIS SURVEY
RT ×H×W ×3 , where T represents the frame number, the goal
of video segmentation is to obtain a mask tube {yi }N i=1 =
{(mi , ci )}N
i=1 , where N is the number of the tube masks
mi ∈ {0, 1}T ×H×W , and ci denotes the class label of the tube
mi . Video panoptic segmentation (VPS) requires temporally
consistent segmentation and tracking results for each pixel.
Each tube mask can be classified into countable thing classes
and countless stuff classes. Each thing tube mask also has a
unique ID for evaluating tracking performance. For stuff masks,
the tracking is zero by default. When N = C and the task
only contains stuff classes, and all thing classes have no IDs,
VPS turns into video semantic segmentation (VSS). If {yi }N i=1
overlap and C only contains the thing classes and all stuff classes
are ignored, VPS turns into video instance segmentation (VIS).
We present the visual examples that summarize the difference
among VPS, VIS, and VSS with T = 2 in Fig. 2(b).
• Related Problems: Object detection and instance-wise seg-
mentation (IS/VIS/VPS) are closely related tasks. Object detec-
tion involves predicting object bounding boxes, which can be
considered a coarse form of IS. After introducing the DETR
model, many works have treated object detection and IS as
the same task, as IS can be achieved by adding a simple mask
prediction head to object detection. Similarly, video object de-
tection (VOD) aims to detect objects in every video frame. In our
survey, we also examine query-based object detectors for both
object detection and VOD. Point cloud segmentation is another
Fig. 2. Illustration of different segmentation tasks. The examples are sampled segmentation task, where the goal is to segment each point in a
from the VIP-Seg dataset [40]. For (V)SS, the same color indicates the same point cloud into pre-defined categories. We can apply the same
class. For (V)IS and (V)PS, different instances are represented by different definitions of semantic segmentation, instance segmentation,
colors.
and panoptic segmentation to this task, resulting in point cloud
semantic segmentation (PCSS), point cloud instance segmen-
tation (PCIS), and point cloud panoptic segmentation (PCPS).
of class labels and masks, image segmentation can be divided Referring segmentation is a task that aims to segment objects
into three different tasks, including semantic segmentation (SS), described in natural language text input. There are two subtasks
instance segmentation (IS), and panoptic segmentation (PS), in referring segmentation: referring image segmentation (RIS),
as shown in Fig. 2(a). For SS, the classes may be foreground which performs language-driven segmentation, and referring
objects (thing) or background (stuff), and each class only has video object segmentation (RVOS), which segments and tracks
one binary mask that indicates the pixels belonging to this a specific object in a video based on required text inputs. Finally,
class. Each SS mask does not overlap with other masks. For video object segmentation (VOS) involves tracking an object in
IS, each class may have more than one binary mask, and all a video by predicting pixel-wise masks in every frame, given a
the classes are foreground objects. Some IS masks may overlap mask of the object in the first frame.
with others. For PS, depending on the class definition, each
class may have a different number of masks. For the countable
thing class, each class may have multiple masks for different B. Datasets and Metrics
instances. For the uncountable stuff class, each class only has • Commonly Used Datasets: For image segmentation, the
one mask. Each PS mask does not overlap with other masks. most commonly used datasets are COCO [43], ADE20k [44]
One can understand image segmentation from the pixel view. and Cityscapes [45]. For video segmentation, the most used
Given an input I ∈ RH×W ×3 , the output of image segmentation datasets are VSPW [49] and Youtube-VIS [50]. We will compare
is a two-channel dense segmentation map S = {kj , cj }H×W j=1 . In several dataset results in Section V. More datasets are listed in
particular, k indicates the identity of the pixel j, and c means the the Table II.
class label of pixel j. For SS, the identities of all pixels are zero. • Common Metric: For SS and VSS, the commonly used
For IS, each instance has a unique identity. For PS, the pixels metric is mean intersection over union (mIoU), which calculates
belonging to the thing classes have a unique identity. The pixel the pixel-wised Union of Interest between output image and
identities of the stuff class are zero. From both two perspectives, video masks and ground truth masks. For IS, the metric is mask
the PS unifies both SS and IS. We present the visual examples mean average precision (mAP), which is extended from the
in Fig. 2. object detection via replacing box IoU with mask IoU. For VIS,
LI et al.: TRANSFORMER-BASED VISUAL SEGMENTATION: A SURVEY 10141

TABLE II
COMMONLY USED DATASETS AND METRIC FOR TRANSFORMER-BASED SEGMENTATION

the metric is 3D mAP, which extends mask mAP in a spatial- semantic segmentation results and clustering methods [84]. Be-
temporal manner. For PS, the metric is the panoptic quality sides, there are also several approaches [85], [86] using gird
(PQ), which unifies both thing and stuff prediction by setting representation to learn instance masks directly. The ideas using
a fixed threshold 0.5. For VPS, the commonly used metrics kernels and different mask encodings are also extended into
are video panoptic quality (VPQ) and segmentation tracking several transformer-based approaches, which will be detailed
quality (STQ). The former extends PQ into temporal window in Section III.
calculation, while the latter decouples the segmentation and • Panoptic Segmentation: Previous works for PS mainly focus
tracking in a per-pixel-wised manner. Note that there are other on how to fuse the results of both SS and IS, which treats
metrics, including pixel accuracy and temporal consistency. PS as two independent tasks. Based on IS subtask, the previ-
For simplicity, we only report the primary metrics used in the ous works can also be divided into two categories: top-down
literature. We present the detailed formulation of these metrics approaches [87], [88] and bottom-up approaches [84], [89],
in the supplementary material. according to the way to generate instance masks. Several works
use a shared backbone with multitask heads to jointly learn IS
and SS, focusing on mutual task association. Meanwhile, several
bottom-up approaches [84], [89] use the sequential pipeline
C. Segmentation Approaches Before Transformer by performing instance clustering from semantic segmentation
• Semantic Segmentation: Prior to the emergence of ViT and results and then fusing both. In summary, most PS methods
DETR. SS was typically approached as a dense pixel classifica- include complex pipelines and are highly engineered.
tion problem, as initially proposed by FCN. Then, the following • Video Segmentation: The research for VSS mainly focuses
works are all based on the FCN framework. These methods can on better spatial-temporal fusion [90] or acceleration using extra
be divided into the following aspects, including better encoder- cues [91], [92] in the video. VIS requires segmenting and track-
decoder frameworks [58], [59], larger kernels [60], [61], mul- ing each instance. Most VIS approaches [52], [93], [94], [95]
tiscale pooling [11], [62], multiscale feature fusion [12], [63], focus on learning instance-wised spatial, temporal relation, and
[64], [65], non-local modeling [18], [66], [67], efficient model- feature fusion. Several works learn the 3D temporal embeddings.
ing [68], [69], [70], and better boundary delineation [71], [72], Like PS, VPS [52] can also be top-down [52] and bottom-up
[73], [74]. After the transformer was proposed, with the goal of approaches [96]. The top-down approaches learn to link the
global context modeling, several works design variants of self- temporal features and then perform instance association online.
attention operators to replace the CNN prediction heads [66], In contrast, the bottom-up approaches predict the center map of
[75]. the near frame and perform instance association in a separate
• Instance Segmentation: IS aims to detect and segment each stage. Most of these approaches are highly engineering. For
object, which goes beyond object detection. Most IS approaches example, MaskPro [93] adopts state-of-the-art IS segmentation
focus on how to represent instance masks beyond object de- models [80], deformable CNN [97], and offline mask propaga-
tection, which can be divided into two categories: top-down tion in one system. There are also several video segmentation
approaches [76], [77] and bottom-up approaches [78], [79]. The tasks, including video object segmentation (VOS) [56], [98],
former extends the object detector with an extra mask head. The referring video segmentation [57], multi-Object tracking, and
designs of mask heads are various, including FCN heads [76], segmentation (MOTS) [99].
[80], diverse mask encodings [81], and dynamic kernels [77], • Point Cloud Segmentation: This task aims to group
[82]. The latter performs instance clustering from semantic point clouds into semantic or instance categories, similar
segmentation maps to form instance masks. The performance to image and video segmentation. Depending on the in-
of top-down approaches is closely related to the choice of put scene, it is typically categorized as either indoor or
detector [83], while bottom-up approaches depend on both outdoor scenes. Indoor scene segmentation mainly includes
10142 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

point cloud semantic segmentation (PSS) and point cloud in-


stance segmentation (PIS). PSS is commonly achieved using the
Point-Net [100], [101], while PIS can be achieved through two
approaches: top-down approaches [102], [103] and bottom-up
approaches [104], [105]. The former extracts 3D bounding boxes
and uses a mask learning branch to predict masks, while the
latter predicts semantic labels and utilizes point embedding to
group points into different instances. For outdoor scenes, point
cloud segmentation can be divided into point-based [100], [106] Fig. 3. Illustration of (a) meta-architecture and (b) common operations in the
and voxel-based [107], [108] approaches. Point-based meth- decoder.
ods focus on processing individual points, while voxel-based
methods divide the point cloud into 3D grids and apply 3D
convolution. Like panoptic segmentation, most 3D panoptic by this, several works [18], [114] treat it as a fully-connected
segmentation methods [109], [110], [111], [112], [113] first graph or a non-local module for visual recognition task.
predict semantic segmentation results, separate instances based • Multi-Head Self-Attention: In practice, multi-head self-
on these predictions and fuse the two results to obtain the final attention (MHSA) is more commonly used. The idea of MHSA
results. is to stack multiple SA sub-layer in parallel, and the concatenated
outputs are fused by a projection matrix W f use ∈ Rd×c
D. Transformer Basics
O = MHSA(Q, K, V ) = concat([SAi , ..SAH ])W f use , (3)
• Vanilla Transformer [13] is a seminal model in the
where SAi = SA(Qi , Ki , Vi ) and H is the number of the head.
transformer-based research field. It is an encoder-decoder struc-
Different heads have individual parameters. Thus, MHSA can
ture that takes tokenized inputs and consists of stacked trans-
be viewed as an ensemble of SA.
former blocks. Each block has two sub-layers: a multi-head self-
• Feed-Forward Network: The goal of feed-forward network
attention (MHSA) layer and a position-wise fully-connected
(FFN) is to enhance the non-linearity of attention layer outputs.
feed-forward network (FFN). The MHSA layer allows the model
It is also called multi-layer perceptron (MLP) since it consists
to attend to different parts of the input sequence while the FFN
of two successive linear layers with non-linear activation layers.
processes the output of the MHSA layer. Both sub-layers use
residual connections and layer normalization for better optimiza-
tion. III. METHODS: A SURVEY
In the vanilla transformer, the encoder and decoder both use In this section, based on DETR-like meta-architecture, we
the same architecture. However, the decoder is modified to review the key techniques of transformer-based segmentation.
include a mask that prevents it from attending to future tokens As shown in Fig. 3, the meta-architecture contains a feature ex-
during training. Additionally, the decoder uses sine and cosine tractor, object query, and a transformer decoder. Then, according
functions to produce positional embeddings, which allow the to the meta-architecture, we survey existing methods by con-
model to understand the order of the input sequence. Subsequent sidering the modification or improvements to each component
models such as BERT and GPT-2 have built upon its architecture of the meta-architecture in Sections III-B1, III-B2 and III-B3.
and achieved state-of-the-art results on a wide range of natural Finally, based on such meta-architecture, we present several
language processing tasks. detailed applications in Sections III-B4 and III-B5.
• Self-Attention: The core operator of the vanilla transformer
is the self-attention (SA) operation. Suppose the input data is A. Meta-Architecture
a set of tokens X = [x1 , x2 , . . ., xN ] ∈ RN ×c . N is the token
• Backbone: Before ViTs, CNNs were the standard ap-
number and c is token dimension. The positional encoding
proach for feature extraction in computer vision tasks. To en-
P may be added into I = X + P . The input embedding I
sure a fair comparison, many research works [22], [76], [115]
goes through three linear projection layers (W q ∈ Rc×d , W k ∈
used the same CNN models, such as ResNet50 [7]. Some re-
Rc×d , W v ∈ Rc×d ) to generate Query (Q), Key (K), and Value
searchers [18], [89] also explored the combination of CNNs
(V):
with self-attention layers to model long-range dependencies.
Q = IW q , K = IW k , V = IW v , (1) ViT, on the other hand, utilizes a standard transformer encoder
for feature extraction. It has a specific input pipeline for images,
where d is the hidden dimension. The Query and Key are usually where the input image is split into fixed-size patches, such as 16
used to generate the attention map in SA. Then the SA is × 16 patches. These patches are then processed through a linear
performed as follows: embedding layer. Then, the positional embeddings are added to
O = SA(Q, K, V ) = Softmax(QK  )V. (2) each patch. Afterward, a standard transformer encoder encodes
all patches. It contains multiple multi-head self-attention and
According to (2), given an input X, self-attention allows each feed-forward layers. For instance, given an image I ∈ RH×W ×3 ,
token xi to attend to all the other tokens. Thus, it has the ability of ViT first reshapes it into a sequence of flattened 2D patches:
2
global perception compared with local CNN operator. Motivated Ip ∈ RN ×P ×3 , where N is the number of patches and P is
LI et al.: TRANSFORMER-BASED VISUAL SEGMENTATION: A SURVEY 10143

the patch size. With patch embedding operations, the final input output embedding is used to perform dot product with feature F ,
2
is Iin ∈ RN ×P ×C , where C is the embedding channel. To per- which results in the binary mask logits. The transformer decoder
form classification, an extra learnable embedding “classification iteratively repeats cross-attention and FFN operations to refine
token” (CLS) is added to the sequence of embedded patches. the object query and obtain the final prediction. The intermediate
2
After the standard transformer for all patches, Iout ∈ RN ×P ×C predictions are used for auxiliary losses during training and
is obtained. For segmentation tasks, ViT is used as a feature discarded during inference. The outputs from the last stage of the
extractor, meaning that Iout is resized back to a dense map decoder are taken as the final detection or segmentation results.
F ∈ RH×W ×C . We show the detailed process in Fig. 3(b).
• Neck: Feature pyramid network (FPN) has been shown • Mask Prediction Representation: Transformer-based seg-
effective in object detection and instance segmentation [116], mentation approaches adopt two formats to represent the mask
[117], [118] for scale variation modeling. FPN maps the features prediction: pixel-wise prediction as FCNs and per-mask-wise
from different stages into the same channel dimension C for the prediction as DETR. The former is used in semantic-aware
decoder. Several works [83], [119] design stronger FPNs via segmentation tasks, including SS, VSS, VOS, and etc. The latter
cross-scale modeling using dilation or deformable convolution. is used in instance-aware segmentation tasks, including IS, VIS,
For example, Deformable DETR [25] proposes a deformable and VPS, where each query represents each instance.
FPN to model cross-scale fusion using deformable attention. • Bipartite Matching and Loss Function: Object query is
Lite-DETR [120] further refines the deformable cross-scale usually combined with bipartite matching [121] during training,
attention design by efficiently sampling high-level features and uniquely assigning predictions with ground truth. This means
low-level features in an interleaved manner. The output features each object query builds the one-to-one matching during train-
are used for decoding the boxes and masks. The role of FPN is the ing. Such matching is based on the matching cost between
same as previous detection-based or FCN-based segmentation ground truth and predictions. The matching cost is defined as the
methods. The FPN generates multi-scale features to handle distance between prediction and ground truth, including labels,
and balance both small and large objects in the scene. For the boxes, and masks. By minimizing the cost with the Hungarian
transformer-based method, FPN architecture is often used to algorithm [121], each object query is assigned by its corre-
refine object queries from different scales, which can lead to sponding ground truth. For object detection, each object query
stronger results than single-scale refinement. is trained with classification and box regression loss [115]. For
• Object Query: Object query is first introduced in DETR [22]. instance-aware segmentation, each object query is trained via
It plays as the dynamic anchors that are used in detectors [76], both mask classification loss and segmentation loss. The output
[115]. In practice, it is a learnable embedding Qobj ∈ RNins ×d . masks are obtained via the inner product between object query
Nins represents the maximum instance number. The query and decoder features. The segmentation loss usually contains
dimension d is usually the same as feature channel c. Object binary cross-entropy loss and dice loss [122].
query is refined by the cross-attention layers. Each object query • Discussion on Scope of Meta-Architecture: We admit our
represents one instance of the image. During the training, each meta-architecture may not cover all transformer-based segmen-
ground truth is assigned with one corresponding query for tation methods. In semantic segmentation, methods such as
learning. During the inference, the queries with high scores are Segformer [123] and SETR [124] employ a fully connected
selected as output. Thus, object query simplifies the design of layer and predict each pixel as previous FCN-based methods [9],
detection and segmentation models by eliminating the need for [62], [125]. These methods concentrate on enhanced feature
hand-crafted components such as non-maximum suppression representation. We argue that this represents a basic form of our
(NMS). The flexible design of object query has led to many meta-architecture, wherein each query corresponds to a class
research works exploring its usage in different contexts, which category. The cascaded cross-attention layers are omitted, and
will be discussed in more detail in Section III-B. bipartite matching is removed. Thus, the object query plays
• Transformer Decoder: Transformer decoder is a crucial the same role as a fully connected layer. In addition, meta-
architecture component in transformer-based segmentation and architecture represents the latest design philosophy. Nearly all
detection models. Its main operation is cross-attention, which recent state-of-the-art methods [126], [127], [128], [129] adopt
takes in the object query Qobj and the image/video feature this meta-architecture. In particular, different methods may add
F . It outputs a refined object query, denoted as Qout . The more components to adapt their tasks and requirements. Thus,
cross-attention operation is derived from the vanilla transformer we review recent works by modifying each component based on
architecture, where Qobj serves as the query, and F is used as the this meta-architecture.
key and value in the self-attention mechanism. After obtaining
the refined object query Qout , it is passed through a prediction
FFN, which typically consists of a 3-layer perceptron with a B. Method Categorization
ReLU activation layer and a linear projection layer. The FFN In this section, we review five aspects of transformer-based
outputs the final prediction, which depends on the specific task. segmentation methods. Rather than classifying the literature by
For example, for classification, the refined query is mapped the task settings, our goal is to extract the essential and common
directly to class prediction via a linear layer. For detection, techniques used in the literature. We summarize the meth-
the FFN predicts the normalized center coordinates, height, ods, techniques, related tasks, and corresponding references in
and width of the object bounding box. For segmentation, the Table III. Most approaches are based on the meta-architecture
10144 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

TABLE III
TRANSFORMER-BASED SEGMENTATION METHOD CATEGORIZATION

described in Section III-A. We list the comparison of represen- transformer encoder. It contains a sequence reduction during
tative works in Table IV. MHSA and a light-weight MLP decoder. Segformer achieves
1) Strong Representations: Learning a strong feature repre- better speed and accuracy trade-off for SS. Meanwhile, several
sentation always leads to better segmentation results. Taking the works [139], [140], [141], [142] directly add CNN layers to a
SS task as an example, SETR [124] is the first to replace CNN transformer to explore the local context. Several works [211],
backbone with the ViT backbone. It achieves state-of-the-art [212] explore the pure MLPs design to replace the transformer.
results on the ADE20 k dataset without bells and whistles. After With specific designs such as shifting and fusion [211], MLP
ViTs, researchers start to design better vision transformers. We models can also achieve comparable results with ViTs. Later,
categorize the related works into three aspects: better vision several works [143], [144] point out that CNNs can achieve
transformer design, hybrid CNNs/transformers/MLPs, and self- stronger results than ViTs if using the same data augmenta-
supervised learning. tion pipeline. In particular, DWNet [144] re-visits the training
• Better ViTs Design: Rather than introducing local bias, pipeline of ViTs and proposes dynamic depth-wise convolution.
these works follow the original ViTs design and process fea- Then, ConvNeXt [143] uses the larger kernel depth-wise convo-
ture using the original MHSA for token mixing. DeiT [130] lution and a stronger data training pipeline. It achieves stronger
proposes knowledge distillation and provides strong data aug- results than Swin [23]. Motivated by ConvNeXt, SegNext [145]
mentation to train ViT efficiently. Starting from DeiT, nearly designs a CNN-like backbone with linear self-attention and
all ViTs adopt the stronger training procedure. MViT-V1 [131] performs strongly on multiple SS benchmarks. Meanwhile,
introduces the multiscale feature representation and pooling Meta-Former [146] shows that the meta-architecture of ViT is
strategies to reduce the computation cost in MHSA. MViT- the key to achieving stronger results. Such meta-architecture
V2 [132] further incorporates decomposed relative positional contains a token mixer, a MLP, and residual connections. The
embeddings and residual pooling design in MViT-V1, which token mixer is a simple MHSA layer. Meta-Former shows that
leads to better representation. Motivated by MViT, from the the token mixer is not as important as meta-architecture. Using
architecture level, MPViT [133] introduces multiscale patch em- simple pooling as a token mixer can achieve stronger results.
bedding and multi-path structure to explore tokens of different Following the Meta-Former, recent work [147] re-benchmarks
scales jointly. Meanwhile, from the operator level, XCiT [134] several previous works using a unified architecture to eliminate
operates across feature channels rather than token inputs and unfair engineering techniques. However, under stronger settings,
proposes cross-covariance attention, which has linear complex- the authors find the spatial token mixer design still matters.
ity in the number of tokens. This design makes it easy to adapt to Meanwhile, several works [214] explore the MLP-like archi-
segmentation tasks, which always have high-resolution inputs. tecture for dense prediction.
Pyramid ViT [135] is the first work to build multiscale features • Self-Supervised Learning (SSL): SSL has achieved huge
for detection and segmentation tasks. There are also several progress in recent years [148], [149], [215]. Compared with su-
works [136], [137], [138] exploring cross-scale modeling via pervised learning, SSL exploits unlabeled data via specially de-
MHSA, which exchange long-range information on different signed pseudo tasks and can be easily scaled up. MoCo-v3 [150]
feature pyramids. is the first study that trains ViTs in SSL. It freezes the patch
• Hybrid CNNs/Transformers/MLPs: Rather than modifying projection layer to stabilize the training process. Motivated by
the ViTs, many works focus on introducing local bias into ViT or BERT, BEiT [151] proposes the BERT-like per-training (Mask
using CNNs with large kernels directly. To build a multi-stage Image Modeling, MIM) of vision transformers. After BEiT,
pipeline, Swin [23], [210] adopts shift-window attention in a MAE [24] shows that ViTs can be trained with the simplest
CNN style. They also scale up the models to large sizes and MIM style. By masking a portion of input tokens and recon-
achieve significant improvements on many vision tasks. From structing the RGB images, MAE achieves better results than su-
an efficient perspective, Segformer [123] designs a light-weight pervised training. As a concurrent work, MaskFeat [152] mainly
LI et al.: TRANSFORMER-BASED VISUAL SEGMENTATION: A SURVEY 10145

TABLE IV
REPRESENTATIVE WORKS SUMMARIZATION AND COMPARISON IN SECTION III

studies reconstructing targets of the MIM framework, such as the works [219], [220], [221] verify such representation learning on
histogram of oriented gradient (HOG) features. The following action or motion learning, such as action recognition. Several
works focus on improving the MIM framework [153], [154] or works [202], [222] adopt a video backbone for video segmenta-
replacing the backbone of ViTs with CNN architecture [155], tion. However, for video segmentation, from the method design
[216]. **********DINO series [216] find the self-supervised perspective, most works focus on matching and association of
learned feature itself has grouping effects, which is always entities or pixels, which is discussed in Sections III-B2 and
used in unsupervised learning contexts. (Section IV-D) Recently, III-B4.
several works [156], [217] on VLM also adopt SSL by utilizing 2) Cross-Attention Design in Decoder: In this section, we
easily obtained text-image pairs. Recent work [157] demon- review the new transformer decoder designs. We categorize
strates the effectiveness of VLM in downstream tasks, including the decoder design into two groups: one for improved cross-
IS and SS. Moreover, several recent works [218] adopt multi- attention design in image segmentation and the other for spatial-
modal SSL pre-training and design a unified model for many temporal cross-attention design in video segmentation. The
vision tasks. For video representation learning, most current former focuses on designing a better decoder to refine the
10146 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

original decoder in the original DETR. The latter extends the


query-based object detector and segmenter into the video do-
main for VOD, VIS, and VPS, focusing on modeling temporal
consistency and association.
• Improved Cross-Attention Design: Cross-attention is the
core operation of meta-architecture for segmentation and detec-
tion. Current solutions for improved cross-attention mainly fo- Fig. 4. Illustration of object query in video segmentation.
cus on designing new or enhanced cross-attention operators and
improved decoder architectures. Following DETR, Deformable
DETR [25] proposes deformable attention to efficiently sample masked cross-attention and replaces the cross-attention in Mask-
point features and perform cross-attention with object query Former. Masked cross-attention makes object query only attend
jointly. Meanwhile, several works bring object queries into pre- to the object area, guided by the mask outputs from previous
vious RCNN frameworks. Sparse-RCNN [158] uses RoI pooled stages. Mask2Former also adopts a stronger Deformable FPN
features to refine the object query for object detection. They backbone [25], stronger data augmentation [226], and multiscale
also propose a new dynamic convolution and self-attention to mask decoding. The above works only consider updating object
enhance object query without extra cross-attention. In partic- queries. To handle this, CMT-Deeplab [227] proposes an alter-
ular, the pooled query features reweight the object query, and nating procedure for object query and decoder features. It jointly
then self-attention is applied to the object query to obtain the updates object queries and pixel features. After that, inspired
global view. After that, several works [159], [160] add the by the k-means clustering algorithm, kMaX-DeepLab [228]
extra mask heads for IS. QueryInst [159] adds mask heads proposes k-means cross-attention by introducing cluster-wise
and refines mask query with dynamic convolution. Meanwhile, argmax operation in the cross-attention operation. Meanwhile,
several works [161], [223] extend Deformable DETR by directly PanopticSegformer [165] proposes a decoupling query strategy
applying MLP on the shared query. Inspired by MEInst [81], and deeply supervised mask decoder to speed up the training
SOLQ [161] utilizes mask encodings on object query via MLP. process. For real-time segmentation setting, SparseInst [229]
By applying the strong Deformable DETR detector and Swin proposes a sparse set of instance activation maps highlighting
transformer [23] backbone, it achieves remarkable results on informative regions for each foreground object.
IS. However, these works still need extra box supervision, Besides segmentation tasks, several works speed up the con-
which makes the system complex. Moreover, most RoI-based vergence of DETR by introducing new decoder designs, and
approaches for IS have low mask quality issues since the mask most approaches can be extended into IS. Several works bring
resolution is limited within the boxes [71]. such semantic priors in the DETR decoder. SAM-DETR [230]
To fix the issues of extra box heads, several works remove the projects object queries into semantic space and searches salient
box prediction and adopt pure mask-based approaches. Earlier points with the most discriminative features. SMAC [231] con-
work, OCRNet [66] characterizes a pixel by exploiting the ducts location-aware co-attention by sampling features of high
representation of the corresponding object class that forms a near estimated bounding box locations. Several works adopt
category query. Then, Segmenter [224] adopts a strong ViT dynamic feature re-weights. From the multiscale feature per-
backbone with the class query to directly decode class-wise spective, AdaMixer [232] samples feature over space and scales
masks. Pure mask-based approaches directly generate segmen- using the estimated offsets. It dynamically decodes sampled
tation masks from high-resolution features and naturally have features with an MLP, which builds a fast-converging query-
better mask quality. Max-Deeplab [26] is the first to remove the based detector. ACT-DETR [233] clusters the query features
box head and design a pure-mask-based segmenter for PS. It also adaptively using a locality-sensitive hashing and replaces the
achieves stronger performance than box-based PS method [83]. query-key interaction with the prototype-key interaction to re-
It combines a CNN-transformer hybrid encoder [89] and a duce cross-attention cost. From the feature re-weighting view,
transformer decoder as an extra path. Max-Deeplab still needs Dynamic-DETR [234] introduces dynamic attention to both the
extra auxiliary loss functions, such as semantic segmentation encoder and decoder parts of DETR using RoI-wise dynamic
loss, and instance discriminative loss. K-Net [163] uses mask convolution. Motivated by the sparsity of the decoder feature,
pooling to group the mask features and designs a gated dynamic Sparse-DETR [235] selectively updates the referenced tokens
convolution to update the corresponding query. By viewing the from the decoder and proposes an auxiliary detection loss on the
segmentation tasks as convolution with different kernels, K-Net selected tokens in the encoder to keep the sparsity. In summary,
is the first to unify all three image segmentation tasks, including dynamically assigning features into query learning speeds up
SS, IS, and PS. Meanwhile, MaskFormer [164] extends the the convergence of DETR.
original DETR by removing the box head and transferring the • Spatial-Temporal Cross-Attention Design: After extending
object query into the mask query via MLPs. It proves simple the object query in the video domain, each object query repre-
mask classification can work well enough for all three seg- sents a tracked object across different frames, which is shown
mentation tasks. Compared to MaskFormer, K-Net is good at in Fig. 4. The simplest extension is proposed by VisTR [166]
training data efficiency. This is because K-Net adopts mask for VIS. VisTR extends the cross-attention in DETR into
pooling to localize object features and then update object queries multiple frames by stacking all clip features into flattened
accordingly. Motivated by this, Mask2Former [225] proposes spatial-temporal features. The spatial-temporal features also
LI et al.: TRANSFORMER-BASED VISUAL SEGMENTATION: A SURVEY 10147

involve temporal embeddings. During inference, one object introduces the box queries from the image content to improve
query can directly output spatial-temporal masks without extra detection results. The box queries are directly learned from
tracking. Meanwhile, TransVOD [167] proposes to link object image content, which is dynamic with various image inputs.
query and corresponding features across the temporal domain. The image-dependent box query helps locate the object and
It splits the clips into sub-clips and performs clip-wise object improve the performance. Motivated by previous anchor
detection. TransVOD utilizes the local temporal information and designs in object detectors, several works bring anchor priors
achieves better speed and accuracy trade-off. IFC [170] adopts in DETR. The Efficient DETR [241] adopts hybrid designs by
message tokens to exchange temporal context among different including query-based and dense anchor-based predictions in
frames. The message tokens are similar to learnable queries, one framework. Anchor DETR [176] proposes to use anchor
which perform cross-attention with features in each frame points to replace the learnable query and also designs an efficient
and self-attention among the tokens. After that, TeViT [168] self-attention head for faster training. Each object query predicts
proposes a novel messenger shift mechanism for temporal fusion multiple objects at one position. DAB-DETR [177] finds the
and a shared spatial-temporal query interaction mechanism to localization issues of the learnable query and proposes dynamic
utilize both frame-level and instance-level temporal context anchor boxes to replace the learnable query. Dynamic anchor
information. Seqformer [171] combines Deformable-DETR boxes make the query learning more explainable and explicitly
and VisTR in one framework. It also proposes to use decouple the localization and content part, further improving
image datasets to augment video segmentation training. the detection performance.
Mask2Former-VIS [169] extends masked cross-attention in • Adding Extra Supervision into Query: DN-DETR [178]
Mask2Former [225] into temporal masked cross-attention. finds that the instability of bipartite graph matching causes the
Following VisTR, it also directly outputs spatial-temporal slow convergence of DETR and proposes a denoising loss to sta-
masks. bilize query learning. In particular, the authors feed GT bounding
In addition to VIS, several works [163], [165], [225] have boxes with noises into the transformer decoder and train the
shown that query-based methods can naturally unify different model to reconstruct the original boxes. Motivated by DN-
segmentation tasks. Following this pipeline, there are also sev- DETR, based on Mask2Former, MP-Former [242] finds incon-
eral works [172], [173] solving multiple video segmentation sistent predictions between consecutive layers. It further adds
tasks in one framework. In particular, based on K-Net [163], class embeddings of both ground truth class labels and masks
Video K-Net [172] proposes to unify VPS/VIS/VSS via tracking to reconstruct the masks and labels. Meanwhile, DINO [179]
and linking kernels and works in an online manner. Meanwhile, improves DN-DETR via a contrastive way of denoising training
TubeFormer [173] extends Max-Deeplab [26] into the temporal and a mixed query selection for better query initialization. Mask
domain by obtaining the mask tubes. Cross-attention is car- DINO [180] extends DINO by adding an extra query decoding
ried out in a clip-wise manner. During inference, the instance head for mask prediction. Mask DINO [180] proposes a unified
association is performed by mask-based matching. Moreover, architecture and joint training process for both object detection
several works [236] propose the local temporal window to and instance segmentation. By sharing the training data, Mask
refine the global spatial-temporal cross-attention. For example, DINO can scale up and fully utilize the detection annotations
VITA [236] aggregates the local temporal query on top of an to improve IS results. Meanwhile, motivated by contrastive
off-the-shelf transformer-based image instance segmentation learning, IUQ [181] introduces two extra supervisions, including
model [225]. Recently, several works [239], [240] have explored cross-image contrastive query loss via extra memory blocks
the cross-clip association for video segmentation. In particular, and equivalent loss against geometric transformations. Both
Tube-Link [240] designs a universal video segmentation frame- losses can be naturally adapted into query-based detectors.
work via learning cross-tube relations. It performs better than Meanwhile, there are also several works [182], [183], [184],
task-specific methods in VSS, VIS, and VPS. [243] exploring query supervision from the target assignment
3) Optimizing Object Query: Compared with Faster- perspective. In particular, since DETR lacks the capability of
RCNN [115], DETR [22] needs a much longer schedule exploiting multiple positive object queries, DE-DETR [243]
for convergence. Due to the critical role of object query, first introduces one-to-many label assignment in query-based
several approaches have launched studies on speeding up instance perception framework, to provide richer supervision
training schedules and improving performance. According for model training. Group DETR [183] proposes group-wise
to the methods for the object query, we divide the following one-to-many assignments during training. H-DETR [182] adds
literature into two aspects: adding position information and auxiliary queries that use one-to-many matching loss during
adopting extra supervision. The position information provides training. Rather than adding more queries, Co-DETR [184]
the cues to sample the query feature for faster training. The proposes a collaborative hybrid training scheme using parallel
extra supervision focuses on designing specific loss functions auxiliary heads supervised by one-to-many label assignments.
in addition to default ones in DETR. All these approaches drop the extra supervision heads during in-
• Adding Position Information into Query: Conditional ference. These extra supervision designs can be easily extended
DETR [174] finds cross-attention in DETR relies highly on to query-based segmentation methods [163], [225].
the content embeddings for localizing the four extremities. 4) Using Query for Association: Benefiting from the sim-
The authors introduce conditional spatial query to explore plicity of query representation, several recent works have
the extremity regions explicitly. Conditional DETR V2 [175] adopted it as an association tool to solve downstream tasks.
10148 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

There are mainly two usages: one for instance-level association 5) Conditional Query Fusion: In addition to using object
and the other for task-level association. The former adopts query for multitask prediction, several works adopt conditional
the idea of instance discrimination, for instance-wise matching query design for cross-modal and cross-image tasks. The query
problems in video, such as joint segmentation and tracking. The is conditional on the task inputs, and the decoder head uses such
latter adopts queries to link features for multitask learning. a conditional query to obtain the corresponding segmentation
• Using Query for Instance Association: The research in this masks. Based on the source of different inputs, we split these
area can be divided into two aspects: one for designing extra works into two aspects: language features and image features.
tracking queries and the other for using object queries directly. • Conditional Query Fusion From Language Feature: Sev-
TrackFormer [185] is the first to treat multi-object tracking eral works [48], [48], [57], [195], [196], [197], [199], [200],
as a set prediction problem by performing joint detection and [201], [222], [245], [246], [247] adopt conditional query fusion
tracking-by-attention. TransTrack [186] uses the object query according to input language feature for both referring image
from the last frame as a new track query and outputs tracking segmentation (RIS) [48] and referring video object segmentation
boxes from the shared decoder. MOTR [187] introduces the (RVOS) [57] tasks. In particular, VLT [195], [222] first adopts
extra track query to model the tracked instances of the entire the vision transformer for the RIS task and proposes a query gen-
video. In particular, MOTR proposes a new tracklet-awared eration module to produce multiple sets of language-conditional
label assignment to train track queries and a temporal aggre- queries, which enhances the diversified comprehensions of the
gation module to fuse temporal features. There are also several language. Then, it adaptively selects the output features of these
works [57], [172], [188], [189], [240] adopting object query queries via the proposed query balance module. Following the
solely for tracking. In particular, MiniVIS [188] directly uses same idea, LAVT [196] designs a new gated cross-attention
object query for matching without extra tracking head modeling fusion where the image features are the query inputs of a
for VIS, where it adopts image instance segmentation training. self-attention layer in the encoder part. Compared with previ-
Both Video K-Net [172] and IDOL [189] learn the association ous CNN approaches [248], [249], using a vision transformer
embeddings directly from the object query using a temporal con- significantly improves the language-driven segmentation qual-
trastive loss. During inference, the learned association embed- ity. With the help of CLIP’s knowledge, CRIS [198] proposes
dings are used to match instances across frames. These methods vision-language decoding and contrastive learning for achieving
are usually verified in VIS and VPS tasks. All methods pre-train text-to-pixel alignment. Meanwhile, several works [57], [199],
their image baseline on image datasets, including COCO and [202], [250] adopt video detection transformer in Section III-B2
Cityscapes, and fine-tune their video architecture in the video for the RVOS task. MTTR [199] models the RVOS task as
datasets. a sequence prediction problem and proposes both language
• Using Query for Linking Multi-Tasks: Several works [128], and video features jointly. Recently, several works [57], [245]
[190], [244], [245] use object query to link features across differ- explore referring VOS under fast motion condition settings.
ent tasks to achieve mutual benefits. Rather than directly fusing Each object query in each frame combines the language features
multitask features, using object query fusion not only selects the before sending it into the decoder. To speed up the query learn-
most discriminative parts to fuse but also is more efficient than ing, ReferFormer [202] designs a small set of object queries
dense feature fusion. In particular, Panoptic-PartFormer [190] conditioned on the language as the input to the transformer.
links part and panoptic features via different object queries into The conditional queries are transformed into dynamic kernels
an end-to-end framework, where joint learning leads to better to generate tracked object masks in the decoder. With the same
part segmentation results. Several works [128], [191] combine design as VisTR, ReferFormer can segment and track object
segmentation features, and depth features using the MHSA layer masks with given language inputs. In this way, each object
on corresponding depth query and segmentation query, which tracklet is controlled by a given language input. In addition
unify the depth prediction and panoptic segmentation prediction to referring segmentation tasks, MDETR [251] presents an
via shared masks. Both works find the mutual effect for both seg- end-to-end modulated detector that detects objects in an image
mentation and depth prediction. Recently, several works [193], conditioned on a raw text query. In particular, they fuse the text
[194] have adopted the vision transformers with multiple task- embedding directly into visual features and jointly train the fused
aware queries for multi-task dense prediction tasks. In partic- feature and object query. X-DETR [238] proposes an effective
ular, they treat object queries as task-specific hidden features architecture for instance-wise vision-language tasks via using
for fusion and perform cross-task reasoning using MSHA on dot-product to align vision and language. In summary, these
task queries. Moreover, in addition to dense prediction tasks, works fully utilize the interaction of language features and query
FashionFormer [192] unifies fashion attribute prediction and features.
instance part segmentation in one framework. It also finds the • Condition Query Fusion From Image Feature: Several tasks
mutual effect on instance segmentation and attribute prediction take multiple images as references and refine corresponding
via query sharing. Recently, X-Decoder [237] uses two different object masks of the main image. The multiple images can be
queries for segmentation and language generation tasks. The support images in few shot segmentation [203], [209], [252]
authors jointly pre-train two different queries using large-scale or the same input image in matting [205], [253] and seman-
vision language datasets, where they find both queries can tic segmentation [206], [207]. These works aim to model the
benefit corresponding tasks, including visual segmentation and correspondences between the main image and other images via
caption generation. condition query fusion. For SS, StructToken [207] presents a
LI et al.: TRANSFORMER-BASED VISUAL SEGMENTATION: A SURVEY 10149

new framework by doing interactions between a set of learn- Masked Point Modeling (MPM) task to pre-train point cloud
able structure tokens and the image features, where the image transformers. It divides a point cloud into several local point
features are the spatial priors. In the video, BATMAN [208] patches as the input of a standard transformer. Moreover, it also
fuses optical flow features and previous frame features into pre-trains a point cloud Tokenizer with a discrete variational
mixed features and uses such features as a query to decode the autoEncoder to encode the semantic contents and train an extra
current frame outputs. For few-shot segmentation, CyCTR [203] decoder using the reconstruction loss. Following MAE [24],
aggregates pixel-wise support features into query features. In several works [258], [259] simply the MIM pretraining process.
particular, CyCTR performs cross-attention between features Point-MAE [258] divides the input point cloud into irregular
from different images in a cycle manner, where support image point patches and randomly masks them at a high ratio. Then,
features and query image features are the query inputs of the it uses a standard transformer-based autoencoder to reconstruct
transformer jointly. Meanwhile, MM-Former [209] adopts a the masked points. Point-M2AE [259] designs a multiscale MIM
class-agnostic method [225] to decompose the query image pretraining by making the encoder and decoder into pyramid
into multiple segment proposals. Then, the support and query architectures to model spatial geometries and multilevel se-
image features are used to select the correct masks via a mantics progressively. Meanwhile, benefiting from the same
transformer module. Then, for few-shot instance segmentation, transformer architecture for point cloud and image, several
RefTwice [254] proposes an object query enhanced framework works adopt image pre-trained standard transformer by distill-
to weight query image features via object queries from support ing the knowledge from large-scale image dataset pre-trained
queries. In image matting, MatteFormer [205] designs a new models.
attention layer called prior-attentive window self-attention based • Instance Level Point Cloud Segmentation: As shown in
on Swin [23]. The prior token represents the global context Section II, previous PCIS / PCPS approaches are based on
feature of each trimap region, which is the query input of win- manually-tuned components, including a voting mechanism
dow self-attention. The prior token introduces spatial cues and that predicts hand-selected geometric features for top-down
achieves thinner matting results. In summary, according to the approaches and heuristics for clustering the votes for bottom-
different tasks, the image features play as the decoder features in up approaches. Both approaches involve many hand-crafted
previous Section III-B2, which enhance the features in the main components and post-processing, The usage of transformers
images. in instance-level point cloud segmentation is similar to the
image or video domain, and most works use bipartite matching
IV. SPECIFIC SUBFIELDS for instance-level masks for indoor and outdoor scenes. For
example, Mask3D [260] proposes the first Transformer-based
In this section, we revisit several related subfields that adopt
approach for 3D semantic instance segmentation. It models each
vision transformers for segmentation tasks. The subfields in-
object instance as an instance query and uses the transformer
clude point cloud segmentation, domain-aware segmentation,
decoder to refine each instance query by attending to point
label and model efficient segmentation, class agnostic segmen-
cloud features at different scales. Meanwhile, SPFormer [261]
tation, tracking, and medical segmentation.
learns to group the potential features from point clouds into
super-points [262], and directly predicts instances through in-
A. Segmentation stance query with a masked-based transformer decoder. The
• Semantic Level Point Cloud Segmentation: Like image super-points utilize geometric regularities to represent homo-
segmentation and video semantic segmentation, adopting trans- geneous neighboring points, which is more efficient than all
formers for semantic level processing mainly focuses on learning point features. The transformer decoder works similarly to
a strong representation (Section III-B1). The works [29], [255] Mask2Former, where the cross-attention between instance query
focus on transferring the success in image/video representa- and super-point features is guided by the attention mask from the
tion learning into the point cloud. Early works [29] directly previous stage. PUPS [263] proposes a unified PPS system for
use modified self-attention as backbone networks and design outdoor scenes. It presents two types of learnable queries named
U-Net-like architectures for segmentation. In particular, Point- semantic score and grouping score. The former predicts the class
Transformer [29] proposes vector self-attention and subtraction label for each point, while the latter indicates the probability of
relation to aggregate local features progressively. The concur- grouping ID for each point. Then, both queries are refined via
rent work PCT [255] also adopts a self-attention operation grouped point features, which share the same ideas from previ-
and enhances input embedding with the support of farthest ous Sparse-RCNN [158] and K-Net [163]. Moreover, PUPS also
point sampling and nearest neighbor searching. However, the presents a context-aware mixing to balance the training instance
ability to model long-range context and cross-scale interaction samples, which achieves the new state-of-the-art results [264].
is still limited. Stratified-Transformer [256] extends the idea
of Swin Transformer [23] into the point cloud and dived 3D
B. Tuning Foundation Models
inputs into cubes. It proposes a mixed key sampling method
for attention input and enlarges the effective receptive field We divide this section into two aspects: vision adapter de-
via merging different cube outputs. Meanwhile, several works sign and open vocabulary learning. The former introduces new
also focus on better pre-training or distilling the knowledge ways to adapt the pre-trained large-scale foundation models
of 2D pre-trained models. PointBert [257] designs the first for downstream tasks. The latter tries to detect and segment
10150 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

unknown objects with the help of the pre-trained vision language clip model and combines the scores of both learned visual em-
model and zero-shot knowledge transfer on unseen segmentation bedding and CLIP embedding for novel class detection. From the
datasets. The core idea for vision adapter design is to extract the data augmentation view, MViT [280] combines the Deformable
knowledge of foundation models and design better ways to fit DETR and CLIP text encoder for the open world class-agnostic
the downstream settings. For open vocabulary learning, the core detection, where the authors build a large dataset by mixing
idea is to align pre-trained VLM features into current detectors existing detection datasets. Motivated by the more balanced sam-
to achieve novel class classification. ples from image classification datasets, Detic [275] improves the
• Vision Adapter and Prompting Modeling: Following the performance of the novel classes with existing image classifica-
idea of prompt tuning in NLP, early works [265], [266] adopt tion datasets by supervising the max-size proposal with all image
learnable parameters with the frozen foundation models to bet- labels. OV-DETR [276] designs the first query-based open vo-
ter transfer the downstream datasets. These works use small cabulary framework by learning conditional matching between
image classification datasets for verification and achieve better class text embedding and query features. Besides these open
results than original zero-shot results [217]. Meanwhile, there vocabulary detection settings, recent works [281], [282] per-
are several works [267] designing adapter and frozen foundation form open vocabulary segmentation. In particular, L-Seg [282]
models for video recognition tasks. In particular, the pre-trained presents a new setting for language-driven semantic image seg-
parameters are frozen, and only a few learnable parameters or mentation and proposes a transformer-based image encoder that
layers are tuned. Following the idea of learnable tuning, recent computes dense per-pixel embeddings according to the language
works [268], [269] extend the vision adapter into dense predic- inputs. OpenSeg [281] learns to generate segmentation masks
tion tasks, including segmentation and detection. In particular, for possible candidates using a DETR-like transformer. Then it
ViT-Adapter [268] proposes a spatial prior module to solve the performs visual-semantic alignments by aligning each word in a
issue of the location prior assumptions in ViTs. The authors de- caption to one or a few predicted masks. BetrayedCaption [283]
sign a two-stream adaption framework using deformable atten- presents a unified transformer framework by joint segmenta-
tion and achieve comparable results in downstream tasks. From tion and caption learning, where the caption part contains both
the CLIP knowledge usage view, DenseCLIP [269] converts the caption generation and caption grounding. The novel class in-
original image-text in CLIP to a pixel-text matching problem formation is encoded into the network during training. With
and uses the pixel-text score maps to guide the learning of dense the goal of unifying different segmentation with text prompts,
prediction models. From the task prompt view, CLIPSeg [270] FreeSeg [284] adopts a similar pipeline as OpenSeg to crop
builds a system to generate image segmentations based on frozen CLIP features for novel class classification. Meanwhile,
arbitrary prompts at test time. A prompt can be a text or an open set segmentation [285] requires the model to output class
image where the CLIP visual model is frozen during training. In agnostic masks and enhance the generality of segmentation
this way, the segmentation model can be turned into a different models. Recently, ODISE [286] uses a frozen diffusion model
task driven by the task prompt. Previous works only focus as the feature extractor, a Mask2Former head, and joint train-
on a single task. OneFormer [271] extends the Mask2Former ing with caption data to perform open vocabulary panoptic
with multiple target training setting and perform segmentation segmentation. There are also several works [287] focusing on
driven by the task prompt. Moreover, using a vision adapter open-world object detection, where the task detects a known set
and text prompt can easily reduce the taxonomy problems of of object categories while simultaneously identifying unknown
each dataset and learn a more general representation for different objects. In particular, OW-DETR [287] adopts the DETR as
segmentation tasks. Recently, SAM [272] proposes more gener- the base detector and proposes several improvements, includ-
alized prompting methods, including mask, points, box, and text. ing attention-driven pseudo-labeling, novelty classification, and
The authors build a larger dataset with 1 billion masks. SAM objectness scoring. In summary, most approaches [284], [288]
achieves good zero-shot performance in various segmentation adopt the idea of region proposal network [115] to generate
datasets. class-agnostic mask proposals via different approaches, includ-
• Open Vocabulary Learning: Recent studies [246], [273], ing anchor-based and query-based decoders in Section III-A.
[274], [275], [276], [277], [278] focus on the open vocabulary Then, the open vocabulary problem turns into a region-level
and open world setting, where their goal is to detect and segment matching problem to match the visual region features with
novel classes, which are not seen during the training. Different pre-trained VLM language embedding.
from zero-shot learning, an open vocabulary setting assumes that
large vocabulary data or knowledge can provide cues for final
classification. Most models are trained by leveraging pre-trained C. Domain-Aware Segmentation
language-text pairs, including captions and text prompts, or with • Domain Adaption: Unsupervised Domain Adaptation
the help of VLM. Then, trained models can detect and segment (UDA) aims at adapting the network trained with source (syn-
the novel classes with the help of weakly annotated captions thetic) domain into target (real) domain [45], [289] without
or existing publicly available VLM. In particular, VilD [274] access to target labels. UDA has two different settings, including
distills the knowledge from a trained open vocabulary image semantic segmentation and object detection. Before ViTs, the
classification model CLIP into a two-stage detector. However, previous works [290] mainly design domain-invariant represen-
VilD still needs an extra visual CLIP encoder for visual distilla- tation learning strategies. DAFormer [291] replaces the outdated
tion. To handle this, Forzen-VLM [279] adopts the frozen visual backbone with the advanced transformer backbone [123] and
LI et al.: TRANSFORMER-BASED VISUAL SEGMENTATION: A SURVEY 10151

proposes three training strategies, including rare class sam- and open-vocabulary segmentation in one shared model and
pling, thing-class ImageNet feature loss, and a learning rate achieved using one model to segment all entities.
warm-up method. It achieves new state-of-the-art results and
is a strong baseline for UDA segmentation. Then, HRDA [292]
improves DAFormer via a multi-resolution training approach D. Label and Model Efficient Segmentation
and uses various crops to preserve fine segmentation details • Weakly Supervised Segmentation: Weakly supervised seg-
and long-range contexts. Motivated by MIM [24], MIC [293] mentation methods learn segmentation with weaker annotations,
proposes a masked image consistency to learn spatial context such as image labels and object boxes. For weakly supervised
relations of the target domain as additional clues. MIC en- semantic segmentation, previous works [305], [306] improve
forces the consistency between predictions of masked target the typical CNN pipeline with class activation maps (CAM)
images and pseudo-labels via a teacher-student framework. It is and use refined CAM as training labels, which requires an extra
a plug-in module that is verified among various UDA settings. model for training. ViT-PCM [307] shows the self-supervised
For detection transformers on UDA, SFA [294] finds feature transformers [150] with a global max pooling can leverage
distribution alignment on CNN brings limited improvements. patch features to negotiate pixel-label probability and achieve
Instead, it proposes a domain query-based feature alignment end-to-end training and test with one model. MCTformer [305]
and a token-wise feature alignment module to enhance. In par- adopts the idea that the attended regions of the one-class token in
ticular, the alignment is achieved by introducing a domain query the vision transformer can be leveraged to form a class-agnostic
and performing the domain classification on the decoder. DA- localization map. It extends to multiple classes by using multi-
DETR [295] proposes a hybrid attention module (HAM), which ple class tokens to learn interactions between the class tokens
contains a coordinate attention module and a level attention mod- and the patch tokens to generate the segmentation labels. For
ule along with the transformer encoder. A single domain-aware weakly supervised instance segmentation, previous works [308],
discriminator supervises the output of HAM. MTTrans [296] [309], [310] mainly leverage the box priors to supervise mask
presents a teacher-student framework and a shared object query heads. Recently, MAL [308] shows that vision transformers
strategy. Meanwhile, SePiCo [297] introduces a new framework are good mask auto-labelers. It takes the box-cropped images
that extracts the semantic meaning of individual pixels to learn as inputs and adopts a teacher-student framework, where the
class-discriminative and class-balanced pixel representations. It two vision transformers are trained with multiple instances
supports both CNN and Transformer architecture. The image loss [308]. MAL proves the zero-shot segmentation ability and
and object features between source and target domains are achieves nearly mask-supervised performance on various base-
aligned at local, global, and instance levels. lines. Meanwhile, several works [311], [312] explore the text-
• Multi-Dataset Segmentation: The goal of multi-dataset only supervision for semantic segmentation. One representative
segmentation is to learn a universal segmentation model on work, GroupViT [311] adopts ViT to group image regions into
various domains. MSeg [298] re-defines the taxonomies and progressively larger shaped segments.
aligns the pixel-level annotations by relabeling several existing • Unsupervised Segmentation: Unsupervised segmentation
semantic segmentation benchmarks. Then, the following works performs segmentation without any labels [313]. Before ViTs,
try to avoid taxonomy conflicts via various approaches. For recent progress [314] leverages the ideas from self-supervised
example, Sentence-Seg [299] replaces each class label with a learning. DINO [216] finds that the self-supervised ViT features
vector-valued embedding. The embedding is generated by a lan- naturally contain explicit information on the segmentation of
guage model [15]. To further handle inflexible one-hot common input image. It finds that the attention maps between the CLS
taxonomy, LMSeg [300] extends such embedding with learnable token and feature to describe the segmentation of objects. Instead
tokens [265] and proposes a dataset-specific augmentation for of using the CLS token, LOST [315] solves unsupervised object
each dataset. It dynamically aligns the segment queries in Mask- discovery by using the key component of the last attention layer
Former [164] with the category embeddings for both SS and for computing the similarities between the different patches.
PS tasks. Meanwhile, there are several works on multi-dataset Several works are aiming at finding the semantic correspon-
object detection [301], [302]. In particular, Detection-Hub [302] dence of multiple images. Then, by utilizing the correspon-
proposes to adapt object queries on language embedding of dence maps as guidance, they achieve better performance than
categories per dataset. Rather than previously shared embedding DINO. Given a pair of images, SETGO [316] finds the self-
for all datasets, it learns semantic bias for each dataset based supervised learned features of DINO have semantically consis-
on the common language embedding to avoid the domain gap. tent correlations. It proposes to distill unsupervised features into
Meanwhile, several works [303], [304] focus on segmentation high-quality discrete semantic labels. Motivated by the success
domain generation, which directly transfers learned knowledge of VLM, ReCo [317] adopts the language-image pre-trained
from one domain to the remaining domains. TarVIS [127] jointly model, CLIP, to retrieve large unlabeled images by leveraging
pre-trains one video segmentation model for different tasks the correspondences in deep representation. Then, it performs
spanning multiple benchmarks, where it extends Mask2Former co-segmentation among both input and retrieved images. There
into the video domain and adopts the unified image datasets are also several works adopting sequential pipelines. MaskDis-
pretraining and video fine-tuning. Recently, OMG-Seg [126] has till [318] first identifies groups of pixels that likely belong to
unified multi-dataset segmentation, image/video segmentation, the same object with a bottom-up model. Then, it clusters the
10152 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

object masks and uses the result as pseudo ground truths to train • Video Object Segmentation: Recent approaches for VOS
an extra model. Finally, the output masks are selected from the mainly focus on designing better memory-based matching meth-
offline model according to the object score. FreeSOLO [319] ods [339]. Inspired by the Non-local network [18] in image
first adopts an extra self-supervised trained model to obtain the recognition tasks, the representative work STM [339] is the
coarse masks. Then, it trains a SOLO-based instance segmenta- first to adopt cross-frame attention, where previous features
tion model via weak supervision. CutLER [320] proposes a new are seen as memory. Then, the following works [204] design
framework for multiple object mask generation. It first designs a better memory-matching process. associating objects with
the MaskCut to discover multiple coarse masks based on the transformers (AOT) [204] matches and decodes multiple objects
self-supervised features (DINO). Then, it adopts a detector to jointly. The authors propose a novel hierarchical matching and
recall the missing masks via a loss-dropping strategy. Finally, it propagation, named long short-term transformer, where they
further refines mask quality via self-training joint persevere an identity bank and long-short term attention.
• Mobile Segmentation: Most transformer-based segmenta- XMem [340] proposes a mixed memory design to handle the
tion methods have huge computational costs and memory re- long video inputs. The mixed memory design is also based
quirements, which make these methods unsuitable for mobile on the self-attention architecture. Meanwhile, Clip-VOS [341]
devices. Different from previous real-time segmentation meth- introduces per-clip memory matching for inference efficiency.
ods [68], [70], [321], the mobile segmentation methods need Recently, to enhance instance-level context, Wang et al. [342]
to be deployed on mobile devices with considering both power adds an extra query from Mask2Former into memory matching
cost and latency. Several earlier works [322], [323], [324], [325], for VOS.
[326], [327] focus on a more efficient transformer backbone.
In particular, Mobile-ViT [323] introduces the first transformer
backbone for mobile devices. It reduces image patches via F. Medical Image Segmentation
MLPs before performing MHSA and shows better task-level CNNs have achieved milestones in medical image anal-
generalization properties. There have also been several works ysis. In particular, the U-shaped architecture and skip-
on designing mobile semantic segmentation using transformers. connections [343], [344] have been widely applied in vari-
TopFormer [328] proposes a token pyramid module that takes ous medical image segmentation tasks. With the success of
the tokens from various scales as input to produce the scale- ViTs, recent representative works [345], [346] adopt vision
aware semantic feature. SeaFormer [329] proposes a squeeze- transformers into the U-Net architecture and achieve better
enhanced axial transformer that contains a generic attention results. TransUNet [345] merges transformer and U-Net, where
block. The block mainly contains two branches: a squeeze axial the transformer encodes tokenized image patches to build the
attention layer to model efficient global context and a detail global context. Then decoder upsamples the encoded features,
enhancement module to preserve the details. RAP-SAM [327] which are then combined with the high-resolution CNN feature
proposes a new unified setting to put real-time interactive seg- maps to enable precise localization. Swin-Unet [346] designs
mentation, panoptic segmentation, and video segmentation into a symmetric Swin-like [23] decoder to recover fine details.
one framework. TransFuse [347] combines transformers and CNNs in a parallel
style, where global dependency and low-level spatial details
can be efficiently captured jointly. UNETR [348] focuses on
E. Class Agnostic Segmentation and Tracking 3D input medical images and designs a similar U-Net-like
• Fine-grained Object Segmentation: Several applications, architecture. The encoded representations of different layers in
such as image and video editing, often need fine-grained details the transformer are extracted and merged with a decoder via skip
of object mask boundaries. Earlier CNN-based works focus on connections to get the final 3D mask outputs.
refining the object masks with extra convolution modules [71], or
extra networks [330]. Most transformer-based approaches [331],
[332], [333], [334], [335] adopt vision transformers due to their V. BENCHMARK RESULTS
fine-grained multiscale features and long-range context model- In this section, we report recent transformer-based visual
ing. Transfiner [331] refines the region of the coarse mask via a segmentation and tabulate the performance of previously dis-
quad-tree transformer. By considering multiscale point features, cussed algorithms. For each reviewed field, the most widely
it produces more natural boundaries while revealing details for used datasets are selected for performance benchmark in Sec-
the objects. Then, Video-Transfiner [332] refines the spatial- tions V-A and V-C. We further re-benchmark several represen-
temporal mask boundaries by applying Transfiner [331] to the tative works in Section V-B using the same data augmentations
video segmentation method [166]. It can refine the existing video and feature extractor. Note that we only list published works for
instance segmentation datasets [50]. PatchDCT [336] adopts the reference. For simplicity, we have excluded several works on
idea of ViT by making object masks into patches. Then, each representation learning and only present specific segmentation
mask is encoded into a DCT vector [337], and PatchDCT designs methods. For a comprehensive method comparison, please refer
a classifier and a regressor to refine each encoded patch. Entity to the supplementary material that provides a more detailed
segmentation [338] aims to segment all visual entities without analysis. In addition, several works [349], [350], [351] achieve
predicting their semantic labels. Its goal is to obtain high-quality better results. However, due to the extra datasets [352] they used,
and generalized segmentation results. we do not list them here.
LI et al.: TRANSFORMER-BASED VISUAL SEGMENTATION: A SURVEY 10153

TABLE V TABLE VIII


BENCHMARK RESULTS ON SEMANTIC SEGMENTATION VALIDATION DATASETS BENCHMARK RESULTS ON VIDEO SEMANTIC SEGMENTATION OF VPSW
VALIDATION DATASETS

TABLE VI TABLE IX
BENCHMARK RESULTS ON INSTANCE SEGMENTATION OF COCO VALIDATION EXPERIMENT RESULTS ON SEMANTIC SEGMENTATION DATASETS
DATASETS

TABLE X
EXPERIMENT RESULTS ON INSTANCE SEGMENTATION DATASETS
TABLE VII
BENCHMARK RESULTS ON PANOPTIC SEGMENTATION VALIDATION DATASETS

TABLE XI
EXPERIMENT RESULTS ON PANOPTIC SEGMENTATION DATASETS

A. Main Results on Image Segmentation Datasets


• Results On Semantic Segmentation Datasets: In Table V,
Mask2Former [225] and OneFormer [271] perform the best on
Cityscapes and ADE20 K dataset, while SegNext [145] achieves
the best results on COCO-Stuff and Pascal-Context datasets.
• Results on Semantic Segmentation: As shown in Table IX,
• Results on COCO Instance Segmentation: In Table VI, Mask
we carry out re-benchmark experiments for SS. In particular,
DINO [180] achieves the best results on the COCO instance
using the same neck architecture, Segformer+ [123] achieves
segmentation with both ResNet and Swin-L backbones.
the best results on COCO-Stuff and Cityscapes. Mask2Former
• Results on Panoptic Segmentation: In Table VII, for panoptic
achieves the best result on the ADE-20 k dataset.
segmentation, Mask DINO [180] and K-Max Deeplab [228]
• Results on Instance Segmentation: In Table X, we also
achieve the best results on the COCO dataset. K-Max Deeplab
explore the instance segmentation methods on COCO datasets.
also achieves the best results on Cityscapes. OneFormer [271]
Under the same neck architecture, we observe gains on both K-
performs the best on ADE20 K.
Net and MaskFormer, compared with origin results in Table VI.
Mask2Former achieve the best results.
B. Re-Benchmarking for Image Segmentation • Results on Panoptic Segmentation: In Table XI, we present
• Motivation: We perform re-benchmarking on two segmen- the re-benchmark results for PS. In particular, Mask2Former
tation tasks: semantic segmentation and panoptic segmentation achieves the best results on all three datasets. Compared
on four public datasets, including ADE20 K, COCO, Cityscapes, with K-Net and MaskFormer, both K-Net+ and MaskFormer+
and COCO-Stuff datasets. In particular, we want to explore achieve over 3-4% improvements due to the usage of stronger
the effect of the transformer decoder. Thus, we use the same neck [25], which close the gaps between their original results
encoder [7] and neck architecture [25] for a fair comparison. and Mask2Former.
10154 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

TABLE XII robotic decision-making. This approach holds significant prac-


BENCHMARK RESULTS ON VIDEO INSTANCE SEGMENTATION VALIDATION
DATASET
tical value, particularly in applications like robot navigation and
autonomous vehicles.
• Joint Learning with Multi-Modality: Transformers’ inherent
flexibility in handling various modalities positions them as ideal
for unifying vision and language tasks. Segmentation tasks,
which offer pixel-level information, can enhance associated
vision-language tasks such as text-image retrieval and caption
generation [360]. Recent studies [237], [361], [362], [363]
demonstrate the potential of a universal transformer architecture
that concurrently learns segmentation alongside visual language
tasks, paving the way for integrated multi-modal segmentation
learning.
• Life-Long Learning for Segmentation: Existing segmenta-
tion methods are usually benchmarked on closed-world datasets
with a set of predefined categories, i.e., assuming that the training
and testing samples have the same categories and feature spaces
that are known beforehand. However, realistic scenarios are
usually open-world and non-stationary, where novel classes may
TABLE XIII occur continuously [364]. For example, unseen situations can oc-
BENCHMARK RESULTS ON VIDEO PANOPTIC SEGMENTATION VALIDATION
DATASETS cur unexpectedly in self-driving vehicles and medical diagnoses.
There is a distinct gap between the performance and capabilities
of existing methods in realistic and open-world settings. Thus,
it is desired to gradually and continuously incorporate novel
concepts into the existing knowledge base of segmentation
models, making the model capable of lifelong learning.
• Long Video Segmentation in Dynamic Scenes: Long videos
introduce several challenges [56], [57], [365]. First, existing
video segmentation methods are designed to work with short
video inputs and may struggle to associate instances over longer
C. Main Results for Video Segmentation Datasets periods. Thus, new methods must incorporate long-term mem-
• Results On Video Semantic Segmentation: In Table VIII, ory design and consider the association of instances over a more
we report VSS results on VPSW. Among the methods, Tube- extended period. Second, maintaining segmentation mask con-
Former [173] achieves the best results. sistency over long periods can be difficult, especially when in-
• Results on Video Instance Segmentation: In Table XII, for stances move in and out of the scene. This requires new methods
VIS, CTVIS [357] achieves the best result on YT-VIS-2019 to incorporate temporal consistency constraints and update the
and YT-VIS-2021 using ResNet50 backbone. GenVIS [358] segmentation masks over time. Third, heavy occlusion can occur
achieves better results on OVIS using ResNet50 backbone. in long videos, making it challenging to segment all instances ac-
When adopting Swin-L backbone, CTVIS [357] achieves the curately. New methods should incorporate occlusion reasoning
best results. and detection to improve segmentation accuracy. Finally, long
• Results on Video Panoptic Segmentation: In Table XIII, for video inputs often involve various scene inputs, which can bring
VPS, SLOT-VPS [360] achieves the best results on Cityscapes- domain robustness challenges for video segmentation models.
VPS. TubeLink [241] achieves the best results on the VIP-Seg New methods must incorporate domain adaptation techniques to
dataset. Video K-Net [172] achieves the best results on the ensure the model can handle diverse scene inputs. In short, ad-
KITTI-STEP dataset. dressing these challenges requires the development of new long
video segmentation models that incorporate advanced memory
design, temporal consistency constraints, occlusion reasoning,
VI. FUTURE DIRECTIONS and detection techniques.
• General and Unified Image/Video Segmentation: The trend • Generative Segmentation: With the rise of stronger gen-
of using transformers to unify diverse segmentation tasks is erative models, recent works [366], [367], [368] solve image
gaining traction. Recent studies [26], [126], [129], [163], [172], segmentation problems via generative modeling, inspired by a
[173], [271] have employed query-based transformers for vari- stronger transformer decoder and high-resolution representation
ous segmentation tasks within a unified architecture. A promis- in the diffusion model [369]. Adopting a generative design
ing research avenue is the integration of image and video seg- avoids the transformer decoder and object query design, which
mentation tasks in a universal model across different datasets. makes the entire framework simpler. However, these generative
Such models may achieve general, robust segmentation capabili- models typically introduce a complicated training pipeline. A
ties in multiple scenarios, like detecting rare classes for improved simpler training pipeline is needed for further research.
LI et al.: TRANSFORMER-BASED VISUAL SEGMENTATION: A SURVEY 10155

• Segmentation with Visual Reasoning: Visual reason- [13] A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neural
ing [370], [371], [372], [373], [374] requires the robot to un- Inf. Process. Syst., 2017, pp. 6000–6010.
[14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
derstand the connections between objects in the scene, and this Comput., vol. 9, pp. 1735–1780, 1997.
understanding plays a crucial role in motion planning. Previous [15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
research has explored using segmentation results as input to of deep bidirectional transformers for language understanding,” in Proc.
Annu. Conf. North Amer. Chapter Assoc. Comput. Linguistics, 2019,
visual reasoning models for various applications, such as object pp. 4171–4186.
tracking and scene understanding. Joint segmentation and visual [16] T. Brown et al., “Language models are few-shot learners,” in Proc. Int.
reasoning can be a promising direction, with the potential for Conf. Neural Inf. Process. Syst., 2020, Art. no. 159.
[17] H. Zhao et al., “PSANet: Point-wise spatial attention network for scene
mutual benefits for both segmentation and relation classification. parsing,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 270–286.
By incorporating visual reasoning into the segmentation process, [18] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-
researchers can leverage the power of reasoning to improve the works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
pp. 7794–7803.
segmentation accuracy, while segmentation can provide better [19] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image
input for visual reasoning. recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020,
pp. 10073–10082.
[20] H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local relation networks for image
VII. CONCLUSION recognition,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 3463–3472.
[21] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for
This survey provides a comprehensive review of recent ad- image recognition at scale,” in Proc. Int. Conf. Learn. Representations,
vancements in transformer-based visual segmentation, which, to 2021, pp. 1–21.
our knowledge, is the first of its kind. The paper covers essential [22] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S.
Zagoruyko, “End-to-end object detection with transformers,” in Proc.
background knowledge and an overview of previous works be- Eur. Conf. Comput. Vis., 2020, pp. 213–229.
fore transformers and summarizes more than 120 deep-learning [23] Z. Liu et al., “Swin transformer: Hierarchical vision transformer us-
models for various segmentation tasks. The recent works are ing shifted windows,” in Proc. IEEE Int. Conf. Comput. Vis., 2021,
pp. 9992–10002.
grouped into six categories based on the meta-architecture of the [24] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked
segmenter. Additionally, the paper reviews five specific subfields autoencoders are scalable vision learners,” in Proc. IEEE Conf. Comput.
and reports the results of several representative segmentation Vis. Pattern Recognit., 2022, pp. 15979–15988.
[25] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR:
methods on widely-used datasets. To ensure fair comparisons, Deformable transformers for end-to-end object detection,” in Proc. Int.
we also re-benchmark several representative works under the Conf. Learn. Representations, 2021, pp. 1–16.
same settings. Finally, we conclude by pointing out future re- [26] H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “MaX-
DeepLab: End-to-end panoptic segmentation with mask transform-
search directions for transformer-based visual segmentation. ers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021,
pp. 5463–5474.
REFERENCES [27] H. Chen et al., “Pre-trained image processing transformer,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12299–12310.
[1] J. Malik, S. Belongie, T. Leung, and J. Shi, “Contour and texture analysis [28] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you
for image segmentation,” Int. J. Comput. Vis., vol. 43, pp. 7–27, 2001. need for video understanding?,” in Proc. Int. Conf. Mach. Learn., 2021,
[2] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE pp. 813–824.
Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, Aug. 2000. [29] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,”
[3] X. Y. Stella and J. Shi, “Multiclass spectral clustering,” in Proc. IEEE in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 16239–16248.
Int. Conf. Comput. Vis., 2003, pp. 313–319. [30] X. Pan, Z. Xia, S. Song, L. E. Li, and G. Huang, “3D object detection
[4] F. Schroff, A. Criminisi, and A. Zisserman, “Object class segmentation with pointformer,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
using random forests,” in Proc. Brit. Mach. Vis. Conf., 2008, pp. 54.1– 2021, pp. 7463–7472.
54.10. [31] K. Han et al., “A survey on vision transformer,” IEEE Trans. Pattern
[5] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contour mod- Anal. Mach. Intell., vol. 45, no. 1, pp. 87–110, Jan. 2023.
els,” Int. J. Comput. Vis., vol. 1, pp. 321–331, 1988. [32] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah,
[6] O. Russakovsky et al., “ImageNet large scale visual recognition chal- “Transformers in vision: A survey,” ACM Comput. Surv., vol. 54, 2022,
lenge,” Int. J. Comput. Vis., vol. 115, pp. 211–252, 2015. Art. no. 200.
[7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [33] T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” AI Open,
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, vol. 3, pp. 111–132, 2022.
pp. 770–778. [34] J. Selva, A. S. Johansen, S. Escalera, K. Nasrollahi, T. B. Moeslund, and
[8] K. Simonyan and A. Zisserman, “Very deep convolutional networks for A. Clapés, “Video transformers: A survey,” 2022, arXiv:2201.05991.
large-scale image recognition,” in Proc. Int. Conf. Learn. Representa- [35] J. Lahoud et al., “3D vision with transformers: A survey,”
tions, 2015, pp. 1–16. 2022, arXiv:2208.04309.
[9] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks [36] P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers:
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern A survey,” 2022, arXiv:2206.06488.
Recognit., 2015, pp. 3431–3440. [37] S. Minaee, Y. Y. Boykov, F. Porikli, A. J. Plaza, N. Kehtarnavaz, and
[10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, D. Terzopoulos, “Image segmentation using deep learning: A survey,”
“DeepLab: Semantic image segmentation with deep convolutional nets, IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3523–3542,
atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern Anal. Jul. 2022.
Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018. [38] S. Hao, Y. Zhou, and Y. Guo, “A brief survey on semantic segmentation
[11] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing with deep learning,” Neurocomputing, vol. 406, pp. 302–321, 2020.
network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, [39] T. Zhou, F. Porikli, D. J. Crandall, L. Van Gool, and W. Wang, “A survey
pp. 6230–6239. on deep learning technique for video segmentation,” IEEE Trans. Pattern
[12] H. Ding, X. Jiang, B. Shuai, A. Q. Liu, and G. Wang, “Context con- Anal. Mach. Intell., vol. 45, no. 6, pp. 7099–7122, Jun. 2023.
trasted feature and gated multi-scale aggregation for scene segmen- [40] J. Miao et al., “Large-scale video panoptic segmentation in the wild: A
tation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, benchmark,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
pp. 2393–2402. pp. 21001–21011.
10156 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

[41] M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. [67] L. Zhang, X. Li, A. Arnab, K. Yang, Y. Tong, and P. H. Torr, “Dual graph
Zisserman, “The PASCAL visual object classes (VOC) challenge,” Int. convolutional network for semantic segmentation,” in Proc. Brit. Mach.
J. Comput. Vis., vol. 88, pp. 303–338, 2010. Vis. Conf., 2019, pp. 254–264.
[42] R. Mottaghi et al., “The role of context for object detection and semantic [68] X. Li et al., “Semantic flow for fast and accurate scene parsing,” in Proc.
segmentation in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Eur. Conf. Comput. Vis., 2020, pp. 775–793.
Recognit., 2014, pp. 891–898. [69] X. Li et al., “SFNet: Faster, accurate, and domain agnostic semantic
[43] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. segmentation via semantic flow,” 2022, arXiv:2207.04415.
Eur. Conf. Comput. Vis., 2014, pp. 740–755. [70] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “BiSeNet: Bilateral
[44] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, segmentation network for real-time semantic segmentation,” in Proc. Eur.
“Semantic understanding of scenes through the ADE20K dataset,” Int. J. Conf. Comput. Vis., 2018, pp. 334–349.
Comput. Vis., vol. 127, pp. 302–321, 2019. [71] A. Kirillov, Y. Wu, K. He, and R. Girshick, “PointRend: Image segmen-
[45] M. Cordts et al., “The CityScapes dataset for semantic urban scene tation as rendering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9796–9805.
2016, pp. 3213–3223. [72] H. Ding, X. Jiang, A. Q. Liu, N. M. Thalmann, and G. Wang, “Boundary-
[46] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The aware feature propagation for scene segmentation,” in Proc. IEEE Int.
mapillary vistas dataset for semantic understanding of street scenes,” Conf. Comput. Vis., 2019, pp. 6818–6828.
in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 5000–5009. [73] X. Li et al., “Improving semantic segmentation via decoupled body
[47] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling and edge supervision,” in Proc. Eur. Conf. Comput. Vis., 2020,
context in referring expressions,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 435–452.
pp. 69–85. [74] H. He et al., “Enhanced boundary learning for glass-like object segmen-
[48] C. Liu, H. Ding, and X. Jiang, “GRES: Generalized referring expression tation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 15839–15848.
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, [75] J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu, “Dual attention network for
pp. 23592–23601. scene segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[49] J. Miao, Y. Wei, Y. Wu, C. Liang, G. Li, and Y. Yang, “VSPW: A large- 2019, pp. 3146–3154.
scale dataset for video scene parsing in the wild,” in Proc. IEEE Conf. [76] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
Comput. Vis. Pattern Recognit., 2021, pp. 4133–4143. IEEE Int. Conf. Comput. Vis., 2017, pp. 2980–2988.
[50] L. Yang, Y. Fan, and N. Xu, “Video instance segmentation,” in Proc. [77] Z. Tian, C. Shen, and H. Chen, “Conditional convolutions for instance
IEEE Int. Conf. Comput. Vis., 2019, pp. 5187–5196. segmentation,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 282–298.
[51] J. Qi et al., “Occluded video instance segmentation: A benchmark,” Int. [78] D. Neven, B. D. Brabandere, M. Proesmans, and L. V. Gool, “Instance
J. Comput. Vis., vol. 130, no. 8, pp. 2022–2039, 2022. segmentation by jointly optimizing spatial embeddings and clustering
[52] D. Kim, S. Woo, J.-Y. Lee, and I. S. Kweon, “Video panoptic seg- bandwidth,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019,
mentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 8837–8845.
pp. 9856–9865. [79] B. De Brabandere, D. Neven, and L. Van Gool, “Semantic instance seg-
[53] M. Weber et al., “STEP: Segmenting and tracking every pixel,” in Proc. mentation with a discriminative loss function,” 2017, arXiv: 1708.02551.
Int. Conf. Neural Inf. Process. Syst., 2021. [80] K. Chen et al., “Hybrid task cascade for instance segmentation,” in Proc.
[54] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4974–4983.
and L. Van Gool, “The 2017 Davis challenge on video object segmenta- [81] R. Zhang, Z. Tian, C. Shen, M. You, and Y. Yan, “Mask encoding for
tion,” 2017, arXiv: 1704.00675. single shot instance segmentation,” in Proc. IEEE Conf. Comput. Vis.
[55] N. Xu et al., “YouTube-VOS: A large-scale video object segmentation Pattern Recognit., 2020, pp. 10223–10232.
benchmark,” 2018, arXiv: 1809.03327. [Online]. Available: http://arxiv. [82] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time
org/abs/1809.03327 instance segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2019,
[56] H. Ding, C. Liu, S. He, X. Jiang, P. H. Torr, and S. Bai, “MOSE: A new pp. 9156–9165.
dataset for video object segmentation in complex scenes,” in Proc. IEEE [83] S. Qiao, L.-C. Chen, and A. Yuille, “DetectoRS: Detecting objects with
Int. Conf. Comput. Vis., 2023, pp. 20167–20177. recursive feature pyramid and switchable atrous convolution,” in Proc.
[57] H. Ding, C. Liu, S. He, X. Jiang, and C. C. Loy, “MeViS: A large-scale IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 10213–10224.
benchmark for video segmentation with motion expressions,” in Proc. [84] B. Cheng et al., “Panoptic-DeepLab: A simple, strong, and fast baseline
IEEE Int. Conf. Comput. Vis., 2023, pp. 2694–2703. for bottom-up panoptic segmentation,” in Proc. IEEE Conf. Comput. Vis.
[58] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Learning a Pattern Recognit., 2020, pp. 12472–12482.
discriminative feature network for semantic segmentation,” in Proc. IEEE [85] X. Chen, R. Girshick, K. He, and P. Dollár, “TensorMask: A foundation
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1857–1866. for dense object segmentation,” in Proc. IEEE Int. Conf. Comput. Vis.,
[59] H. Ding, X. Jiang, B. Shuai, A. Q. Liu, and G. Wang, “Semantic seg- 2019, pp. 2061–2069.
mentation with context encoding and multi-path decoding,” IEEE Trans. [86] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “SOLOv2: Dynamic
Image Process., vol. 29, pp. 3520–3533, 2020. and fast instance segmentation,” in Proc. Int. Conf. Neural Inf. Process.
[60] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel Syst., 2020, Art. no. 1487.
matters–improve semantic segmentation by global convolutional net- [87] Y. Xiong et al., “UPSNet: A unified panoptic segmentation net-
work,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, work,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019,
pp. 1743–1751. pp. 8818–8826.
[61] H. Ding, X. Jiang, B. Shuai, A. Q. Liu, and G. Wang, “Semantic corre- [88] Y. Li et al., “Fully convolutional networks for panoptic segmentation,”
lation promoted shape-variant context for segmentation,” in Proc. IEEE in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 214–223.
Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8885–8894. [89] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. Chen, “Axial-
[62] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous DeepLab: Stand-alone axial-attention for panoptic segmentation,” in
convolution for semantic image segmentation,” 2017, arXiv: 1706.05587. Proc. Eur. Conf. Comput. Vis., 2020, pp. 108–126.
[63] B. Shuai, H. Ding, T. Liu, G. Wang, and X. Jiang, “Toward achieving [90] R. Gadde, V. Jampani, and P. V. Gehler, “Semantic video CNNs through
robust low-level and high-level scene parsing,” IEEE Trans. Image Pro- representation warping,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
cess., vol. 28, no. 3, pp. 1378–1390, Mar. 2019. pp. 4463–4472.
[64] X. Li, H. Zhao, L. Han, Y. Tong, S. Tan, and K. Yang, “Gated fully [91] E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell, “Clockwork
fusion for semantic segmentation,” in Proc. AAAI Conf. Artif. Intell., convnets for video semantic segmentation,” in Proc. Eur. Conf. Comput.
2020, pp. 11418–11425. Vis., 2016, pp. 852–868.
[65] X. Li et al., “Global aggregation then local distribution for scene [92] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, “Deep feature flow for
parsing,” IEEE Trans. Image Process., vol. 30, pp. 6829–6842, video recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2021. 2017, pp. 4141–4150.
[66] Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations [93] G. Bertasius and L. Torresani, “Classifying, segmenting, and tracking
for semantic segmentation,” in Proc. Eur. Conf. Comput. Vis., 2020, object instances in video with mask propagation,” in Proc. IEEE Conf.
pp. 173–190. Comput. Vis. Pattern Recognit., 2020, pp. 9736–9745.
LI et al.: TRANSFORMER-BASED VISUAL SEGMENTATION: A SURVEY 10157

[94] Y. Fu, L. Yang, D. Liu, T. S. Huang, and H. Shi, “CompFeat: Compre- [118] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: A simple and strong
hensive feature aggregation for video instance segmentation,” in Proc. anchor-free object detector,” IEEE Trans. Pattern Anal. Mach. Intell.,
AAAI Conf. Artif. Intell., 2021, pp. 1361–1369. vol. 44, no. 4, pp. 1922–1933, Apr. 2022.
[95] X. Li et al., “Improving video instance segmentation via temporal pyra- [119] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “NAS-FPN: Learning scalable feature
mid routing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 5, pyramid architecture for object detection,” in Proc. IEEE Conf. Comput.
pp. 6594–6601, May 2023. Vis. Pattern Recognit., 2019, pp. 7036–7045.
[96] S. Qiao, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “VIP-DeepLab: [120] F. Li et al., “Lite DETR: An interleaved multi-scale encoder for efficient
Learning visual perception with depth-aware video panoptic segmen- DETR,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023,
tation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 18558–18567.
pp. 3997–4008. [121] H. W. Kuhn, “The Hungarian method for the assignment problem,” Nav.
[97] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable ConvNets v2: More Res. Logistics Quart., vol. 2, pp. 83–97, 1955.
deformable, better results,” in Proc. IEEE Conf. Comput. Vis. Pattern [122] F. Milletari, N. Navab, and S. Ahmadi, “V-Net: Fully convolutional neural
Recognit., 2019, pp. 9308–9316. networks for volumetric medical image segmentation,” in Proc. Int. Conf.
[98] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and 3D Vis., 2016, pp. 565–571.
A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology [123] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo,
for video object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern “SegFormer: Simple and efficient design for semantic segmentation
Recognit., 2016, pp. 724–732. with transformers,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021,
[99] P. Voigtlaender et al., “MOTS: Multi-object tracking and segmen- pp. 12077–12090.
tation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, [124] S. Zheng et al., “Rethinking semantic segmentation from a sequence-to-
pp. 7942–7951. sequence perspective with transformers,” in Proc. IEEE Conf. Comput.
[100] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on Vis. Pattern Recognit., 2021, pp. 6881–6890.
point sets for 3D classification and segmentation,” in Proc. IEEE Conf. [125] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-
Comput. Vis. Pattern Recognit., 2017, pp. 652–660. decoder with atrous separable convolution for semantic image segmen-
[101] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet: Deep hierarchical tation,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 833–851.
feature learning on point sets in a metric space,” in Proc. Int. Conf. Neural [126] X. Li et al., “OMG-seg: Is one model good enough for all segmentation?,”
Inf. Process. Syst., 2017, pp. 5099–5108. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 27948–
[102] B. Yang et al., “Learning object bounding boxes for 3D instance seg- 27959.
mentation on point clouds,” in Proc. Int. Conf. Neural Inf. Process. Syst., [127] A. Athar, A. Hermans, J. Luiten, D. Ramanan, and B. Leibe, “TarViS:
2019, pp. 6737–6746. A unified approach for target-based video segmentation,” in Proc. IEEE
[103] L. Yi, W. Zhao, H. Wang, M. Sung, and L. J. Guibas, “GSPN: Gen- Conf. Comput. Vis. Pattern Recognit., 2023, pp. 18738–18748.
erative shape proposal network for 3D instance segmentation in point [128] H. Yuan et al., “PolyphonicFormer: Unified query learning for depth-
cloud,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, aware video panoptic segmentation,” in Proc. Eur. Conf. Comput. Vis.,
pp. 3947–3956. 2022, pp. 582–599.
[104] W. Wang, R. Yu, Q. Huang, and U. Neumann, “SGPN: Similar- [129] J. Wu, Y. Jiang, B. Yan, H. Lu, Z. Yuan, and P. Luo, “UniRef:
ity group proposal network for 3D point cloud instance segmenta- Segment every reference object in spatial and temporal spaces,”
tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, 2023, arXiv:2312.15715.
pp. 2569–2578. [130] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
[105] L. Jiang, H. Zhao, S. Shi, S. Liu, C.-W. Fu, and J. Jia, “PointGroup: H. Jégou, “Training data-efficient image transformers & distilla-
Dual-set point grouping for 3D instance segmentation,” in Proc. IEEE tion through attention,” in Proc. Int. Conf. Mach. Learn., 2021,
Conf. Comput. Vis. Pattern Recognit., 2020, pp. 4866–4875. pp. 10347–10357.
[106] J. Mao, X. Wang, and H. Li, “Interpolated convolutional networks for 3D [131] H. Fan et al., “Multiscale vision transformers,” in Proc. IEEE Int. Conf.
point cloud understanding,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, Comput. Vis., 2021, pp. 6804–6815.
pp. 1578–1587. [132] Y. Li et al., “MViTv2: Improved multiscale vision transformers for
[107] Q. Hu et al., “RandLA-Net: Efficient semantic segmentation of large- classification and detection,” in Proc. IEEE Conf. Comput. Vis. Pattern
scale point clouds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Recognit., 2022, pp. 4794–4804.
2020, pp. 11105–11114. [133] Y. Lee, J. Kim, J. Willette, and S. J. Hwang, “MPViT: Multi-path vision
[108] R. Cheng, R. Razani, E. Taghavi, E. Li, and B. Liu, “(AF)2-S3Net: transformer for dense prediction,” in Proc. IEEE Conf. Comput. Vis.
Attentive feature fusion with adaptive feature selection for sparse se- Pattern Recognit., 2022, pp. 7277–7286.
mantic segmentation network,” in Proc. IEEE Conf. Comput. Vis. Pattern [134] A. Ali et al., “XCiT: Cross-covariance image transformers,” in Proc. Int.
Recognit., 2021, pp. 12547–12556. Conf. Neural Inf. Process. Syst., 2021, pp. 20014–20027.
[109] Z. Zhou, Y. Zhang, and H. Foroosh, “Panoptic-PolarNet: Proposal-free [135] W. Wang et al., “Pyramid vision transformer: A versatile backbone for
LiDAR point cloud panoptic segmentation,” in Proc. IEEE Conf. Comput. dense prediction without convolutions,” in Proc. IEEE Int. Conf. Comput.
Vis. Pattern Recognit., 2021, pp. 13194–13203. Vis., 2021, pp. 548–558.
[110] S. Xu, R. Wan, M. Ye, X. Zou, and T. Cao, “Sparse cross-scale attention [136] C.-F. R. Chen, Q. Fan, and R. Panda, “CrossViT: Cross-attention multi-
network for efficient LiDAR panoptic segmentation,” in Proc. AAAI Conf. scale vision transformer for image classification,” in Proc. IEEE Int. Conf.
Artif. Intell., 2022, pp. 2920–2928. Comput. Vis., 2021, pp. 347–356.
[111] F. Hong, H. Zhou, X. Zhu, H. Li, and Z. Liu, “LiDAR-based panoptic seg- [137] D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, “Fea-
mentation via dynamic shifting network,” in Proc. IEEE Conf. Comput. ture pyramid transformer,” in Proc. Eur. Conf. Comput. Vis., 2020,
Vis. Pattern Recognit., 2021, pp. 13090–13099. pp. 323–339.
[112] M. Aygun et al., “4D panoptic LiDAR segmentation,” in Proc. IEEE Conf. [138] W. Xu, Y. Xu, T. Chang, and Z. Tu, “Co-scale conv-attentional
Comput. Vis. Pattern Recognit., 2021, pp. 5527–5537. image transformers,” in Proc. IEEE Int. Conf. Comput. Vis., 2021,
[113] X. Zhu et al., “Cylindrical and asymmetrical 3D convolution networks pp. 9961–9970.
for LiDAR segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern [139] J. Guo et al., “CMT: Convolutional neural networks meet vision trans-
Recognit., 2021, pp. 9939–9948. formers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
[114] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng, “A2-Nets: Double pp. 12165–12175.
attention networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2018, [140] X. Chu et al., “Twins: Revisiting the design of spatial attention in
pp. 350–359. vision transformers,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021,
[115] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- pp. 9355–9366.
time object detection with region proposal networks,” in Proc. Int. Conf. [141] H. Wu et al., “CvT: Introducing convolutions to vision transformers,” in
Neural Inf. Process. Syst., 2015, pp. 91–99. Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 22–31.
[116] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. [142] Y. Xu, Q. Zhang, J. Zhang, and D. Tao, “ViTAE: Vision transformer
Belongie, “Feature pyramid networks for object detection,” in Proc. IEEE advanced by exploring intrinsic inductive Bias,” in Proc. Int. Conf. Neural
Conf. Comput. Vis. Pattern Recognit., 2017, pp. 936–944. Inf. Process. Syst., 2021, pp. 28522–28535.
[117] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for [143] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie,
dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, “A ConvNet for the 2020s,” in Proc. IEEE Conf. Comput. Vis. Pattern
pp. 2999–3007. Recognit., 2022, pp. 11966–11976.
10158 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

[144] Q. Han et al., “On the connection between local attention and dynamic [171] J. Wu, Y. Jiang, S. Bai, W. Zhang, and X. Bai, “SeqFormer: Sequential
depth-wise convolution,” in Proc. Int. Conf. Learn. Representations, transformer for video instance segmentation,” in Proc. Eur. Conf. Comput.
2022, pp. 1–25. Vis., 2022, pp. 553–569.
[145] M.-H. Guo, C.-Z. Lu, Q. Hou, Z. Liu, M.-M. Cheng, and S.-M. [172] X. Li et al., “Video K-Net: A simple, strong, and unified baseline for video
Hu, “SegNeXt: Rethinking convolutional attention design for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
segmentation,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 18825–18835.
pp. 1140–1156. [173] D. Kim et al., “TubeFormer-DeepLab: Video mask transformer,” in Proc.
[146] W. Yu et al., “MetaFormer is actually what you need for vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 13904–13914.
IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10809–10819. [174] D. Meng et al., “Conditional DETR for fast training conver-
[147] J. Dai et al., “Demystify transformers & convolutions in modern image gence,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021,
deep networks,” 2022, arXiv:2211.05781. pp. 3631–3640.
[148] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework [175] X. Chen, F. Wei, G. Zeng, and J. Wang, “Conditional DETR v2: Efficient
for contrastive learning of visual representations,” in Proc. Int. Conf. detection transformer with box queries,” 2022, arXiv:2207.08914.
Mach. Learn., 2020, pp. 1597–1607. [176] Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor DETR: Query design
[149] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for transformer-based detector,” in Proc. AAAI Conf. Artif. Intell., 2022,
for unsupervised visual representation learning,” in Proc. IEEE Conf. pp. 2567–2575.
Comput. Vis. Pattern Recognit., 2020, pp. 9726–9735. [177] S. Liu et al., “DAB-DETR: Dynamic anchor boxes are better queries for
[150] X. Chen, S. Xie, and K. He, “An empirical study of training self- DETR,” in Proc. Int. Conf. Learn. Representations, 2022, pp. 1–20.
supervised vision transformers,” in Proc. IEEE Int. Conf. Comput. Vis., [178] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “DN-
2021, pp. 9620–9629. DETR: Accelerate DETR training by introducing query denois-
[151] H. Bao, L. Dong, and F. Wei, “BEiT: BERT pre-training of image trans- ing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
formers,” in Proc. Int. Conf. Learn. Representations, 2022, pp. 14668– pp. 13609–13617.
14678. [179] H. Zhang et al., “DINO: DETR with improved denoising anchor boxes for
[152] C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feicht- end-to-end object detection,” in Proc. Int. Conf. Learn. Representations,
enhofer, “Masked feature prediction for self-supervised visual pre- 2023, pp. 1–19.
training,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, [180] F. Li et al., “Mask DINO: Towards a unified transformer-based framework
pp. 14648–14658. for object detection and segmentation,” in Proc. IEEE Conf. Comput. Vis.
[153] Y. Gandelsman, Y. Sun, X. Chen, and A. A. Efros, “Test-time training Pattern Recognit., 2023, pp. 3041–3050.
with masked autoencoders,” in Proc. Int. Conf. Neural Inf. Process. Syst., [181] W. Wang, J. Liang, and D. Liu, “Learning equivariant segmentation with
2022, Art. no. 2130. instance-unique querying,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
[154] R. Hu, S. Debnath, S. Xie, and X. Chen, “Exploring long-sequence 2022, Art. no. 932.
masked autoencoders,” 2022, arXiv:2210.07224. [182] D. Jia et al., “DETRs with hybrid matching,” in Proc. IEEE Conf. Comput.
[155] P. Gao, T. Ma, H. Li, J. Dai, and Y. Qiao, “ConvMAE: Masked convolution Vis. Pattern Recognit., 2023, pp. 19702–19712.
meets masked autoencoders,” in Proc. Int. Conf. Neural Inf. Process. [183] Q. Chen et al., “Group DETR: Fast DETR training with group-wise one-
Syst., 2022, pp. 35632–35644. to-many assignment,” 2022, arXiv:2207.13085.
[156] A. Radford et al., “Learning transferable visual models from natu- [184] Z. Zong, G. Song, and Y. Liu, “DETRs with collaborative hybrid assign-
ral language supervision,” in Proc. Int. Conf. Mach. Learn., 2021, ments training,” 2022, arXiv:2211.12860.
pp. 8748–8763. [185] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, “Track-
[157] Y. Li, H. Fan, R. Hu, C. Feichtenhofer, and K. He, “Scaling language- Former: Multi-object tracking with transformers,” in Proc. IEEE Conf.
image pre-training via masking,” 2022, arXiv:2212.00794. Comput. Vis. Pattern Recognit., 2022, pp. 8834–8844.
[158] P. Sun et al., “SparseR-CNN: End-to-end object detection with learnable [186] P. Sun et al., “TransTrack: Multiple-object tracking with transformer,”
proposals,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, 2020, arXiv: 2012.15460.
pp. 14449–14458. [187] F. Zeng, B. Dong, T. Wang, C. Chen, X. Zhang, and Y. Wei, “MOTR: End-
[159] Y. Fang et al., “Instances as queries,” in Proc. IEEE Int. Conf. Comput. to-end multiple-object tracking with transformer,” in Proc. Eur. Conf.
Vis., 2021, pp. 6890–6899. Comput. Vis., 2022, pp. 659–675.
[160] J. Hu et al., “ISTR: End-to-end instance segmentation via transformers,” [188] D.-A. Huang, Z. Yu, and A. Anandkumar, “MinVIS: A minimal video
2021, arXiv:2105.00637. instance segmentation framework without video-based training,” in Proc.
[161] B. Dong, F. Zeng, T. Wang, X. Zhang, and Y. Wei, “SOLQ: Segmenting Int. Conf. Neural Inf. Process. Syst., 2022, pp. 31265–31277.
objects by learning queries,” in Proc. Int. Conf. Neural Inf. Process. Syst., [189] J. Wu, Q. Liu, Y. Jiang, S. Bai, A. Yuille, and X. Bai, “In defense of online
2021, pp. 21898–21909. models for video instance segmentation,” in Proc. Eur. Conf. Comput.
[162] H. He et al., “BoundarySqueeze: Image segmentation as boundary Vis., 2022, pp. 588–605.
squeezing,” 2021, arXiv:2105.11668. [190] X. Li, S. Xu, Y. Yang, G. Cheng, Y. Tong, and D. Tao, “Panoptic-
[163] W. Zhang, J. Pang, K. Chen, and C. C. Loy, “K-Net: Towards unified PartFormer: Learning a unified model for panoptic part segmentation,”
image segmentation,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, in Proc. Eur. Conf. Comput. Vis., 2022, pp. 729–747.
pp. 10326–10338. [191] N. Gao et al., “PanopticDepth: A unified framework for depth-aware
[164] B. Cheng, A. G. Schwing, and A. Kirillov, “Per-pixel classification is not panoptic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
all you need for semantic segmentation,” in Proc. Int. Conf. Neural Inf. nit., 2022, pp. 1622–1632.
Process. Syst., 2021, pp. 17864–17875. [192] S. Xu, X. Li, J. Wang, G. Cheng, Y. Tong, and D. Tao, “Fashion-
[165] Z. Li et al., “Panoptic SegFormer: Delving deeper into panoptic seg- former: A simple, effective and unified baseline for human fashion
mentation with transformers,” in Proc. IEEE Conf. Comput. Vis. Pattern segmentation and recognition,” in Proc. Eur. Conf. Comput. Vis., 2022,
Recognit., 2022, pp. 1270–1279. pp. 545–563.
[166] Y. Wang et al., “End-to-end video instance segmentation with trans- [193] Y. Xu et al., “Multi-task learning with multi-query transformer for dense
formers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, prediction,” 2022, arXiv:2205.14354.
pp. 8741–8750. [194] H. Ye and D. Xu, “Inverted pyramid multi-task transformer for
[167] Q. Zhou et al., “TransVOD: End-to-end video object detection with dense scene understanding,” in Proc. Eur. Conf. Comput. Vis., 2022,
spatial-temporal transformers,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 514–530.
vol. 45, no. 6, pp. 7853–7869, Jun. 2023. [195] H. Ding, C. Liu, S. Wang, and X. Jiang, “Vision-language transformer
[168] S. Yang et al., “Temporally efficient vision transformer for video instance and query generation for referring segmentation,” in Proc. IEEE Int. Conf.
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, Comput. Vis., 2021, pp. 16301–16310.
pp. 2875–2885. [196] Z. Yang, J. Wang, Y. Tang, K. Chen, H. Zhao, and P. H. Torr,
[169] B. Cheng, A. Choudhuri, I. Misra, A. Kirillov, R. Girdhar, and “LAVT: Language-aware vision transformer for referring image seg-
A. G. Schwing, “Mask2Former for video instance segmentation,” mentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
2021, arXiv:2112.10764. pp. 18134–18144.
[170] S. Hwang, M. Heo, S. W. Oh, and S. J. Kim, “Video instance segmenta- [197] N. Kim, D. Kim, C. Lan, W. Zeng, and S. Kwak, “ReSTR: Convolution-
tion using inter-frame communication transformers,” in Proc. Int. Conf. free referring image segmentation using transformers,” in Proc. IEEE
Neural Inf. Process. Syst., 2021, pp. 13352–13363. Conf. Comput. Vis. Pattern Recognit., 2022, pp. 18124–18133.
LI et al.: TRANSFORMER-BASED VISUAL SEGMENTATION: A SURVEY 10159

[198] Z. Wang et al., “CRIS: CLIP-driven referring image segmenta- [224] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer
tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, for semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021,
pp. 11676–11685. pp. 7242–7252.
[199] A. Botach, E. Zheltonozhskii, and C. Baskin, “End-to-end referring video [225] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-
object segmentation with multimodal transformers,” in Proc. IEEE Conf. attention mask transformer for universal image segmentation,” in Proc.
Comput. Vis. Pattern Recognit., 2022, pp. 4975–4985. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 1280–1289.
[200] Z. Ding, T. Hui, J. Huang, X. Wei, J. Han, and S. Liu, “Language- [226] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detec-
bridged spatial-temporal interaction for referring video object segmen- tron2,” 2019. [Online]. Available: https://github.com/facebookresearch/
tation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, detectron2
pp. 4954–4963. [227] Q. Yu et al., “CMT-DeepLab: Clustering mask transformers for panoptic
[201] J. Wu, X. Li, X. Li, H. Ding, Y. Tong, and D. Tao, “Towards robust segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
referring image segmentation,” 2022, arXiv:2209.09554. pp. 2550–2560.
[202] J. Wu, Y. Jiang, P. Sun, Z. Yuan, and P. Luo, “Language as queries for [228] Q. Yu et al., “k-means mask transformer,” in Proc. Eur. Conf. Comput.
referring video object segmentation,” in Proc. IEEE Conf. Comput. Vis. Vis., 2022, pp. 288–307.
Pattern Recognit., 2022, pp. 4964–4974. [229] T. Cheng et al., “Sparse instance activation for real-time instance seg-
[203] G. Zhang, G. Kang, Y. Yang, and Y. Wei, “Few-shot segmentation via mentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
cycle-consistent transformer,” in Proc. Int. Conf. Neural Inf. Process. pp. 4423–4432.
Syst., 2021, pp. 21984–21996. [230] G. Zhang, Z. Luo, Y. Yu, K. Cui, and S. Lu, “Accelerating DETR con-
[204] Z. Yang, Y. Wei, and Y. Yang, “Associating objects with transformers for vergence via semantic-aligned matching,” in Proc. IEEE Conf. Comput.
video object segmentation,” in Proc. Int. Conf. Neural Inf. Process. Syst., Vis. Pattern Recognit., 2022, pp. 939–948.
2021, pp. 2491–2502. [231] P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, “Fast convergence of
[205] G. Park, S. Son, J. Yoo, S. Kim, and N. Kwak, “MatteFormer: DETR with spatially modulated co-attention,” in Proc. IEEE Int. Conf.
Transformer-based image matting via prior-tokens,” in Proc. IEEE Conf. Comput. Vis., 2021, pp. 3601–3610.
Comput. Vis. Pattern Recognit., 2022, pp. 11686–11696. [232] Z. Gao, L. Wang, B. Han, and S. Guo, “AdaMixer: A fast-converging
[206] B. Shi et al., “A transformer-based decoder for semantic segmentation query-based object detector,” in Proc. IEEE Conf. Comput. Vis. Pattern
with multi-level context mining,” in Proc. Eur. Conf. Comput. Vis., 2022, Recognit., 2022, pp. 5354–5363.
pp. 624–639. [233] M. Zheng et al., “End-to-end object detection with adaptive clustering
[207] F. Lin, Z. Liang, J. He, M. Zheng, S. Tian, and K. Chen, “Struct- transformer,” in Proc. Brit. Mach. Vis. Conf., 2021, pp. 226–236.
Token: Rethinking semantic segmentation with structural prior,” [234] X. Dai, Y. Chen, J. Yang, P. Zhang, L. Yuan, and L. Zhang, “Dynamic
2022, arXiv:2203.12612. DETR: End-to-end object detection with dynamic attention,” in Proc.
[208] Y. Yu, J. Yuan, G. Mittal, L. Fuxin, and M. Chen, “BATMAN: Bilat- IEEE Int. Conf. Comput. Vis., 2021, pp. 2968–2977.
eral attention transformer in motion-appearance neighboring space for [235] B. Roh, J. Shin, W. Shin, and S. Kim, “Sparse DETR: Efficient end-to-
video object segmentation,” in Proc. Eur. Conf. Comput. Vis., 2022, end object detection with learnable sparsity,” in Proc. Int. Conf. Learn.
pp. 612–629. Representations, 2022, pp. 1–23.
[209] S. Jiao et al., “Mask matching transformer for few-shot segmentation,” [236] M. Heo, S. Hwang, S. W. Oh, J.-Y. Lee, and S. J. Kim, “VITA: Video
2022, arXiv:2301.01208. instance segmentation via object token association,” in Proc. Int. Conf.
[210] Z. Liu et al., “Swin transformer V2: Scaling up capacity and reso- Neural Inf. Process. Syst., 2022, pp. 23109–23120.
lution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, [237] X. Zou et al., “Generalized decoding for pixel, image and lan-
pp. 11999–12009. guage,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023,
[211] S. Chen, E. Xie, C. Ge, R. Chen, D. Liang, and P. Luo, “CycleMLP: A pp. 15116–15127.
MLP-like architecture for dense prediction,” in Proc. Int. Conf. Learn. [238] Z. Cai et al., “X-DETR: A versatile architecture for instance-wise
Representations, 2022, pp. 1–21. vision-language tasks,” in Proc. Eur. Conf. Comput. Vis., 2022,
[212] I. O. Tolstikhin et al., “MLP-mixer: An all-MLP architecture for vision,” pp. 290–308.
in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 24261–24272. [239] I. Shin et al., “Video-kMaX: A simple unified approach for online and
[213] J. Guo et al., “Hire-MLP: Vision MLP via hierarchical rearrangement,” near-online video panoptic segmentation,” 2023, arXiv:2304.04694.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 816–826. [240] X. Li, H. Yuan, W. Zhang, J. Pang, G. Cheng, and C. C. Loy, “Tube-link:
[214] X. Chen et al., “Context autoencoder for self-supervised repre- A flexible cross tube baseline for universal video segmentation,” in Proc.
sentation learning,” Int. J. Comput. Vis., vol. 132, pp. 208–223, IEEE Int. Conf. Comput. Vis., 2023, pp. 13877–13887.
2024. [241] Z. Yao, J. Ai, B. Li, and C. Zhang, “Efficient DETR: Improving end-to-
[215] K. Tian, Y. Jiang, Q. Diao, C. Lin, L. Wang, and Z. Yuan, “Designing end object detector with dense prior,” 2021, arXiv:2104.01318.
BERT for convolutional networks: Sparse and hierarchical masked mod- [242] H. Zhang et al., “MP-Former: Mask-piloted transformer for image seg-
eling,” in Proc. Int. Conf. Learn. Representations, 2023, pp. 1–16. mentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023,
[216] M. Caron et al., “Emerging properties in self-supervised vision trans- pp. 18074–18083.
formers,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 9630–9640. [243] W. Wang, J. Zhang, Y. Cao, Y. Shen, and D. Tao, “Towards data-
[217] C. Jia et al., “Scaling up visual and vision-language representation efficient detection transformers,” in Proc. Eur. Conf. Comput. Vis., 2022,
learning with noisy text supervision,” in Proc. Int. Conf. Mach. Learn., pp. 88–105.
2021, pp. 4904–4916. [244] X. Li et al., “PanopticPartFormer: A unified and decoupled view for
[218] H. Li et al., “Uni-perceiver v2: A generalist model for large-scale vision panoptic part segmentation,” 2023, arXiv:2301.00954.
and vision-language tasks,” 2022, arXiv:2211.09808. [245] S. He and H. Ding, “Decoupling static and hierarchical motion perception
[219] Z. Tong, Y. Song, J. Wang, and L. Wang, “VideoMAE: Masked autoen- for referring video segmentation,” in Proc. IEEE Conf. Comput. Vis.
coders are data-efficient learners for self-supervised video pre-training,” Pattern Recognit., 2024, pp. 13332–13341.
2022, arXiv:2203.12602. [246] S. He, H. Ding, and W. Jiang, “Semantic-promoted debiasing and back-
[220] C. Feichtenhofer, H. Fan, Y. Li, and K. He, “Masked autoencoders as ground disambiguation for zero-shot instance segmentation,” in Proc.
spatiotemporal learners,” 2022, arXiv:2205.09113. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 19498–19507.
[221] Z. Liu et al., “Video swin transformer,” in Proc. IEEE Conf. Comput. Vis. [247] C. Liu, X. Li, and H. Ding, “Referring image editing: Object-level image
Pattern Recognit., 2022, pp. 3192–3201. editing via referring expressions,” in Proc. IEEE Conf. Comput. Vis.
[222] H. Ding, C. Liu, S. Wang, and X. Jiang, “VLT: Vision-language Pattern Recognit., 2024, pp. 13128–13138.
transformer and query generation for referring segmentation,” IEEE [248] C. Liu, X. Jiang, and H. Ding, “Instance-specific feature propaga-
Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, pp. 7900–7916, tion for referring segmentation,” IEEE Trans. Multimedia, vol. 25,
Jun. 2023. pp. 3657–3667, 2022.
[223] X. Yu, D. Shi, X. Wei, Y. Ren, T. Ye, and W. Tan, “SOIT: Segmenting [249] G. Luo et al., “Multi-task collaborative network for joint referring expres-
objects with instance-aware transformers,” in Proc. AAAI Conf. Artif. sion comprehension and segmentation,” in Proc. IEEE Conf. Comput. Vis.
Intell., 2022, pp. 3188–3196. Pattern Recognit., 2020, pp. 10031–10040.
10160 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

[250] D. Wu, X. Dong, L. Shao, and J. Shen, “Multi-level representation [277] S. He, H. Ding, and W. Jiang, “Primitive generation and semantic-related
learning with semantic alignment for referring video object segmen- alignment for universal zero-shot segmentation,” in Proc. IEEE Conf.
tation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, Comput. Vis. Pattern Recognit., 2023, pp. 11238–11247.
pp. 4986–4995. [278] H. Zhou et al., “Rethinking evaluation metrics of open-vocabulary seg-
[251] A. Kamath, M. Singh, Y. LeCun, I. Misra, G. Synnaeve, and N. Carion, mentaion,” 2023, arXiv:2311.03352.
“MDETR–modulated detection for end-to-end multi-modal understand- [279] W. Kuo, Y. Cui, X. Gu, A. Piergiovanni, and A. Angelova, “F-
ing,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 1760–1770. VLM: Open-vocabulary object detection upon frozen vision and
[252] L. Cao, Y. Guo, Y. Yuan, and Q. Jin, “Prototype as query for few shot language models,” in Proc. Int. Conf. Learn. Representations,
semantic segmentation,” 2022, arXiv:2211.14764. 2023, pp. 1–20.
[253] H. Ding, H. Zhang, C. Liu, and X. Jiang, “Deep interactive image [280] M. Maaz, H. Rasheed, S. Khan, F. S. Khan, R. M. Anwer, and M.-H.
matting with feature propagation,” IEEE Trans. Image Process., vol. 31, Yang, “Class-agnostic object detection with multi-modal transformer,”
pp. 2421–2432, 2022. in Proc. Eur. Conf. Comput. Vis., 2022, pp. 512–531.
[254] Y. Han et al., “Reference twice: A simple and unified baseline for few-shot [281] G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin, “Scaling open-vocabulary image
instance segmentation,” 2023, arXiv:2301.01156. segmentation with image-level labels,” in Proc. Eur. Conf. Comput. Vis.,
[255] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu, 2022, pp. 540–557.
“PCT: Point cloud transformer,” Comput. Vis. Media, vol. 7, pp. 187–199, [282] B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl,
2021. “Language-driven semantic segmentation,” in Proc. Int. Conf. Learn.
[256] X. Lai et al., “Stratified transformer for 3D point cloud segmentation,” in Representations, 2022, pp. 1–13.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8490–8499. [283] J. Wu et al., “Betrayed by captions: Joint caption grounding and gen-
[257] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-BERT: eration for open vocabulary instance segmentation,” in Proc. IEEE Int.
Pre-training 3D point cloud transformers with masked point mod- Conf. Comput. Vis., 2023, pp. 21881–21891.
eling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, [284] J. Qin et al., “FreeSeg: Unified, universal and open-vocabulary image
pp. 19291–19300. segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023,
[258] Y. Pang, W. Wang, F. E. H. Tay, W. Liu, Y. Tian, and L. Yuan, “Masked pp. 19446–19455.
autoencoders for point cloud self-supervised learning,” in Proc. Eur. Conf. [285] W. Wang, M. Feiszli, H. Wang, and D. Tran, “Unidentified video objects:
Comput. Vis., 2022, pp. 604–621. A benchmark for dense, open-world segmentation,” in Proc. IEEE Int.
[259] R. Zhang et al., “Point-M2AE: Multi-scale masked autoencoders for Conf. Comput. Vis., 2021, pp. 10756–10765.
hierarchical point cloud pre-training,” 2022, arXiv:2205.14401. [286] J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. De Mello,
[260] J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe, “Open-vocabulary panoptic segmentation with text-to-image diffusion
“Mask3D for 3D semantic instance segmentation,” in Proc. IEEE Int. models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023,
Conf. Robot. Autom., 2023, pp. 8216–8223. pp. 2955–2966.
[261] J. Sun, C. Qing, J. Tan, and X. Xu, “Superpoint transformer for 3D [287] A. Gupta, S. Narayan, K. Joseph, S. Khan, F. S. Khan, and M. Shah,
scene instance segmentation,” in Proc. AAAI Conf. Artif. Intell., 2023, “OW-DETR: Open-world detection transformer,” in Proc. IEEE Conf.
pp. 2393–2401. Comput. Vis. Pattern Recognit., 2022, pp. 9225–9234.
[262] L. Landrieu and M. Simonovsky, “Large-scale point cloud semantic [288] M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai, “Side adapter network for
segmentation with superpoint graphs,” in Proc. IEEE Conf. Comput. Vis. open-vocabulary semantic segmentation,” in Proc. IEEE Conf. Comput.
Pattern Recognit., 2018, pp. 4558–4567. Vis. Pattern Recognit., 2023, pp. 2945–2954.
[263] S. Su et al., “PUPS: Point cloud unified panoptic segmentation,” in Proc. [289] Z. Liu et al., “Open compound domain adaptation,” in Proc. IEEE Conf.
AAAI Conf. Artif. Intell., 2023, pp. 2339–2347. Comput. Vis. Pattern Recognit., 2020, pp. 12403–12412.
[264] J. Behley et al., “A dataset for semantic segmentation of point cloud [290] Y. Yang and S. Soatto, “FDA: Fourier domain adaptation for semantic
sequences,” 2019, arXiv: 1904.01416. segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020,
[265] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning pp. 4084–4094.
for vision-language models,” in Proc. IEEE Conf. Comput. Vis. Pattern [291] L. Hoyer, D. Dai, and L. Van Gool, “DAFormer: Improving network
Recognit., 2022, pp. 16795–16804. architectures and training strategies for domain-adaptive semantic seg-
[266] R. Zhang et al., “Tip-Adapter: Training-free CLIP-adapter for better mentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
vision-language modeling,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 9914–9925.
pp. 493–510. [292] L. Hoyer, D. Dai, and L. Van Gool, “HRDA: Context-aware high-
[267] Z. Lin et al., “Frozen CLIP models are efficient video learners,” in Proc. resolution domain-adaptive semantic segmentation,” in Proc. Eur. Conf.
Eur. Conf. Comput. Vis., 2022, pp. 388–404. Comput. Vis., 2022, pp. 372–391.
[268] Z. Chen et al., “Vision transformer adapter for dense predictions,” in [293] L. Hoyer, D. Dai, H. Wang, and L. Van Gool, “MIC: Masked image
Proc. Int. Conf. Learn. Representations, 2023, pp. 1–20. consistency for context-enhanced domain adaptation,” in Proc. IEEE
[269] Y. Rao et al., “DenseCLIP: Language-guided dense prediction with Conf. Comput. Vis. Pattern Recognit., 2023, pp. 11721–11732.
context-aware prompting,” in Proc. IEEE Conf. Comput. Vis. Pattern [294] W. Wang et al., “Exploring sequence feature alignment for domain adap-
Recognit., 2022, pp. 18061–18070. tive detection transformers,” in Proc. 29th ACM Int. Conf. Multimedia,
[270] T. Lüddecke and A. Ecker, “Image segmentation using text and image 2021, pp. 1730–1738.
prompts,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, [295] J. Zhang, J. Huang, Z. Luo, G. Zhang, and S. Lu, “DA-
pp. 7076–7086. DETR: Domain adaptive detection transformer by hybrid attention,”
[271] J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, and H. Shi, “OneFormer: 2021, arXiv:2103.17084.
One transformer to rule universal image segmentation,” in Proc. IEEE [296] J. Yu et al., “MTTrans: Cross-domain object detection with mean teacher
Conf. Comput. Vis. Pattern Recognit., 2023, pp. 2989–2998. transformer,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 629–645.
[272] A. Kirillov et al., “Segment anything,” in Proc. IEEE Int. Conf. Comput. [297] B. Xie, S. Li, M. Li, C. H. Liu, G. Huang, and G. Wang, “SePiCo:
Vis., 2023, pp. 3992–4003. Semantic-guided pixel contrast for domain adaptive semantic seg-
[273] A. Zareian, K. D. Rosa, D. H. Hu, and S.-F. Chang, “Open-vocabulary mentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 7,
object detection using captions,” in Proc. IEEE Conf. Comput. Vis. Pattern pp. 9004–9021, Jul. 2023.
Recognit., 2021, pp. 14393–14402. [298] J. Lambert, Z. Liu, O. Sener, J. Hays, and V. Koltun, “MSeg: A composite
[274] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection dataset for multi-domain semantic segmentation,” in Proc. IEEE Conf.
via vision and language knowledge distillation,” in Proc. Int. Conf. Learn. Comput. Vis. Pattern Recognit., 2020, pp. 2876–2885.
Representations, 2022, pp. 1–20. [299] W. Yin, Y. Liu, C. Shen, A. v. d. Hengel, and B. Sun, “The devil is in the
[275] X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting labels: Semantic segmentation from sentences,” 2022, arXiv:2202.02002.
twenty-thousand classes using image-level supervision,” in Proc. Eur. [300] Z. Qiang et al., “LMSeg: Language-guided multi-dataset segmentation,”
Conf. Comput. Vis., 2022, pp. 350–368. in Proc. Int. Conf. Learn. Representations, 2023, pp. 1–12.
[276] Y. Zang, W. Li, K. Zhou, C. Huang, and C. C. Loy, “Open-vocabulary [301] X. Zhou, V. Koltun, and P. Krähenbühl, “Simple multi-dataset de-
DETR with conditional matching,” in Proc. Eur. Conf. Comput. Vis., tection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
2022, pp. 106–122. pp. 7561–7570.
LI et al.: TRANSFORMER-BASED VISUAL SEGMENTATION: A SURVEY 10161

[302] L. Meng et al., “Detection hub: Unifying object detection datasets via [326] C. Zhou, X. Li, C. C. Loy, and B. Dai, “EdgeSAM: Prompt-in-the-loop
query adaptation on language embedding,” in Proc. IEEE Conf. Comput. distillation for on-device deployment of SAM,” 2023, arXiv:2312.06660.
Vis. Pattern Recognit., 2023, pp. 11402–11411. [327] S. Xu et al., “RAP-SAM: Towards real-time all-purpose segment any-
[303] L. Hoyer, D. Dai, and L. Van Gool, “Domain adaptive and generalizable thing,” 2024, arXiv:2401.10228.
network architectures and training strategies for semantic image segmen- [328] W. Zhang et al., “TopFormer: Token pyramid transformer for mobile
tation,” 2023, arXiv:2304.13615. semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
[304] Y. Zhao, Z. Zhong, N. Zhao, N. Sebe, and G. H. Lee, “Style- nit., 2022, pp. 12073–12083.
hallucinated dual consistency learning: A unified framework for [329] Q. Wan, Z. Huang, J. Lu, G. Yu, and L. Zhang, “SeaFormer: Squeeze-
visual domain generalization,” Int. J. Comput. Vis., vol. 132, enhanced axial transformer for mobile semantic segmentation,” in Proc.
pp. 837–853, 2024. Int. Conf. Learn. Representations, 2023, pp. 1–19.
[305] L. Xu, W. Ouyang, M. Bennamoun, F. Boussaid, and D. Xu, “Multi- [330] H. K. Cheng, J. Chung, Y.-W. Tai, and C.-K. Tang, “CascadePSP: Toward
class token transformer for weakly supervised semantic segmentation,” in class-agnostic and very high-resolution segmentation via global and local
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 4300–4309. refinement,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020,
[306] Y. Wang, J. Zhang, M. Kan, S. Shan, and X. Chen, “Self-supervised pp. 8887–8896.
equivariant attention mechanism for weakly supervised semantic seg- [331] L. Ke, M. Danelljan, X. Li, Y.-W. Tai, C.-K. Tang, and F. Yu, “Mask
mentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, transfiner for high-quality instance segmentation,” in Proc. IEEE Conf.
pp. 12272–12281. Comput. Vis. Pattern Recognit., 2022, pp. 4402–4411.
[307] S. Rossetti, D. Zappia, M. Sanzari, M. Schaerf, and F. Pirri, “Max pooling [332] L. Ke, H. Ding, M. Danelljan, Y.-W. Tai, C.-K. Tang, and F. Yu, “Video
with vision transformers reconciles class and shape in weakly super- mask transfiner for high-quality video instance segmentation,” in Proc.
vised semantic segmentation,” in Proc. Eur. Conf. Comput. Vis., 2022, Eur. Conf. Comput. Vis., 2022, pp. 731–747.
pp. 446–463. [333] Q. Liu, Z. Xu, G. Bertasius, and M. Niethammer, “SimpleClick:
[308] C.-C. Hsu, K.-J. Hsu, C.-C. Tsai, Y.-Y. Lin, and Y.-Y. Chuang, Interactive image segmentation with simple vision transformers,”
“Weakly supervised instance segmentation using the bounding box 2022, arXiv:2210.11006.
tightness prior,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, [334] M. Wang, H. Ding, J. H. Liew, J. Liu, Y. Zhao, and Y. Wei, “SegRefiner:
pp. 6582–6593. Towards model-agnostic segmentation refinement with discrete diffusion
[309] S. Lan et al., “DiscoBox: Weakly supervised instance segmentation and process,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2023, Art. no. 3492.
semantic correspondence from box supervision,” in Proc. IEEE Conf. [335] Y. Song, Q. Zhou, X. Li, D.-P. Fan, X. Lu, and L. Ma, “BA-
Comput. Vis. Pattern Recognit., 2021, pp. 3386–3396. SAM: Scalable bias-mode attention mask for segment anything model,”
[310] Z. Tian, C. Shen, X. Wang, and H. Chen, “BoxInst: High-performance in- 2024, arXiv:2401.02317.
stance segmentation with box annotations,” in Proc. IEEE Conf. Comput. [336] Q. Wen, J. Yang, X. Yang, and K. Liang, “PatchDCT: Patch refinement
Vis. Pattern Recognit., 2021, pp. 5443–5452. for high quality instance segmentation,” in Proc. Int. Conf. Learn. Rep-
[311] J. Xu et al., “GroupViT: Semantic segmentation emerges from text resentations, 2023, pp. 1–15.
supervision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, [337] X. Shen et al., “DCT-mask: Discrete cosine transform mask representa-
pp. 18113–18123. tion for instance segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
[312] M. Yi, Q. Cui, H. Wu, C. Yang, O. Yoshie, and H. Lu, “A simple Recognit., 2021, pp. 8720–8729.
framework for text-supervised semantic segmentation,” in Proc. IEEE [338] L. Qi et al., “Open world entity segmentation,” IEEE Trans. Pattern Anal.
Conf. Comput. Vis. Pattern Recognit., 2023, pp. 7071–7080. Mach. Intell., vol. 45, no. 7, pp. 8743–8756, Jul. 2023.
[313] L. Ke, M. Danelljan, H. Ding, Y.-W. Tai, C.-K. Tang, and F. Yu, “Mask- [339] S. W. Oh, J.-Y. Lee, N. Xu, and S. J. Kim, “Video object segmentation
free video instance segmentation,” in Proc. IEEE Conf. Comput. Vis. using space-time memory networks,” in Proc. IEEE Int. Conf. Comput.
Pattern Recognit., 2023, pp. 22857–22866. Vis., 2019, pp. 9225–9234.
[314] W. Van Gansbeke, S. Vandenhende, S. Georgoulis, and L. Van Gool, “Un- [340] H. K. Cheng and A. G. Schwing, “XMem: Long-term video object
supervised semantic segmentation by contrasting object mask proposals,” segmentation with an Atkinson-Shiffrin memory model,” in Proc. Eur.
in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 10032–10042. Conf. Comput. Vis., 2022, pp. 640–658.
[315] O. Siméoni et al., “Localizing objects with self-supervised transformers [341] K. Park, S. Woo, S. W. Oh, I. S. Kweon, and J.-Y. Lee, “Per-clip
and no labels,” in Proc. Brit. Mach. Vis. Conf., 2021, pp. 310–320. video object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
[316] M. Hamilton, Z. Zhang, B. Hariharan, N. Snavely, and W. T. Freeman, Recognit., 2022, pp. 1342–1351.
“Unsupervised semantic segmentation by distilling feature correspon- [342] J. Wang et al., “Look before you match: Instance understanding matters
dences,” in Proc. Int. Conf. Learn. Representations, 2022, pp. 1–26. in video object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
[317] G. Shin, W. Xie, and S. Albanie, “ReCo: Retrieve and co-segment for Recognit., 2023, pp. 2268–2278.
zero-shot transfer,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, [343] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks
pp. 33754–33767. for biomedical image segmentation,” in Proc. Int. Conf. Med. Image
[318] W. Van Gansbeke, S. Vandenhende, and L. Van Gool, “Discovering Comput. Comput. Assist. Intervention, 2015, pp. 234–241.
object masks with transformers for unsupervised semantic segmentation,” [344] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein,
2022, arXiv:2206.06363. “nnU-Net: A self-configuring method for deep learning-based biomedical
[319] X. Wang et al., “FreeSOLO: Learning to segment objects without an- image segmentation,” Nature Methods, vol. 18, pp. 203–211, 2021.
notations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, [345] J. Chen et al., “TransUNet: Transformers make strong encoders for
pp. 14156–14166. medical image segmentation,” 2021, arXiv:2102.04306.
[320] X. Wang, R. Girdhar, S. X. Yu, and I. Misra, “Cut and learn for unsuper- [346] H. Cao et al., “Swin-Unet: Unet-like pure transformer for medical im-
vised object detection and instance segmentation,” in Proc. IEEE Conf. age segmentation,” in Proc. Eur. Conf. Comput. Vis. Workshops, 2022,
Comput. Vis. Pattern Recognit., 2023, pp. 3124–3134. pp. 205–218.
[321] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “ICNet for real-time semantic [347] Y. Zhang, H. Liu, and Q. Hu, “TransFuse: Fusing transformers and CNNs
segmentation on high-resolution images,” in Proc. Eur. Conf. Comput. for medical image segmentation,” in Proc. Int. Conf. Med. Image Comput.
Vis., 2018, pp. 418–434. Comput. Assist. Intervention, Springer, 2021, pp. 14–24.
[322] M. Maaz et al., “EdgeNeXt: Efficiently amalgamated CNN-transformer [348] A. Hatamizadeh et al., “UNETR: Transformers for 3D medical image
architecture for mobile vision applications,” in Proc. Eur. Conf. Comput. segmentation,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis.,
Vis. Workshops, 2022, pp. 3–20. 2022, pp. 1748–1758.
[323] S. Mehta and M. Rastegari, “MobileViT: Light-weight, general-purpose, [349] H. Zhang et al., “A simple framework for open-vocabulary segmen-
and mobile-friendly vision transformer,” in Proc. Int. Conf. Learn. Rep- tation and detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2023,
resentations, 2022, pp. 1–26. pp. 1020–1031.
[324] J. Zhang et al., “Rethinking mobile block for efficient neural models,” [350] X. Zou et al., “Segment everything everywhere all at once,”
2023, arXiv:2301.01146. 2023, arXiv:2304.06718.
[325] W. Liang et al., “Expediting large-scale vision transformer for dense [351] X. Wang, S. Li, K. Kallidromitis, Y. Kato, K. Kozuka, and T. Darrell,
prediction without fine-tuning,” in Proc. Int. Conf. Neural Inf. Process. “Hierarchical open-vocabulary universal image segmentation,” in Proc.
Syst., 2022, pp. 35462–35477. Int. Conf. Neural Inf. Process. Syst., 2023, Art. no. 936.
10162 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

[352] S. Shao et al., “Objects365: A large-scale, high-quality dataset for object Henghui Ding received the BE degree from Xi’an
detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 8429–8438. Jiaotong University, in 2016, and the PhD degree
[353] J. Liang, T. Zhou, D. Liu, and W. Wang, “CLUSTSEG: Clustering from Nanyang Technological University (NTU), Sin-
for universal segmentation,” in Proc. Int. Conf. Mach. Learn., 2023, gapore, in 2020. He was research scientist with
pp. 20787–20809. ByteDance, postdoctoral researcher with ETH and
[354] G. Sun, Y. Liu, H. Ding, T. Probst, and L. Van Gool, “Coarse-to-fine NTU. He is currently a tenure-track professor with
feature mining for video semantic segmentation,” in Proc. IEEE Conf. Fudan University. He serves as associate editors
Comput. Vis. Pattern Recognit., 2022, pp. 3116–3127. for IET Computer Vision and Visual Intelligence.
[355] G. Sun, Y. Liu, H. Tang, A. Chhatkuli, L. Zhang, and L. Van He serves/served as area chairs for CVPR’24,
Gool, “Mining relations among cross-frame affinities for video se- NeurIPS’24, ACM MM’24, BMVC’24, Senior Pro-
mantic segmentation,” in Proc. Eur. Conf. Comput. Vis., 2022, gram Committee members for AAAI’(22-25) and
pp. 522–539. IJCAI’(23-24). His research interests include computer vision and machine
[356] J. Hu, L. Huang, T. Ren, S. Zhang, R. Ji, and L. Cao, “You only segment learning.
once: Towards real-time panoptic segmentation,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2023, pp. 17819–17829.
[357] K. Ying et al., “CTVIS: Consistent training for online video instance
segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 899–908.
[358] M. Heo et al., “A generalized framework for video instance segmen- Haobo Yuan received the master’s degree from
tation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, Wuhan University, in 2023. He is a research associate
pp. 14623–14632. with S-Lab of Nanyang Technological University.
[359] Y. Zhou et al., “Slot-VPS: Object-centric representation learning for His research interests include computer vision and
video panoptic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern machine learning. He has published several works in
Recognit., 2022, pp. 3083–3093. top-tier conferences and journals.
[360] H. Ding, S. Cohen, B. Price, and X. Jiang, “PhraseClick: Toward achiev-
ing flexible interactive segmentation by phrase and click,” in Proc. Eur.
Conf. Comput. Vis., Springer, 2020, pp. 417–435.
[361] J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi, “UNIFIED-IO:
A unified model for vision, language, and multi-modal tasks,” in Proc.
Int. Conf. Learn. Representations, 2023, pp. 1–34.
[362] L. Qi et al., “Generalizable entity grounding via assistance of large
language model,” 2024, arXiv:2402.02555.
[363] H. Yuan, X. Li, C. Zhou, Y. Li, K. Chen, and C. C. Loy, “Open-vocabulary Wenwei Zhang received the BEng degree from Com-
SAM: Segment and recognize twenty-thousand classes interactively,” puter Science School, Wuhan University, in 2019.
2024, arXiv:2401.02955. He is currently working toward the final-year PhD
[364] H. Zhang and H. Ding, “Prototypical matching and open set rejection for degree with Nanyang Technological University. He
zero-shot semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., is a member of the Multimedia Laboratory of NTU
2021, pp. 6954–6963. (MMLab@NTU), affiliated with the NTU S-Lab,
[365] Y. Zhou, T. Zhang, S. Ji, S. Yan, and X. Li, “DVIS-DAQ: Improving video supervised by Prof. Chen Change Loy. His research
segmentation via dynamic anchor queries,” 2024, arXiv:2404.00086. interests focus on 3D/2D object detection and seg-
[366] T. Chen, L. Li, S. Saxena, G. Hinton, and D. J. Fleet, “A gener- mentation. He also devotes himself to OpenMMLab
alist framework for panoptic segmentation of images and videos,” projects, an open source projects for academic re-
2022, arXiv:2210.06366. search and industrial applications, covering a wide
[367] C. Wang et al., “Explore in-context segmentation via latent diffusion range of computer vision tasks. He is a core maintainer of MMDetection,
models,” 2024, arXiv:2403.09616. MMDetection3D, and MMCV in the OpenMMLab projects.
[368] J. Xie, W. Li, X. Li, Z. Liu, Y. S. Ong, and C. C. Loy, “MosaicFusion:
Diffusion models as data augmenters for large vocabulary instance seg-
mentation,” 2023, arXiv:2309.13042.
[369] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
resolution image synthesis with latent diffusion models,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10674–10685. Jiangmiao Pang is a research scientist at Shanghai
[370] J. Johnson et al., “Image retrieval using scene graphs,” in Proc. IEEE AI Laboratory. He received his PhD degree from Zhe-
Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3668–3678. jiang University in 2021. His research interests cover
[371] J. Yang et al., “Panoptic video scene graph generation,” in Proc. IEEE robotics and multimodal learning. He has published
Conf. Comput. Vis. Pattern Recognit., 2023, pp. 18675–18685. over 30 papers with 1000+ citations in top-tier confer-
[372] J. Yang, Y. Z. Ang, Z. Guo, K. Zhou, W. Zhang, and Z. Liu, “Panop- ences and journals, including CVPR, ICCV, ECCV,
tic scene graph generation,” in Proc. Eur. Conf. Comput. Vis., 2022, NeurIPS, TPAMI, etc. He leads the development of
pp. 178–196. OpenRobotLab in Shanghai AI Laboratory.
[373] J. Yang et al., “4D panoptic scene graph generation,” in Proc. Int. Conf.
Neural Inf. Process. Syst., 2023, Art. no. 3053.
[374] J. Wang, Z. Wen, X. Li, Z. Guo, J. Yang, and Z. Liu, “Pair then relation:
Pair-net for panoptic scene graph generation,” 2023, arXiv:2307.08699.

Xiangtai Li received the PhD degree from Peking


University, in 2022. He currently works as a research Guangliang Cheng received the PhD degree from the
fellow with S-Lab, a member of the Multimedia Labo- National Laboratory of Pattern Recognition (NLPR),
ratory of NTU (MMLab@NTU) with Nanyang Tech- Institute of Automation, Chinese Academy of Sci-
nological University. His research interests include ences, Beijing. He is currently a reader with the De-
computer vision and machine learning with a focus partment of Computer Science, University of Liver-
on scene understanding, segmentation, and video un- pool. Previously, He was a vice research director with
derstanding. Several of his works have been published SenseTime. Before that, he was a postdoc researcher
in top-tier conferences and journals. He serves as the with the Institute of Remote Sensing and Digital
reviewer for top-tier conferences and journals, includ- Earth, Chinese Academy of Sciences, China. His re-
ing CVPR, ICML, ECCV, ICCV, ICLR, NeurIPS, search interests include computer vision, autonomous
IEEE Transactions on Pattern Analysis and Machine Intelligence, and Inter- driving and robotic perception, scene understanding,
national Journal of Computer Vision. domain adaptation, and remote sensing image processing.
LI et al.: TRANSFORMER-BASED VISUAL SEGMENTATION: A SURVEY 10163

Kai Chen received the BEng degree from Tsinghua Chen Change Loy (Senior Member, IEEE) received
University, in 2015, and the PhD degree from the the PhD degree in computer science from the Queen
Chinese University of Hong Kong, in 2019. He is a Mary University of London, in 2010. He is a professor
research scientist with Shanghai AI Laboratory. His of computer vision with the School of Computer
research interests cover computer vision and large Science and Engineering, Nanyang Technological
language models. He has published more than 30 pa- University, Singapore. He is also an adjunct asso-
pers with more than 9000 citations in top-tier confer- ciate professor with The Chinese University of Hong
ences and journals, including CVPR, ICCV, ECCV, Kong. He is the Lab director of MMLab@NTU and
NeurIPS, IEEE Transactions on Pattern Analysis and co-associate director of S-Lab. Prior to joining NTU,
Machine Intelligence, etc. He also leads the devel- he served as a research assistant professor with the
opment of OpenMMLab, an open-source computer MMLab of The Chinese University of Hong Kong,
vision algorithm platform. from 2013 to 2018. He serves as an associate editor of the International Journal
of Computer Vision (IJCV), IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI) and Computer Vision and Image Understanding (CVIU).
He also serves/served as an area chair of top conferences, such as ICCV, CVPR,
Ziwei Liu is currently a Nanyang assistant professor
ECCV, ICLR, and NeurIPS.
with Nanyang Technological University, Singapore.
His research revolves around computer vision, ma-
chine learning, and computer graphics. He has pub-
lished extensively in top-tier conferences and jour-
nals, including CVPR, ICCV, ECCV, NeurIPS, ICLR,
ICML, IEEE Transactions on Pattern Analysis and
Machine Intelligence, ACM Transactions on Graph-
ics, and Nature Machine Intelligence. He is the recip-
ient of Microsoft Young Fellowship, Hong Kong PhD
Fellowship, ICCV Young Researcher Award, CVPR
Best Paper Award Candidate, and MIT Technology Review Innovators under 35
Asia Pacific. He serves as an area chair of CVPR, ICCV, NeurIPS, and ICLR,
as well as an associate editor of International Journal of Computer Vision.

You might also like