Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views21 pages

Oscd 2

The document presents a novel transformer-based Siamese network, Pyramid-SCDFormer, designed for semantic change detection (SCD) in remote sensing images, addressing limitations in existing methods regarding fine-grained features and long-range spatial-temporal information. It introduces a new open optical dataset, Landsat-SCD, which contains 8468 pairs of multispectral images with detailed change type annotations. Experimental results demonstrate that Pyramid-SCDFormer outperforms state-of-the-art models, significantly improving the detection of small-scale changes and fine edges in change maps.

Uploaded by

kunalsawant2611
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views21 pages

Oscd 2

The document presents a novel transformer-based Siamese network, Pyramid-SCDFormer, designed for semantic change detection (SCD) in remote sensing images, addressing limitations in existing methods regarding fine-grained features and long-range spatial-temporal information. It introduces a new open optical dataset, Landsat-SCD, which contains 8468 pairs of multispectral images with detailed change type annotations. Experimental results demonstrate that Pyramid-SCDFormer outperforms state-of-the-art models, significantly improving the detection of small-scale changes and fine edges in change maps.

Uploaded by

kunalsawant2611
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

International Journal of Digital Earth

ISSN: 1753-8947 (Print) 1753-8955 (Online) Journal homepage: www.tandfonline.com/journals/tjde20

A transformer-based Siamese network and


an open optical dataset for semantic change
detection of remote sensing images

Panli Yuan, Qingzhan Zhao, Xingbiao Zhao, Xuewen Wang, Xuefeng Long &
Yuchen Zheng

To cite this article: Panli Yuan, Qingzhan Zhao, Xingbiao Zhao, Xuewen Wang, Xuefeng Long &
Yuchen Zheng (2022) A transformer-based Siamese network and an open optical dataset for
semantic change detection of remote sensing images, International Journal of Digital Earth,
15:1, 1506-1525, DOI: 10.1080/17538947.2022.2111470

To link to this article: https://doi.org/10.1080/17538947.2022.2111470

© 2022 The Author(s). Published by Informa


UK Limited, trading as Taylor & Francis
Group

Published online: 12 Sep 2022.

Submit your article to this journal

Article views: 5416

View related articles

View Crossmark data

Citing articles: 79 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=tjde20
INTERNATIONAL JOURNAL OF DIGITAL EARTH
2022, VOL. 15, NO. 1, 1506–1525
https://doi.org/10.1080/17538947.2022.2111470

A transformer-based Siamese network and an open optical


dataset for semantic change detection of remote sensing images
Panli Yuana,b, Qingzhan Zhaoa,b, Xingbiao Zhaoa, Xuewen Wangc, Xuefeng Longa,b and
Yuchen Zhenga
a
College of Information Science and Technology, Shihezi University, Shihezi, People’s Republic of China;
b
Geospatial Information Engineering Research Center, Xinjiang Production and Construction Corps, Shihezi,
People’s Republic of China; cInstitute of Geophysics and Geomatics, China University of Geosciences, Wuhan,
People’s Republic of China

ABSTRACT ARTICLE HISTORY


Recent change detection (CD) methods focus on the extraction of deep Received 6 June 2022
change semantic features. However, existing methods overlook the fine- Accepted 5 August 2022
grained features and have the poor ability to capture long-range space–
KEYWORDS
time information, which leads to the micro changes missing and the Semantic change detection
edges of change types smoothing. In this paper, a potential (SCD); change detection
transformer-based semantic change detection (SCD) model, Pyramid- dataset; transformer siamese
SCDFormer is proposed, which precisely recognizes the small changes network; self-attention
and fine edges details of the changes. The SCD model selectively mechanism; bitemporal
merges different semantic tokens in multi-head self-attention block to remote sensing
obtain multiscale features, which is crucial for extraction information of
remote sensing images (RSIs) with multiple changes from different
scales. Moreover, we create a well-annotated SCD dataset, Landsat-SCD
with unprecedented time series and change types in complex scenarios.
Comparing with three Convolutional Neural Network-based, one
attention-based, and two transformer-based networks, experimental
results demonstrate that the Pyramid-SCDFormer stably outperforms
the existing state-of-the-art CD models and obtains an improvement in
MIoU/F1 of 1.11/0.76%, 0.57/0.50%, and 8.75/8.59% on the LEVIR-CD,
WHU_CD, and Landsat-SCD dataset respectively. For change classes
proportion less than 1%, the proposed model improves the MIoU by
7.17–19.53% on Landsat-SCD dataset. The recognition performance for
small-scale and fine edges of change types has greatly improved.

1. Introduction
Change detection (CD) of remote sensing images (RSIs), a process of extracting land cover change
information by analysing a pair of co-registered remote sensing images of the same area in distinct
periods, comprises a hot topic in the intelligent interpretation of remote sensing images community
(Shafique et al. 2022). The definition of change detection exhibits very large variability depending
on different applications. Detecting changes manually is a time-consuming and labor-intensive task
(Singh 1989). Therefore, automated CD is one of the key technologies for earth observation
applications, and plays an important role in urban expansion (Chen and Shi 2020), deforestation

CONTACT Qingzhan Zhao [email protected] College of Information Science and Technology, Shihezi University,
Shihezi 832003, People’s Republic of China; Geospatial Information Engineering Research Center, Xinjiang Production and
Construction Corps, Shihezi 832003, People’s Republic of China; Yuchen Zheng [email protected] College of Infor-
mation Science and Technology, Shihezi University, Shihezi 832003, People’s Republic of China
© 2022 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1507

(De Bem et al. 2020), disaster assessment (Fujita et al. 2017), as well as other practical application
(Naegeli, Huss, and Hoelzle 2019). Especially for the areas with fragile ecological environment,
regular and long-ranges monitoring in terms of land cover is increasingly vital. Fortunately, with
the improvement of earth observation technologies, massive high-quality multitemporal wide cov-
erage RSIs provide key data support for CD tasks.
According to the type of semantic label information desired in the output change map, CD
divides into two categories: Binary Change Detection (BCD), where problems only interrelated
to ‘where changes happen’, and semantic change detection (SCD), where problems related to
both ‘where changes happen’ and ‘how changes happen’ are solved in parallel. Hence, SCD provides
‘from-to’ change map indicating change direction and contains more comprehensive land-cover
change information, and the acquisition of detailed change type conversion information is crucial
for specific applications (Peng et al. 2021).
Current CD networks are data-driven deep learning-based methods. Well-annotated CD data-
sets play a crucial role in exploring novel CD methods. In the context of CD tasks, there are many
BCD datasets (Chen and Shi 2020; Wang et al. 2018; Daudt et al. 2018), while few from-to-change
well-annotated datasets as open source are available. Additionally, existing datasets suffer from
some bottlenecks: (1) lack of long-range multi-temporal RSIs. (2) lack of multiple change types
with detailed annotation. (3) lack of SCD information, which almost reflects whether there is a
change but not reports the direction of land cover transformation, for example, from farmland
to building. To a certain extent, exploration of from-to CD dataset accelerates future research on
SCD methods. In this paper, we create a large-scale SCD dataset with more variation change
types and from-to information, richer prior land cover information, and longer time series RSIs.
At present, majority of CD methods are mainly resorted to various convolutional neural net-
works (CNNs) to realize the BCD. One of the ideas of CD methods is to fuse the bands of a pair
of images and input them into the end-to-end network to get a pixel-level change map (Alcantarilla
et al. 2018; Peng, Zhang, and Guan 2019). Others are based on deep Siamese networks to obtain CD
maps. The input pair images are fed into two shared weights feature extracting branches, and embed
them into a feature space where the distance of change pairs is large and no-change pairs is small
(Daudt, Le Saux, and Boulch 2018; Chen et al. 2020). However, it is an intrinsic locality of convolu-
tion operation that leads to performance degradation in representing image features and imposes a
further constraint on modeling explicit long-range relations.
To enhance the performance of CD, the latest studies focus on increasing the receptive field and
improving feature extraction and refinement. Peng, Zhang, and Guan (2019) and Zhang et al.
(2018) utilize multiscale atrous convolution to extract multiscale features for improving the per-
formance of CD. Others strive to explore the performances via the nonlocal operations of attention
mechanisms to enhance the global context of features (Chen and Shi 2020; Chen et al. 2020). Never-
theless, existing CNN-based CD methods generally still struggle to relate long-range concepts in
space–time. And it is essential that long-range context information for the semantic changes of
bitemporal RSIs.
Inspired by the encouraging performance of the transformer in the Computer Vision (CV) area,
transformer-based approaches are proposed in the downstream task of CD (Bandara and Patel
2022a; Chen, Qi, and Shi 2021). Thanks to global self-attention, vision transformers (ViTs) have
stronger long-range spatial and temporal relations shaping ability than CNN-based networks
(Dosovitskiy et al. 2020). There is a certain degradation in model performance of capturing multi-
scale features when solving many visual tasks at different scales objects, thereby failing in capturing
small objects and fining edge of objects. Despite the ViTs having a wider receptive field and more
robust long-range context modeling ability, the transformers-based SCD works do not carry out in
a deep-going way. And the most recent related researches mainly focus on the BCD.
Based on the above researches on the CD work, we intuit that extracting various scales of objects,
especially for small objects and fine edges in change map, need to capture long-range and multiscale
context features in SCD tasks. To meet the above challenges, we introduce the Pyramid-SCDFormer
1508 P. YUAN ET AL.

network. At the encoder stage, the shunted self-attention (SSA) module (Ren et al. 2021) is inte-
grated to better model multiscale features among different attention heads within one self-attention
layer, and then multi-level features from the Siamese network are concatenated to obtain the dis-
tance feature maps of bitemporal RSIs. The Semantic Change Map (SCM) is finally obtained after
processing by Multi-Layer Perception (MLP) and upsampling in the decoder architecture.
For feature extraction, the Pyramid-SCDFormer is different from previous transformer-based
CD methods, which learns to extract pyramid features of different scales changed objects at differ-
ent attention heads within one attention layer in an efficient and effective manner thanks to the SSA
module. Hence, it retains more fined-grained features and clearly identifies small objects and fine
edges of changes that other models easily ignore. For the distance maps of the bitemporal images,
we concatenate the features of different levels from the encoder stage to obtain the final distance
map, while previous CD models mainly calculate the absolute distance.
In sum, the main contributions are as follows:

(1) For the clear recognition of small-scale objects and fine boundaries of changes objects, we pro-
pose a novel end-to-end SCD network based on a transformer module, Pyramid-SCDFormer,
which utilizes the SSA module to capture features at different scales simultaneously with favor-
able efficiency, then integrates different hierarchical features of the changed land covers in Sia-
mese network to obtain fine-grained features of changes.
(2) A new, open-source optical satellite SCD dataset with unprecedented time series and semantic
change types, Landsat-SCD, is presented, which comprises 8468 pairs of multispectral Landsat
images with 10 classes change types. Landsat-SCD is an available dataset to advance the state-
of-the-art models in BCD and SCD tasks.
(3) Extensive experiments confirm the validity of the proposed Pyramid-SCDFormer. The pro-
posed model well mitigates the misdetections of small-scale changes and the fine edges, and
achieves state-of-the-art performance on the LEVIR-CD, WHU_CD, and the proposed Land-
sat-SCD benchmarks.

The remainder of this paper is structured as follows. Section 2 presents the related work of the
proposed network and dataset. Section 3 describes the proposed dataset in detail. Section 4 presents
the architecture of the proposed Pyramid-SCDFormer network and each network module will be
introduced in detail. Section 5 reports the experimental results and discusses the performance of
the proposed network. Section 6 concludes this paper with remarks and future work.

2. Related work
2.1. CNN-based CD methods
Deep learning-based CD methods for RSIs are rapidly evolving and yielding good results, such as
supervised (Zhang et al. 2018; Chen and Shi 2020; Li et al. 2021), unsupervised (Gong et al. 2019),
and semi-supervised (Bandara and Patel 2022b) for different CD datasets. Here, we focus on super-
vised CNN-based methods for CD tasks. The main prior CD methods benefit from the semantic
representation capability of CNNs. To improve the recognition performance of CNN-based CD
methods, scholars mainly optimize the network structure, introduce the attention mechanism
and other tricks.
There are roughly three ways to improve the feature extraction of CNN-based architectures:
multiscale feature, spatial–temporal features, and residual connection. Zhang et al. (2018) introduce
the atrous convolution in ResNet101 to increase the receptive field for capturing multiscale context
information and make the best of Atrous Spatial Pyramid Pooling (ASPP) to extract features to keep
various scale characteristics. Yang et al. (2020) propose an end-to-end deep learning CD framework
based on the D-LinkNet to overcome the boundary error of traditional block. Li et al. (2021)
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1509

propose MFCN network by using multiscale convolution filters to extract detailed information.
Gedara Chaminda Bandara, Gopalakrishnan Nair, and Patel (2022) first use Denoising Diffusion
Probabilistic Models (DDPM) to leverage more multiscale information from RSIs, then train a
CD classifier for the precise CD task. For exploring temporal features, the BiDateNet network
(Papadomanolaki et al. 2019) imports Long Short-Term Memory Networks (LSTMs) to improve
the CD accuracy. Song et al. (2018) propose the convLSTM network, combining 3D Fully Convolu-
tional Networks (FCNs) and LSTMs for hyperspectral images CD to preserve spectral–spatial fea-
tures. Chen and Shi (2020) utilize classic residual connections for coarse-grained and fined-grained
features of changes. UNet++ (Peng, Zhang, and Guan 2019) employs dense skip connections to
improve spatial accuracy and reduce pseudo-changes by enhancing scale variance. FC-Siam-Co
and FC-Siam-Di (Daudt, Le Saux, and Boulch 2018) make the best of the skip connection for
achieving multiscale feature extraction of CD results. CNN-based CD methods are good at extract-
ing high-level semantic features that reveal the change of interest but focus on local modeling. How-
ever, the above methods fail to alleviate pseudo-changes effect.
Consequently, the strategy of leveraging attention mechanism for CD has also been applied to
obtain discriminative information. A deep supervised image fusion network (IFN) (Zhang et al.
2020) fuses multi-level deep features in a channel attention-wise manner to improve boundary
completeness and internal compactness. To capture more discriminative features, DSAMNet
(Shi et al. 2021) imports a Convolutional Block Attention Module (CBAM) into the network
to obtain the spatial and channel information simultaneously. DASNet (Chen et al. 2020) and
DTCDSCN (Liu et al. 2020) introduce a dual attention module (DAM) to get more discriminative
information of semantic changes in RSIs. However, the above approaches only realize the
reweighting of feature information in the spatial dimension and the channel dimension, and it
is a limitation of capturing long-range spatiotemporal information required for the accurate
CD. As a supplement, the self-attention mechanism can capture more contextual information
to solve long-range dependency problems and relieve the influence of pseudo-changes (Chen
and Shi 2020).

2.2. Transformer-based vision methods


The transformer (Vaswani et al. 2017) boomed in 2017, and succeed in Natural Language Proces-
sing (NLP). Based on this intuition, ViTs models are proposed one after another and achieve prom-
ising performance across classic CV tasks, such as classification (Deng et al. 2009), object detection
(Everingham et al. 2010), and semantic segmentation (Zhou et al. 2017). Typical ViTs models
achieve profitable results or perform better than CNN-based models in many tasks (Wang,
Zhang, et al. 2022). The transformer is the architecture based on the self-attention mechanism
with quadratic cost in the number of pixels. Hence, recent studies generally exploit down-sampling
and token merging strategies to reduce the amount of computation. Wang, Li, et al. (2022) propose
the UVACD network combining a transformer and a CNN with the help of spatial and temporal
information to extract more distinguish change information. Dosovitskiy et al. (2020) apply
down-sampling projection to reduce computation cost but incur the output with single-scale and
coarse-grained information. Wang et al. (2021) merge the tokens through linear projection and
adopt a Spatial-Reduction Attention (SRA) layer to reduce the computational cost. However, the
above ViTs largely omit the static receptive fields of each token feature within one self-attention
layer, leading to insufficiency in terms of the boundary and shape of the change of interest. How-
ever, it is of significance to realize fine SCD to meet the practical applications, such as urban expan-
sion, land desertification, agricultural land occupation, etc.
To deal with the above problems, we propose the Pyramid-SCDFormer method to tackle the fine
SCD task, especially for the extraction of small changes and detailed edges. In particular, the model
learns to extract pyramid features of different scales changed objects at different attention heads
within one attention layer. The Pyramid-SCDFormer method, a transformer Siamese network
1510 P. YUAN ET AL.

model embedded with the SSA module, is effective in representing multiscale semantic features and
capturing lang-range spatiotemporal information of the change of interest.

2.3. The existing CD datasets


There are currently many CD datasets for remote sensing from drones and satellites (Shafique et al.
2022). Obviously, the majority of CD datasets contain the binary label, showing the change and no-
change information. Table 1 demonstrates that the publicly available CD datasets have some
limitations.

(1) Lack of long-range multi-temporal RSIs. Most of the existing public CD datasets contain
bitemporal RSIs of the same area, and a few contain remote sensing images of three phases.
The SZTAKI Air dataset (Benedek and Szirányi 2009) is the earliest CD dataset, comprising
1000 pairs of bitemporal images with the size of 800 × 600 and the resolution of 0.5 m. To
make the best of the rich change information about high-resolution RSIs, the Learning, Vision
and Remote Sensing Laboratory LEVIR-CD (Chen and Shi 2020) is released to monitor change
in buildings, containing 637 pair bitemporal aerial images with a size of 1024 × 1024 and res-
olution of 0.3 m. In addition, the Wuhan University (WHU) Building CD (Ji, Wei, and Lu
2018) is also a bitemporal building change detection dataset, which has a higher spatial resol-
ution. The Onera Satellite Change Detection (OSCD) dataset (Daudt et al. 2018) comprises 24
pairs of multispectral Sentinel-2 satellite images from five areas around the world, each with
a size of 600 × 600 and a resolution of 10 m. What is more, the bitemporal hyperspectral
image CD dataset ‘River’ (Wang et al. 2018) was also proposed for the objective evaluation
of CD methods.
(2) Lack of multiple change types with detailed annotation. LEVIR-CD and WHU_CD, contain
binary building CD label, and the SZTAKI AirChange and OSCD only label binary land cover
information. The high resolution semantic change detection (HRSCD) dataset (Daudt et al.
2019) is a SCD dataset, but it does not obviously represent the transformation relationship
between features with the label accuracy of 80–85% leading to inaccurate borders in some
cases. On the side, Hi-UCD dataset (Tian et al. 2020) only includes 9 classes land cover
maps and binary change maps. Although the HRSCD, Hi-UCD, and HCCD (López-Fandiño
et al. 2018) contain more than two change types, they are still lacking in accuracy and semantic
information richness.
(3) Lack of SCD information. The above-mentioned public datasets almost reflect whether there is
a change but not reports the direction of land cover transformation. Therefore, refined SCD is
hindered by the lack of SCD datasets.

Existing CD datasets do not meet the needs of the SCD methods and there is still room for
improvement in large-scale and richer change information. Firstly, the high-resolution RSIs are
unable to provide large-scale and ultra-long time series Land Cover and Land Change (LULC)
monitoring, satellite image data such as Landsat imagery have rich historical data, wider spatial

Table 1. Information of different public CD datasets.


Dataset Resolution (m) Images Image size (Pixel) Phases Change type Changes object Classes
SZTAKI AirChange 1.5 13 952 × 640 2 Binary change Land cover 2
LEVIR-CD 0.5 637 1024 × 1024 2 Binary change Building 2
WHU-CD 0.2 1 32207 × 15354 2 Binary change Building 2
OSCD 10 24 600 × 600 2 Binary change Land cover 2
HRSCD 0.5 291 10000 × 10000 2 Semantic change Land cover 5
Hi-UCD 0.1 1293 1024 × 1024 3 Semantic change Land cover 9
HCCD 30 3 390 × 200 2 Semantic change Land cover 5
Landsat-SCD 30 8468 416 × 416 28 Semantic change Land cover 10
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1511

coverage, and higher time resolution, which could be used as a good supplement. Secondly, the
definition of change types across the existing CD datasets is too broad to meet practical appli-
cations. The proposed Landsat-SCD dataset largely complements existing CD datasets in spatial
scale, time span, and the diversity of change types.

3. Landsat-SCD: a new well-annotated dataset for semantic change detection


In view of the challenge that there are few existing SCD datasets, but the mainstream CD methods
have high requirements for data quality and quantity. Landsat-SCD focuses on refined land cover
semantic changes and provides a benchmark with longer time series and more various semantic
change types for evaluating the refined SCD models of RSIs in this context.
Multispectral RSIs mainly become the vital data source in the field of CD due to theirs relatively
high-quality spatiotemporal and spectral resolution and good data accessibility. The source data of
the Landsat-SCD benchmark from Landsat-like images taken between the years 1990 and 2020 in
Tumushuke (39̊39΄N – 40̊4΄N, 78̊53΄E – 79̊19΄E), Xinjiang, adjacent to the Taklimakan Desert with
the fragile ecological environment and located on the Belt and Road Economic Belt. The detailed
information about the Landsat-SCD dataset is shown in Table 2.
The Landsat-SCD dataset provides 10 change types with much more fine change information
than is previously available in the context of CD datasets, where each ‘from-to’ change type is a sep-
arate class representing land-cover transitions. The change type codes and color maps correspond-
ing to the specific dataset are shown in Table 3.
Some examples of image pairs and labels are depicted in Figure 1. Despite its unprecedented
size and qualities, the challenges of this dataset are needed to be discussed. First, the dataset contains
many complicated detection scenes with unprecedented multiple change types. The study area is
adjacent to the edge of the desert, and the buildings therein are small and scattered. In addition,
the data source is Landsat series images with a resolution of 30 m. The above two points are chal-
lenges for accurate manual annotation and new robust CD model. Second, the label imbalance

Table 2. A summary of the Landsat-SCD dataset.


Type Item Value
Image info. Total image pairs 8468
Image size 416 × 416
Image resolution 30 m
Image phases 28
Total time span 1990–2020
Bitemporal time span 3–23 years
Image Source Landsat-like
Modality RGB image
Change info. Change types 10
Total unchanged proportion 81.11%
Total changed proportion 18.89%

Table 3. Details of each change types in the Landsat-SCD dataset.


Change type code Change type name RGB code Change type proportion
0 No change 255, 255, 255 81.11%
1 Farmland to desert 255, 165, 0 1.81%
2 Farmland to building 230, 30, 100 0.95%
3 Desert to farmland 70, 140, 0 11.61%
4 Desert to building 218, 112, 214 0.77%
5 Desert to water 0, 170, 240 2.12%
6 Building to farmland 127, 235, 170 0.33%
7 Building to desert 230, 80, 0 0.11%
8 Water to farmland 205, 220, 57 0.09%
9 Water to desert 218, 165, 32 1.10%
1512 P. YUAN ET AL.

Figure 1. Some samples from the Landsat-SCD.


Note: Each column is one sample, including the image pair (rows 1 and 2), and the label (row 3). The results of the unchanged and changed classes
are viewed according to the legend for each period of an image.

also arises in the proposed dataset. In line with the characteristics of the real world, changes are
much less ratio than unchanged land covers. Thus, we provide a realistic evaluation benchmark
for SCD methods.

4. The Pyramid-SCDFormer for the precise and fine SCD task


This section describes a robust SCD network (Pyramid-SCDFormer) for monitoring LULC by
bitemporal optical satellite RSIs. An overview of the network first provided, after which each net-
work module will be introduced in detail.

4.1. Overall model architecture


The Pyramid-SCDFormer takes bitemporal RSIs as input and then outputs a pixel-level ‘from-to’
change map, each pixel of which belongs to the unique encoded ‘from-to’ change label. We
focus on the semantic CD, which means not only reflects the change and no change, but also reflects
the changes from one land cover to another.
The proposed Pyramid-SCDFormer architecture (Figure 2) consists of three parts: the Siamese
pyramid transformer encoder, the fusion module of multi-level distance maps from bitemporal fea-
ture pairs, and the prediction head of the decoder. Concretely, the inputs I1 , I2 in size of
H0 × W0 × 3 are fed into the Siamese pyramid transformer encoder. The patch embedding
mixes convolutions of different kernel sizes to achieve image-to-token conversion and get semantic
H0 W0
tokens with the size of × × C, where C is the dimension of the token. In the i-th stage, the
4 4
H0 W0
output feature maps FTI j is the solution of i+1 × i+1 × (C × 2i+1 ), where C is the number of chan-
2 2
nel from the token sequence, i = {1, 2, 3, 4} is the number of stages, and Tj = {T1 , T2 } is the
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1513

Figure 2. Architecture of the proposed Pyramid-SCDFormer for SCD task.

number of periods, which are sent into the fusion module to obtain the distance maps followed by
the prediction head of decoder to acquire the Semantic Change Map (SCM).

4.2. The Siamese pyramid transformer encoder


In the pyramid transformer encoder phase: first, the bitemporal images I1 , I2 in size of H0 × W0 × 3
are fed into the patch embedding for expressing the input images into a few high-level semantic
H0 W 0
tokens. Then, we achieve the input sequence with the size of × × C. The next four stages
4 4
are followed by feature extraction. Each stage contains a linear embedding and several SSA trans-
former blocks. After a stage, the height and width of feature maps are halved and the channel num-
bers are doubled. We get four output feature maps FT1 j , FT2 j , FT3 j , FT4 j from four stages with the size of
H0 W0
FTi j is i+1 × i+1 × (C × 2i+1 ), where i is the number of stages and Tj is the number of periods. And
2 2
the branches of the two Siamese networks share weights
Shunted Self-Attention (SSA): Different from ViT (Dosovitskiy et al. 2020), Swin transformer
(Liu et al. 2021), and PVT (Wang et al. 2021), the SSA is designed to enable the self-attention to
simultaneously extract multiscale features at different attention heads within one attention layer.
And the extracted features are more discriminative and contain more fine-grained feature infor-
mation, benefiting distinguishing the change of different scale interests.
The input tokens are projected into Query (Q), Key (K), and Value (V) vectors. The Multi-Head
Self-Attention (MSA) with different attention heads to compute attention scores simultaneously.
The Multiscale Token Aggregation (MTA) down-samples the K, V from different heads to different
sizes. The SSA is calculated by:

Qi = XWiQ , (1)

Ki, Vi = MTA(X, ri )WiK , MTA(X, ri )WiV , (2)

Vi = Vi + LE(Vi ), (3)
 
Qi KiT
hi = Softmax √ Vi , (4)
dh

MSA(X) = Concat(h1 , h2 , . . . , hH )W 0 , (5)


1514 P. YUAN ET AL.

where X is the input feature map. WiQ , WiK , WiV are linear projection parameters. ri is the down-
sampling rate in the i-th head. When ri grows large, more K, V tokens are merged and the compu-
tation cost is low, capturing large scale objects. when ri grows small, the computation cost is high,
preserving more details information. Hence, we subtly mix multiple ri to extract multi-granularity
features within one self-attention layer. dh is the dimension of Q and K, hi is the output from the i-th
head. Concate( · ) is the concatenation operation. MSA(X) is the output feature map from the MSA
module. Therefore, the key and value vectors enable to capture different scales in a self-attention by
integrating variant ri in different attention heads.
Distance Map: The distance map aims to leverage the multi-level feature maps from each stage
of the encoder and compute the optimal distance metric at each pyramid level while the traditional
CD methods (Daudt, Le Saux, and Boulch 2018; Chen, Qi, and Shi 2021) directly calculate the
absolute distance. The distance map is calculated as:
i
Fdist = BN(ReLU(Conv2D(Concat(FTi 1 , FTi 2 ))), (6)
where FTi 1 and FTi 2 represent the output features of the i-th stage in the T1 and T2 periods.
Detail-specific Feed forward Layer: Based on the traditional feed forward layer, we import a
specific layer as the local details complement between the two fully connected layers.
x′ = FC(x; u1 ), (7)

x = FC(s(x′ + DS(x′ ; u)); u2 ), (8)


where DS(·; u) is the specific layer by utilizing a depth-wise convolution with the parameter u.

4.3. The prediction head of decoder


Multi-level distance maps are provided from the Siamese Pyramid transformer encoder stage. And
then those multiscale features at each level are aggregated to predict the semantic change map in the
prediction head of decoder. Concretely, we utilize MLP layer to unify the channel dimension of the
H0 W0
i
distance map Fdist followed by unsampling each one to the size of × × Cconc . Then,
4 4
unsampled distance maps are concatenated and fused by an MLP layer followed by an upsampling
with a factor of 4. Thus, we obtain the feature map with the size of H0 × W0 × Cconc . Finally, the
upsamlped feature map is processed by MLP and softmax layers to obtain the SCM with the size
of H0 × W0 × Ncls , where Ncls = 10 is the number of change types as follows:
SCM = Softmax(MPL(Upsample(MLP(Upsample(MLP(Fdist
1 2
, Fdist 3
, Fdist 4
, Fdist )))))) (9)

5. Experimental results and discussion


In this section, we introduce the existing state-of-the-art methods about CD. Then, the model
evaluation metrics and implementation details are represented. Lastly, we compare the SCD per-
formance of the proposed Pyramid-SCDFormer with the existing state-of-the-art methods on
the LEVIR-CD, WHU_CD, and the proposed dataset Landsat-SCD.

5.1. The existing state-of-the-art methods


In summary, three CNN-based methods, one attention-based method, and two transformer-based
methods are compared in the experiment to evaluate the effectiveness of the proposed methods.

(1) FC-EF (Daudt, Le Saux, and Boulch 2018): a CNN-based network. The concatenated bitemporal
images are fed into a fully convolution network and skip connection to transport multiscale features.
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1515

(2) FC-Siam-Di (Daudt, Le Saux, and Boulch 2018): a CNN-based network. Bitemporal images are
fed into a Siamese network to capture multi-level features, and differences are transported to
the decoder.
(3) FC-Siam-Co (Daudt, Le Saux, and Boulch 2018): a CNN-based network. Bitemporal images are
fed into a Siamese network to extract multi-level features, and different level concatenations
from the encoder are used to detect changes.
(4) DTCDSCN (Liu et al. 2021): an attention-based CNN method. Bitemporal images are fed into a
Siamese-based network which employs DAM to explore the correlation of channels and spatial
dimensions, capturing more discriminative features.
(5) BIT (Chen, Qi, and Shi 2021): a transformer-based method. The semantic tokens are fed into
the encoder-decoder transformer architecture to enhance context information.
(6) ChangeFormer (Bandara and Patel 2022a): a transformer-based method. A transformer enco-
der in a Siamese network to extract detail and semantic features of bitemporal image, a light
decoder to fuse the multi-level features to acquire change map.

5.2. Metrics and implementation details


In this work, we conduct experiments on the LEVIR-CD, WHU_CD, and proposed Landsat-SCD
datasets. For the LEVIR-CD dataset, the numbers of pairs in training/validation/test sets are 445/
128/64. Limited by computing power, these image pairs are resized into 416 × 416 from the 1024 ×
1024. For WHU_CD, we filter out the image pairs without change information and get the number
of image pairs with the size of 416 × 416 in the training/validation/test sets are 725/233/111. For the
Landsat-SCD dataset, we randomly divide them into three parts to form the training/validation/test
sets of samples 6053/1729/686. In the training process, we exploit flip, re-scale, crop, Gaussian blur,
and color jittering for data augmentation on the above three datasets.
To evaluate the effectiveness of the proposed Pyramid-SCDFormer network, we present four
encoder variants with different configurations, as shown in Table 4. Then, we use five evaluation
metrics to evaluate SCD results, which are Overall Accuracy (OA), Precision (P), Recall (R), F1-
score (F1), and Mean Intersection over Union (MIoU).
TP + TN
OA = (10)
TP + TN + FP + FN
TP
precision = (11)
TP + FP
TP
recall = (12)
TP + FN
2 × precision × recall
F1 = (13)
precision + recall
TP
MIoU = (14)
FP + FN + TP

where TP, FP, TN, and FN are the numbers of true positive, false positive, true negative, and false
negative pixels, respectively.
We implement all experiments with PyTorch library on the GeForce RTX 3090 GPU platform.
All networks are randomly initialized by default. We train all models adopting the Cross-Entropy
loss function and Adamw optimizer (Loshchilov and Hutter 2018) with decay equal to 0.01. The
learning rate is initially set to 0.0001, except for WHU_CD dataset, which has an initial learning
rate of 0.00001. And the batch size is 6.
1516
Table 4. The different configurations of four Pyramid-SCDFormer encoder variants.
Output size Pyramid-SCDFormer-T Pyramid-SCDFormer-S Pyramid-SCDFormer-B Pyramid-SCDFormer-same

P. YUAN ET AL.
head head head head
Stage1 104 × 104 ri = 4 if i , else 8 ri = 4 if i , else 8 ri = 4 if i , else 8 ri = 4 if i , else 8
2 2 2 2
C1 = 64, head = 2, N1 = 3 C1 = 64, head = 2, N1 = 2 C1 = 64, head = 2, N1 = 3 C1 = 64, head = 2, N1 = 3
head head head head
Stage2 52 × 52 ri = 2 if i , else 4 ri = 2 if i , else 4 ri = 2 if i , else 4 ri = 2 if i , else 4
2 2 2 2
C2 = 128, head = 4, N1 = 4 C2 = 128, head = 4, N1 = 4 C2 = 128, head = 4, N1 = 4 C2 = 128, head = 4, N1 = 3
head head head head
Stage3 26 × 26 ri = 1 if i , else 2 ri = 1 if i , else 2 ri = 1 if i , else 2 ri = 1 if i , else 2
2 2 2 2
C3 = 256, head = 8, N1 = 6 C3 = 256, head = 8, N1 = 12 C3 = 256, head = 8, N1 = 24 C3 = 256, head = 8, N1 = 4
Stage4 13 × 13 ri =1 ri =1 ri =1 ri = 1
C4 = 512, head = 16, N1 = 3 C4 = 512, head = 16, N1 = 1 C4 = 512, head = 16, N1 = 2 C4 = 512, head = 16, N1 = 3

Table 5. The overall quantitative results of different CD methods on the LEVIR-CD and WHU_CD dataset.
LEVIR-CD WHU_CD
Method Param. (M) Pre. Rec. OA MIoU F1 Time (min) Pre. Rec. OA MIoU F1 Time (min)
FC-EF 1.35 55.63 50.00 94.91 47.45 48.69 0.90 82.87 72.10 92.86 65.76 76.12 1.02
FC-Siam-Di 1.35 88.69 78.48 97.09 73.66 82.71 0.89 78.14 62.72 91.28 57.28 66.69 1.04
FC-Siam-Co 1.55 88.85 80.67 97.27 75.43 84.22 0.92 50.87 50.84 82.82 44.29 50.85 0.94
DTCDSCN 41.09 91.48 88.43 98.12 82.89 89.89 1.01 92.39 86.08 96.34 81.24 88.91 1.14
BIT 11.95 92.57 88.92 98.27 84.00 90.65 0.88 78.31 83.15 92.44 70.10 80.46 1.14
ChangeFormer 29.16 90.23 86.22 97.81 80.41 88.11 1.029 89.58 83.46 95.46 77.49 86.19 1.53
Pyramid-SCDFormer-T 21.39 91.97 88.11 98.14 82.96 89.94 1.11 91.07 85.35 96.00 79.87 87.94 1.92
Pyramid-SCDFormer-S 21.48 92.41 89.38 98.29 84.27 90.84 1.20 91.14 86.72 96.21 81.03 88.77 1.58
Pyramid-SCDFormer-B 38.64 92.72 90.18 98.39 85.11 91.41 1.55 92.22 86.86 96.43 81.81 89.31 2.24
Note: All values are reported in percentage (%). Best and 2nd-best results are bolded. Pre. denotes precision, Rec. denotes recall, Param. denotes the number of parameters of the model in millions,
and Time represents the average number of minutes used to train an epoch of the model. The bolded represents the best and second best experimental results.
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1517

5.3. Comparisons on the public CD datasets


To demonstrate the robustness of the Pyramid-SCDFormer model in this paper, we selected two
publicly available CD datasets, LEVIR-CD and WHU_CD, for our experiments to make the results
section much stronger. Table 5 presents the segmentation performance, model parameters, and
training time cost of different CD methods on the LEVIR-CD and WHU_CD test sets.
From Table 5, we can see that the proposed method outperforms most of baselines on the LEVIR-
CD and WHU_CD datasets in precision, recall, OA, MIoU, and F1, and performs consistently well on
the WHU_CD dataset with slightly more training time cost and less number of model parameters. Sig-
nificantly, the Pyramid-SCDFormer-B improves previous state-of-the-art network in OA/MIoU/F1 by
0.12/1.11/0.76% and 0.90/0.57/0.50% on LEVIR-CD and WHU_CD dataset. The second-ranked Pyra-
mid-SCDFormer-S obtains OA/MIoU/F1 of 98.29/84.27/90.84%. Pyramid-SCDFormer-T and
DTCDSCN obtain similar recognition results on the LEVIR-CD dataset. FC-EF gets the lowest OA/
MIoU/F1 of 94.91/47.45/48.69% among these nine contrasting networks on the LEVIR-CD dataset.
FC-EF has only four max pooling and four upsampling layers and the layers are shallower compared
with U-Net (Ronneberger, Fischer, and Brox 2015). Therefore, its ability to extract features of changes
is not enough. FC-Siam-Co is higher 0.18/1.77/1.51% than FC-Siam-Di network in OA/MIoU/F1 on
the LEVIR-CD dataset, which indicates that the concatenation operation can preserve more change
information than difference. The segmentation of DTCDSCN network is slightly lower than that of Pyr-
amid-SCDFormer-B module on WHU_CD dataset, thanks to its introduction of spatial pyramid mod-
ule and attention module to extract multi-scale contextual features in the decoder phase, which is the
same idea as the Siamese pyramid transformer encoder of this paper.
The radar plot (Figure 3) shows the accuracy of ‘change’ type in different CD models on the LEVIR-
CD dataset. Due to the small proportion of ‘change’ type in terms of on Recall (R_1), Precision (P_1), F1
(F1_1), MIoU (MIoU_1), it can better highlight the advantages of the proposed model in small-scale
class of changed interests. For ‘change’ type, the results of all transformer-based models yielded pre-
cision > 81%, recall > 73%, F1 > 77%, and MIoU > 63%. The Pyramid-SCDFormer achieves the highest
precision/recall/F1/MIoU/ of 71.91/83.66/86.44/81.05% for ‘change’ type. Hence, transformer-based
models have stable performance in solving the CD problem.
Moreover, the visual comparison results of LEVIR-CD and WHU_CD datasets are displayed
in Figure 4 For better visualization. white, black, red, and green represent TP, TN, FP, and FN
respectively. Overall, the proposed model achieves better results compared to other models.
The Pyramid-SCDFormer shows stronger robustness under ‘non-semantic change’ conditions
such as changes caused by a light change (Figure 4(a)) and seasonal differences (Figure 4(c)),
which indicates that the proposed model effectively learns global context information in long-
range spatiotemporal conditions exclude the irrelevant change. In scenes with more complex
backgrounds, it can recognize finer detail than other networks (Figure 4(b,c)). Compared with
other models, the proposed network can better handle the multiscale changes, which can be
seen in Figure 4(b,d,f,j). For large scale changes, the proposed model has a larger receptive
field for complete feature extraction and obtains a more intact building shape (Figure 4(d,f)).
For small scale changes, it achieves finer detail recognition than other networks (Figure 4(b,j)).
Outstanding performance of recognition for multiscale change types thanks to the assistance
of the SSA module, which captures multiscale features within one self-attention layer via multi-
scale token aggregation.

5.4. Comparisons on the landsat-SCD dataset


As shown in Figure 5, we compare the OA curve of each existing CD models with the Pyramid-
SCDFormer model on the Landsat-SCD dataset in the training and validating phases. It can be
observed that the Pyramid-SCDFormer represents a good fit between the training and validating
curve and achieves high and stable learning performance.
1518 P. YUAN ET AL.

Figure 3. The accuracy of ‘change’ type using different CD models on the LEVIR-CD dataset.
Note: P_1, R_1, F1_1 and MIoU_1 represent recall, precision, F1 score, and Mean Intersection over Union, respectively. The value in the circle of
radar plot is 40% and the value of outer boundary is 90%. The higher value means that higher accuracy was achieved.

Figure 4. Visualizing comparison results of LEVIR-CD and WHU_CD datasets.


Note: Four colors are used for better visualization. White for true positive, black for true negative, dark gray denotes that ‘no change’ type is wrongly
classified into ‘change’ type for false positive, light gray indicates that ‘change’ type is missed for false negative.
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1519

Figure 5. The OA curve of different CD models in the training and validating phases.

As shown in Table 6, the three variants of the Pyramid-SCDFormer with different configurations
all achieve the best results compared to other state-of-the-art methods on the Landsat-SCD dataset.
Pyramid-SCDFormer-B achieves the highest OA/MIoU/F1 of 96.08/59.91/72.50%. The delight is
that compared to the existing best performing network (BIT) the OA/MIoU/F1 of Pyramid-
SCDFormer-B are increased by 0.99/8.75/8.59% on the Landsat-SCD dataset. The third-ranked Pyr-
amid-SCDFormer-T obtains 95.75/56.13/68.52% of OA/MIoU/F1, which are 0.66/4.97/4.61%
higher than the current state of the arts, respectively. The great improvement of the Pyramid-
SCDFormer not only further improves the effectiveness of SSA module and the fusion of distance
map, but also proves the gain effect of their combination.

Table 6. The overall quantitative results of different CD methods on the Landsat-SCD dataset.
Method Param. (M) Time (min) Pre. Rec. OA MIoU F1
FC-EF 13.54 5.92 54.54 29.89 91.98 26.61 33.01
FC-Siam-Di 13.63 3.68 56.34 30.30 92.35 27.22 34.08
FC-Siam-Co 15.50 3.73 55.48 31.80 92.51 28.16 34.87
DTCDSCN 41.10 6.25 60.35 45.46 93.20 38.35 49.48
BIT 11.96 7.58 67.42 61.03 95.09 51.16 63.91
ChangeFormer 29.78 11.98 63.23 57.48 94.95 47.70 59.67
Pyramid-SCDFormer-T 21.40 11.55 68.57 69.17 95.75 56.13 68.52
Pyramid-SCDFormer-S 21.49 11.69 74.42 70.49 96.02 59.55 72.37
Pyramid-SCDFormer-B 38.66 16.25 74.61 70.83 96.08 59.91 72.50
Note: Pre. denotes precision, Rec. denotes recall, Param. denotes the number of parameters of the model in millions, and Time
represents the average number of minutes used to train an epoch of the model. The bolded represents the best and second
best experimental results.
1520
P. YUAN ET AL.
Table 7. The CD models performance about each change type on the Landsat-SCD dataset.
FC- FC-Siam- FC-Siam- Change- Pyramid- Pyramid- Pyramid-
Change type Change type proportion EF Di Co DTCDSCN BIT Former SCDFormer-T SCDFormer-S SCDFormer-B
No change 81.11 91.66 92.04 92.19 93.03 94.91 94.80 95.62 95.89 95.93
Farmland to desert 1.81 0.00 3.10 0.49 30.78 41.85 39.86 49.86 51.20 51.13
Farmland to building 0.95 15.36 19.00 18.93 40.83 59.44 54.95 67.97 69.36 70.69
Desert to farmland 11.61 51.32 51.81 56.04 60.32 68.18 68.21 71.83 73.65 74.21
Desert to building 0.77 0.01 0.02 2.59 32.24 52.11 49.44 62.50 62.83 66.23
Desert to water 2.12 51.11 50.94 53.15 51.91 69.24 66.61 71.50 72.54 72.73
Building to farmland 0.33 0.00 0.00 0.00 3.56 20.48 14.72 26.56 36.29 40.01
Building to desert 0.11 0.00 0.00 0.00 6.95 20.23 15.81 23.80 29.43 32.05
Water to farmland 0.09 0.00 0.00 0.00 2.73 15.71 6.00 18.46 29.93 22.88
Water to desert 1.10 56.64 55.28 58.25 61.10 69.43 66.61 73.17 74.37 73.21
Note: The bolded represents the best and second best experimental results.
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1521

Figure 6. Visualizing comparison results on the Landsat-SCD dataset.


Note: Ten colors are used for better visualization of ten change types. The blue with RGB (0, 47, 167) denotes the false positive and false negative.
1522 P. YUAN ET AL.

Table 8. Ablation study on model efficiency and cost-effectiveness on LEVIR_CD, WHU_CD, and Landsat-SCD datasets.
Dataset name Network Param. (M) Time (min) Pre. Rec. OA MIoU F1
LEVIR-CD ChangeFormer 29.16 1.02 72.49 85.94 88.11 77.36 80.41
Pyramid-SCDFormer-same 18.59 0.97 75.66 88.20 89.79 80.59 82.75
WHU_CD ChangeFormer 29.16 1.35 89.58 83.46 95.46 77.49 86.19
Pyramid-SCDFormer-same 18.59 1.02 90.88 86.14 96.08 80.39 88.32
Landsat-SCD ChangeFormer 29.78 11.98 63.23 57.48 94.95 47.70 59.67
Pyramid-SCDFormer-same 18.61 9.42 71.95 68.99 95.79 57.56 70.32
Note: Pre. denotes precision, Rec. denotes recall, Param. denotes the number of parameters of the model in millions, and time
represents the average number of minutes used to train an epoch of the model. The bolded represents the best experimental
results.

Table 7 shows the MIoU performance of 10 change types among different CD models on the
Landsat-SCD dataset. Notably, the proposed model has an obvious improvement effect for a
small proportion of change types. ‘water to farmland’ is the smallest change type (0.09%). Compared
with the optimal effect of the BIT model, the proposed model has improved by 7.17%. For change types
proportion below 1%, such as ‘water to farmland’, ‘building to desert’, ‘building to farmland’, ‘desert to
building’, and ‘farmland to building’, MIoU increased by 7.17–19.53% compared to the best existing
models. For change types proportion between 1% and 20%, such as ‘farmland to desert’, ‘desert to farm-
land’, ‘desert to water’, and ‘water to desert’, MIoU increased by 3.49–9.53%. Hence, the proposed model
is more efficient for boosting the small proportion of change types.
Figure 6 indicates the performance of different CD methods on the Landsat-SCD dataset. The
blue pixel represents the error recognition, and fewer blue pixels mean fewer misclassifications.
In general, the semantic CD results of the proposed model are closest to the ground truth. First,
the Pyramid-SCDFormer model keeps more precise boundaries of multiscale change objects
than all the baselines in Figure 6(a–c), which demonstrates that more useful fine-grained features
are preserved to improve the accuracy. Compared with other state-of-the-art models, the missed
and false alarms are significantly reduced in semantic change maps. Second, the proposed model
accurately identifies small change objects that other existing models are prone to miss in relatively
complex scenarios of the Landsat-SCD dataset, such as Figure 6(a,c). For example, the proposed
model can not only identify accurately almost small scale ‘desert to farmland’ change type, but
also maintain fine boundary information of changes in Figure 6(c). Therefore, the Pyramid-
SCDFormer can effectively recognize more accurate scale-variance change types and keep finer
boundaries. Particularly, the recognition improvement of changes on a small scale is the most
obvious in complex scenarios.

5.5. Ablation experiments


To verify the effectiveness of the SSA module, we perform ablation experiments on the LEVIR-CD,
WHU_CD, and Landsat-SCD datasets. Pyramid-SCDFormer-same indicates the same depth as
ChangeFormer, where the depth represents the number of SSA transformer blocks at each
stage.
Table 8 shows our Pyramid-SCDFormer model outperforms the ChangeFormer model in the
precision/recall/OA/MIoU/F1 for the same depth parameters with much small computational com-
plexity and model parameters. Combining the segmentation of our model on the three datasets, it is
worth noting that the model has the least enhancement performance on the WHU_CD dataset with
higher spatial resolution, and the most significant enhancement performance on the Landsat-SCD
dataset with lower spatial resolution. Our Pyramid-SCDFormer model has higher improvement for
dataset containing more information on small objects and details. This is a further indication of the
efficiency and cost-effectiveness of our Pyramid-SCDFormer model on the recognition for small-
scale and fine edges of change types.
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1523

6. Conclusion
In this work, a new SCD benchmark dataset, Landsat-SCD is created, which largely complements
existing SCD datasets. We benchmark the Landsat-SCD dataset by using classical approaches in
BCD and SCD tasks. Extensive experiment results report that the proposed dataset is changing
and useful, which will facilitate the future research of effective methods for refined SCD tasks.
Then, we present a novel transformer-based Siamese networks, Pyramid-SCDFormer trained
end-to-end from scratch that surpasses state of the arts for bitemporal remote sensing SCD. Com-
pared with prior three CNN-based, one attention-based, and two transformer-based networks, the
Pyramid-SCDFormer achieves the best performance on the LEVIR-CD, WUH_CD and Landsat-
SCD datasets respectively. Most notably, the SSA is introduced into the pyramid Siamese architec-
ture to effectively capture the multiscale context features to achieve precise recognition of multiscale
changes and the fine edges of objects in complicated detection scenes.
In the next work, we will continue to expand the multi-region dataset further verifying the gen-
eralization of the CD models and hope to promote the development of SCD. Moreover, we will
further study how to reduce computational consumption based on ensuring that the SCD model
can extract refined and multiscale changes in complex scenes.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Data availability statement


The Landsat-SCD dataset is available at https://doi.org/10.6084/m9.figshare.19946135.v1

Funding
This work was supported by National Key Research and Development Program of China [Grant
number 2017YFB0504203]; Xinjiang Production and Construction Corps Science and Technology Project: [Grant
number 2017DB005].

References
Alcantarilla, P. F., S. Stent, G. Ros, R. Arroyo, and R. Gherardi. 2018. “Street-View Change Detection with
Deconvolutional Networks.” Autonomous Robots 42 (7): 1301–1322.
Bandara, W. G. C., and V. M. Patel. 2022a. “A Transformer-Based Siamese Network for Change Detection.” arXiv
Preprint ArXiv:2201.01293.
Bandara, W. G. C., and V. M. Patel. 2022b. “Revisiting Consistency Regularization for Semi-Supervised Change
Detection in Remote Sensing Images.” arXiv Preprint ArXiv:2204.08454.
Benedek, C., and T. Szirányi. 2009. “Change Detection in Optical Aerial Images by a Multilayer Conditional Mixed
Markov Model.” IEEE Transactions on Geoscience and Remote Sensing 47 (10): 3416–3430.
Chen, H., Z. Qi, and Z. Shi. 2021. “Remote Sensing Image Change Detection with Transformers.” IEEE Transactions
on Geoscience and Remote Sensing 60: 1–14.
Chen, H., and Z. Shi. 2020. “A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing
Image Change Detection.” Remote Sensing 12 (10): 1662.
Chen, J., Z. Yuan, J. Peng, L. Chen, H. Huang, J. Zhu, Yu Liu, and H. Li. 2020. “DASNet: Dual Attentive Fully
Convolutional Siamese Networks for Change Detection in High-Resolution Satellite Images.” IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing 14: 1194–1206.
Daudt, R. C., B. Le Saux, and A. Boulch. 2018. “Fully Convolutional Siamese Networks for Change Detection.” In
International Conference on Image Processing (ICIP), 4063–4067.
Daudt, R. C., B. Le Saux, A. Boulch, and Y. Gousseau. 2018. “Urban Change Detection for Multispectral Earth
Observation Using Convolutional Neural Networks.” In International Geoscience and Remote Sensing
Symposium (IGARSS), 2115–2118.
1524 P. YUAN ET AL.

Daudt, R. C., B. Le Saux, A. Boulch, and Y. Gousseau. 2019. “Multitask Learning for Large-Scale Semantic Change
Detection.” Computer Vision and Image Understanding 187: 102783.
De Bem, P. P., O. A. de Carvalho Junior, R. Fontes Guimarães, and R. A. Trancoso Gomes. 2020. “Change Detection
of Deforestation in the Brazilian Amazon Using Landsat Data and Convolutional Neural Networks.” Remote
Sensing 12 (6): 901.
Deng, J., W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. 2009. “Imagenet: A Large-Scale Hierarchical Image
Database.” In International Conference on Computer Vision and Pattern Recognition (CVPR), 248–255.
Dosovitskiy, A., L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, et al. 2020. “An
Image is Worth 16 x16 Words: Transformers for Image Recognition at Scale.” arXiv Preprint ArXiv:2010.11929.
Everingham, M., L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. 2010. “The Pascal Visual Object Classes
(VOC) Challenge.” International Journal of Computer Vision 88 (2): 303–338.
Fujita, A., K. Sakurada, T. Imaizumi, R. Ito, S. Hikosaka, and R. Nakamura. 2017. “Damage Detection from Aerial
Images Via Convolutional Neural Networks.” In International Conference on Machine Vision Applications
(MVA), 5–8.
Gedara Chaminda Bandara, W., N. Gopalakrishnan Nair, and V. M. Patel. 2022. “Remote Sensing Change Detection
(Segmentation) Using Denoising Diffusion Probabilistic Models.” arXiv e-Prints: ArXiv-2206.
Gong, M., Y. Yang, T. Zhan, X. Niu, and S. Li. 2019. “A Generative Discriminatory Classified Network for Change
Detection in Multispectral Imagery.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Sensing 12 (1): 321–333.
Ji, S., S. Wei, and M. Lu. 2018. “Fully Convolutional Networks for Multisource Building Extraction from an
Open Aerial and Satellite Imagery Data Set.” IEEE Transactions on Geoscience and Remote Sensing 57 (1):
574–586.
Li, X., M. He, H. Li, and H. Shen. 2021. “A Combined Loss-Based Multiscale Fully Convolutional Network
for High-Resolution Remote Sensing Image Change Detection.” IEEE Geoscience and Remote Sensing Letters
19: 1–5.
Liu, Z., Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. 2021. “Swin Transformer: Hierarchical Vision
Transformer Using Shifted Windows.” In International Conference on Computer Vision (ICCV), 10012–10022.
Liu, Y., C. Pang, Z. Zhan, X. Zhang, and X. Yang. 2020. “Building Change Detection for Remote Sensing Images
Using a Dual-Task Constrained Deep Siamese Convolutional Network Model.” IEEE Geoscience and Remote
Sensing Letters 18 (5): 811–815.
López-Fandiño, J., A. S. Garea, D. B. Heras, and F. Argüello. 2018. “Stacked Autoencoders for Multiclass Change
Detection in Hyperspectral Images.” In International Geoscience and Remote Sensing Symposium (IGARSS),
1906–1909.
Loshchilov, I., and F. Hutter. 2018. “Fixing Weight Decay Regularization in Adam.” arXiv Preprint
ArXiv:1711.05101, 2017.
Naegeli, K., M. Huss, and M. Hoelzle. 2019. “Change Detection of Bare-Ice Albedo in the Swiss Alps.” The Cryosphere
13 (1): 397–412.
Papadomanolaki, M., S. Verma, M. Vakalopoulou, S. Gupta, and K. Karantzalos. 2019. “Detecting Urban Changes
with Recurrent Neural Networks from Multitemporal Sentinel-2 Data.” In International Geoscience and Remote
Sensing Symposium (IGARSS), 214–217.
Peng, D., L. Bruzzone, Y. Zhang, H. Guan, and P. He. 2021. “SCDNET: A Novel Convolutional Network for Semantic
Change Detection in High Resolution Optical Remote Sensing Imagery.” International Journal of Applied Earth
Observation and Geoinformation 103: 102465.
Peng, D., Y. Zhang, and H. Guan. 2019. “End-to-End Change Detection for High Resolution Satellite Images Using
Improved UNet++.” Remote Sensing 11 (11): 1382.
Ren, S., D. Zhou, S. He, J. Feng, and X. Wang. 2021. “Shunted Self-Attention via Multi-Scale Token Aggregation.”
arXiv Preprint ArXiv:2111.15193.
Ronneberger, O., P. Fischer, and T. Brox. 2015. “U-Net: Convolutional Networks for Biomedical Image
Segmentation.” In International Conference on Medical Image Computing and Computer-Assisted Intervention
(MICCAI), 234–241.
Shafique, A., G. Cao, Z. Khan, M. Asad, and M. Aslam. 2022. “Deep Learning-Based Change Detection in Remote
Sensing Images: A Review.” Remote Sensing 14 (4): 871.
Shi, Q., M. Liu, S. Li, X. Liu, F. Wang, and L. Zhang. 2021. “A Deeply Supervised Attention Metric-Based Network
and an Open Aerial Image Dataset for Remote Sensing Change Detection.” IEEE Transactions on Geoscience and
Remote Sensing 60: 1–16.
Singh, A. 1989. “Review Article Digital Change Detection Techniques Using Remotely-Sensed Data.” International
Journal of Remote Sensing 10 (6): 989–1003.
Song, A., J. Choi, Y. Han, and Y. Kim. 2018. “Change Detection in Hyperspectral Images Using Recurrent 3D Fully
Convolutional Networks.” Remote Sensing 10 (11): 1827.
Tian, S., A. Ma, Z. Zheng, and Y. Zhong. 2020. “Hi-UCD: A Large-Scale Dataset for Urban Semantic Change
Detection in Remote Sensing Imagery.” arXiv Preprint ArXiv:2011.03247.
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1525

Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł Kaiser, and I. Polosukhin. 2017. “Attention
Is All You Need.” In Conference and Workshop on Neural Information Processing Systems (NIPS), 30.
Wang, G., B. Li, T. Zhang, and S. Zhang. 2022. “A Network Combining a Transformer and a Convolutional Neural
Network for Remote Sensing Image Change Detection.” Remote Sensing 14 (9): 2228.
Wang, W., E. Xie, X. Li, D. P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao. 2021. “Pyramid Vision Transformer:
A Versatile Backbone for Dense Prediction Without Convolutions.” In International Conference on Computer
Vision (ICCV), 568–578.
Wang, Q., Z. Yuan, Q. Du, and X. Li. 2018. “GETNET: A General End-to-End 2-D CNN Framework for
Hyperspectral Image Change Detection.” IEEE Transactions on Geoscience and Remote Sensing 57 (1): 3–13.
Wang, D., J. Zhang, B. Du, G. S. Xia, and D. Tao. 2022. “An Empirical Study of Remote Sensing Pretraining.” IEEE
Transactions on Geoscience and Remote Sensing, 1.
Yang, Y., H. Gu, Y. Han, and H. Li. 2020. “An End-to-End Deep Learning Change Detection Framework for Remote
Sensing Images.” In 2020–2020 IEEE International Geoscience and Remote Sensing Symposium (IGARSS),
20427006.
Zhang, M., G. Xu, K. Chen, M. Yan, and X. Sun. 2018. “Triplet-Based Semantic Relation Learning for Aerial Remote
Sensing Image Change Detection.” IEEE Geoscience and Remote Sensing Letters 16 (2): 266–270.
Zhang, C., P. Yue, D. Tapete, L. Jiang, B. Shangguan, L. Huang, and G. Liu. 2020. “A Deeply Supervised Image Fusion
Network for Change Detection in High Resolution Bi-Temporal Remote Sensing Images.” ISPRS Journal of
Photogrammetry and Remote Sensing 166: 183–200.
Zhou, B., H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. 2017. “Scene Parsing Through ade20k Dataset.” In
Conference on Computer Vision and Pattern Recognition (CVPR), 633–641.

You might also like