Oscd 2
Oscd 2
Panli Yuan, Qingzhan Zhao, Xingbiao Zhao, Xuewen Wang, Xuefeng Long &
Yuchen Zheng
To cite this article: Panli Yuan, Qingzhan Zhao, Xingbiao Zhao, Xuewen Wang, Xuefeng Long &
Yuchen Zheng (2022) A transformer-based Siamese network and an open optical dataset for
semantic change detection of remote sensing images, International Journal of Digital Earth,
15:1, 1506-1525, DOI: 10.1080/17538947.2022.2111470
1. Introduction
Change detection (CD) of remote sensing images (RSIs), a process of extracting land cover change
information by analysing a pair of co-registered remote sensing images of the same area in distinct
periods, comprises a hot topic in the intelligent interpretation of remote sensing images community
(Shafique et al. 2022). The definition of change detection exhibits very large variability depending
on different applications. Detecting changes manually is a time-consuming and labor-intensive task
(Singh 1989). Therefore, automated CD is one of the key technologies for earth observation
applications, and plays an important role in urban expansion (Chen and Shi 2020), deforestation
CONTACT Qingzhan Zhao [email protected] College of Information Science and Technology, Shihezi University,
Shihezi 832003, People’s Republic of China; Geospatial Information Engineering Research Center, Xinjiang Production and
Construction Corps, Shihezi 832003, People’s Republic of China; Yuchen Zheng [email protected] College of Infor-
mation Science and Technology, Shihezi University, Shihezi 832003, People’s Republic of China
© 2022 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1507
(De Bem et al. 2020), disaster assessment (Fujita et al. 2017), as well as other practical application
(Naegeli, Huss, and Hoelzle 2019). Especially for the areas with fragile ecological environment,
regular and long-ranges monitoring in terms of land cover is increasingly vital. Fortunately, with
the improvement of earth observation technologies, massive high-quality multitemporal wide cov-
erage RSIs provide key data support for CD tasks.
According to the type of semantic label information desired in the output change map, CD
divides into two categories: Binary Change Detection (BCD), where problems only interrelated
to ‘where changes happen’, and semantic change detection (SCD), where problems related to
both ‘where changes happen’ and ‘how changes happen’ are solved in parallel. Hence, SCD provides
‘from-to’ change map indicating change direction and contains more comprehensive land-cover
change information, and the acquisition of detailed change type conversion information is crucial
for specific applications (Peng et al. 2021).
Current CD networks are data-driven deep learning-based methods. Well-annotated CD data-
sets play a crucial role in exploring novel CD methods. In the context of CD tasks, there are many
BCD datasets (Chen and Shi 2020; Wang et al. 2018; Daudt et al. 2018), while few from-to-change
well-annotated datasets as open source are available. Additionally, existing datasets suffer from
some bottlenecks: (1) lack of long-range multi-temporal RSIs. (2) lack of multiple change types
with detailed annotation. (3) lack of SCD information, which almost reflects whether there is a
change but not reports the direction of land cover transformation, for example, from farmland
to building. To a certain extent, exploration of from-to CD dataset accelerates future research on
SCD methods. In this paper, we create a large-scale SCD dataset with more variation change
types and from-to information, richer prior land cover information, and longer time series RSIs.
At present, majority of CD methods are mainly resorted to various convolutional neural net-
works (CNNs) to realize the BCD. One of the ideas of CD methods is to fuse the bands of a pair
of images and input them into the end-to-end network to get a pixel-level change map (Alcantarilla
et al. 2018; Peng, Zhang, and Guan 2019). Others are based on deep Siamese networks to obtain CD
maps. The input pair images are fed into two shared weights feature extracting branches, and embed
them into a feature space where the distance of change pairs is large and no-change pairs is small
(Daudt, Le Saux, and Boulch 2018; Chen et al. 2020). However, it is an intrinsic locality of convolu-
tion operation that leads to performance degradation in representing image features and imposes a
further constraint on modeling explicit long-range relations.
To enhance the performance of CD, the latest studies focus on increasing the receptive field and
improving feature extraction and refinement. Peng, Zhang, and Guan (2019) and Zhang et al.
(2018) utilize multiscale atrous convolution to extract multiscale features for improving the per-
formance of CD. Others strive to explore the performances via the nonlocal operations of attention
mechanisms to enhance the global context of features (Chen and Shi 2020; Chen et al. 2020). Never-
theless, existing CNN-based CD methods generally still struggle to relate long-range concepts in
space–time. And it is essential that long-range context information for the semantic changes of
bitemporal RSIs.
Inspired by the encouraging performance of the transformer in the Computer Vision (CV) area,
transformer-based approaches are proposed in the downstream task of CD (Bandara and Patel
2022a; Chen, Qi, and Shi 2021). Thanks to global self-attention, vision transformers (ViTs) have
stronger long-range spatial and temporal relations shaping ability than CNN-based networks
(Dosovitskiy et al. 2020). There is a certain degradation in model performance of capturing multi-
scale features when solving many visual tasks at different scales objects, thereby failing in capturing
small objects and fining edge of objects. Despite the ViTs having a wider receptive field and more
robust long-range context modeling ability, the transformers-based SCD works do not carry out in
a deep-going way. And the most recent related researches mainly focus on the BCD.
Based on the above researches on the CD work, we intuit that extracting various scales of objects,
especially for small objects and fine edges in change map, need to capture long-range and multiscale
context features in SCD tasks. To meet the above challenges, we introduce the Pyramid-SCDFormer
1508 P. YUAN ET AL.
network. At the encoder stage, the shunted self-attention (SSA) module (Ren et al. 2021) is inte-
grated to better model multiscale features among different attention heads within one self-attention
layer, and then multi-level features from the Siamese network are concatenated to obtain the dis-
tance feature maps of bitemporal RSIs. The Semantic Change Map (SCM) is finally obtained after
processing by Multi-Layer Perception (MLP) and upsampling in the decoder architecture.
For feature extraction, the Pyramid-SCDFormer is different from previous transformer-based
CD methods, which learns to extract pyramid features of different scales changed objects at differ-
ent attention heads within one attention layer in an efficient and effective manner thanks to the SSA
module. Hence, it retains more fined-grained features and clearly identifies small objects and fine
edges of changes that other models easily ignore. For the distance maps of the bitemporal images,
we concatenate the features of different levels from the encoder stage to obtain the final distance
map, while previous CD models mainly calculate the absolute distance.
In sum, the main contributions are as follows:
(1) For the clear recognition of small-scale objects and fine boundaries of changes objects, we pro-
pose a novel end-to-end SCD network based on a transformer module, Pyramid-SCDFormer,
which utilizes the SSA module to capture features at different scales simultaneously with favor-
able efficiency, then integrates different hierarchical features of the changed land covers in Sia-
mese network to obtain fine-grained features of changes.
(2) A new, open-source optical satellite SCD dataset with unprecedented time series and semantic
change types, Landsat-SCD, is presented, which comprises 8468 pairs of multispectral Landsat
images with 10 classes change types. Landsat-SCD is an available dataset to advance the state-
of-the-art models in BCD and SCD tasks.
(3) Extensive experiments confirm the validity of the proposed Pyramid-SCDFormer. The pro-
posed model well mitigates the misdetections of small-scale changes and the fine edges, and
achieves state-of-the-art performance on the LEVIR-CD, WHU_CD, and the proposed Land-
sat-SCD benchmarks.
The remainder of this paper is structured as follows. Section 2 presents the related work of the
proposed network and dataset. Section 3 describes the proposed dataset in detail. Section 4 presents
the architecture of the proposed Pyramid-SCDFormer network and each network module will be
introduced in detail. Section 5 reports the experimental results and discusses the performance of
the proposed network. Section 6 concludes this paper with remarks and future work.
2. Related work
2.1. CNN-based CD methods
Deep learning-based CD methods for RSIs are rapidly evolving and yielding good results, such as
supervised (Zhang et al. 2018; Chen and Shi 2020; Li et al. 2021), unsupervised (Gong et al. 2019),
and semi-supervised (Bandara and Patel 2022b) for different CD datasets. Here, we focus on super-
vised CNN-based methods for CD tasks. The main prior CD methods benefit from the semantic
representation capability of CNNs. To improve the recognition performance of CNN-based CD
methods, scholars mainly optimize the network structure, introduce the attention mechanism
and other tricks.
There are roughly three ways to improve the feature extraction of CNN-based architectures:
multiscale feature, spatial–temporal features, and residual connection. Zhang et al. (2018) introduce
the atrous convolution in ResNet101 to increase the receptive field for capturing multiscale context
information and make the best of Atrous Spatial Pyramid Pooling (ASPP) to extract features to keep
various scale characteristics. Yang et al. (2020) propose an end-to-end deep learning CD framework
based on the D-LinkNet to overcome the boundary error of traditional block. Li et al. (2021)
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1509
propose MFCN network by using multiscale convolution filters to extract detailed information.
Gedara Chaminda Bandara, Gopalakrishnan Nair, and Patel (2022) first use Denoising Diffusion
Probabilistic Models (DDPM) to leverage more multiscale information from RSIs, then train a
CD classifier for the precise CD task. For exploring temporal features, the BiDateNet network
(Papadomanolaki et al. 2019) imports Long Short-Term Memory Networks (LSTMs) to improve
the CD accuracy. Song et al. (2018) propose the convLSTM network, combining 3D Fully Convolu-
tional Networks (FCNs) and LSTMs for hyperspectral images CD to preserve spectral–spatial fea-
tures. Chen and Shi (2020) utilize classic residual connections for coarse-grained and fined-grained
features of changes. UNet++ (Peng, Zhang, and Guan 2019) employs dense skip connections to
improve spatial accuracy and reduce pseudo-changes by enhancing scale variance. FC-Siam-Co
and FC-Siam-Di (Daudt, Le Saux, and Boulch 2018) make the best of the skip connection for
achieving multiscale feature extraction of CD results. CNN-based CD methods are good at extract-
ing high-level semantic features that reveal the change of interest but focus on local modeling. How-
ever, the above methods fail to alleviate pseudo-changes effect.
Consequently, the strategy of leveraging attention mechanism for CD has also been applied to
obtain discriminative information. A deep supervised image fusion network (IFN) (Zhang et al.
2020) fuses multi-level deep features in a channel attention-wise manner to improve boundary
completeness and internal compactness. To capture more discriminative features, DSAMNet
(Shi et al. 2021) imports a Convolutional Block Attention Module (CBAM) into the network
to obtain the spatial and channel information simultaneously. DASNet (Chen et al. 2020) and
DTCDSCN (Liu et al. 2020) introduce a dual attention module (DAM) to get more discriminative
information of semantic changes in RSIs. However, the above approaches only realize the
reweighting of feature information in the spatial dimension and the channel dimension, and it
is a limitation of capturing long-range spatiotemporal information required for the accurate
CD. As a supplement, the self-attention mechanism can capture more contextual information
to solve long-range dependency problems and relieve the influence of pseudo-changes (Chen
and Shi 2020).
model embedded with the SSA module, is effective in representing multiscale semantic features and
capturing lang-range spatiotemporal information of the change of interest.
(1) Lack of long-range multi-temporal RSIs. Most of the existing public CD datasets contain
bitemporal RSIs of the same area, and a few contain remote sensing images of three phases.
The SZTAKI Air dataset (Benedek and Szirányi 2009) is the earliest CD dataset, comprising
1000 pairs of bitemporal images with the size of 800 × 600 and the resolution of 0.5 m. To
make the best of the rich change information about high-resolution RSIs, the Learning, Vision
and Remote Sensing Laboratory LEVIR-CD (Chen and Shi 2020) is released to monitor change
in buildings, containing 637 pair bitemporal aerial images with a size of 1024 × 1024 and res-
olution of 0.3 m. In addition, the Wuhan University (WHU) Building CD (Ji, Wei, and Lu
2018) is also a bitemporal building change detection dataset, which has a higher spatial resol-
ution. The Onera Satellite Change Detection (OSCD) dataset (Daudt et al. 2018) comprises 24
pairs of multispectral Sentinel-2 satellite images from five areas around the world, each with
a size of 600 × 600 and a resolution of 10 m. What is more, the bitemporal hyperspectral
image CD dataset ‘River’ (Wang et al. 2018) was also proposed for the objective evaluation
of CD methods.
(2) Lack of multiple change types with detailed annotation. LEVIR-CD and WHU_CD, contain
binary building CD label, and the SZTAKI AirChange and OSCD only label binary land cover
information. The high resolution semantic change detection (HRSCD) dataset (Daudt et al.
2019) is a SCD dataset, but it does not obviously represent the transformation relationship
between features with the label accuracy of 80–85% leading to inaccurate borders in some
cases. On the side, Hi-UCD dataset (Tian et al. 2020) only includes 9 classes land cover
maps and binary change maps. Although the HRSCD, Hi-UCD, and HCCD (López-Fandiño
et al. 2018) contain more than two change types, they are still lacking in accuracy and semantic
information richness.
(3) Lack of SCD information. The above-mentioned public datasets almost reflect whether there is
a change but not reports the direction of land cover transformation. Therefore, refined SCD is
hindered by the lack of SCD datasets.
Existing CD datasets do not meet the needs of the SCD methods and there is still room for
improvement in large-scale and richer change information. Firstly, the high-resolution RSIs are
unable to provide large-scale and ultra-long time series Land Cover and Land Change (LULC)
monitoring, satellite image data such as Landsat imagery have rich historical data, wider spatial
coverage, and higher time resolution, which could be used as a good supplement. Secondly, the
definition of change types across the existing CD datasets is too broad to meet practical appli-
cations. The proposed Landsat-SCD dataset largely complements existing CD datasets in spatial
scale, time span, and the diversity of change types.
also arises in the proposed dataset. In line with the characteristics of the real world, changes are
much less ratio than unchanged land covers. Thus, we provide a realistic evaluation benchmark
for SCD methods.
number of periods, which are sent into the fusion module to obtain the distance maps followed by
the prediction head of decoder to acquire the Semantic Change Map (SCM).
Qi = XWiQ , (1)
Vi = Vi + LE(Vi ), (3)
Qi KiT
hi = Softmax √ Vi , (4)
dh
where X is the input feature map. WiQ , WiK , WiV are linear projection parameters. ri is the down-
sampling rate in the i-th head. When ri grows large, more K, V tokens are merged and the compu-
tation cost is low, capturing large scale objects. when ri grows small, the computation cost is high,
preserving more details information. Hence, we subtly mix multiple ri to extract multi-granularity
features within one self-attention layer. dh is the dimension of Q and K, hi is the output from the i-th
head. Concate( · ) is the concatenation operation. MSA(X) is the output feature map from the MSA
module. Therefore, the key and value vectors enable to capture different scales in a self-attention by
integrating variant ri in different attention heads.
Distance Map: The distance map aims to leverage the multi-level feature maps from each stage
of the encoder and compute the optimal distance metric at each pyramid level while the traditional
CD methods (Daudt, Le Saux, and Boulch 2018; Chen, Qi, and Shi 2021) directly calculate the
absolute distance. The distance map is calculated as:
i
Fdist = BN(ReLU(Conv2D(Concat(FTi 1 , FTi 2 ))), (6)
where FTi 1 and FTi 2 represent the output features of the i-th stage in the T1 and T2 periods.
Detail-specific Feed forward Layer: Based on the traditional feed forward layer, we import a
specific layer as the local details complement between the two fully connected layers.
x′ = FC(x; u1 ), (7)
(1) FC-EF (Daudt, Le Saux, and Boulch 2018): a CNN-based network. The concatenated bitemporal
images are fed into a fully convolution network and skip connection to transport multiscale features.
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1515
(2) FC-Siam-Di (Daudt, Le Saux, and Boulch 2018): a CNN-based network. Bitemporal images are
fed into a Siamese network to capture multi-level features, and differences are transported to
the decoder.
(3) FC-Siam-Co (Daudt, Le Saux, and Boulch 2018): a CNN-based network. Bitemporal images are
fed into a Siamese network to extract multi-level features, and different level concatenations
from the encoder are used to detect changes.
(4) DTCDSCN (Liu et al. 2021): an attention-based CNN method. Bitemporal images are fed into a
Siamese-based network which employs DAM to explore the correlation of channels and spatial
dimensions, capturing more discriminative features.
(5) BIT (Chen, Qi, and Shi 2021): a transformer-based method. The semantic tokens are fed into
the encoder-decoder transformer architecture to enhance context information.
(6) ChangeFormer (Bandara and Patel 2022a): a transformer-based method. A transformer enco-
der in a Siamese network to extract detail and semantic features of bitemporal image, a light
decoder to fuse the multi-level features to acquire change map.
where TP, FP, TN, and FN are the numbers of true positive, false positive, true negative, and false
negative pixels, respectively.
We implement all experiments with PyTorch library on the GeForce RTX 3090 GPU platform.
All networks are randomly initialized by default. We train all models adopting the Cross-Entropy
loss function and Adamw optimizer (Loshchilov and Hutter 2018) with decay equal to 0.01. The
learning rate is initially set to 0.0001, except for WHU_CD dataset, which has an initial learning
rate of 0.00001. And the batch size is 6.
1516
Table 4. The different configurations of four Pyramid-SCDFormer encoder variants.
Output size Pyramid-SCDFormer-T Pyramid-SCDFormer-S Pyramid-SCDFormer-B Pyramid-SCDFormer-same
P. YUAN ET AL.
head head head head
Stage1 104 × 104 ri = 4 if i , else 8 ri = 4 if i , else 8 ri = 4 if i , else 8 ri = 4 if i , else 8
2 2 2 2
C1 = 64, head = 2, N1 = 3 C1 = 64, head = 2, N1 = 2 C1 = 64, head = 2, N1 = 3 C1 = 64, head = 2, N1 = 3
head head head head
Stage2 52 × 52 ri = 2 if i , else 4 ri = 2 if i , else 4 ri = 2 if i , else 4 ri = 2 if i , else 4
2 2 2 2
C2 = 128, head = 4, N1 = 4 C2 = 128, head = 4, N1 = 4 C2 = 128, head = 4, N1 = 4 C2 = 128, head = 4, N1 = 3
head head head head
Stage3 26 × 26 ri = 1 if i , else 2 ri = 1 if i , else 2 ri = 1 if i , else 2 ri = 1 if i , else 2
2 2 2 2
C3 = 256, head = 8, N1 = 6 C3 = 256, head = 8, N1 = 12 C3 = 256, head = 8, N1 = 24 C3 = 256, head = 8, N1 = 4
Stage4 13 × 13 ri =1 ri =1 ri =1 ri = 1
C4 = 512, head = 16, N1 = 3 C4 = 512, head = 16, N1 = 1 C4 = 512, head = 16, N1 = 2 C4 = 512, head = 16, N1 = 3
Table 5. The overall quantitative results of different CD methods on the LEVIR-CD and WHU_CD dataset.
LEVIR-CD WHU_CD
Method Param. (M) Pre. Rec. OA MIoU F1 Time (min) Pre. Rec. OA MIoU F1 Time (min)
FC-EF 1.35 55.63 50.00 94.91 47.45 48.69 0.90 82.87 72.10 92.86 65.76 76.12 1.02
FC-Siam-Di 1.35 88.69 78.48 97.09 73.66 82.71 0.89 78.14 62.72 91.28 57.28 66.69 1.04
FC-Siam-Co 1.55 88.85 80.67 97.27 75.43 84.22 0.92 50.87 50.84 82.82 44.29 50.85 0.94
DTCDSCN 41.09 91.48 88.43 98.12 82.89 89.89 1.01 92.39 86.08 96.34 81.24 88.91 1.14
BIT 11.95 92.57 88.92 98.27 84.00 90.65 0.88 78.31 83.15 92.44 70.10 80.46 1.14
ChangeFormer 29.16 90.23 86.22 97.81 80.41 88.11 1.029 89.58 83.46 95.46 77.49 86.19 1.53
Pyramid-SCDFormer-T 21.39 91.97 88.11 98.14 82.96 89.94 1.11 91.07 85.35 96.00 79.87 87.94 1.92
Pyramid-SCDFormer-S 21.48 92.41 89.38 98.29 84.27 90.84 1.20 91.14 86.72 96.21 81.03 88.77 1.58
Pyramid-SCDFormer-B 38.64 92.72 90.18 98.39 85.11 91.41 1.55 92.22 86.86 96.43 81.81 89.31 2.24
Note: All values are reported in percentage (%). Best and 2nd-best results are bolded. Pre. denotes precision, Rec. denotes recall, Param. denotes the number of parameters of the model in millions,
and Time represents the average number of minutes used to train an epoch of the model. The bolded represents the best and second best experimental results.
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1517
Figure 3. The accuracy of ‘change’ type using different CD models on the LEVIR-CD dataset.
Note: P_1, R_1, F1_1 and MIoU_1 represent recall, precision, F1 score, and Mean Intersection over Union, respectively. The value in the circle of
radar plot is 40% and the value of outer boundary is 90%. The higher value means that higher accuracy was achieved.
Figure 5. The OA curve of different CD models in the training and validating phases.
As shown in Table 6, the three variants of the Pyramid-SCDFormer with different configurations
all achieve the best results compared to other state-of-the-art methods on the Landsat-SCD dataset.
Pyramid-SCDFormer-B achieves the highest OA/MIoU/F1 of 96.08/59.91/72.50%. The delight is
that compared to the existing best performing network (BIT) the OA/MIoU/F1 of Pyramid-
SCDFormer-B are increased by 0.99/8.75/8.59% on the Landsat-SCD dataset. The third-ranked Pyr-
amid-SCDFormer-T obtains 95.75/56.13/68.52% of OA/MIoU/F1, which are 0.66/4.97/4.61%
higher than the current state of the arts, respectively. The great improvement of the Pyramid-
SCDFormer not only further improves the effectiveness of SSA module and the fusion of distance
map, but also proves the gain effect of their combination.
Table 6. The overall quantitative results of different CD methods on the Landsat-SCD dataset.
Method Param. (M) Time (min) Pre. Rec. OA MIoU F1
FC-EF 13.54 5.92 54.54 29.89 91.98 26.61 33.01
FC-Siam-Di 13.63 3.68 56.34 30.30 92.35 27.22 34.08
FC-Siam-Co 15.50 3.73 55.48 31.80 92.51 28.16 34.87
DTCDSCN 41.10 6.25 60.35 45.46 93.20 38.35 49.48
BIT 11.96 7.58 67.42 61.03 95.09 51.16 63.91
ChangeFormer 29.78 11.98 63.23 57.48 94.95 47.70 59.67
Pyramid-SCDFormer-T 21.40 11.55 68.57 69.17 95.75 56.13 68.52
Pyramid-SCDFormer-S 21.49 11.69 74.42 70.49 96.02 59.55 72.37
Pyramid-SCDFormer-B 38.66 16.25 74.61 70.83 96.08 59.91 72.50
Note: Pre. denotes precision, Rec. denotes recall, Param. denotes the number of parameters of the model in millions, and Time
represents the average number of minutes used to train an epoch of the model. The bolded represents the best and second
best experimental results.
1520
P. YUAN ET AL.
Table 7. The CD models performance about each change type on the Landsat-SCD dataset.
FC- FC-Siam- FC-Siam- Change- Pyramid- Pyramid- Pyramid-
Change type Change type proportion EF Di Co DTCDSCN BIT Former SCDFormer-T SCDFormer-S SCDFormer-B
No change 81.11 91.66 92.04 92.19 93.03 94.91 94.80 95.62 95.89 95.93
Farmland to desert 1.81 0.00 3.10 0.49 30.78 41.85 39.86 49.86 51.20 51.13
Farmland to building 0.95 15.36 19.00 18.93 40.83 59.44 54.95 67.97 69.36 70.69
Desert to farmland 11.61 51.32 51.81 56.04 60.32 68.18 68.21 71.83 73.65 74.21
Desert to building 0.77 0.01 0.02 2.59 32.24 52.11 49.44 62.50 62.83 66.23
Desert to water 2.12 51.11 50.94 53.15 51.91 69.24 66.61 71.50 72.54 72.73
Building to farmland 0.33 0.00 0.00 0.00 3.56 20.48 14.72 26.56 36.29 40.01
Building to desert 0.11 0.00 0.00 0.00 6.95 20.23 15.81 23.80 29.43 32.05
Water to farmland 0.09 0.00 0.00 0.00 2.73 15.71 6.00 18.46 29.93 22.88
Water to desert 1.10 56.64 55.28 58.25 61.10 69.43 66.61 73.17 74.37 73.21
Note: The bolded represents the best and second best experimental results.
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1521
Table 8. Ablation study on model efficiency and cost-effectiveness on LEVIR_CD, WHU_CD, and Landsat-SCD datasets.
Dataset name Network Param. (M) Time (min) Pre. Rec. OA MIoU F1
LEVIR-CD ChangeFormer 29.16 1.02 72.49 85.94 88.11 77.36 80.41
Pyramid-SCDFormer-same 18.59 0.97 75.66 88.20 89.79 80.59 82.75
WHU_CD ChangeFormer 29.16 1.35 89.58 83.46 95.46 77.49 86.19
Pyramid-SCDFormer-same 18.59 1.02 90.88 86.14 96.08 80.39 88.32
Landsat-SCD ChangeFormer 29.78 11.98 63.23 57.48 94.95 47.70 59.67
Pyramid-SCDFormer-same 18.61 9.42 71.95 68.99 95.79 57.56 70.32
Note: Pre. denotes precision, Rec. denotes recall, Param. denotes the number of parameters of the model in millions, and time
represents the average number of minutes used to train an epoch of the model. The bolded represents the best experimental
results.
Table 7 shows the MIoU performance of 10 change types among different CD models on the
Landsat-SCD dataset. Notably, the proposed model has an obvious improvement effect for a
small proportion of change types. ‘water to farmland’ is the smallest change type (0.09%). Compared
with the optimal effect of the BIT model, the proposed model has improved by 7.17%. For change types
proportion below 1%, such as ‘water to farmland’, ‘building to desert’, ‘building to farmland’, ‘desert to
building’, and ‘farmland to building’, MIoU increased by 7.17–19.53% compared to the best existing
models. For change types proportion between 1% and 20%, such as ‘farmland to desert’, ‘desert to farm-
land’, ‘desert to water’, and ‘water to desert’, MIoU increased by 3.49–9.53%. Hence, the proposed model
is more efficient for boosting the small proportion of change types.
Figure 6 indicates the performance of different CD methods on the Landsat-SCD dataset. The
blue pixel represents the error recognition, and fewer blue pixels mean fewer misclassifications.
In general, the semantic CD results of the proposed model are closest to the ground truth. First,
the Pyramid-SCDFormer model keeps more precise boundaries of multiscale change objects
than all the baselines in Figure 6(a–c), which demonstrates that more useful fine-grained features
are preserved to improve the accuracy. Compared with other state-of-the-art models, the missed
and false alarms are significantly reduced in semantic change maps. Second, the proposed model
accurately identifies small change objects that other existing models are prone to miss in relatively
complex scenarios of the Landsat-SCD dataset, such as Figure 6(a,c). For example, the proposed
model can not only identify accurately almost small scale ‘desert to farmland’ change type, but
also maintain fine boundary information of changes in Figure 6(c). Therefore, the Pyramid-
SCDFormer can effectively recognize more accurate scale-variance change types and keep finer
boundaries. Particularly, the recognition improvement of changes on a small scale is the most
obvious in complex scenarios.
6. Conclusion
In this work, a new SCD benchmark dataset, Landsat-SCD is created, which largely complements
existing SCD datasets. We benchmark the Landsat-SCD dataset by using classical approaches in
BCD and SCD tasks. Extensive experiment results report that the proposed dataset is changing
and useful, which will facilitate the future research of effective methods for refined SCD tasks.
Then, we present a novel transformer-based Siamese networks, Pyramid-SCDFormer trained
end-to-end from scratch that surpasses state of the arts for bitemporal remote sensing SCD. Com-
pared with prior three CNN-based, one attention-based, and two transformer-based networks, the
Pyramid-SCDFormer achieves the best performance on the LEVIR-CD, WUH_CD and Landsat-
SCD datasets respectively. Most notably, the SSA is introduced into the pyramid Siamese architec-
ture to effectively capture the multiscale context features to achieve precise recognition of multiscale
changes and the fine edges of objects in complicated detection scenes.
In the next work, we will continue to expand the multi-region dataset further verifying the gen-
eralization of the CD models and hope to promote the development of SCD. Moreover, we will
further study how to reduce computational consumption based on ensuring that the SCD model
can extract refined and multiscale changes in complex scenes.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Funding
This work was supported by National Key Research and Development Program of China [Grant
number 2017YFB0504203]; Xinjiang Production and Construction Corps Science and Technology Project: [Grant
number 2017DB005].
References
Alcantarilla, P. F., S. Stent, G. Ros, R. Arroyo, and R. Gherardi. 2018. “Street-View Change Detection with
Deconvolutional Networks.” Autonomous Robots 42 (7): 1301–1322.
Bandara, W. G. C., and V. M. Patel. 2022a. “A Transformer-Based Siamese Network for Change Detection.” arXiv
Preprint ArXiv:2201.01293.
Bandara, W. G. C., and V. M. Patel. 2022b. “Revisiting Consistency Regularization for Semi-Supervised Change
Detection in Remote Sensing Images.” arXiv Preprint ArXiv:2204.08454.
Benedek, C., and T. Szirányi. 2009. “Change Detection in Optical Aerial Images by a Multilayer Conditional Mixed
Markov Model.” IEEE Transactions on Geoscience and Remote Sensing 47 (10): 3416–3430.
Chen, H., Z. Qi, and Z. Shi. 2021. “Remote Sensing Image Change Detection with Transformers.” IEEE Transactions
on Geoscience and Remote Sensing 60: 1–14.
Chen, H., and Z. Shi. 2020. “A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing
Image Change Detection.” Remote Sensing 12 (10): 1662.
Chen, J., Z. Yuan, J. Peng, L. Chen, H. Huang, J. Zhu, Yu Liu, and H. Li. 2020. “DASNet: Dual Attentive Fully
Convolutional Siamese Networks for Change Detection in High-Resolution Satellite Images.” IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing 14: 1194–1206.
Daudt, R. C., B. Le Saux, and A. Boulch. 2018. “Fully Convolutional Siamese Networks for Change Detection.” In
International Conference on Image Processing (ICIP), 4063–4067.
Daudt, R. C., B. Le Saux, A. Boulch, and Y. Gousseau. 2018. “Urban Change Detection for Multispectral Earth
Observation Using Convolutional Neural Networks.” In International Geoscience and Remote Sensing
Symposium (IGARSS), 2115–2118.
1524 P. YUAN ET AL.
Daudt, R. C., B. Le Saux, A. Boulch, and Y. Gousseau. 2019. “Multitask Learning for Large-Scale Semantic Change
Detection.” Computer Vision and Image Understanding 187: 102783.
De Bem, P. P., O. A. de Carvalho Junior, R. Fontes Guimarães, and R. A. Trancoso Gomes. 2020. “Change Detection
of Deforestation in the Brazilian Amazon Using Landsat Data and Convolutional Neural Networks.” Remote
Sensing 12 (6): 901.
Deng, J., W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. 2009. “Imagenet: A Large-Scale Hierarchical Image
Database.” In International Conference on Computer Vision and Pattern Recognition (CVPR), 248–255.
Dosovitskiy, A., L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, et al. 2020. “An
Image is Worth 16 x16 Words: Transformers for Image Recognition at Scale.” arXiv Preprint ArXiv:2010.11929.
Everingham, M., L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. 2010. “The Pascal Visual Object Classes
(VOC) Challenge.” International Journal of Computer Vision 88 (2): 303–338.
Fujita, A., K. Sakurada, T. Imaizumi, R. Ito, S. Hikosaka, and R. Nakamura. 2017. “Damage Detection from Aerial
Images Via Convolutional Neural Networks.” In International Conference on Machine Vision Applications
(MVA), 5–8.
Gedara Chaminda Bandara, W., N. Gopalakrishnan Nair, and V. M. Patel. 2022. “Remote Sensing Change Detection
(Segmentation) Using Denoising Diffusion Probabilistic Models.” arXiv e-Prints: ArXiv-2206.
Gong, M., Y. Yang, T. Zhan, X. Niu, and S. Li. 2019. “A Generative Discriminatory Classified Network for Change
Detection in Multispectral Imagery.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Sensing 12 (1): 321–333.
Ji, S., S. Wei, and M. Lu. 2018. “Fully Convolutional Networks for Multisource Building Extraction from an
Open Aerial and Satellite Imagery Data Set.” IEEE Transactions on Geoscience and Remote Sensing 57 (1):
574–586.
Li, X., M. He, H. Li, and H. Shen. 2021. “A Combined Loss-Based Multiscale Fully Convolutional Network
for High-Resolution Remote Sensing Image Change Detection.” IEEE Geoscience and Remote Sensing Letters
19: 1–5.
Liu, Z., Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. 2021. “Swin Transformer: Hierarchical Vision
Transformer Using Shifted Windows.” In International Conference on Computer Vision (ICCV), 10012–10022.
Liu, Y., C. Pang, Z. Zhan, X. Zhang, and X. Yang. 2020. “Building Change Detection for Remote Sensing Images
Using a Dual-Task Constrained Deep Siamese Convolutional Network Model.” IEEE Geoscience and Remote
Sensing Letters 18 (5): 811–815.
López-Fandiño, J., A. S. Garea, D. B. Heras, and F. Argüello. 2018. “Stacked Autoencoders for Multiclass Change
Detection in Hyperspectral Images.” In International Geoscience and Remote Sensing Symposium (IGARSS),
1906–1909.
Loshchilov, I., and F. Hutter. 2018. “Fixing Weight Decay Regularization in Adam.” arXiv Preprint
ArXiv:1711.05101, 2017.
Naegeli, K., M. Huss, and M. Hoelzle. 2019. “Change Detection of Bare-Ice Albedo in the Swiss Alps.” The Cryosphere
13 (1): 397–412.
Papadomanolaki, M., S. Verma, M. Vakalopoulou, S. Gupta, and K. Karantzalos. 2019. “Detecting Urban Changes
with Recurrent Neural Networks from Multitemporal Sentinel-2 Data.” In International Geoscience and Remote
Sensing Symposium (IGARSS), 214–217.
Peng, D., L. Bruzzone, Y. Zhang, H. Guan, and P. He. 2021. “SCDNET: A Novel Convolutional Network for Semantic
Change Detection in High Resolution Optical Remote Sensing Imagery.” International Journal of Applied Earth
Observation and Geoinformation 103: 102465.
Peng, D., Y. Zhang, and H. Guan. 2019. “End-to-End Change Detection for High Resolution Satellite Images Using
Improved UNet++.” Remote Sensing 11 (11): 1382.
Ren, S., D. Zhou, S. He, J. Feng, and X. Wang. 2021. “Shunted Self-Attention via Multi-Scale Token Aggregation.”
arXiv Preprint ArXiv:2111.15193.
Ronneberger, O., P. Fischer, and T. Brox. 2015. “U-Net: Convolutional Networks for Biomedical Image
Segmentation.” In International Conference on Medical Image Computing and Computer-Assisted Intervention
(MICCAI), 234–241.
Shafique, A., G. Cao, Z. Khan, M. Asad, and M. Aslam. 2022. “Deep Learning-Based Change Detection in Remote
Sensing Images: A Review.” Remote Sensing 14 (4): 871.
Shi, Q., M. Liu, S. Li, X. Liu, F. Wang, and L. Zhang. 2021. “A Deeply Supervised Attention Metric-Based Network
and an Open Aerial Image Dataset for Remote Sensing Change Detection.” IEEE Transactions on Geoscience and
Remote Sensing 60: 1–16.
Singh, A. 1989. “Review Article Digital Change Detection Techniques Using Remotely-Sensed Data.” International
Journal of Remote Sensing 10 (6): 989–1003.
Song, A., J. Choi, Y. Han, and Y. Kim. 2018. “Change Detection in Hyperspectral Images Using Recurrent 3D Fully
Convolutional Networks.” Remote Sensing 10 (11): 1827.
Tian, S., A. Ma, Z. Zheng, and Y. Zhong. 2020. “Hi-UCD: A Large-Scale Dataset for Urban Semantic Change
Detection in Remote Sensing Imagery.” arXiv Preprint ArXiv:2011.03247.
INTERNATIONAL JOURNAL OF DIGITAL EARTH 1525
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł Kaiser, and I. Polosukhin. 2017. “Attention
Is All You Need.” In Conference and Workshop on Neural Information Processing Systems (NIPS), 30.
Wang, G., B. Li, T. Zhang, and S. Zhang. 2022. “A Network Combining a Transformer and a Convolutional Neural
Network for Remote Sensing Image Change Detection.” Remote Sensing 14 (9): 2228.
Wang, W., E. Xie, X. Li, D. P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao. 2021. “Pyramid Vision Transformer:
A Versatile Backbone for Dense Prediction Without Convolutions.” In International Conference on Computer
Vision (ICCV), 568–578.
Wang, Q., Z. Yuan, Q. Du, and X. Li. 2018. “GETNET: A General End-to-End 2-D CNN Framework for
Hyperspectral Image Change Detection.” IEEE Transactions on Geoscience and Remote Sensing 57 (1): 3–13.
Wang, D., J. Zhang, B. Du, G. S. Xia, and D. Tao. 2022. “An Empirical Study of Remote Sensing Pretraining.” IEEE
Transactions on Geoscience and Remote Sensing, 1.
Yang, Y., H. Gu, Y. Han, and H. Li. 2020. “An End-to-End Deep Learning Change Detection Framework for Remote
Sensing Images.” In 2020–2020 IEEE International Geoscience and Remote Sensing Symposium (IGARSS),
20427006.
Zhang, M., G. Xu, K. Chen, M. Yan, and X. Sun. 2018. “Triplet-Based Semantic Relation Learning for Aerial Remote
Sensing Image Change Detection.” IEEE Geoscience and Remote Sensing Letters 16 (2): 266–270.
Zhang, C., P. Yue, D. Tapete, L. Jiang, B. Shangguan, L. Huang, and G. Liu. 2020. “A Deeply Supervised Image Fusion
Network for Change Detection in High Resolution Bi-Temporal Remote Sensing Images.” ISPRS Journal of
Photogrammetry and Remote Sensing 166: 183–200.
Zhou, B., H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. 2017. “Scene Parsing Through ade20k Dataset.” In
Conference on Computer Vision and Pattern Recognition (CVPR), 633–641.