Local Feature Matching Using Deep Learning - A Survey
Local Feature Matching Using Deep Learning - A Survey
Shibiao Xua , Shunpeng Chena , Rongtao Xub , Changwei Wangb , Peng Lua , Li Guoa
a School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
b The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, China
Abstract
Local feature matching enjoys wide-ranging applications in the realm of computer vision, encompassing domains such as im-
age retrieval, 3D reconstruction, and object recognition. However, challenges persist in improving the accuracy and robustness
of matching due to factors like viewpoint and lighting variations. In recent years, the introduction of deep learning models has
arXiv:2401.17592v2 [cs.CV] 11 Mar 2024
sparked widespread exploration into local feature matching techniques. The objective of this endeavor is to furnish a comprehen-
sive overview of local feature matching methods. These methods are categorized into two key segments based on the presence of
detectors. The Detector-based category encompasses models inclusive of Detect-then-Describe, Joint Detection and Description,
Describe-then-Detect, as well as Graph Based techniques. In contrast, the Detector-free category comprises CNN Based, Trans-
former Based, and Patch Based methods. Our study extends beyond methodological analysis, incorporating evaluations of prevalent
datasets and metrics to facilitate a quantitative comparison of state-of-the-art techniques. The paper also explores the practical ap-
plication of local feature matching in diverse domains such as Structure from Motion, Remote Sensing Image Registration, and
Medical Image Registration, underscoring its versatility and significance across various fields. Ultimately, we endeavor to outline
the current challenges faced in this domain and furnish future research directions, thereby serving as a reference for researchers
involved in local feature matching and its interconnected domains. A comprehensive list of studies in this survey is available at
https://github.com/vignywang/Awesome-Local-Feature-Matching.
Keywords: Local Feature Matching, Image Matching, Deep Learning, Survey.
Figure 1: Matching results for outdoor images. It can be observed that for images with significant variations in viewpoint and lighting conditions, the matching task
encounters considerable challenges.
The extant methods of image matching can be sorted into two arate keypoint detection and feature description stages by tap-
major categories: Detector-based and Detector-free methods. ping into the rich contextual information prevalent within the
Detector-based methods hinge on the detection and description images. These methods enable end-to-end image matching,
of sparsely distributed keypoints in order to establish matches thereby offering a distinct mechanism to tackle the task.
between images. The efficacy of these methods is largely de- Image matching plays a pivotal role in the domain of im-
pendent on the performance of keypoint detectors and feature age registration, where it contributes significantly by enabling
descriptors, given their significant role in the process. Con- the precise fitting of transformation functions through a reli-
trastingly, Detector-free methods sidestep the necessity for sep- able set of feature matches. This functionality positions im-
2
age matching as a crucial area of study within the broader con- deep learning. Although recent surveys [52, 53, 54] have in-
text of image fusion [43]. To coherently encapsulate the evo- corporated trainable methods, they have failed to timely sum-
lution of the local feature matching domain and stimulate in- marize the plethora of literature that has emerged in the past
novative research avenues, this paper presents an exhaustive re- five years. Furthermore, many are limited to specific aspects of
view and thorough analysis of the latest progress in local fea- image matching within the field, such as some articles introduc-
ture matching, particularly emphasizing the use of deep learn- ing only the feature detection and description methods of local
ing algorithms. In addition, we re-examine pertinent datasets features, but not including matching [52], some particularly fo-
and evaluation criteria, and conduct detailed comparative analy- cusing on matching of cultural heritage images [55], and others
ses of key methodologies. Our investigation addresses both the solely concentrating on medical image registration [56, 57, 58],
gap and potential bridging between traditional manual meth- remote sensing image registration [59, 60], and so on. In this
ods and modern deep learning technologies. We emphasize survey, our goal is to provide the most recent and compre-
the ongoing relevance and collaboration between these two ap- hensive overview by assessing the existing methods of image
proaches by analyzing the latest developments in traditional matching, particularly the state-of-the-art learning-based ap-
manual methods alongside deep learning techniques. Further, proaches. Importantly, we not only discuss the existing meth-
we address the emerging focus on multi-modal images. This ods that serve natural image applications, but also the wide ap-
includes a detailed overview of methods specifically tailored plication of feature matching in SfM, remote sensing images,
for multi-modal image analysis. Our survey also identifies and and medical images. We illustrate the close connection of this
discusses the gaps and future needs in existing datasets for eval- research with the field of information fusion through a detailed
uating local feature matching methods, highlighting the impor- discussion on the matching of multimodal images. Addition-
tance of adapting to diverse and dynamic scenarios. In keep- ally, we have conducted a thorough examination and analysis of
ing with current trends, we examine the role of large foun- the recent mainstream methods, discussions that are evidently
dation models in feature matching. These models represent a missing in the existing literature. Figure 2 showcases a rep-
significant shift from traditional semantic segmentation mod- resentative timeline of local feature matching methodologies,
els [44, 45, 46, 47, 48], offering superior generalization capa- which provides insights on the evolution of these methods and
bilities for a wide array of scenes and objects. their pivotal contributions towards spearheading advancements
In summary, some of the key contributions of this survey in the field.
can be summarized as follows:
• This survey extensively covers the literature on contem- 2. Detector-based Models
porary local feature matching problems and provides a
detailed overview of various local feature matching al- Detector-based methodologies have been the prevailing ap-
gorithms proposed since 2018. Following the preva- proach for local feature matching for a considerable dura-
lent image matching pipeline, we primarily categorize tion. Numerous well-established handcrafted works, including
these methods into two major classes: Detector-based SIFT [33] and ORB [35], have been broadly adopted for varied
and Detector-free, and provide a comprehensive review tasks within the field of 3D computer vision [74, 75]. These tra-
of matching algorithms employing deep learning. ditional Detector-based methodologies typically comprise three
primary stages: feature detection, feature description, and fea-
• We scrutinize the deployment of these methodologies in a ture matching. Initially, a set of sparse key points is extracted
myriad of real-world scenarios, encompassing SfM, Re- from the images. Subsequently, in the feature description stage,
mote Sensing Image Registration, and Medical Image these key points are characterized using high-dimensional vec-
Registration. This investigation highlights the versatil- tors, often designed to encapsulate the specific structure and in-
ity and extensive applicability inherent in local feature formation of the region surrounding these points. Lastly, during
matching techniques. the feature matching stage, correspondences at a pixel level are
• We start from relevant computer vision tasks, review the established through mechanisms like nearest-neighbor searches
major datasets involved in local feature matching, and or more complex matching algorithms. Notable among these
classify them according to different tasks to delve into are GMS (Grid-based Motion Statistics) by Bian et al.[76] and
specific research requirements within each domain. OANET (Order-Aware Network) by Zhang et al.[77]. GMS
enhances feature correspondence quality using grid-based mo-
• We analyze various metrics used for performance evalu- tion statistics, simplifying and accelerating matching, while
ation and conduct a quantitative comparison of key local OANET innovatively optimizes two-view matching by inte-
feature matching methods. grating spatial contexts for precise correspondence and geom-
• We present a series of challenges and future research di- etry estimation. This is typically done by comparing the high-
rections, offering valuable guidance for further advance- dimensional vectors of keypoints between different images and
ments in this field. identifying matches based on the level of similarity – often de-
fined by a distance function in the vector space.
It is important to note that the initial surveys [49, 50, 51] However, in the era of deep learning, the rise of data-driven
primarily focused on manual methods, hence they do not pro- methods has made approaches like LIFT [78] popular. These
vide sufficient reference points for research centered around
3
NCNet D2Net GLU-Net COTR OETR ASTR
DOAP R2D2 GOCor LoFTR TopicFM CasMTR
Figure 2: Representative local feature matching methods. Blue and gray represent Detector-based Models, where gray represents the Graph Based method. The
yellow and green blocks represent the CNN Based and Transformer Based methods in Detector-free Models, respectively.In 2018, Superpoint [61] pioneered the
computation of keypoints and descriptors within a single network. Subsequently, numerous works such as D2Net [62], R2D2 [63], and others attempted to integrate
keypoint detection and description for matching purposes. Concurrently, the NCNet [64] method introduced four-dimensional cost volumes into local feature
matching, initiating a trend in utilizing correlation-based or cost volume-based convolutional neural networks for Detector-free matching research. Building upon
this trend, methods like Sparse-NCNet [65], DRC-Net [66], GLU-Net [67], and PDC-Net [68] emerged. In 2020, SuperGlue [69] framed the task as a graph matching
problem involving two sets of features. Following this, SGMNet [70] and ClusterGNN [71] focused on improving the graph matching process by addressing the
complexity of matching. In 2021, approaches such as LoFTR [72] and Aspanformer [73] successfully incorporated Transformer or Attention mechanisms into
the Detector-free matching process. They achieved this by employing interleaved self and cross-attention modules, significantly expanding the receptive field and
further advancing deep learning-based matching techniques.
methods leverage CNNs to extract more robust and discrimi- 2.1. Detect-then-Describe
native keypoint descriptors, resulting in significant progress in In feature matching methodologies, the adoption of sparse-
handling large viewpoint changes and local feature illumina- to-sparse feature matching is rather commonplace. These meth-
tion variations. Currently, Detector-based methods can be cate- ods adhere to a ’detect-then-describe’ paradigm, where the pri-
gorized into four main classes: 1. Detect-then-Describe meth- mary step involves the detection of keypoint locations. The
ods; 2. Joint Detection and Description methods; 3. Describe- detector subsequently extracts feature descriptors from patches
then-Detect methods; 4. Graph Based methods. Additionally, centered on each detected keypoint. These descriptors are then
we further subdivide Detect-then-Describe methods based on relayed to the feature description stage. This procedure is typ-
the type of supervised learning into Fully-Supervised methods, ically trained utilizing metric learning methods, which aim to
Weakly Supervised methods, and Other forms of Supervision learn a distance function where similar points are close and
methods. This classification is visually depicted in Figure 3. dissimilar points are distant in the feature space. To enhance
efficiency, feature detectors often focus on small image re-
Local Feature Matching Models Using Deep Learning
gions [78], generally emphasizing low-level structures such as
corners [26] or blobs [33]. The descriptors, on the other hand,
aim to capture more nuanced, higher-level information within
Detector-based Models Detector-free Models larger patches encompassing the keypoints. Providing verbosity
and distinctive details, these descriptors serve as the defining
Detect-then-Describe CNN Based features for matching purposes. Figure 4(a) illustrates the com-
• Fully-Supervised mon structure of the Detect-then-Describe pipeline.
• Weakly Supervised and Others Transformer Based
2.1.1. Fully-Supervised
Joint Detection and Description Patch Based The field of local feature matching has undergone a remark-
able transformation, primarily driven by the advent of annotated
Describe-then-Detect patch datasets [79] and the integration of deep learning tech-
nologies. This transformation marks a departure from tradi-
Graph Based tional handcrafted methods to more data-driven methodologies,
reshaping the landscape of feature matching. This section aims
Figure 3: Overview of the Local Feature Matching Models and taxonomy of to trace the historical development of these changes, emphasiz-
the most relevant approaches. ing the sequential progression and the interconnected nature of
various fully supervised methods. At the forefront of this evo-
lution are CNNs, which have been pivotal in revolutionizing the
4
tion of a subspace pooling methodology, leveraging CNNs to
learn invariant and discriminative descriptors. DeepBit [86] of-
fers an unsupervised deep learning framework to learn compact
binary descriptors. It encodes crucial properties like rotation,
translation, and scale invariance of local descriptors into bi-
nary representations. Bingan [87] proposes a method to learn
Describe Detect
compact binary image descriptors using regularized Generative
Adversarial Networks (GANs). GLAD [88] addresses the Per-
son Re-Identification task by considering both local and global
Detect&Describe
cues from human bodies. A four-stream CNN framework is
Detect Describe
implemented to generate discriminative and robust descriptors.
Geodesc [89] advances descriptor computation by integrating
(a) Detect-then-Describe (b) Joint Detection and Description (c) Describe-then-Detect geometric constraints from SfM algorithms. This approach em-
phasizes two aspects: first, the construction of training data us-
Figure 4: The comparison of various prominent Detector-based pipelines for ing geometric information to measure sample hardness, where
trainable local feature matching is presented. Here, the categorization is based
hardness is defined by the variability between pixel blocks of
on the relationship between the detection and description steps: (a) Detect-
then-Describe framework, (b) Joint Detection and Description framework, and the same 3D point and uniformity for different points. Sec-
(c) Describe-then-Detect framework. ond, a geometric similarity loss function is devised, promoting
closeness among pixel blocks corresponding to the same 3D
point. These innovations enable Geodesc to significantly en-
process of descriptor learning. By enabling end-to-end learning hance descriptor effectiveness in 3D reconstruction tasks. For
directly from raw local patches, CNNs have facilitated the con- GIFT [90] and COLD [91], the former underscores the impor-
struction of a hierarchy of local features. This capability has al- tance of incorporating underlying structural information from
lowed CNNs to capture complex patterns in data, leading to the group features to construct potent descriptors. Through the uti-
creation of more specialized and distinct descriptors that signif- lization of group convolutions, GIFT generates dense descrip-
icantly enhance the matching process. This revolutionary shift tors that exhibit both distinctiveness and invariance to trans-
was largely influenced by innovative models like L2Net [80], formation groups. In contrast, COLD introduces a novel ap-
which pioneered a progressive sampling strategy. L2Net’s ap- proach through a multi-level feature distillation network archi-
proach accentuated the relative distances between descriptors tecture. This architecture leverages intermediate layers of Ima-
while applying additional supervision to intermediate feature geNet pre-trained convolutional neural networks to encapsulate
maps. This strategy significantly contributed to the develop- hierarchical features, ultimately extracting highly compact and
ment of robust descriptors, setting a new standard in descriptor robust local descriptors.
learning. Advancing the narrative, our exploration extends to recent
The shift towards these data-driven methodologies, under- strides in fully-supervised methodologies, constituting a note-
pinned by CNNs, has not only improved the accuracy and ef- worthy augmentation of the repertoire of local feature match-
ficiency of local feature matching but has also opened new av- ing capabilities. These pioneering approaches, building upon
enues for research and innovation in this area. As we explore the foundational frameworks expounded earlier, synergistically
the chronological advancements in this field, we observe a clear elevate and finesse the methodologies that underpin the field.
trajectory of growth and refinement, moving from the conven- Continuing the trend of enhancing descriptor robustness, SOS-
tional to the contemporary, each method building upon the suc- Net [92] extends HardNet by introducing a second-order simi-
cesses of its predecessors while introducing novel concepts and larity regularization term for descriptor learning. This enhance-
techniques. OriNet [81] present a method using CNNs to as- ment involves integrating second-order similarity constraints
sign a canonical orientation to feature points in an image, en- into the training process, thereby augmenting the performance
hancing feature point matching. They introduce a Siamese net- of learning robust descriptors. The term ”second-order simi-
work [82] training approach that eliminates the need for pre- larity” denotes a metric that evaluates the consistency of rel-
defined orientations and propose a novel GHH activation func- ative distances among descriptor pairs in a training batch. It
tion, showing significant performance improvements in feature measures the similarity between a descriptor pair not only di-
descriptors across multiple datasets. Building on the architec- rectly but also by considering their relative distances to other
tural principles of L2Net, HardNet [83] streamlined the learn- descriptor pairs within the same batch. Ebel et al. [93] pro-
ing process by focusing on metric learning and eliminating poses a local feature descriptor based on a log-polar sampling
the need for auxiliary loss terms, setting a precedent for sub- scheme to achieve scale invariance. This unique approach al-
sequent models to simplify learning objectives. DOAP [84] lows for keypoint matching across different scales and exhibits
shifted the focus to a learning-to-rank formulation, optimiz- less sensitivity to occlusion and background motion. Thus, it
ing local feature descriptors for nearest-neighbor matching, a effectively utilizes a larger image region to improve perfor-
methodology that found success in specific matching scenar- mance. To design a better loss function, HyNet [94] intro-
ios and influenced later models to consider ranking-based ap- duces a mixed similarity measure for triplet margin loss and
proaches. The KSP [85] method is notable for its introduc- implements a regularization term to constrain descriptor norms,
5
thus establishing a balanced and effective learning framework. ferent modal image types [104], there is an emerging focus on
CNDesc [95] also investigates L2 normalization, presenting an frequency-domain-based feature descriptors. These descriptors
innovative dense local descriptor learning approach. It uses a exhibit improved proficiency in matching cross-modal images.
special cross-normalization technique instead of L2 normaliza- For instance, RIFT [105] utilizes FAST [106] for extracting
tion, introducing a new way of normalizing the feature vectors. repeatable feature points on the phase congruency (PC) map,
Key.Net [96] proposes a keypoint detector that combines hand- subsequently constructing robust descriptors using frequency
crafted and learned CNN features, and uses scale space repre- domain information to tackle the challenges in multimodal im-
sentation in the network to extract keypoints at different levels. age feature matching. Building on RIFT, SRIFT [107] fur-
To address the non-differentiability issue in keypoint detection ther refines this approach by establishing a nonlinear diffusion
methods, ALIKE [97] offers a differentiable keypoint detection scale (NDS) space, thus constructing a multiscale space that not
(DKD) module based on the score map. In contrast to methods only achieves scale and rotation invariance but also addresses
relying on non-maximum suppression (NMS), DKD can back- the issue of slow inference speeds associated with RIFT. With
propagate gradients and produce keypoints at subpixel levels. the evolution of deep learning technologies, depth-based meth-
This enables the direct optimization of keypoint locations. S- ods have demonstrated significant prowess in feature extraction.
TREK [98] introduces an advanced local feature extractor that SemLA [108] uses semantic guidance in its registration and fu-
combines a translation and rotation equivariant keypoint de- sion processes. The feature matching is limited to the semantic
tector with a lightweight descriptor extractor. Trained through sensing area, so as to provide the most accurate registration ef-
a reinforcement learning-inspired framework to optimize key- fect for image fusion tasks.
point repeatability, S-TREK achieves remarkable performance
in repeatability and pose recovery across multiple benchmarks, 2.1.2. Weakly Supervised and Others
especially excelling in scenarios with in-plane rotations. Zip- Weakly supervised learning presents opportunities for mod-
pyPoint [99] is designed based on KP2D [100], introduces an els to learn robust features without requiring densely anno-
entire set of accelerated extraction and matching techniques. tated labels, offering a solution to one of the largest chal-
This method suggests the use of a binary descriptor normaliza- lenges in training deep learning models. Several weakly su-
tion layer, thereby enabling the generation of unique, length- pervised local feature learning methods have emerged, leverag-
invariant binary descriptors. ing easily obtainable geometric information from camera poses.
Implementing contextual information into feature descrip- AffNet [109] represents a key advancement in weakly super-
tors has been a rising trend in the advancement of local fea- vised local feature learning, focusing on the learning of affine
ture matching methods. ContextDesc [101] introduces context shape of local features. This method challenges the traditional
awareness to improve off-the-shelf local feature descriptors. It emphasis on geometric repeatability, showing that it is insuf-
encodes both geometric and visual contexts by using keypoint ficient for reliable feature matching and stressing the impor-
locations, raw local features, and high-level regional features tance of descriptor-based learning. AffNet introduces a hard
as inputs. A novel aspect of its training process is the use of negative-constant loss function to improve the matchability and
an N-pair loss, which is self-adaptive and requires no param- geometric accuracy of affine regions. This has proven effective
eter tuning. This dynamic loss function can allow for a more in enhancing the performance of affine-covariant detectors, es-
efficient learning process. MTLDesc [102] offers a strategy to pecially in wide baseline matching and image retrieval. The ap-
address the inherent locality issue confronted in the domains of proach underscores the need to consider both descriptor match-
convolutional neural networks. This is attained by introducing ability and repeatability for developing more effective local fea-
an adaptive global context enhancement module and multiple ture detectors. GLAMpoints [110] presents a semi-supervised
local context enhancement modules to inject non-local contex- keypoint detection method, creatively drawing insights from re-
tual information. By adding these non-local connections, it can inforcement learning loss formulations. Here, rewards are used
efficiently learn high-level dependencies between distant fea- to calculate the significance of detecting keypoints based on the
tures. Building upon MTLDesc, AWDesc [103] seeks to trans- quality of the final alignment. This method has been noted
fer knowledge from a larger, more complex model (teacher) to to significantly impact the matching and registration quality
a smaller and simpler one (student). This approach leverages of the final images. CAPS [111] introduces a weakly super-
the knowledge learned by the teacher, while enabling signifi- vised learning framework that utilizes the relative camera poses
cantly faster computations with the student, allowing the model between image pairs to learn feature descriptors. By employ-
to achieve an optimal balance between accuracy and speed. The ing epipolar geometric constraints as supervision signals, they
focus on context-awareness in these methods emphasizes the designed differentiable matching layers and a coarse-to-fine
importance of considering more global information when de- architecture, resulting in the generation of dense descriptors.
scribing local features. Each method leverages this information DISK [112] maximizes the potential of reinforcement learn-
in a slightly different way, leading to diverse but potentially ing to integrate weakly supervised learning into an end-to-end
complementary approaches for tackling the challenge of fea- Detector-based pipeline using policy gradients. This integra-
ture matching. tive approach of weak supervision with reinforcement learning
In light of the limitations inherent in traditional image fea- can provide more robust learning signals and achieve effective
ture descriptors (like gradients, grayscale, etc.), which struggle optimization. [113] proposes a group alignment approach that
to handle the geometric and radiometric disparities across dif- leverages the power of group-equivariant CNNs. These CNNs
6
are efficient in extracting discriminative rotation-invariant local truth key points for these images are generated through vari-
descriptors. The authors use a self-supervised loss for better ous homographic transformations, and key-point extraction is
orientation estimation and efficient local descriptor extraction. performed using the MagicPoint model. This strategy, which
Weakly and semi-supervised methods using camera pose super- involves aggregating multiple key-point heatmaps, ensures pre-
vision and other techniques provide useful strategies to tackle cise determination of key-point locations on real images. In-
the challenges of training robust local feature methods and may spired by Q-learning, LF-Net [118] predicts geometric rela-
pave the way for more efficient and scalable learning methods tionships, such as relative depth and camera poses, between
in this domain. matched image pairs using an existing SfM model. It employs
asymmetric gradient backpropagation to train a network for de-
2.2. Joint Detection and Description tecting image pairs without needing manual annotation. Build-
Sparse local feature matching has indeed proved very effec- ing upon LF-Net, RF-Net [119] introduces a receptive field-
tive under a variety of imaging conditions. Yet, under extreme based keypoint detector and designs a general loss function
variations like day-night changes [114], different seasons [115], term, referred to as ’neighbor mask’, which facilitates training
or weak-textured scenes [116], the performance of these fea- of patch selection. Reinforced SP [120] employs principles of
tures can deteriorate significantly. The limitations may stem reinforcement learning to handle the discreteness in keypoint
from the nature of keypoint detectors and local descriptors. selection and descriptor matching. It integrates a feature de-
Detecting keypoints often involves focusing on small regions tector into a complete visual pipeline and trains learnable pa-
of the image and might rely heavily on low-level information, rameters in an end-to-end manner. R2D2 [63] combines grid
such as pixel intensities. This procedure makes keypoint detec- peak detection with reliability prediction for descriptors using
tors more susceptible to variations in low-level image statistics, a dense version of the L2-Net architecture, aiming to produce
which are often affected by changes in lighting, weather, and sparse, repeatable, and reliable keypoints. D2Net [62] adopts a
other environmental factors. Moreover, when trying to indi- joint detect-and-describe approach for sparse feature extraction.
vidually learn or train keypoint detectors or feature descriptors, Unlike Superpoint, it shares all parameters between detection
even after carefully optimizing the individual components, in- and description process and uses a joint formulation that opti-
tegrating them into a feature matching pipeline could still lead mizes both tasks simultaneously. Keypoints in their method are
to information loss or inconsistencies. This is due to the fact defined as local maxima within and across channels of the depth
that the optimization of individual components might not fully feature maps. These techniques elegantly illustrate how the in-
consider the dependencies and information sharing between the tegration of detection and description tasks in a unified model
components. To tackle these issues, the approach of Joint De- leads to more efficient learning and superior performance for
tection and Description has been proposed. In this approach, local feature extraction under different imaging conditions.
the tasks of keypoint detection and description are integrated A dual-headed D2Net model with a correspondence en-
and learned simultaneously within a single model. This can semble is presented by RoRD [121] to address extreme view-
enable the model to fuse information from both tasks during point changes combing vanilla and rotation-robust feature cor-
optimization, better adapting to specific tasks and data, and al- respondences. HDD-Net [122] designs an interactively learn-
lowing deeper feature mappings through CNNs. Such a unified able detector and descriptor fusion network, handling detec-
approach can benefit the task by allowing the detection and de- tor and descriptor components independently and focusing on
scription process to be influenced by higher-level information, their interactions during the learning process. MLIFeat [123]
such as structural or shape-related features of the image. Addi- devises two lightweight modules used for keypoint detec-
tionally, dense descriptors involve richer image context, which tion and descriptor generation with multi-level information fu-
generally leads to better performance. Figure 4 (b) illustrates sion utilized to jointly detect keypoints and extract descrip-
the common structure of the Joint Detection and Description tors. LLF [124] proposes utilizing low-level features to su-
pipeline. pervise keypoint detection. It extends a single CNN layer
Image-based descriptor methods, which take the entire from the descriptor backbone as a detector and co-learns it
image as input and utilize fully convolutional neural net- with the descriptor to maximize descriptor matching. Feature-
works [117] to generate dense descriptors, have seen substan- Booster [125] introduces a descriptor enhancement stage into
tial progress in recent years. These methods often amalga- traditional feature matching pipelines. It establishes a generic
mate the processes of detection and description, leading to im- lightweight descriptor enhancement framework that takes orig-
proved performance in both tasks. SuperPoint [61] employs inal descriptors and geometric attributes of keypoints as in-
a self-supervised approach to simultaneously determine key- puts. The framework employs self-enhancement based on MLP
point locations at the pixel level and their descriptors. Initially, and cross-enhancement based on transformers [126] to enhance
the model undergoes training on synthetic shapes and images descriptors. ASLFeat [127] improves the D2Net using chan-
through the application of random homographies. A crucial nel and spatial peaks on multi-level feature maps. It intro-
aspect of the method lies in its self-annotation process with duces a precise detector and invariant descriptor as well as
real images. This process involves adapting homographies to multi-level connections and deformable convolution networks.
enhance the model’s relevance to real-world images, and the The dense prediction framework employs deformable convo-
MS-COCO dataset is employed for additional training. Ground lution networks (DCN) to alleviate limitations caused by key-
point extraction from low-resolution feature maps. SeLF [128]
7
builds on the Aslfeat architecture to leverage semantic infor- posed by localized feature descriptors solely. The ultimate out-
mation from pre-trained semantic segmentation networks used come is the generation of matches based on the soft assignment
to learn semantically aware feature mappings. It combines matrix. Figure 5 provides a comprehensive depiction of the fun-
learned correspondences-aware feature descriptors with seman- damental architecture of Graph-Based matching.
tic features, therefore, enhancing the robustness of local fea-
ture matching for long-term localization. Lastly, SFD2 [129]
proposes the extraction of reliable features from global regions ×L
Sinkhorn Algorithm
(e.g., buildings, traffic lanes) with the suppression of unreli-
able areas (e.g., sky, cars) by implicitly embedding high-level
semantics into the detection and description processes. This en-
ables the model to extract globally reliable features end-to-end
from a single network. Confidence Matrix
Self Cross
2.3. Describe-then-Detect
One common approach to local feature extraction is the
Describe-then-Detect pipeline, entailing the description of lo- Figure 5: General GNN Matching Model Architecture. Firstly, keypoint posi-
cal image regions first using feature descriptors followed by tions pi along with their visual descriptors di are mapped into individual vec-
tors. Subsequently, self-attention layers and cross-attention layers are thereafter
the detection of keypoints based on these descriptors. Fig- applied alternately, L times, within a graph neural network to create enhanced
ure 4 (c) serves as an illustration of the standard structure of matching descriptors. Finally, the Sinkhorn Algorithm is utilized to determine
the Describe-then-Detect pipeline. the optimal partial assignment.
D2D [130] presents a novel framework for keypoint detec-
tion called Describe-to-Detect (D2D), highlighting the wealth SuperGlue [69] adopts attention graph neural networks and
of information inherent in the feature description phase. This optimal transport methods to address partial assignment prob-
framework involves the generation of a voluminous collection lems. It processes two sets of interest points and their de-
of dense feature descriptors followed by the selection of key- scriptors as inputs and leverages self and cross-attention to ex-
points from this dataset. D2D introduces relative and abso- change messages between the two sets of descriptors. The
lute saliency measurements of local depth feature maps to de- complexity of this method grows quadratically with the num-
fine keypoints. Due to the challenges arising from weak su- ber of keypoints, which prompted further exploration in sub-
pervision inability to differentiate losses between detection and sequent works. SGMNet [70] builds on SuperGlue and adds
description stages, PoSFeat [131] presents a decoupled train- a Seeding Module that processes only a subset of matching
ing approach in the describe-then-detect pipeline specifically points as seeds. The fully connected graph is relinquished for a
designed for weakly supervised local feature learning. This sparse connection graph. A seed graph neural network is then
pipeline separates the description network from the detection designed with an attention mechanism to aggregate informa-
network, leveraging camera pose information for descriptor tion. Keypoints usually exhibit strong correlations with just
learning that enhances performance. Through a novel search a few points, resulting in a sparsely connected adjacency ma-
strategy, the descriptor learning process more adeptly utilizes trix for most keypoints. Therefore, ClusterGNN [71] makes
camera pose information. ReDFeat [132] uses a mutual weight- use of graph node clustering algorithms to partition nodes in
ing strategy to combine multimodal feature learning’s detection a graph into multiple clusters. This strategy applies attention
and description aspects. SCFeat [133] proposes a shared cou- GNN layers with clustering to learn feature matching between
pling bridge strategy for weakly supervised local feature learn- two sets of keypoints and their related descriptors, thus train-
ing. Through shared coupling bridges and cross-normalization ing the subgraphs to reduce redundant information propagation.
layers, the framework ensures the individual, optimal training MaKeGNN [135] introduces bilateral context-aware sampling
of description networks and detection networks. This segre- and keypoint-assisted context aggregation in a sparse attention
gation enhances the robustness and overall performance of de- GNN architecture.
scriptors. Inspired by SuperGlue, GlueStick [136] incorporates point
and line descriptors into a joint framework for joint matching
2.4. Graph Based and leveraging point-to-point relationships to link lines from
In the conventional feature matching pipelines, correspon- matched images. LightGlue [137], in an effort to make Super-
dence relationships are established via nearest neighbor (NN) Glue adaptive in computational complexity, proposes the dy-
search of feature descriptors, and outliers are eliminated based namic alteration of the network’s depth and width based on
on matching scores or mutual NN verification. In recent the matching difficulty between each image pair. It devises a
times, attention-based graph neural networks (GNNs) [134] lightweight confidence classifier to forecast and hone state as-
have emerged as effective means to obtain local feature match- signments. DenseGAP [138] devises a graph structure utiliz-
ing. These approaches create GNNs with keypoints as nodes ing anchor points as sparse, yet reliable priors for inter-image
and utilize self-attention layers and cross-attention layers from and intra-image contexts. It propagates this information to
Transformers to exchange global visual and geometric infor- all image points through directed edges. HTMatch [139] and
mation among nodes. This exchange overcomes the challenges Paraformer [140] study the application of attention for interac-
8
tive mixing and explore architectures that strike a balance be- pixel-wise correspondences between image pairs in a coarse-
tween efficiency and effectiveness. ResMatch [141] presents to-fine fashion. Utilizing a dual-resolution strategy with a Fea-
the idea of residual attention learning for feature matching, re- ture Pyramid Network (FPN)-like backbone, the approach gen-
articulating self-attention and cross-attention as learned resid- erates a 4D correlation tensor from coarse-resolution feature
ual functions of relative positional reference and descriptor sim- maps and refines it through a learnable neighborhood consen-
ilarity. It aims to bridge the divide between interpretable match- sus module, thereby augmenting matching reliability and lo-
ing and filtering pipelines and attention-based feature match- calization accuracy. GLU-Net [67] introduces a global-local
ing networks that inherently possess uncertainty via empirical universal network applicable to estimating dense correspon-
means. dences for geometric matching, semantic matching, and optical
flow. It trains the network in a self-supervised manner. GO-
Cor [143] presents a fully differentiable dense matching mod-
3. Detector-free Models
ule that predicts the global optimization matching confidence
While the feature detection stage enables a reduction in the between two depth feature maps and can be integrated into
search space for matching, handling extreme circumstances, state-of-the-art networks to replace feature correlation layers
such as image pairs involving substantial viewpoint changes directly. DCFlow [144] enhances optical flow estimation by
and textureless regions, prove to be difficult when using utilizing the four-dimensional cost volume, drawing on meth-
detection-based approaches, notwithstanding perfect descrip- ods used in stereo matching. By applying a learned feature
tors and matching methodologies [142]. Detector-free meth- embedding and adapting semi-global matching to four dimen-
ods, on the other hand, eliminate feature detectors and directly sions, DCFlow addresses the computational hurdles tradition-
extract visual descriptors on a dense grid spread across the im- ally linked to this extensive approach. Its efficiency in con-
ages to produce dense matches. Thus, compared to Detector- structing and processing the cost volume, combined with main-
based methods, these techniques can capture keypoints that are taining accuracy, marks an improvement in integrating optical
repeatable across image pairs. flow and stereo estimation techniques. Building on these con-
ceptual advancements, RAFT [145] further refines the approach
3.1. CNN Based to dense correspondence estimation. By extracting per-pixel
features and constructing multi-scale 4D correlation volumes
In the early stages, detection-free matching methodologies
for all pixel pairs, RAFT introduces a recurrent processing unit
often relied on CNNs that used correlation or cost volume to
that iteratively refines the flow field. This innovative strategy
identify potential neighborhood consistencies [142]. Figure 6
effectively addresses several limitations of previous methods,
illustrates the fundamental architecture of the 4D correspon-
such as error propagation at coarse resolutions and the neglect
dence volume.
of small, fast-moving objects, thereby enhancing the precision
and reliability of flow estimation. Following in the footsteps of
these foundational methods, PDC-Net [68] proposes a proba-
bilistic depth network that estimates dense image-to-image cor-
respondences and their associated confidence estimates. It in-
CNN troduces an architecture and an improved self-supervised train-
ing strategy to achieve robust uncertainty prediction that is gen-
eralizable. PDC-Net+ [146] introduces a probabilistic deep
network designed to estimate dense image-to-image correspon-
dences and their associated confidence estimates. They employ
Figure 6: Overview of the 4D correspondence volume. Dense feature maps, a constrained mixture model to parameterize the predictive dis-
denoted as f A and f B , are extracted from images I A and I A using convo- tribution, enhancing the modeling capacity for handling out-
lutional neural networks. Each individual feature match, fijA and f B , corre-
kl liers. PUMP [147] combines unsupervised losses with standard
sponds to the matching (i, j, k, l) coordinates. The 4D correlation tensor c is self-supervised losses to augment synthetic images. By utiliz-
ultimately formed, which contains scores for all points between a pair of im-
ages that could potentially be corresponding points. Subsequently, matching ing a 4D correlation volume, it leverages the non-parametric
pairs are obtained by analyzing the properties of corresponding points in the pyramid structure of DeepMatching [148] to learn unsupervised
four-dimensional space. descriptors. DFM [149] utilizes a pre-trained VGG architecture
as a feature extractor, capturing matches without requiring ad-
NCNet [64] analyzes the neighborhood consistency in four- ditional training strategies, thus demonstrating the robust power
dimensional space of all possible corresponding points between of features extracted from the deepest layers of the VGG net-
a pair of images, obtaining matches without requiring a global work.
geometric model. Sparse-NCNet [65] utilizes a 4D convolu-
tional neural network on sparse correlation tensors and uti- 3.2. Transformer Based
lizes submanifold sparse convolutions to significantly reduce The CNN’s dense feature receptive field may have limita-
memory consumption and execution time. DualRC-Net [66] tions in handling regions with low texture or discerning be-
introduces an innovative methodology for establishing dense tween keypoints with similar feature representations. In con-
trast, humans tend to consider both local and global infor-
9
mation when matching in such regions. Given Transform- pipeline. MatchFormer [162] devises a hierarchical transformer
ers’ success in computer vision tasks such as image classifi- encoder and a lightweight decoder. In each stage of the hi-
cation [150], object detection [151], and semantic segmenta- erarchical structure, cross-attention modules and self-attention
tion [152, 153, 154, 155, 156], researchers have explored in- modules are interleaved to provide an optimal combination
corporating Transformers’ global receptive field and long-range path, enhancing multi-scale features. CAT [163] proposes a
dependencies into local feature matching. Various approaches context-aware network based on the self-attention mechanism,
that integrate Transformers into feature extraction networks for where attention layers can be applied along the spatial dimen-
local feature matching have emerged. sion for higher efficiency or along the channel dimension for
Given that the only difference between sparse matching and higher accuracy and a reduced storage burden. TopicFM [164]
dense matching is the quantity of points to query, COTR [157] encodes high-level context in images, utilizing a topic modeling
combines the advantages of both approaches. It learns two approach. This improves matching robustness by focusing on
matching images jointly with self-attention, using some key- semantically similar regions in images. ASTR [165] introduces
points as queries and recursively refining matches in the other an Adaptive Spot-guided Transformer, which includes a point-
image through a corresponding neural network. This inte- guided aggregation module to allow most pixels to avoid the in-
gration combines both matches into one parameter-optimized fluence of irrelevant regions, while using computed depth infor-
problem. ECO-TR [158] strives to develop an end-to-end mation to adaptively adjust the size of the grid at the refinement
model to accelerate COTR by intelligently connecting multi- stage. DeepMatcher [142] introduces the Feature Transforma-
ple transformer blocks and progressively refining predicted co- tion Module to ensure a smooth transition of locally aggregated
ordinates in a coarse-to-fine manner on a shared multi-scale features extracted from CNNs to features with a global recep-
feature extraction network. LoFTR [72] is groundbreaking tive field, extracted from Transformers. It also presents Slim-
because it creates a GNN with keypoints as nodes, utilizing Former, which builds deep networks, employing a hierarchical
self-attention layers and mutual attention layers to obtain fea- strategy that enables the network to adaptively absorb informa-
ture descriptors for two images and generating dense matches tion exchange within residual blocks, simulating human-like
in regions with low texture. To overcome the absence of behavior. OAMatcher [166] proposes the Overlapping Areas
local attention interaction in LoFTR, Aspanformer [73] pro- Prediction Module to capture keypoints in co-visible regions
poses an uncertainty-driven scheme based on flow prediction and conduct feature enhancement among them, simulating how
probabilistic modeling that adaptively varies the local attention humans shift focus from entire images to overlapping regions.
span to allocate different context sizes for different positions. They also propose a Matching Label Weight Strategy to gen-
Contrary to the detect-then-describe strategy of S-TREK [98], erate coefficients for evaluating the reliability of true match-
which leverages a translation and rotation equivariant keypoint ing labels, using probabilities to determine whether the match-
detector paired with a lightweight descriptor extractor, SE2- ing labels are correct. CasMTR [167] proposes to enhance
LoFTR [159] adopts a detector-free paradigm, seamlessly ex- the transformer-based matching pipeline by incorporating new
tracting pixel-level correspondences between pairs of images stages of cascade matching and NMS detection.
without necessitating the preliminary step of keypoint detec- PMatch [168] enhances geometric matching performance
tion. This model enhances the original LoFTR framework by pretraining with transformer modules using a paired masked
by incorporating a steerable CNN, thereby achieving inher- image modeling pretext task, utilizing the LoFTR module. To
ent equivariance to translations and rotations. This modifi- effectively leverage geometric priors, SEM [169] introduces a
cation significantly boosts the model’s resilience to rotational structured feature extractor that models relative positional re-
variances, showcasing the model’s unique contribution to the lationships between pixels and highly confident anchor points.
domain of feature matching through direct image correspon- It also incorporates epipolar attention and matching techniques
dence. SE2-LoFTR’s approach exemplifies the versatility and to filter out irrelevant regions based on epipolar constraints.
efficiency of detector-free models in handling complex image DKM [170] addresses the two-view geometric estimation prob-
matching scenarios, particularly those involving significant ro- lem by devising a dense feature matching method. DKM
tational movements. presents a robust global matcher with a kernel regressor and
To address the challenges posed by the presence of nu- embedded decoder, involving warp refinement through large
merous similar points in dense matching approaches and the depth-wise kernels applied to stacked feature maps. Building
limitations on the performance of linear transformers them- on this, RoMa [171] represents a significant advancement in
selves, several recent works have proposed novel methodolo- dense feature matching by applying a Markov chain framework
gies. Quadtree [160] introduces quadtree attention to quickly to analyze and improve the matching process. It introduces
skip calculations in irrelevant regions at finer levels, reducing a two-stage approach: a coarse stage for globally consistent
the computational complexity of visual transformations from matching and a refinement stage for precise localization. This
quadratic to linear. OETR [161] introduces the Overlap Regres- method, which separates the initial matching from the refine-
sion method, which uses a Transformer decoder to estimate the ment process and employs robust regression losses for greater
degree of overlap between bounding boxes in an image. It in- accuracy, has led to notable improvements in matching perfor-
corporates a symmetric center consistency loss to ensure spatial mance, outperforming current SotA.
consistency in the overlapping regions. OETR can be inserted
as a preprocessing module into any local feature matching
10
3.3. Patch Based such as visual localization, multi-view stereo, and the syn-
The Patch-Based matching approach enhances point corre- thesis of innovative perspectives. The developmental trajec-
spondences by matching local image regions. It involves di- tory of SfM, underscored by extensive scholarly investigations,
viding images into patches, extracting descriptor vectors for has engendered firmly established methodologies, sophisticated
each, and then matching these vectors to establish correspon- open-source frameworks exemplified by Bundler [176] and
dences. This technique accommodates large displacements and COLMAP [12], and advanced proprietary software solutions.
is valuable in various computer vision applications. Figure 7 These frameworks are meticulously tailored to ensure precision
illustrates the general architecture of the Patch-Based matching and scalability when handling expansive scenes.
approach. Conventional SfM methodologies rely upon the identifica-
tion and correlation of sparse characteristic points dispersed
across manifold perspectives. Nonetheless, this methodology
encounters formidable challenges in regions with scant tex-
Patch-level Matching
16
Specifically, it quantifies the Euclidean disparity between pre- corner error distances up to 3, 5, and 10 pixels. Table 2 focuses
dicted and actual flow fields, computed as the average over valid on the ScanNet [225] test set, following the SuperGlue [69] test-
pixels within the target image. Fl assesses the average percent- ing protocol. The reported metric is the pose AUC error. Ta-
age of outliers across all pixels, where outliers are defined as ble 3 centers on the YFCC100M [226] test set, with a protocol
flow errors exceeding either 3 pixels or 5% of the ground truth based on RANSAC-Flow [244]. Additionally, the pose mAP
flows. PCK elucidates the percentage of appropriately matched (mean Average Precision) value is reported. A pose estimation
estimated points x̂i situated within a specified threshold (in pix- is considered an outlier if its maximum degree error in trans-
els) from the corresponding ground truth points xi . lation or rotation exceeds the threshold. Table 4 highlights the
MegaDepth [227] test set. The pose estimation AUC error is re-
6.1.6. Structure from Motion ported, following the SuperGlue [69] evaluation methodology.
As delineated in the evaluation framework prescribed by Tables 5 and 6 emphasize the Aachen Day-Night v1.0 [230]
ETH [237], a suite of pivotal metrics is employed to rigorously and v1.1 [9] test sets, respectively, in the local feature evalua-
evaluate the fidelity of the reconstruction process. These en- tion track and the full visual localization track. Table 7 focuses
compass the number of registered images, which serves as an on the InLoc [116] test set. The reported metrics include the
indicator of the reconstruction’s comprehensiveness, along with percentage of correctly localized queries under specific error
the sparse points metric, which provides insights into the depth thresholds, following the HLoc [245] pipeline. Table 8 empha-
and intricacy of the scene’s depiction. Furthermore, the total sizes the KITTI [233] test set. The AEPE and the flow outlier
observations in image metric is pivotal for the calibration of ratio Fl are reported for both the 2012 and 2015 versions of the
cameras and the triangulation process, denoting the confirmed KITTI dataset. Table 9 focuses on the ETH3D [239], presenting
image projections of sparse points. The mean feature track a detailed evaluation of various SfM methods as reported in the
length, indicative of the average count of verified image obser- DetectorFreeSfM [182]. This evaluation thoroughly examines
vations per sparse point, plays a vital role in ensuring precise the effectiveness of these methods across three crucial metrics:
calibration and robust triangulation. Lastly, the mean reprojec- AUC, Accuracy, and Completeness.
tion error is a critical measure for gauging the accuracy of the
Table 1: Evaluation on HPatches [219] for homography estimation. We com-
reconstruction, encapsulating the cumulative reprojection error
pare with two groups of the methods, Detector-based and Detector-free meth-
observed in bundle adjustment and influenced by the thorough- ods.
ness of input data as well as the precision of keypoint detection.
The key metrics in ETH3D [239] are crucial for evaluat- Category Methods
Pose Estimation AUC↑
matches
ing the effectiveness of various SfM methods. The AUC of @3px @5px @10px
D2Net [62]+NN 23.2 35.9 53.6 0.2K
pose error at different thresholds is used to assess the accu- R2D2 [63]+NN 50.6 63.9 76.8 0.5K
racy of multi-view camera pose estimation. This metric reflects Detector-based DISK [112]+NN 52.3 64.9 78.9 1.1K
the precision of estimated camera poses in relation to ground SP+GFM [246] 51.9 65.8 79.1 2.0k
truth. Accuracy and Completeness percentages at different dis- SP+SuperGlue [69] 53.9 68.3 81.7 0.6K
COTR [157] 41.9 57.7 74.0 1.0K
tance thresholds evaluate the 3D Triangulation task. Accuracy Sparse-NCNet [65] 48.9 54.2 67.1 1.0K
indicates the proportion of reconstructed points within a cer- DualRC-Net [66] 50.6 56.2 68.3 1.0K
tain distance from the ground truth, and Completeness mea- Patch2Pix [172] 59.3 70.6 81.2 0.7K
sures the percentage of ground truth points that are adequately 3DG-STFM [247] 64.7 73.1 81.0 1.0k
LoFTR [72] 65.9 75.6 84.6 1.0K
represented within the reconstructed point cloud. SE2-LoFTR [159] 66.2 76.6 86.0 —
Detector-free QuadTree [160] 66.3 76.2 84.9 2.7k
6.2. Quantitative Performance PDC-Net+ [146] 66.7 76.8 85.8 1.0k
TopicFM [164] 67.3 77.0 85.7 1.0K
In this section, we analyze the performance of several key ASpanFormer [73] 67.4 76.9 85.6 —
methods in terms of the evaluation scores provided in Section SEM [169] 69.6 79.0 87.1 1.0K
6.1, which includes various algorithms discussed earlier and ad- CasMTR-2c [167] 71.4 80.2 87.9 0.5k
DKM [170] 71.3 80.6 88.5 5.0K
ditional methods. We compile their performances on popular ASTR [165] 71.7 80.3 88.0 1.0K
benchmarks into tables, where the data is sourced either from PMatch [168] 71.9 80.7 88.5 —
the original authors or from the best reported results of other RoMa [171] 72.2 81.2 89.1 —
authors under the same evaluation conditions. Furthermore,
some publications may report performances on non-standard
benchmarks/databases or only involve certain subsets of pop-
ular benchmark test sets. We do not present the performance of 7. Challenges and Opportunities
these methods.
Deep learning has brought significant advantages to image-
The following tables provide summaries of several ma-
based local feature matching. However, there are still sev-
jor DL-based matching models’ performances on different
eral challenges that remain to be addressed. In the following
datasets. Table 1 highlights the HPatches [219] test set,
sections, we will explore potential research directions that we
adopting the evaluation protocol utilized by the LoFTR ap-
believe will provide valuable momentum for further advance-
proach [72]. Performance metrics are based on the AUC of
ments in image-based local feature matching algorithms.
17
Table 2: ScanNet [225] Two-View Camera Pose Estimation. We compare with 174, 73]. For latency-sensitive applications, adaptive mecha-
two groups of the methods, Detector-based and Detector-free methods.
nisms can be incorporated into the matching process. This al-
Pose Estimation AUC↑ lows the modulation of network depth and width based on fac-
Category Methods
@5° @10° @20° tors such as visual overlap and appearance variation, enabling
ORB+GMS [76] 5.2 13.7 25.4 fine-grained control over the difficulty of the matching task.
D2Net [62]+NN 5.3 14.5 28.0
Furthermore, researchers have proposed various innovative ap-
ContextDesc+RT [101] 6.6 15.0 25.8
ContextDesc+NN [101] 9.4 21.5 36.4 proaches to address issues such as scale variations. One key
SP+NN [61] 9.4 21.5 36.4 challenge is how to adaptively adjust the size of the cropping
Detector-based SP+PointCN [248] 11.4 25.5 41.4 grid based on image scale variations to avoid matching failures.
SP+HTMatch [139] 15.1 31.4 48.2
SP+SGMNet [70] 15.4 32.1 48.3
Geometric inconsistencies in patch-level feature matching can
ContextDesc+SGMNet [70] 15.4 32.3 48.8 also be alleviated through adaptive allocation strategies, com-
SP+SuperGlue [69] 16.2 33.8 51.8 bined with adaptive patch subdivision strategies that enhance
SP+DenseGAP [138] 17.0 36.1 55.7 correspondence quality progressively from coarse to fine dur-
DualRC-Net [66] 7.7 17.9 30.5
SEM [169] 18.7 36.6 52.9
ing the matching process. On the other hand, attention spans
PDC-Net(H) [68] 18.7 37.0 54.0 can be adaptively adjusted based on the difficulty of matching,
PDC-Net+(H) [146] 20.3 39.4 57.1 achieving variable-sized adaptive attention regions at different
LoFTR-DT [72] 22.1 40.8 57.6 positions. This allows the network to better adapt to features
3DG-STFM [247] 23.6 43.6 61.2
LoFTR [72]+QuadTree [160] 23.9 43.2 60.3 at different locations while capturing contextual information,
MatchFormer [162] 24.3 43.9 61.4 thereby improving matching performance.
Detector-free
QuadTree [160] 24.9 44.7 61.8 In summary, the research on adaptability in local feature
ASpanFormer [73] 25.6 46.0 63.3 matching offers vast prospects and opportunities for future de-
OAMatcher [166] 26.1 45.3 62.1
PATS [174] 26.0 46.9 64.3 velopments, while enhancing efficiency in terms of memory
CasMTR-4c [167] 27.1 47.0 64.4 and computation. With the emergence of more demands and
DeepMatcher-L [142] 27.3 46.3 62.5 challenges across various fields, it is anticipated that adaptive
PMatch [168] 29.4 50.1 67.4
mechanisms will play an increasingly important role in local
DKM [170] 29.4 50.7 68.3
RoMa [171] 31.8 53.4 70.9 feature matching. Future research could further explore finer-
grained adaptive strategies to achieve more efficient and accu-
rate matching results.
7.1. Efficient Attention and Transformer
7.3. Weakly Supervised Learning
For local feature matching, integrating transformers into
GNN models can be considered, framing the task as a graph The field of local feature matching has not only achieved
matching problem involving two sets of features. To enhance significant progress under fully supervised settings but has also
the construction of better matchers, the self-attention and cross- shown potential in the realm of weakly supervised learning.
attention layers within transformers are commonly employed Traditional fully supervised methods rely on dense ground-
to exchange global visual and geometric information between truth correspondence labels. In recent years, researchers have
nodes. In addition to sparse descriptors generated by matching turned their attention to self-supervised and weakly super-
detectors, direct application of self-attention and cross-attention vised learning to reduce the dependency on precise annotations.
to feature maps extracted by CNNs, and generating matches Self-supervised learning methods like SuperPoint [61] train on
in a coarse-to-fine manner, has also been explored [69, 72]. pairs of images generated through virtual homography trans-
However, the computational cost of basic matrix multiplication formations, yielding promising results. However, these sim-
in transformers remains high when dealing with a large num- ple geometric transformations might not work effectively un-
ber of keypoints. In recent years, efforts have been made to der complex scenarios. Weakly supervised learning has be-
enhance the efficiency of transformers and attempts to com- come a research focus in the domain of local feature learn-
pute the two types of attention in parallel have been made, ing [111, 112, 172, 131, 133]. These methods often combine
continuously reducing complexity while maintaining perfor- weakly supervised learning with the describe-detect pipeline,
mance [70, 71, 141, 140]. Future research will further opti- but direct use of weakly supervised loss leads to noticeable per-
mize the structures of attention mechanisms and transformers, formance drops. Some approaches rely solely on solutions in-
aiming to maintain high matching performance while reduc- volving relative camera poses, learning descriptors via epipo-
ing computational complexity. This will make local feature lar loss. The limitations of weakly supervised methods lie in
matching methods more efficient and applicable in real-time their difficulty to differentiate errors introduced by descriptors
and resource-constrained environments. and keypoints, as well as accurately distinguishing different de-
scriptors. To overcome these challenges, carefully designed de-
7.2. Adaptation strategy coupled training pipelines have emerged, where the description
network and detection network are trained separately until high-
In recent years, significant progress has been made in the re- quality descriptors are obtained. Chen et al. [256] propose in-
search on adaptability in local feature matching [137, 165, 173, novative methods using convolutional neural networks for fea-
ture shape estimation, orientation assignment, and descriptor
18
Table 3: Evaluation on YFCC100M [226] for outdoor pose estimation. We compare with two groups of the methods, Detector-based and Detector-free methods.
Table 4: MegaDepth [227] Two-View Camera Pose Estimation. We compare and keypoints. In the future, developments in weakly super-
with two groups of the methods, Detector-based and Detector-free methods.
vised learning in the domain of local feature matching might
Pose Estimation AUC↑ focus on finer loss function designs, better utilization of weak
Category Methods
@5° @10° @20° supervision signals, and broader application domains. Explor-
SP+SGMNet [70] 40.5 59.0 72.6 ing mechanisms for weakly supervised learning in conjunction
SP+DenseGAP [138] 41.2 56.9 70.2
Detector-based with traditional fully supervised methods holds promise for en-
SP+SuperGlue [69] 42.2 61.2 75.9
SP+ClusterGNN [71] 44.2 58.5 70.3 hancing the performance and generalization capabilities of lo-
Patch2Pix [172] 41.4 56.3 68.3 cal feature matching in complex scenes.
ECO-TR [158] 48.3 65.8 78.5
PDC-Net+ [146] 51.5 67.2 78.5
3DG-STFM [247] 52.6 68.5 80.0
7.4. Foundation Segmentation Models
SE2-LoFTR [159] 52.6 69.2 81.4 Typically, semantic segmentation models, trained on
LoFTR [72] 52.8 69.2 81.2 datasets such as Cityscapes [257] and MIT ADE20k [258], of-
MatchFormer [162] 52.9 69.7 82.0
TopicFM [164] 54.1 70.1 81.6 fer fundamental semantic information and play a crucial role
Detector-free in enhancing the detection and description processes of specific
QuadTree [160] 54.6 70.5 82.2
ASpanFormer [73] 55.3 71.5 83.1 environments [128, 129].
OAMatcher [166] 56.6 72.3 83.6 However, the advent of large foundation models such as
DeepMatcher-L [142] 57.0 73.1 84.2
SEM [169] 58.0 72.9 83.7 SAM [259], DINO [260], and DINOv2 [261] marks a new era
ASTR [165] 58.4 73.1 83.8 in artificial intelligence. While traditional segmentation mod-
CasMTR-2c [167] 59.1 74.3 84.8 els excel in their specific domains, these foundation models
DKM [170] 60.4 74.9 85.1
introduce a broader, more versatile approach. Their extensive
PMatch [168] 61.4 75.7 85.7
RoMa [171] 62.6 76.7 86.3 pre-training on massive, diverse datasets equips them with a
remarkable zero-shot generalization capability, enabling them
to adapt to a wide range of scenarios. For instance, SAM-
learning. Their approach establishes a standard shape and ori- Feat [42] demonstrates how SAM, a model adept at segment-
entation for each feature, enabling a transition from supervised ing ”anything” in ”any scene,” can guide local feature learn-
to self-supervised learning by eliminating the need for known ing with its rich, category-agnostic semantic knowledge. By
feature matching relationships. They also introduce a ’weak distilling fine-grained semantic relations and focusing on edge
match finder’ in descriptor learning, enhancing feature appear- detection, SAMFeat showcases a significant enhancement in lo-
ance variability and improving descriptor invariance. These ad- cal feature description and accuracy. Similarly, SelaVPR [262]
vancements signify a significant progress in weakly supervised demonstrates how to effectively adjust the DINOv2 model us-
learning for feature matching, especially in cases involving sub- ing lightweight adapters to address the challenges in Visual
stantial viewpoint and viewing direction changes. Place Recognition (VPR) by proficiently matching local fea-
These weakly supervised methods open up new prospects tures without the need for extensive spatial verification, thus
and opportunities for local feature learning, allowing models streamlining the retrieval process.
to be trained on larger and more diverse datasets, thereby ob- Looking towards an open-world scenario, the versatility and
taining more generalized descriptors. However, these methods robust generalization offered by these large foundation models
still face challenges, such as effectively leveraging weak su- present exciting prospects. Their ability to understand and in-
pervision signals and addressing the uncertainty of descriptors terpret a vast array of scenes and objects far exceeds the scope
19
Table 5: Visual localization evaluation on the Aachen Day-Night benchmark v1.0 [230] and v1.1 [9]. The evaluation results on the full visual localization track are
reported. We compare with two groups of the methods, Detector-based and Detector-free methods.
Day Night
Dataset Category Methods
(0.25m,2°) (0.5m,5°) (1.0m,10°) (0.25m,2°) (0.5m,5°) (1.0m,10°)
SP+NN [61] 85.4 93.3 97.2 75.5 86.7 92.9
SP+CAPS [111]+NN 86.3 93.0 95.9 83.7 90.8 96.9
SP+SuperGlue [69] 89.6 95.4 98.8 86.7 93.9 100.0
SP+SGMNet [70] 86.8 94.2 97.7 83.7 91.8 99.0
SP+ClusterGNN [71] 89.4 95.5 98.5 81.6 93.9 100.0
Detector-based
v1.0 SP+LightGlue [137] 89.2 95.4 98.5 87.8 93.9 100.0
ASLFeat [127]+NN 82.3 89.2 92.7 67.3 79.6 85.7
ASLFeat [127]+SGMNet [70] 86.8 93.4 97.1 86.7 94.9 98.0
ASLFeat [127]+SuperGlue [69] 87.9 95.4 98.3 81.6 91.8 99.0
ASLFeat [127]+ClusterGNN [71] 88.6 95.5 98.4 85.7 93.9 99.0
Detector-free Patch2Pix [172] 84.6 92.1 96.5 82.7 92.9 99.0
ISRF [251] 87.1 94.7 98.3 74.3 86.9 97.4
Rlocs [252] 88.8 95.4 99.0 74.3 90.6 98.4
KAPTURE+R2D2+APGeM [253] 90.0 96.2 99.5 72.3 86.4 97.9
Detector-based SP+SuperGlue [69] 89.8 96.1 99.4 77.0 90.6 100.0
SP+SuperGlue [69]+Patch2Pix [172] 89.3 95.8 99.2 78.0 90.6 99.0
SP+GFM [246] 90.2 96.4 99.5 74.0 91.5 99.5
SP+LightGlue [137] 90.2 96.0 99.4 77.0 91.1 100.0
Patch2Pix [172] 86.4 93.0 97.5 72.3 88.5 97.9
v1.1
LoFTR-DS [72] — — — 72.8 88.5 99.0
LoFTR-OT [72] 88.7 95.6 99.0 78.5 90.6 99.0
ASpanFormer [73] 89.4 95.6 99.0 77.5 91.6 99.5
Detector-free AdaMatcher-LoFTR [173] 89.2 96.0 99.3 79.1 90.6 99.5
AdaMatcher-Quad [173] 89.2 95.9 99.2 79.1 92.1 99.5
ASTR [165] 89.9 95.6 99.2 76.4 92.1 99.5
TopicFM [164] 90.2 95.9 98.9 77.5 91.1 99.5
CasMTR [167] 90.4 96.2 99.3 78.5 91.6 99.5
Table 6: Visual localization evaluation on the Aachen Day-Night benchmark v1.0 [230] and v1.1 [9]. The evaluation results on the local feature evaluation track are
reported. We compare with two groups of the methods, Detector-based and Detector-free methods.
of traditional segmentation networks, paving the way for ad- 7.5. Mismatch Removal
vancements in feature matching across diverse and dynamic Image matching, involving the establishment of reliable
environments. In summary, while the contributions of tradi- connections between two images portraying a shared object or
tional semantic segmentation networks remain invaluable, the scene, poses intricate challenges due to the combinatorial na-
integration of large foundation models offers a complementary ture of the process and the presence of outliers. Direct matching
and expansive approach, essential for pushing the boundaries methodologies, such as point set registration and graph match-
of what is achievable in feature matching, particularly in open- ing, often grapple with formidable computational demands and
world applications. erratic performance. Consequently, a bifurcated approach,
commencing with preliminary match construction utilizing fea-
20
Table 7: Visual Localization on the InLoc benchmark [116]. We compare with two groups of the methods, Detector-based and Detector-free methods.
DUC1 DUC2
Category Methods
(0.25m, 10°) (0.5m, 10°) (1m, 10°) (0.25m, 10°) (0.5m, 10°) (1m, 10°)
SIFT+CAPS [111]+NN 38.4 56.6 70.7 35.1 48.9 58.8
ISRF [251] 39.4 58.1 70.2 41.2 61.1 69.5
D2Net [62]+NN 38.4 56.1 71.2 37.4 55.0 64.9
R2D2 [63]+NN 36.4 57.6 74.2 45.0 60.3 67.9
KAPTURE [253]+R2D2 [63] 41.4 60.1 73.7 47.3 67.2 73.3
SeLF [128]+NN 41.4 61.6 73.2 44.3 61.1 68.7
AWDesc [103]+NN 41.9 61.6 72.2 45.0 61.1 70.2
Detector-based ASLFeat [127]+NN 39.9 59.1 71.7 43.5 58.8 64.9
ASLFeat [127]+SGMNet [70] 43.9 62.1 68.2 45.0 63.4 73.3
ASLFeat [127]+SuperGlue [69] 51.5 66.7 75.8 53.4 76.3 84.0
ASLFeat [127]+ClusterGNN [71] 52.5 68.7 76.8 55.0 76.0 82.4
SP+NN [61] 40.4 58.1 69.7 42.0 58.8 69.5
SP+ClusterGNN [71] 47.5 69.7 79.8 53.4 77.1 84.7
SP+SuperGlue [69] 49.0 68.7 80.8 53.4 77.1 82.4
SP+CAPS [111]+NN 40.9 60.6 72.7 43.5 58.8 68.7
SP+LightGlue [137] 49.0 68.2 79.3 55.0 74.8 79.4
Sparse-NCNet [65] 41.9 62.1 72.7 35.1 48.1 55.0
MTLDesc [102] 41.9 61.6 72.2 45.0 61.1 70.2
COTR [157] 41.9 61.1 73.2 42.7 67.9 75.6
Patch2Pix [172] 44.4 66.7 78.3 49.6 64.9 72.5
LoFTR-OT [72] 47.5 72.2 84.8 54.2 74.8 85.5
MatchFormer [162] 46.5 73.2 85.9 55.7 71.8 81.7
ASpanFormer [73] 51.5 73.7 86.4 55.0 74.0 81.7
Detector-free
TopicFM [164] 52.0 74.7 87.4 53.4 74.8 83.2
GlueStick [136] 49.0 70.2 84.3 55.0 83.2 87.0
SEM [169] 52.0 74.2 87.4 50.4 76.3 83.2
ASTR [165] 53.0 73.7 87.4 52.7 76.3 84.0
CasMTR [167] 53.5 76.8 85.4 51.9 70.2 83.2
ASTR [165] 53.0 73.7 87.4 52.7 76.3 84.0
PATS [174] 55.6 71.2 81.0 58.8 80.9 85.5
Table 8: Optical flow results on the training splits of KITTI [233]. Following ination, as highlighted by Ma et al. [264]. Traditional meth-
[157, 158], (”+intp”) represents interpolating the output of the model to ob-
tain the corresponding relationship per pixel. † means evaluated it with Dense- ods, epitomized by RANSAC [39] and its variants such as
Matching tools provided by the authors of GLU-Net. This part contains generic USAC [265] and MAGSAC++ [266], have significantly im-
matching networks. proved efficiency and accuracy in outlier rejection. Nonethe-
less, these approaches are limited by computational time con-
KITTI-2012 KITTI-2015
Methods
APAE↓ Fl-all[%]↓ APAE↓ Fl-all[%]↓
straints and their suitability for non-rigid contexts. Techniques
DGC-Net [243] 8.50 32.28 14.97 50.98 specific to non-rigid scenarios, like ICF [267], have shown
DGC-Net† [243] 7.96 34.35 14.33 50.35 efficacy in addressing geometric distortions. The Advent of
GLU-Net [67] 3.14 19.76 7.49 33.83 Learning-Driven Strategies in Mismatch Elimination The inte-
GLU-Net+GOCor [143] 2.68 15.43 6.68 27.57
RANSAC-Flow [244] — — 12.48 —
gration of deep learning into mismatch elimination has opened
COTR [157] 1.28 7.36 2.62 9.92 new pathways for enhancing feature matching. Yi et al. [248]
COTR+Intp. [157] 2.26 10.50 6.12 16.90 introduced Context Normalization (CNe), a groundbreaking
PDC-Net(D) [68] 2.08 7.98 5.22 15.13 concept that has transformed wide-baseline stereo correspon-
PDC-Net+(D) [146] 1.76 6.60 4.53 12.62
COTR† [157] 1.15 6.98 2.06 9.14
dences by effectively distinguishing inliers from outliers. Ex-
COTR†+Intp. [157] 1.47 8.79 3.65 13.65 panding upon this, Sun et al. [268] developed Attentive Con-
ECO-TR [158] 0.96 3.77 1.40 6.39 text Networks (ACNe), which improved the management of
ECO-TR+Intp. [158] 1.46 6.64 3.16 12.10 permutation-equivariant data through Attentive Context Nor-
PATS+Intp. [174] 1.17 4.04 3.39 9.68
malization, yielding significant advancements in camera pose
estimation and point cloud classification. Zhang et al. [77]
ture descriptors like SIFT, ORB, and SURF [33, 35, 34], suc- proposed the OANet, a novel methodology that precisely de-
ceeded by the application of local and global geometrical con- termines two-view correspondences and bolsters geometric es-
straints, has become a prevalent strategy. Nevertheless, these timations using a hierarchical clustering approach. Zhao et
methodologies encounter constraints, notably when confronted al. [269] introduced NM-Net, a layered network focusing on
with multi-modal images or substantial variations in viewpoint the selection of feature correspondences with compatibility-
and lighting [263]. specific mining, demonstrating outstanding performance in var-
The evolution of methodologies for outlier rejection has ious settings. Shape-Former [270] innovatively addresses mul-
been crucial in overcoming the challenges of mismatch elim- timodal and multiview image matching challenges, focusing
on robust mismatch removal through a hybrid neural network.
21
Table 9: Evaluation of SfM Methods on the ETH3D [239] for Multi-View Camera Pose Estimation and 3D Triangulation. The table segregates methods into
Detector-based and Detector-free categories. The results are derived from the DetectorFreeSfM [182].
Leveraging CNNs and Transformers, Shape-Former introduces 7.6. Deep Learning and Handcrafted Analogies
ShapeConv for sparse matches learning, excelling in outlier es- The field of image matching is witnessing a unique blend of
timation and consensus representation while showcasing supe- deep learning and traditional handcrafted techniques. This con-
rior performance. Given that RANSAC is an integral part of vergence is evident in the adoption of foundational elements
the matching pipeline, recent innovations have significantly en- from classic methods, such as the ”SIFT” pipeline, in recent
hanced its integration with deep learning approaches for im- semi-dense, detector-free approaches. Notable examples of this
proved performance. DSAC [271] introduces a paradigm shift trend include the Hybrid Pipeline (HP) by Bellavia et al.[274],
by making RANSAC differentiable, employing a probabilistic HarrisZ+[275], and Slime [276], all demonstrating competitive
selection mechanism that facilitates its integration into end-to- capabilities alongside state-of-the-art deep methods. The HP
end trainable deep learning pipelines. This innovative approach method integrates handcrafted and learning-based approaches,
not only maintains the robustness of traditional RANSAC but maintaining crucial rotational invariance for photogrammetric
also leverages deep learning to directly minimize expected surveys. It features the novel Keypoint Filtering by Coverage
losses. On the other hand, CA-RANSAC [272] evolves the (KFC) module, enhancing the accuracy of the overall pipeline.
RANSAC framework by incorporating an adaptive consen- HarrisZ+ represents an evolution of the classic Harris corner
sus mechanism through a novel attention layer. This mech- detector, optimized to synergize with modern image match-
anism dynamically refines per-point estimation states based ing components. It yields more discriminative and accurately
on accumulated residuals, enhancing model refinement and placed keypoints, aligning closely with results from contem-
sample selection. Recent developments like LSVANet [273], porary deep learning models. Slime, employs a novel strat-
LGSC [263], and HCA-Net [185], have shown promise in more egy of modeling scenes with local overlapping planes, merg-
effectively discerning inliers from outliers. These approaches ing local affine approximation principles with global match-
leverage deep learning modules for geometric estimation and ing constraints. This hybrid approach echoes traditional im-
feature correspondence categorization, signifying an advance- age matching processes, challenging the performance of deep
ment over conventional methods. learning methods. Meanwhile, it is important to highlight a
Primarily, the development of more generalized and robust key distinction between deep learning methods and traditional
learning-based methodologies adept at handling a diverse ar- handcrafted detectors in recent advancements in image match-
ray of scenarios, encompassing non-rigid transformations and ing: rotation equivariance. Despite the remarkable matching
multi-modal images, is imperative. Secondly, there exists a ne- performances of modern methods, they often fall short in han-
cessity for methodologies that amalgamate the virtues of both dling in-plane rotations—a fundamental feature inherently in-
traditional geometric approaches and contemporary learning- tegrated into handcrafted detectors. This oversight reveals a
based techniques. Such hybrid approaches hold the potential to performance gap under rotational transformations, underscor-
deliver superior performance by capitalizing on the strengths of ing the importance of designing or training deep learning mod-
both paradigms. Lastly, the exploration of innovative learning els to explicitly address this challenge. By focusing on the de-
architectures and loss functions tailored for mismatch elimina- velopment of rotation-equivariant approaches, as exemplified
tion can unveil novel prospects in feature matching, elevating by SE2-LoFTR [159] and S-TREK [98], the field moves closer
the overall resilience and precision of computer vision systems. to bridging this gap, combining the precision of deep learning
In conclusion, the elimination of mismatches persists as a piv- with the robustness of handcrafted detectors against orientation
otal yet formidable facet of local feature matching. The on- changes.
going evolution of both traditional and learning-based method- These advancements signify that despite the significant
ologies unfolds promising trajectories to address extant limita- strides made by deep learning methods like LoFTR and Su-
tions and unlock newfound potentials in computer vision appli- perGlue, the foundational principles of handcrafted techniques
cations. remain vital. The integration of classical concepts with mod-
ern computational power, as seen in HP, HarrisZ+, Slime, SE2-
LoFTR, and S-TREK, leads to robust image matching solu-
tions. These methods offer potential avenues for future research
22
that blends diverse methodologies, bridging the gap between cal datasets [55]. Additionally, the computational intensity re-
traditional and modern approaches in image matching. quired to process high-resolution images presents a significant
hurdle, particularly when aiming for real-time application in
7.7. Utilizing geometric information web and VR/AR platforms [279]. These technical challenges
When facing challenges such as texturelessness, occlusion, underscore the necessity for ongoing research and development
and repetitive patterns, traditional local feature matching meth- to refine deep learning algorithms and optimize them for the
ods may perform poorly. In recent years, researchers have specific demands of cultural heritage applications.
started to focus on better utilizing geometric information to en- Moreover, the dynamic nature of deep learning presents
hance the effectiveness of local feature matching in the presence an opportunity to continually improve the accuracy and effi-
of these challenges. Several studies [169, 170, 146, 168] have ciency of historical image matching processes. As algorithms
indicated that leveraging geometric information holds signifi- evolve and new approaches emerge, there is the potential for
cant potential for local feature matching. By capturing geomet- even more sophisticated methods of feature detection, extrac-
ric relationships between pixels more accurately and combin- tion, and matching that could further revolutionize the field.
ing geometric priors with image appearance information, these The exploration of these novel strategies, alongside the adap-
methods can enhance the robustness and accuracy of matching tation of current methodologies to the unique characteristics
in complex scenes. However, this direction presents numerous of historical imagery, is essential for advancing our ability to
opportunities and challenges for future development. Firstly, digitally preserve and explore our cultural heritage [55]. In
how to model geometric information more profoundly to bet- summary, the confluence of deep learning with historical im-
ter address scenarios involving large displacements, occlusions, age matching offers a promising pathway towards bridging the
and textureless regions remains a critical question. Secondly, gap between the past and present. Future research can unlock
improving the performance of confidence estimation to yield new possibilities for the preservation, understanding, and dis-
more reliable matching results is also a direction worthy of in- semination of cultural heritage, making it more accessible and
vestigation. engaging for generations to come.
The introduction of geometric priors expands feature
matching beyond mere appearance similarity to consider an 8. Conclusions
object’s behavior from different viewpoints. This trend sug-
gests that dense matching methods hold promise for tackling We have investigated various algorithms related to local fea-
challenges posed by large displacements and appearance vari- ture matching based on deep learning models over the past five
ations. It also implies that the future development in the field years. These algorithms have demonstrated impressive perfor-
of geometric matching may increasingly focus on dense feature mance in various local feature matching tasks and benchmark
matching, leveraging geometric information and prior knowl- tests. They can be broadly categorized into Detector-based
edge to enhance matching performance. Models and Detector-free Models. The application of feature
detectors reduces the scope of matching and relies on the pro-
7.8. Advancing Cultural Heritage Preservation cesses of keypoint detection and feature description. Detector-
free methods, on the other hand, directly capture richer con-
The integration of deep learning into the realm of histor-
text from the raw images to generate dense matches. Subse-
ical image matching heralds a new era for cultural heritage
quently, we discuss the strengths and weaknesses of existing
preservation, offering unparalleled opportunities while present-
local feature matching algorithms, introduce popular datasets
ing unique challenges. Insights from recent research under-
and evaluation standards, and summarize the quantitative per-
score the potential of advanced deep learning methods to over-
formance analysis of these models on some common bench-
come the limitations of traditional techniques [277]. These ap-
marks such as HPatches, ScanNet, YFCC100M, MegaDepth,
proaches have shown remarkable resilience to the inherent ra-
and Aachen Day-Night datasets. Lastly, we explore the open
diometric and geometric discrepancies that plague the match-
challenges and potential research avenues that the field of local
ing of historical and contemporary imagery, thereby facilitat-
feature matching may encounter in the forthcoming years. Our
ing the accurate co-registration of multi-temporal scenes. Tech-
aim is not only to enhance researchers’ understanding of local
niques such as SuperGlue [69], LoFTR [72], and DISK [112]
feature matching but also to inspire and guide future research
have been identified as particularly effective, surpassing classi-
endeavors in this domain.
cal methods by achieving higher robustness against severe illu-
mination and viewpoint shifts. This advancement enables more
accurate reconstructions of historical sites and artifacts [278], References
thus enhancing knowledge transfer and cultural heritage promo- [1] L. Tang, J. Yuan, J. Ma, Image fusion in the loop of high-level vision
tion through immersive technologies like virtual and augmented tasks: A semantic-aware real-time infrared and visible image fusion net-
reality. work, Information Fusion 82 (2022) 28–42.
[2] S.-Y. Cao, B. Yu, L. Luo, R. Zhang, S.-J. Chen, C. Li, H.-L. Shen, Pc-
However, the journey towards fully harnessing the capa- net: A structure similarity enhancement method for multispectral and
bilities of deep learning in this domain is not without its ob- multimodal image registration, Information Fusion 94 (2023) 200–214.
stacles. Challenges persist in managing extensive image ro- [3] M. Hu, B. Sun, X. Kang, S. Li, Multiscale structural feature transform
tations and variations in scale, which are common in histori- for multi-modal image matching, Information Fusion 95 (2023) 341–
354.
23
[4] K. Sun, J. Yu, W. Tao, X. Li, C. Tang, Y. Qian, A unified feature-spatial object detection via frequency domain disentanglement, in: Proceedings
cycle consistency fusion framework for robust image matching, Infor- of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
mation Fusion 97 (2023) 101810. tion, 2023, pp. 1064–1073.
[5] Z. Hou, Y. Liu, L. Zhang, Pos-gift: A geometric and intensity-invariant [25] C. Cao, X. Fu, H. Liu, Y. Huang, K. Wang, J. Luo, Z.-J. Zha, Event-
feature transformation for multimodal images, Information Fusion 102 guided person re-identification via sparse-dense complementary learn-
(2024) 102027. ing, in: Proceedings of the IEEE/CVF Conference on Computer Vision
[6] T. Sattler, B. Leibe, L. Kobbelt, Improving image-based localization by and Pattern Recognition, 2023, pp. 17990–17999.
active correspondence search, in: Computer Vision–ECCV 2012: 12th [26] C. Harris, M. Stephens, et al., A combined corner and edge detector, in:
European Conference on Computer Vision, Florence, Italy, October 7- Alvey vision conference, Vol. 15, Citeseer, 1988, pp. 10–5244.
13, 2012, Proceedings, Part I 12, Springer, 2012, pp. 752–765. [27] S. M. Smith, J. M. Brady, Susan—a new approach to low level image
[7] T. Sattler, A. Torii, J. Sivic, M. Pollefeys, H. Taira, M. Okutomi, T. Pa- processing, International journal of computer vision 23 (1) (1997) 45–
jdla, Are large-scale 3d models really necessary for accurate visual local- 78.
ization?, in: Proceedings of the IEEE Conference on Computer Vision [28] E. Rosten, T. Drummond, Fusing points and lines for high performance
and Pattern Recognition, 2017, pp. 1637–1646. tracking, in: Tenth IEEE International Conference on Computer Vision
[8] S. Cai, Y. Guo, S. Khan, J. Hu, G. Wen, Ground-to-aerial image geo- (ICCV’05) Volume 1, Vol. 2, Ieee, 2005, pp. 1508–1515.
localization with a hard exemplar reweighting triplet loss, in: Proceed- [29] J. Matas, O. Chum, M. Urban, T. Pajdla, Robust wide-baseline stereo
ings of the IEEE/CVF International Conference on Computer Vision, from maximally stable extremal regions, Image and vision computing
2019, pp. 8391–8400. 22 (10) (2004) 761–767.
[9] Z. Zhang, T. Sattler, D. Scaramuzza, Reference pose generation for long- [30] N. Dalal, B. Triggs, Histograms of oriented gradients for human detec-
term visual localization via learned features and view synthesis, Interna- tion, in: 2005 IEEE computer society conference on computer vision
tional Journal of Computer Vision 129 (2021) 821–844. and pattern recognition (CVPR’05), Vol. 1, Ieee, 2005, pp. 886–893.
[10] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S. M. [31] K. Mikolajczyk, C. Schmid, A performance evaluation of local descrip-
Seitz, R. Szeliski, Building rome in a day, Communications of the ACM tors, IEEE transactions on pattern analysis and machine intelligence
54 (10) (2011) 105–112. 27 (10) (2005) 1615–1630.
[11] J. Heinly, J. L. Schonberger, E. Dunn, J.-M. Frahm, Reconstructing the [32] M. Calonder, V. Lepetit, C. Strecha, P. Fua, Brief: Binary robust inde-
world* in six days*(as captured by the yahoo 100 million image dataset), pendent elementary features, in: Computer Vision–ECCV 2010: 11th
in: Proceedings of the IEEE conference on computer vision and pattern European Conference on Computer Vision, Heraklion, Crete, Greece,
recognition, 2015, pp. 3287–3295. September 5-11, 2010, Proceedings, Part IV 11, Springer, 2010, pp.
[12] J. L. Schonberger, J.-M. Frahm, Structure-from-motion revisited, in: 778–792.
Proceedings of the IEEE conference on computer vision and pattern [33] D. G. Lowe, Distinctive image features from scale-invariant keypoints,
recognition, 2016, pp. 4104–4113. International journal of computer vision 60 (2004) 91–110.
[13] J. Wang, Y. Zhong, Y. Dai, S. Birchfield, K. Zhang, N. Smolyanskiy, [34] H. Bay, T. Tuytelaars, L. Van Gool, Surf: Speeded up robust features,
H. Li, Deep two-view structure-from-motion revisited, in: Proceedings Lecture notes in computer science 3951 (2006) 404–417.
of the IEEE/CVF conference on Computer Vision and Pattern Recogni- [35] E. Rublee, V. Rabaud, K. Konolige, G. Bradski, Orb: An efficient al-
tion, 2021, pp. 8953–8962. ternative to sift or surf, in: 2011 International conference on computer
[14] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, vision, Ieee, 2011, pp. 2564–2571.
I. Reid, J. J. Leonard, Past, present, and future of simultaneous localiza- [36] S. Leutenegger, M. Chli, R. Y. Siegwart, Brisk: Binary robust invari-
tion and mapping: Toward the robust-perception age, IEEE Transactions ant scalable keypoints, in: 2011 International conference on computer
on robotics 32 (6) (2016) 1309–1332. vision, Ieee, 2011, pp. 2548–2555.
[15] R. Mur-Artal, J. D. Tardós, Orb-slam2: An open-source slam system [37] P. F. Alcantarilla, A. Bartoli, A. J. Davison, Kaze features, in: Computer
for monocular, stereo, and rgb-d cameras, IEEE transactions on robotics Vision–ECCV 2012: 12th European Conference on Computer Vision,
33 (5) (2017) 1255–1262. Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, Springer,
[16] Y. Zhao, S. Xu, S. Bu, H. Jiang, P. Han, Gslam: A general slam frame- 2012, pp. 214–227.
work and benchmark, in: Proceedings of the IEEE/CVF International [38] P. F. Alcantarilla, T. Solutions, Fast explicit diffusion for accelerated
Conference on Computer Vision, 2019, pp. 1110–1120. features in nonlinear scale spaces, IEEE Trans. Patt. Anal. Mach. Intell
[17] C. Liu, J. Yuen, A. Torralba, Sift flow: Dense correspondence across 34 (7) (2011) 1281–1298.
scenes and its applications, IEEE transactions on pattern analysis and [39] M. A. Fischler, R. C. Bolles, Random sample consensus: a paradigm for
machine intelligence 33 (5) (2010) 978–994. model fitting with applications to image analysis and automated cartog-
[18] P. Weinzaepfel, J. Revaud, Z. Harchaoui, C. Schmid, Deepflow: Large raphy, Communications of the ACM 24 (6) (1981) 381–395.
displacement optical flow with deep matching, in: Proceedings of the [40] R. Xu, C. Wang, B. Fan, Y. Zhang, S. Xu, W. Meng, X. Zhang, Domain-
IEEE international conference on computer vision, 2013, pp. 1385– desc: Learning local descriptors with domain adaptation, in: ICASSP
1392. 2022-2022 IEEE International Conference on Acoustics, Speech and
[19] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, Signal Processing (ICASSP), IEEE, 2022, pp. 2505–2509.
P. Van Der Smagt, D. Cremers, T. Brox, Flownet: Learning optical flow [41] R. Xu, C. Wang, S. Xu, W. Meng, Y. Zhang, B. Fan, X. Zhang, Domain-
with convolutional networks, in: Proceedings of the IEEE international feat: Learning local features with domain adaptation, IEEE Transactions
conference on computer vision, 2015, pp. 2758–2766. on Circuits and Systems for Video Technology (2023).
[20] F. Radenović, G. Tolias, O. Chum, Fine-tuning cnn image retrieval with [42] J. Wu, R. Xu, Z. Wood-Doughty, C. Wang, Segment anything model is a
no human annotation, IEEE transactions on pattern analysis and machine good teacher for local feature learning, arXiv preprint arXiv:2309.16992
intelligence 41 (7) (2018) 1655–1668. (2023).
[21] B. Cao, A. Araujo, J. Sim, Unifying deep local and global features for [43] X. Jiang, J. Ma, G. Xiao, Z. Shao, X. Guo, A review of multimodal im-
image search, in: Computer Vision–ECCV 2020: 16th European Con- age matching: Methods and applications, Information Fusion 73 (2021)
ference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, 22–71.
Springer, 2020, pp. 726–743. [44] R. Xu, Y. Li, C. Wang, S. Xu, W. Meng, X. Zhang, Instance segmenta-
[22] P. Chhabra, N. K. Garg, M. Kumar, Content-based image retrieval sys- tion of biological images using graph convolutional network, Engineer-
tem using orb and sift features, Neural Computing and Applications 32 ing Applications of Artificial Intelligence 110 (2022) 104739.
(2020) 2725–2733. [45] R. Xu, C. Wang, S. Xu, W. Meng, X. Zhang, Dc-net: Dual context net-
[23] D. Zhang, H. Li, W. Cong, R. Xu, J. Dong, X. Chen, Task relation distil- work for 2d medical image segmentation, in: Medical Image Computing
lation and prototypical pseudo label for incremental named entity recog- and Computer Assisted Intervention–MICCAI 2021: 24th International
nition, in: Proceedings of the 32nd ACM International Conference on Conference, Strasbourg, France, September 27–October 1, 2021, Pro-
Information and Knowledge Management, 2023, pp. 3319–3329. ceedings, Part I 24, Springer, 2021, pp. 503–513.
[24] K. Wang, X. Fu, Y. Huang, C. Cao, G. Shi, Z.-J. Zha, Generalized uav [46] R. Xu, C. Wang, S. Xu, W. Meng, X. Zhang, Wave-like class activation
24
map with representation fusion for weakly-supervised semantic segmen- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
tation, IEEE Transactions on Multimedia (2023). 2021, pp. 5714–5724.
[47] W. Xu, R. Xu, C. Wang, S. Xu, L. Guo, M. Zhang, X. Zhang, Spectral [69] P.-E. Sarlin, D. DeTone, T. Malisiewicz, A. Rabinovich, Superglue:
prompt tuning: Unveiling unseen classes for zero-shot semantic segmen- Learning feature matching with graph neural networks, in: Proceedings
tation, arXiv preprint arXiv:2312.12754 (2023). of the IEEE/CVF conference on computer vision and pattern recogni-
[48] C. Wang, R. Xu, S. Xu, W. Meng, X. Zhang, Treating pseudo-labels tion, 2020, pp. 4938–4947.
generation as image matting for weakly supervised semantic segmen- [70] H. Chen, Z. Luo, J. Zhang, L. Zhou, X. Bai, Z. Hu, C.-L. Tai, L. Quan,
tation, in: Proceedings of the IEEE/CVF International Conference on Learning to match features with seeded graph matching network, in:
Computer Vision, 2023, pp. 755–765. Proceedings of the IEEE/CVF International Conference on Computer
[49] M. Awrangjeb, G. Lu, C. S. Fraser, Performance comparisons of Vision, 2021, pp. 6301–6310.
contour-based corner detectors, IEEE Transactions on Image Process- [71] Y. Shi, J.-X. Cai, Y. Shavit, T.-J. Mu, W. Feng, K. Zhang, Clustergnn:
ing 21 (9) (2012) 4167–4179. Cluster-based coarse-to-fine graph neural network for efficient feature
[50] Y. Li, S. Wang, Q. Tian, X. Ding, A survey of recent advances in visual matching, in: Proceedings of the IEEE/CVF Conference on Computer
feature detection, Neurocomputing 149 (2015) 736–751. Vision and Pattern Recognition, 2022, pp. 12517–12526.
[51] S. Krig, S. Krig, Interest point detector and feature descriptor survey, [72] J. Sun, Z. Shen, Y. Wang, H. Bao, X. Zhou, Loftr: Detector-free local
Computer Vision Metrics: Textbook Edition (2016) 187–246. feature matching with transformers, in: Proceedings of the IEEE/CVF
[52] K. Joshi, M. I. Patel, Recent advances in local feature detector and de- conference on computer vision and pattern recognition, 2021, pp. 8922–
scriptor: a literature survey, International Journal of Multimedia Infor- 8931.
mation Retrieval 9 (4) (2020) 231–247. [73] H. Chen, Z. Luo, L. Zhou, Y. Tian, M. Zhen, T. Fang, D. McKinnon,
[53] J. Ma, X. Jiang, A. Fan, J. Jiang, J. Yan, Image matching from hand- Y. Tsin, L. Quan, Aspanformer: Detector-free image matching with
crafted to deep features: A survey, International Journal of Computer adaptive span transformer, in: Computer Vision–ECCV 2022: 17th Eu-
Vision 129 (2021) 23–79. ropean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,
[54] J. Jing, T. Gao, W. Zhang, Y. Gao, C. Sun, Image feature information Part XXXII, Springer, 2022, pp. 20–36.
extraction for interest point detection: A comprehensive review, IEEE [74] J. Zhang, L. Dai, F. Meng, Q. Fan, X. Chen, K. Xu, H. Wang, 3d-aware
Transactions on Pattern Analysis and Machine Intelligence (2022). object goal navigation via simultaneous exploration and identification,
[55] F. Bellavia, C. Colombo, L. Morelli, F. Remondino, Challenges in image in: Proceedings of the IEEE/CVF Conference on Computer Vision and
matching for cultural heritage: an overview and perspective, in: Inter- Pattern Recognition, 2023, pp. 6672–6682.
national Conference on Image Analysis and Processing, Springer, 2022, [75] J. Zhang, Y. Tang, H. Wang, K. Xu, Asro-dio: Active subspace ran-
pp. 210–222. dom optimization based depth inertial odometry, IEEE Transactions on
[56] G. Haskins, U. Kruger, P. Yan, Deep learning in medical image registra- Robotics 39 (2) (2022) 1496–1508.
tion: a survey, Machine Vision and Applications 31 (2020) 1–18. [76] J. Bian, W.-Y. Lin, Y. Matsushita, S.-K. Yeung, T.-D. Nguyen, M.-M.
[57] S. Bharati, M. Mondal, P. Podder, V. Prasath, Deep learning for Cheng, Gms: Grid-based motion statistics for fast, ultra-robust feature
medical image registration: A comprehensive review, arXiv preprint correspondence, in: Proceedings of the IEEE conference on computer
arXiv:2204.11341 (2022). vision and pattern recognition, 2017, pp. 4181–4190.
[58] J. Chen, Y. Liu, S. Wei, Z. Bian, S. Subramanian, A. Carass, J. L. [77] J. Zhang, D. Sun, Z. Luo, A. Yao, L. Zhou, T. Shen, Y. Chen, L. Quan,
Prince, Y. Du, A survey on deep learning in medical image registration: H. Liao, Learning two-view correspondences and geometry using order-
New technologies, uncertainty, evaluation metrics, and beyond, arXiv aware network, in: Proceedings of the IEEE/CVF international confer-
preprint arXiv:2307.15615 (2023). ence on computer vision, 2019, pp. 5845–5854.
[59] S. Paul, U. C. Pati, A comprehensive review on remote sensing im- [78] K. M. Yi, E. Trulls, V. Lepetit, P. Fua, Lift: Learned invariant feature
age registration, International Journal of Remote Sensing 42 (14) (2021) transform, in: Computer Vision–ECCV 2016: 14th European Confer-
5396–5432. ence, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings,
[60] B. Zhu, L. Zhou, S. Pu, J. Fan, Y. Ye, Advances and challenges in multi- Part VI 14, Springer, 2016, pp. 467–483.
modal remote sensing image registration, IEEE Journal on Miniaturiza- [79] M. Brown, G. Hua, S. Winder, Discriminative learning of local image
tion for Air and Space Systems (2023). descriptors, IEEE transactions on pattern analysis and machine intelli-
[61] D. DeTone, T. Malisiewicz, A. Rabinovich, Superpoint: Self-supervised gence 33 (1) (2010) 43–57.
interest point detection and description, in: Proceedings of the IEEE [80] Y. Tian, B. Fan, F. Wu, L2-net: Deep learning of discriminative patch
conference on computer vision and pattern recognition workshops, descriptor in euclidean space, in: Proceedings of the IEEE conference
2018, pp. 224–236. on computer vision and pattern recognition, 2017, pp. 661–669.
[62] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, T. Sat- [81] K. M. Yi, Y. Verdie, P. Fua, V. Lepetit, Learning to assign orientations
tler, D2-net: A trainable cnn for joint description and detection of local to feature points, in: Proceedings of the IEEE conference on computer
features, in: Proceedings of the ieee/cvf conference on computer vision vision and pattern recognition, 2016, pp. 107–116.
and pattern recognition, 2019, pp. 8092–8101. [82] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah, Signature verifi-
[63] J. Revaud, P. Weinzaepfel, C. De Souza, N. Pion, G. Csurka, Y. Cabon, cation using a” siamese” time delay neural network, Advances in neural
M. Humenberger, R2d2: repeatable and reliable detector and descriptor, information processing systems 6 (1993).
arXiv preprint arXiv:1906.06195 (2019). [83] A. Mishchuk, D. Mishkin, F. Radenovic, J. Matas, Working hard to
[64] I. Rocco, M. Cimpoi, R. Arandjelović, A. Torii, T. Pajdla, J. Sivic, know your neighbor’s margins: Local descriptor learning loss, Advances
Neighbourhood consensus networks, Advances in neural information in neural information processing systems 30 (2017).
processing systems 31 (2018). [84] K. He, Y. Lu, S. Sclaroff, Local descriptors optimized for average pre-
[65] I. Rocco, R. Arandjelović, J. Sivic, Efficient neighbourhood consensus cision, in: Proceedings of the IEEE conference on computer vision and
networks via submanifold sparse convolutions, in: Computer Vision– pattern recognition, 2018, pp. 596–605.
ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, [85] X. Wei, Y. Zhang, Y. Gong, N. Zheng, Kernelized subspace pooling
2020, Proceedings, Part IX 16, Springer, 2020, pp. 605–621. for deep local descriptors, in: Proceedings of the IEEE conference on
[66] X. Li, K. Han, S. Li, V. Prisacariu, Dual-resolution correspondence net- computer vision and pattern recognition, 2018, pp. 1867–1875.
works, Advances in Neural Information Processing Systems 33 (2020) [86] K. Lin, J. Lu, C.-S. Chen, J. Zhou, M.-T. Sun, Unsupervised deep learn-
17346–17357. ing of compact binary descriptors, IEEE transactions on pattern analysis
[67] P. Truong, M. Danelljan, R. Timofte, Glu-net: Global-local univer- and machine intelligence 41 (6) (2018) 1501–1514.
sal network for dense flow and correspondences, in: Proceedings of [87] M. Zieba, P. Semberecki, T. El-Gaaly, T. Trzcinski, Bingan: Learning
the IEEE/CVF conference on computer vision and pattern recognition, compact binary descriptors with a regularized gan, Advances in neural
2020, pp. 6258–6268. information processing systems 31 (2018).
[68] P. Truong, M. Danelljan, L. Van Gool, R. Timofte, Learning accurate [88] L. Wei, S. Zhang, H. Yao, W. Gao, Q. Tian, Glad: Global–local-
dense correspondences and when to trust them, in: Proceedings of the alignment descriptor for scalable person re-identification, IEEE Trans-
25
actions on Multimedia 21 (4) (2018) 986–999. J. Zhong, Semantics lead all: Towards unified image registration and fu-
[89] Z. Luo, T. Shen, L. Zhou, S. Zhu, R. Zhang, Y. Yao, T. Fang, L. Quan, sion from a semantic perspective, Information Fusion 98 (2023) 101835.
Geodesc: Learning local descriptors by integrating geometry con- [109] D. Mishkin, F. Radenovic, J. Matas, Repeatability is not enough: Learn-
straints, in: Proceedings of the European conference on computer vision ing affine regions via discriminability, in: Proceedings of the European
(ECCV), 2018, pp. 168–183. conference on computer vision (ECCV), 2018, pp. 284–300.
[90] Y. Liu, Z. Shen, Z. Lin, S. Peng, H. Bao, X. Zhou, Gift: Learning [110] P. Truong, S. Apostolopoulos, A. Mosinska, S. Stucky, C. Ciller, S. D.
transformation-invariant dense visual descriptors via group cnns, Ad- Zanet, Glampoints: Greedily learned accurate match points, in: Pro-
vances in Neural Information Processing Systems 32 (2019). ceedings of the IEEE/CVF International Conference on Computer Vi-
[91] J. Lee, Y. Jeong, S. Kim, J. Min, M. Cho, Learning to distill con- sion, 2019, pp. 10732–10741.
volutional features into compact local descriptors, in: Proceedings of [111] Q. Wang, X. Zhou, B. Hariharan, N. Snavely, Learning feature descrip-
the IEEE/CVF Winter Conference on Applications of Computer Vision, tors using camera pose supervision, in: Computer Vision–ECCV 2020:
2021, pp. 898–908. 16th European Conference, Glasgow, UK, August 23–28, 2020, Pro-
[92] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, V. Balntas, Sosnet: Sec- ceedings, Part I 16, Springer, 2020, pp. 757–774.
ond order similarity regularization for local descriptor learning, in: Pro- [112] M. Tyszkiewicz, P. Fua, E. Trulls, Disk: Learning local features with
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern policy gradient, Advances in Neural Information Processing Systems 33
Recognition, 2019, pp. 11016–11025. (2020) 14254–14265.
[93] P. Ebel, A. Mishchuk, K. M. Yi, P. Fua, E. Trulls, Beyond cartesian [113] J. Lee, B. Kim, S. Kim, M. Cho, Learning rotation-equivariant features
representations for local descriptors, in: Proceedings of the IEEE/CVF for visual correspondence, in: Proceedings of the IEEE/CVF Conference
international conference on computer vision, 2019, pp. 253–262. on Computer Vision and Pattern Recognition, 2023, pp. 21887–21897.
[94] Y. Tian, A. Barroso Laguna, T. Ng, V. Balntas, K. Mikolajczyk, Hynet: [114] H. Zhou, T. Sattler, D. W. Jacobs, Evaluating local features for day-night
Learning local descriptor with hybrid similarity measure and triplet loss, matching, in: Computer Vision–ECCV 2016 Workshops: Amsterdam,
Advances in neural information processing systems 33 (2020) 7401– The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III
7412. 14, Springer, 2016, pp. 724–736.
[95] C. Wang, R. Xu, S. Xu, W. Meng, X. Zhang, Cndesc: Cross normal- [115] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg,
ization for local descriptors learning, IEEE Transactions on Multimedia D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, et al., Benchmarking 6dof
(2022). outdoor visual localization in changing conditions, in: Proceedings of
[96] A. Barroso-Laguna, E. Riba, D. Ponsa, K. Mikolajczyk, Key. net: Key- the IEEE conference on computer vision and pattern recognition, 2018,
point detection by handcrafted and learned cnn filters, in: Proceedings pp. 8601–8610.
of the IEEE/CVF international conference on computer vision, 2019, pp. [116] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic,
5836–5844. T. Pajdla, A. Torii, Inloc: Indoor visual localization with dense matching
[97] X. Zhao, X. Wu, J. Miao, W. Chen, P. C. Chen, Z. Li, Alike: Accu- and view synthesis, in: Proceedings of the IEEE Conference on Com-
rate and lightweight keypoint detection and descriptor extraction, IEEE puter Vision and Pattern Recognition, 2018, pp. 7199–7209.
Transactions on Multimedia (2022). [117] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for se-
[98] E. Santellani, C. Sormann, M. Rossi, A. Kuhn, F. Fraundorfer, S-trek: mantic segmentation, in: Proceedings of the IEEE conference on com-
Sequential translation and rotation equivariant keypoints for local fea- puter vision and pattern recognition, 2015, pp. 3431–3440.
ture extraction, in: Proceedings of the IEEE/CVF International Confer- [118] Y. Ono, E. Trulls, P. Fua, K. M. Yi, Lf-net: Learning local features from
ence on Computer Vision, 2023, pp. 9728–9737. images, Advances in neural information processing systems 31 (2018).
[99] M. Kanakis, S. Maurer, M. Spallanzani, A. Chhatkuli, L. Van Gool, Zip- [119] X. Shen, C. Wang, X. Li, Z. Yu, J. Li, C. Wen, M. Cheng, Z. He, Rf-
pypoint: Fast interest point detection, description, and matching through net: An end-to-end image matching network based on receptive field,
mixed precision discretization, in: Proceedings of the IEEE/CVF Con- in: Proceedings of the IEEE/CVF conference on computer vision and
ference on Computer Vision and Pattern Recognition, 2023, pp. 6113– pattern recognition, 2019, pp. 8132–8140.
6122. [120] A. Bhowmik, S. Gumhold, C. Rother, E. Brachmann, Reinforced fea-
[100] J. Tang, H. Kim, V. Guizilini, S. Pillai, A. Rares, Neural outlier rejection ture points: Optimizing feature detection and description for a high-level
for self-supervised keypoint learning, in: 8th International Conference task, in: Proceedings of the IEEE/CVF conference on computer vision
on Learning Representations, ICLR 2020, International Conference on and pattern recognition, 2020, pp. 4948–4957.
Learning Representations, ICLR, 2020. [121] U. S. Parihar, A. Gujarathi, K. Mehta, S. Tourani, S. Garg, M. Mil-
[101] Z. Luo, T. Shen, L. Zhou, J. Zhang, Y. Yao, S. Li, T. Fang, L. Quan, ford, K. M. Krishna, Rord: Rotation-robust descriptors and orthographic
Contextdesc: Local descriptor augmentation with cross-modality con- views for local feature matching, in: 2021 IEEE/RSJ International Con-
text, in: Proceedings of the IEEE/CVF conference on computer vision ference on Intelligent Robots and Systems (IROS), IEEE, 2021, pp.
and pattern recognition, 2019, pp. 2527–2536. 1593–1600.
[102] C. Wang, R. Xu, Y. Zhang, S. Xu, W. Meng, B. Fan, X. Zhang, Mtldesc: [122] A. Barroso-Laguna, Y. Verdie, B. Busam, K. Mikolajczyk, Hdd-net: Hy-
Looking wider to describe better, in: Proceedings of the AAAI Confer- brid detector descriptor with mutual interactive learning, in: Proceedings
ence on Artificial Intelligence, Vol. 36, 2022, pp. 2388–2396. of the Asian Conference on Computer Vision, 2020.
[103] C. Wang, R. Xu, K. Lv, S. Xu, W. Meng, Y. Zhang, B. Fan, X. Zhang, At- [123] Y. Zhang, J. Wang, S. Xu, X. Liu, X. Zhang, Mlifeat: Multi-level infor-
tention weighted local descriptors, IEEE Transactions on Pattern Anal- mation fusion based deep local features, in: Proceedings of the Asian
ysis and Machine Intelligence (2023). Conference on Computer Vision, 2020.
[104] J. Chen, S. Chen, Y. Liu, X. Chen, X. Fan, Y. Rao, C. Zhou, Y. Yang, [124] S. Suwanwimolkul, S. Komorita, K. Tasaka, Learning of low-level fea-
Igs-net: Seeking good correspondences via interactive generative struc- ture keypoints for accurate and robust detection, in: Proceedings of
ture learning, IEEE Transactions on Geoscience and Remote Sensing 60 the IEEE/CVF Winter Conference on Applications of Computer Vision,
(2021) 1–13. 2021, pp. 2262–2271.
[105] J. Li, Q. Hu, M. Ai, Rift: Multi-modal image matching based on [125] X. Wang, Z. Liu, Y. Hu, W. Xi, W. Yu, D. Zou, Featurebooster: Boosting
radiation-variation insensitive feature transform, IEEE Transactions on feature descriptors with a lightweight neural network, in: Proceedings
Image Processing 29 (2019) 3296–3310. of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
[106] E. Rosten, T. Drummond, Machine learning for high-speed corner de- tion, 2023, pp. 7630–7639.
tection, in: Computer Vision–ECCV 2006: 9th European Conference [126] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural
I 9, Springer, 2006, pp. 430–443. information processing systems 30 (2017).
[107] S. Cui, M. Xu, A. Ma, Y. Zhong, Modality-free feature detector and [127] Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang,
descriptor for multimodal remote sensing image registration, Remote L. Quan, Aslfeat: Learning local features of accurate shape and localiza-
Sensing 12 (18) (2020) 2937. tion, in: Proceedings of the IEEE/CVF conference on computer vision
[108] H. Xie, Y. Zhang, J. Qiu, X. Zhai, X. Liu, Y. Yang, S. Zhao, Y. Luo, and pattern recognition, 2020, pp. 6589–6598.
26
[128] B. Fan, J. Zhou, W. Feng, H. Pu, Y. Yang, Q. Kong, F. Wu, H. Liu, scale, arXiv preprint arXiv:2010.11929 (2020).
Learning semantic-aware local features for long term visual localization, [151] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov,
IEEE Transactions on Image Processing 31 (2022) 4842–4855. S. Zagoruyko, End-to-end object detection with transformers, in: Eu-
[129] F. Xue, I. Budvytis, R. Cipolla, Sfd2: Semantic-guided feature detec- ropean conference on computer vision, Springer, 2020, pp. 213–229.
tion and description, in: Proceedings of the IEEE/CVF Conference on [152] R. Xu, C. Wang, J. Zhang, S. Xu, W. Meng, X. Zhang, Rssformer: Fore-
Computer Vision and Pattern Recognition, 2023, pp. 5206–5216. ground saliency enhancement for remote sensing land-cover segmenta-
[130] Y. Tian, V. Balntas, T. Ng, A. Barroso-Laguna, Y. Demiris, K. Miko- tion, IEEE Transactions on Image Processing 32 (2023) 1052–1064.
lajczyk, D2d: Keypoint extraction with describe to detect approach, in: [153] R. Xu, C. Wang, J. Sun, S. Xu, W. Meng, X. Zhang, Self correspondence
Proceedings of the Asian conference on computer vision, 2020. distillation for end-to-end weakly-supervised semantic segmentation, in:
[131] K. Li, L. Wang, L. Liu, Q. Ran, K. Xu, Y. Guo, Decoupling makes Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
weakly supervised local feature better, in: Proceedings of the IEEE/CVF [154] R. Xu, C. Wang, S. Xu, W. Meng, X. Zhang, Dual-stream representation
Conference on Computer Vision and Pattern Recognition, 2022, pp. fusion learning for accurate medical image segmentation, Engineering
15838–15848. Applications of Artificial Intelligence 123 (2023) 106402.
[132] Y. Deng, J. Ma, Redfeat: Recoupling detection and description for mul- [155] W. Cong, Y. Cong, J. Dong, G. Sun, H. Ding, Gradient-semantic com-
timodal feature learning, IEEE Transactions on Image Processing 32 pensation for incremental semantic segmentation, IEEE Transactions on
(2022) 591–602. Multimedia (2023) 1–14doi:10.1109/TMM.2023.3336243.
[133] J. Sun, J. Zhu, L. Ji, Shared coupling-bridge for weakly supervised local [156] W. Cong, Y. Cong, G. Sun, Y. Liu, J. Dong, Self-paced weight con-
feature learning, arXiv preprint arXiv:2212.07047 (2022). solidation for continual learning, IEEE Transactions on Circuits and
[134] D. Zhang, F. Chen, X. Chen, Dualgats: Dual graph attention networks Systems for Video Technology (2023) 1–1doi:10.1109/TCSVT.
for emotion recognition in conversations, in: Proceedings of the 61st 2023.3304567.
Annual Meeting of the Association for Computational Linguistics (Vol- [157] W. Jiang, E. Trulls, J. Hosang, A. Tagliasacchi, K. M. Yi, Cotr: Corre-
ume 1: Long Papers), 2023, pp. 7395–7408. spondence transformer for matching across images, in: Proceedings of
[135] Z. Li, J. Ma, Learning feature matching via matchable keypoint-assisted the IEEE/CVF International Conference on Computer Vision, 2021, pp.
graph neural network, arXiv preprint arXiv:2307.01447 (2023). 6207–6217.
[136] R. Pautrat, I. Suárez, Y. Yu, M. Pollefeys, V. Larsson, Gluestick: Ro- [158] D. Tan, J.-J. Liu, X. Chen, C. Chen, R. Zhang, Y. Shen, S. Ding, R. Ji,
bust image matching by sticking points and lines together, arXiv preprint Eco-tr: Efficient correspondences finding via coarse-to-fine refinement,
arXiv:2304.02008 (2023). in: European Conference on Computer Vision, Springer, 2022, pp. 317–
[137] P. Lindenberger, P.-E. Sarlin, M. Pollefeys, Lightglue: Local feature 334.
matching at light speed, arXiv preprint arXiv:2306.13643 (2023). [159] G. Bökman, F. Kahl, A case for using rotation invariant features in state
[138] Z. Kuang, J. Li, M. He, T. Wang, Y. Zhao, Densegap: graph-structured of the art feature matchers, in: Proceedings of the IEEE/CVF Confer-
dense correspondence learning with anchor points, in: 2022 26th In- ence on Computer Vision and Pattern Recognition, 2022, pp. 5110–
ternational Conference on Pattern Recognition (ICPR), IEEE, 2022, pp. 5119.
542–549. [160] S. Tang, J. Zhang, S. Zhu, P. Tan, Quadtree attention for vision trans-
[139] Y. Cai, L. Li, D. Wang, X. Li, X. Liu, Htmatch: An efficient hybrid formers, arXiv preprint arXiv:2201.02767 (2022).
transformer based graph neural network for local feature matching, Sig- [161] Y. Chen, D. Huang, S. Xu, J. Liu, Y. Liu, Guide local feature match-
nal Processing 204 (2023) 108859. ing by overlap estimation, in: Proceedings of the AAAI Conference on
[140] X. Lu, Y. Yan, B. Kang, S. Du, Paraformer: Parallel attention trans- Artificial Intelligence, Vol. 36, 2022, pp. 365–373.
former for efficient feature matching, arXiv preprint arXiv:2303.00941 [162] Q. Wang, J. Zhang, K. Yang, K. Peng, R. Stiefelhagen, Matchformer:
(2023). Interleaving attention in transformers for feature matching, in: Proceed-
[141] Y. Deng, J. Ma, Resmatch: Residual attention learning for local feature ings of the Asian Conference on Computer Vision, 2022, pp. 2746–2762.
matching, arXiv preprint arXiv:2307.05180 (2023). [163] J. Ma, Y. Wang, A. Fan, G. Xiao, R. Chen, Correspondence attention
[142] T. Xie, K. Dai, K. Wang, R. Li, L. Zhao, Deepmatcher: a deep transformer: A context-sensitive network for two-view correspondence
transformer-based network for robust and accurate local feature match- learning, IEEE Transactions on Multimedia (2022).
ing, arXiv preprint arXiv:2301.02993 (2023). [164] K. T. Giang, S. Song, S. Jo, Topicfm: robust and interpretable feature
[143] P. Truong, M. Danelljan, L. V. Gool, R. Timofte, Gocor: Bringing matching with topic-assisted, arXiv preprint arXiv:2207.00328 (2022).
globally optimized correspondence volumes into your neural network, [165] J. Yu, J. Chang, J. He, T. Zhang, J. Yu, F. Wu, Adaptive spot-guided
Advances in Neural Information Processing Systems 33 (2020) 14278– transformer for consistent local feature matching, in: Proceedings of the
14290. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[144] J. Xu, R. Ranftl, V. Koltun, Accurate optical flow via direct cost vol- 2023, pp. 21898–21908.
ume processing, in: Proceedings of the IEEE Conference on Computer [166] K. Dai, T. Xie, K. Wang, Z. Jiang, R. Li, L. Zhao, Oamatcher: An over-
Vision and Pattern Recognition, 2017, pp. 1289–1297. lapping areas-based network for accurate local feature matching, arXiv
[145] Z. Teed, J. Deng, Raft: Recurrent all-pairs field transforms for optical preprint arXiv:2302.05846 (2023).
flow, in: Computer Vision–ECCV 2020: 16th European Conference, [167] C. Cao, Y. Fu, Improving transformer-based image matching by
Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, Springer, cascaded capturing spatially informative keypoints, arXiv preprint
2020, pp. 402–419. arXiv:2303.02885 (2023).
[146] P. Truong, M. Danelljan, R. Timofte, L. Van Gool, Pdc-net+: Enhanced [168] S. Zhu, X. Liu, Pmatch: Paired masked image modeling for dense geo-
probabilistic dense correspondence network, IEEE Transactions on Pat- metric matching, in: Proceedings of the IEEE/CVF Conference on Com-
tern Analysis and Machine Intelligence (2023). puter Vision and Pattern Recognition, 2023, pp. 21909–21918.
[147] J. Revaud, V. Leroy, P. Weinzaepfel, B. Chidlovskii, Pump: Pyramidal [169] J. Chang, J. Yu, T. Zhang, Structured epipolar matcher for local feature
and uniqueness matching priors for unsupervised learning of local de- matching, in: Proceedings of the IEEE/CVF Conference on Computer
scriptors, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6176–6185.
Vision and Pattern Recognition, 2022, pp. 3926–3936. [170] J. Edstedt, I. Athanasiadis, M. Wadenbäck, M. Felsberg, Dkm: Dense
[148] J. Revaud, P. Weinzaepfel, Z. Harchaoui, C. Schmid, Deepmatching: kernelized feature matching for geometry estimation, in: Proceedings
Hierarchical deformable dense matching, International Journal of Com- of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
puter Vision 120 (2016) 300–323. tion, 2023, pp. 17765–17775.
[149] U. Efe, K. G. Ince, A. Alatan, Dfm: A performance baseline for deep [171] J. Edstedt, Q. Sun, G. Bökman, M. Wadenbäck, M. Felsberg, Roma:
feature matching, in: Proceedings of the IEEE/CVF conference on com- Revisiting robust losses for dense feature matching, arXiv preprint
puter vision and pattern recognition, 2021, pp. 4284–4293. arXiv:2305.15404 (2023).
[150] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [172] Q. Zhou, T. Sattler, L. Leal-Taixe, Patch2pix: Epipolar-guided pixel-
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., level correspondences, in: Proceedings of the IEEE/CVF conference on
An image is worth 16x16 words: Transformers for image recognition at computer vision and pattern recognition, 2021, pp. 4669–4678.
27
[173] D. Huang, Y. Chen, Y. Liu, J. Liu, S. Xu, W. Wu, Y. Ding, F. Tang, 3 satellite image geometric accuracy verification using the rpc model,
C. Wang, Adaptive assignment for geometry aware local feature match- Sensors 17 (9) (2017) 2005.
ing, in: Proceedings of the IEEE/CVF Conference on Computer Vision [195] W. Ma, J. Zhang, Y. Wu, L. Jiao, H. Zhu, W. Zhao, A novel two-step
and Pattern Recognition, 2023, pp. 5425–5434. registration method for remote sensing images based on deep and local
[174] J. Ni, Y. Li, Z. Huang, H. Li, H. Bao, Z. Cui, G. Zhang, Pats: Patch features, IEEE Transactions on Geoscience and Remote Sensing 57 (7)
area transportation with subdivision for local feature matching, in: Pro- (2019) 4834–4843.
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern [196] L. Zhou, Y. Ye, T. Tang, K. Nan, Y. Qin, Robust matching for sar and
Recognition, 2023, pp. 17776–17786. optical images using multiscale convolutional gradient features, IEEE
[175] Y. Zhang, X. Zhao, D. Qian, Searching from area to point: A hierarchical Geoscience and Remote Sensing Letters 19 (2021) 1–5.
framework for semantic-geometric combined feature matching, arXiv [197] S. Cui, A. Ma, L. Zhang, M. Xu, Y. Zhong, Map-net: Sar and optical
preprint arXiv:2305.00194 (2023). image matching via image-based convolutional network with attention
[176] N. Snavely, S. M. Seitz, R. Szeliski, Photo tourism: exploring photo mechanism and spatial pyramid aggregated pooling, IEEE Transactions
collections in 3d, in: ACM siggraph 2006 papers, 2006, pp. 835–846. on Geoscience and Remote Sensing 60 (2021) 1–13.
[177] P. Lindenberger, P.-E. Sarlin, V. Larsson, M. Pollefeys, Pixel-perfect [198] Z. Wang, Y. Ma, Y. Zhang, Review of pixel-level remote sensing image
structure-from-motion with featuremetric refinement, in: Proceedings fusion based on deep learning, Information Fusion 90 (2023) 36–58.
of the IEEE/CVF international conference on computer vision, 2021, [199] Z. Bian, A. Jabri, A. A. Efros, A. Owens, Learning pixel trajectories with
pp. 5987–5997. multiscale contrastive random walks, in: Proceedings of the IEEE/CVF
[178] C. M. Parameshwara, G. Hari, C. Fermüller, N. J. Sanket, Y. Aloi- Conference on Computer Vision and Pattern Recognition, 2022, pp.
monos, Diffposenet: direct differentiable camera pose estimation, in: 6508–6519.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- [200] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, M. J. Black,
tern Recognition, 2022, pp. 6845–6854. Competitive collaboration: Joint unsupervised learning of depth, cam-
[179] J. Y. Zhang, D. Ramanan, S. Tulsiani, Relpose: Predicting probabilistic era motion, optical flow and motion segmentation, in: Proceedings of
relative rotation for single objects in the wild, in: European Conference the IEEE/CVF conference on computer vision and pattern recognition,
on Computer Vision, Springer, 2022, pp. 592–611. 2019, pp. 12240–12249.
[180] C. Tang, P. Tan, Ba-net: Dense bundle adjustment network, arXiv [201] A. W. Harley, Z. Fang, K. Fragkiadaki, Particle video revisited: Tracking
preprint arXiv:1806.04807 (2018). through occlusions using point trajectories, in: European Conference on
[181] X. Gu, W. Yuan, Z. Dai, C. Tang, S. Zhu, P. Tan, Dro: Deep recurrent Computer Vision, Springer, 2022, pp. 59–75.
optimizer for structure-from-motion, arXiv preprint arXiv:2103.13201 [202] C. Qin, S. Wang, C. Chen, W. Bai, D. Rueckert, Generative myocar-
(2021). dial motion tracking via latent space exploration with biomechanics-
[182] X. He, J. Sun, Y. Wang, S. Peng, Q. Huang, H. Bao, X. Zhou, Detector- informed prior, Medical Image Analysis 83 (2023) 102682.
free structure from motion, arXiv preprint arXiv:2306.15669 (2023). [203] M. Ye, M. Kanski, D. Yang, Q. Chang, Z. Yan, Q. Huang, L. Axel,
[183] L. H. Hughes, D. Marcos, S. Lobry, D. Tuia, M. Schmitt, A deep learn- D. Metaxas, Deeptag: An unsupervised deep learning method for mo-
ing framework for matching of sar and optical imagery, ISPRS Journal tion tracking on cardiac tagging magnetic resonance images, in: Pro-
of Photogrammetry and Remote Sensing 169 (2020) 166–179. ceedings of the IEEE/CVF conference on computer vision and pattern
[184] Y. Ye, T. Tang, B. Zhu, C. Yang, B. Li, S. Hao, A multiscale framework recognition, 2021, pp. 7261–7271.
with unsupervised learning for remote sensing image registration, IEEE [204] Z. Bian, F. Xing, J. Yu, M. Shao, Y. Liu, A. Carass, J. Woo, J. L. Prince,
Transactions on Geoscience and Remote Sensing 60 (2022) 1–15. Drimet: Deep registration-based 3d incompressible motion estimation
[185] S. Chen, J. Chen, Y. Rao, X. Chen, X. Fan, H. Bai, L. Xing, C. Zhou, in tagged-mri with application to the tongue, in: Medical Imaging with
Y. Yang, A hierarchical consensus attention network for feature match- Deep Learning, 2023.
ing of remote sensing images, IEEE Transactions on Geoscience and [205] T. Fechter, D. Baltas, One-shot learning for deformable medical image
Remote Sensing 60 (2022) 1–11. registration and periodic motion tracking, IEEE transactions on medical
[186] Y. Liu, B. N. Zhao, S. Zhao, L. Zhang, Progressive motion coherence imaging 39 (7) (2020) 2506–2517.
for remote sensing image matching, IEEE Transactions on Geoscience [206] Y. Zhang, X. Wu, H. M. Gach, H. Li, D. Yang, Groupregnet: a groupwise
and Remote Sensing 60 (2022) 1–13. one-shot deep learning-based 4d image registration method, Physics in
[187] Y. Ye, B. Zhu, T. Tang, C. Yang, Q. Xu, G. Zhang, A robust multimodal Medicine & Biology 66 (4) (2021) 045030.
remote sensing image registration method and system using steerable fil- [207] Y. Ji, Z. Zhu, Y. Wei, A one-shot lung 4d-ct image registration method
ters with first-and second-order gradients, ISPRS Journal of Photogram- with temporal-spatial features, in: 2022 IEEE Biomedical Circuits and
metry and Remote Sensing 188 (2022) 331–350. Systems Conference (BioCAS), IEEE, 2022, pp. 203–207.
[188] L. H. Hughes, M. Schmitt, L. Mou, Y. Wang, X. X. Zhu, Identifying [208] M. Z. Iqbal, I. Razzak, A. Qayyum, T. T. Nguyen, M. Tanveer,
corresponding patches in sar and optical images with a pseudo-siamese A. Sowmya, Hybrid unsupervised paradigm based deformable image
cnn, IEEE Geoscience and Remote Sensing Letters 15 (5) (2018) 784– fusion for 4d ct lung image modality, Information Fusion 102 (2024)
788. 102061.
[189] D. Quan, S. Wang, X. Liang, R. Wang, S. Fang, B. Hou, L. Jiao, Deep [209] M. Pfandler, P. Stefan, C. Mehren, M. Lazarovici, M. Weigl, Technical
generative matching network for optical and sar image registration, in: and nontechnical skills in surgery: a simulated operating room environ-
IGARSS 2018-2018 IEEE International Geoscience and Remote Sens- ment study, Spine 44 (23) (2019) E1396–E1400.
ing Symposium, IEEE, 2018, pp. 6215–6218. [210] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, P. Suetens, Mul-
[190] N. Merkle, S. Auer, R. Mueller, P. Reinartz, Exploring the potential timodality image registration by maximization of mutual information,
of conditional adversarial networks for optical and sar image matching, IEEE transactions on Medical Imaging 16 (2) (1997) 187–198.
IEEE Journal of Selected Topics in Applied Earth Observations and Re- [211] M. Unberath, C. Gao, Y. Hu, M. Judish, R. H. Taylor, M. Armand,
mote Sensing 11 (6) (2018) 1811–1820. R. Grupp, The impact of machine learning on 2d/3d registration for
[191] W. Shi, F. Su, R. Wang, J. Fan, A visual circle based image registration image-guided interventions: A systematic review and perspective, Fron-
algorithm for optical and sar imagery, in: 2012 IEEE International Geo- tiers in Robotics and AI 8 (2021) 716007.
science and Remote Sensing Symposium, IEEE, 2012, pp. 2109–2112. [212] S. Jaganathan, M. Kukla, J. Wang, K. Shetty, A. Maier, Self-supervised
[192] A. Zampieri, G. Charpiat, N. Girard, Y. Tarabalka, Multimodal image 2d/3d registration for x-ray to ct image fusion, in: Proceedings of the
alignment through a multiscale chain of neural networks with applica- IEEE/CVF Winter Conference on Applications of Computer Vision,
tion to remote sensing, in: Proceedings of the European Conference on 2023, pp. 2788–2798.
Computer Vision (ECCV), 2018, pp. 657–673. [213] D.-X. Huang, X.-H. Zhou, X.-L. Xie, S.-Q. Liu, Z.-Q. Feng, J.-L. Hao,
[193] F. Ye, Y. Su, H. Xiao, X. Zhao, W. Min, Remote sensing image registra- Z.-G. Hou, N. Ma, L. Yan, A novel two-stage framework for 2d/3d reg-
tion using convolutional neural network features, IEEE Geoscience and istration in neurological interventions, in: 2022 IEEE International Con-
Remote Sensing Letters 15 (2) (2018) 232–236. ference on Robotics and Biomimetics (ROBIO), IEEE, 2022, pp. 266–
[194] T. Wang, G. Zhang, L. Yu, R. Zhao, M. Deng, K. Xu, Multi-mode gf- 271.
28
[214] Y. Pei, Y. Zhang, H. Qin, G. Ma, Y. Guo, T. Xu, H. Zha, Non-rigid cran- [235] H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola, A. B. Dahl, Large-scale
iofacial 2d-3d registration using cnn-based regression, in: Deep Learn- data for multiple-view stereopsis, International Journal of Computer Vi-
ing in Medical Image Analysis and Multimodal Learning for Clinical sion 120 (2016) 153–168.
Decision Support: Third International Workshop, DLMIA 2017, and [236] A. Knapitsch, J. Park, Q.-Y. Zhou, V. Koltun, Tanks and temples: Bench-
7th International Workshop, ML-CDS 2017, Held in Conjunction with marking large-scale scene reconstruction, ACM Transactions on Graph-
MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings ics (ToG) 36 (4) (2017) 1–13.
3, Springer, 2017, pp. 117–125. [237] J. L. Schonberger, H. Hardmeier, T. Sattler, M. Pollefeys, Comparative
[215] M. D. Foote, B. E. Zimmerman, A. Sawant, S. C. Joshi, Real-time 2d- evaluation of hand-crafted and learned local features, in: Proceedings of
3d deformable registration with deep learning and application to lung the IEEE conference on computer vision and pattern recognition, 2017,
radiotherapy targeting, in: Information Processing in Medical Imaging: pp. 1482–1491.
26th International Conference, IPMI 2019, Hong Kong, China, June 2– [238] K. Wilson, N. Snavely, Robust global translations with 1dsfm, in: Com-
7, 2019, Proceedings 26, Springer, 2019, pp. 265–276. puter Vision–ECCV 2014: 13th European Conference, Zurich, Switzer-
[216] W. Yu, M. Tannast, G. Zheng, Non-rigid free-form 2d–3d registration land, September 6-12, 2014, Proceedings, Part III 13, Springer, 2014,
using a b-spline-based statistical deformation model, Pattern recognition pp. 61–75.
63 (2017) 689–699. [239] T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler,
[217] P. Li, Y. Pei, Y. Guo, G. Ma, T. Xu, H. Zha, Non-rigid 2d-3d registra- M. Pollefeys, A. Geiger, A multi-view stereo benchmark with high-
tion using convolutional autoencoders, in: 2020 IEEE 17th International resolution images and multi-camera videos, in: Proceedings of the IEEE
Symposium on Biomedical Imaging (ISBI), IEEE, 2020, pp. 700–704. conference on computer vision and pattern recognition, 2017, pp. 3260–
[218] G. Dong, J. Dai, N. Li, C. Zhang, W. He, L. Liu, Y. Chan, Y. Li, Y. Xie, 3269.
X. Liang, 2d/3d non-rigid image registration via two orthogonal x-ray [240] D. Marelli, L. Morelli, E. M. Farella, S. Bianco, G. Ciocca, F. Re-
projection images for lung tumor tracking, Bioengineering 10 (2) (2023) mondino, Enrich: Multi-purpose dataset for benchmarking in computer
144. vision and photogrammetry, ISPRS Journal of Photogrammetry and Re-
[219] V. Balntas, K. Lenc, A. Vedaldi, K. Mikolajczyk, Hpatches: A bench- mote Sensing 198 (2023) 84–98.
mark and evaluation of handcrafted and learned local descriptors, in: [241] C. Schmid, R. Mohr, C. Bauckhage, Evaluation of interest point detec-
Proceedings of the IEEE conference on computer vision and pattern tors, International Journal of computer vision 37 (2) (2000) 151–172.
recognition, 2017, pp. 5173–5182. [242] C. B. Choy, J. Gwak, S. Savarese, M. Chandraker, Universal correspon-
[220] X. Shen, Z. Cai, W. Yin, M. Müller, Z. Li, K. Wang, X. Chen, C. Wang, dence network, Advances in neural information processing systems 29
Gim: Learning generalizable image matcher from internet videos, arXiv (2016).
preprint arXiv:2402.11095 (2024). [243] I. Melekhov, A. Tiulpin, T. Sattler, M. Pollefeys, E. Rahtu, J. Kan-
[221] Cvpr 2020 image matching challenge, accessed March 1, 2024. nala, Dgc-net: Dense geometric correspondence network, in: 2019 IEEE
URL https://www.cs.ubc.ca/research/ Winter Conference on Applications of Computer Vision (WACV), IEEE,
image-matching-challenge/2020/ 2019, pp. 1034–1042.
[222] Cvpr 2021 image matching challenge, accessed March 1, 2024. [244] X. Shen, F. Darmon, A. A. Efros, M. Aubry, Ransac-flow: generic two-
URL https://www.cs.ubc.ca/research/ stage image alignment, in: Computer Vision–ECCV 2020: 16th Euro-
image-matching-challenge/current/ pean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part
[223] Cvpr 2022 image matching challenge, accessed March 1, 2024. IV 16, Springer, 2020, pp. 618–637.
URL https://www.kaggle.com/competitions/ [245] P.-E. Sarlin, C. Cadena, R. Siegwart, M. Dymczyk, From coarse to fine:
image-matching-challenge-2022/overview Robust hierarchical localization at large scale, in: Proceedings of the
[224] Cvpr 2023 image matching challenge, accessed March 1, 2024. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
URL https://www.kaggle.com/competitions/ 2019, pp. 12716–12725.
image-matching-challenge-2023/overview [246] X. Nan, L. Ding, Learning geometric feature embedding with transform-
[225] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, M. Nießner, ers for image matching, Sensors 22 (24) (2022) 9882.
Scannet: Richly-annotated 3d reconstructions of indoor scenes, in: Pro- [247] R. Mao, C. Bai, Y. An, F. Zhu, C. Lu, 3dg-stfm: 3d geometric guided
ceedings of the IEEE conference on computer vision and pattern recog- student-teacher feature matching, in: European Conference on Com-
nition, 2017, pp. 5828–5839. puter Vision, Springer, 2022, pp. 125–142.
[226] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, [248] K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, P. Fua, Learning
D. Borth, L.-J. Li, Yfcc100m: The new data in multimedia research, to find good correspondences, in: Proceedings of the IEEE conference
Communications of the ACM 59 (2) (2016) 64–73. on computer vision and pattern recognition, 2018, pp. 2666–2674.
[227] Z. Li, N. Snavely, Megadepth: Learning single-view depth prediction [249] O. Wiles, S. Ehrhardt, A. Zisserman, Co-attention for conditioned image
from internet photos, in: Proceedings of the IEEE conference on com- matching, in: Proceedings of the IEEE/CVF conference on computer
puter vision and pattern recognition, 2018, pp. 2041–2050. vision and pattern recognition, 2021, pp. 15920–15929.
[228] D. Mishkin, J. Matas, M. Perdoch, Mods: Fast and robust method [250] R. Arandjelović, A. Zisserman, Three things everyone should know to
for two-view matching, Computer vision and image understanding 141 improve object retrieval, in: 2012 IEEE conference on computer vision
(2015) 81–93. and pattern recognition, IEEE, 2012, pp. 2911–2918.
[229] D. Mishkin, J. Matas, M. Perdoch, K. Lenc, Wxbs: Wide baseline stereo [251] I. Melekhov, G. J. Brostow, J. Kannala, D. Turmukhambetov, Image styl-
generalizations, arXiv preprint arXiv:1504.06603 (2015). ization for robust features, arXiv preprint arXiv:2008.06959 (2020).
[230] T. Sattler, T. Weyand, B. Leibe, L. Kobbelt, Image retrieval for image- [252] Y. Zhou, H. Fan, S. Gao, Y. Yang, X. Zhang, J. Li, Y. Guo, Retrieval
based localization revisited., in: BMVC, Vol. 1, 2012, p. 4. and localization with observation constraints, in: 2021 IEEE Interna-
[231] W. Maddern, G. Pascoe, C. Linegar, P. Newman, 1 year, 1000 km: The tional Conference on Robotics and Automation (ICRA), IEEE, 2021,
oxford robotcar dataset, The International Journal of Robotics Research pp. 5237–5244.
36 (1) (2017) 3–15. [253] M. Humenberger, Y. Cabon, N. Guerin, J. Morat, V. Leroy, J. Revaud,
[232] P.-E. Sarlin, M. Dusmanu, J. L. Schönberger, P. Speciale, L. Gruber, P. Rerole, N. Pion, C. de Souza, G. Csurka, Robust image retrieval-based
V. Larsson, O. Miksik, M. Pollefeys, Lamar: Benchmarking localiza- visual localization using kapture, arXiv preprint arXiv:2007.13867
tion and mapping for augmented reality, in: European Conference on (2020).
Computer Vision, Springer, 2022, pp. 686–704. [254] H. Germain, G. Bourmaud, V. Lepetit, S2dnet: Learning image features
[233] A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: The for accurate sparse-to-dense matching, in: Computer Vision–ECCV
kitti dataset, The International Journal of Robotics Research 32 (11) 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
(2013) 1231–1237. Proceedings, Part III 16, Springer, 2020, pp. 626–643.
[234] J. Xiao, A. Owens, A. Torralba, Sun3d: A database of big spaces re- [255] Y. Zhao, H. Zhang, P. Lu, P. Li, E. Wu, B. Sheng, Dsd-matchingnet:
constructed using sfm and object labels, in: Proceedings of the IEEE Deformable sparse-to-dense feature matching for learning accurate cor-
international conference on computer vision, 2013, pp. 1625–1632. respondences, Virtual Reality & Intelligent Hardware 4 (5) (2022) 432–
29
443. of Geo-Information 10 (11) (2021) 748.
[256] L. Chen, C. Heipke, Deep learning feature representation for image [278] L. Morelli, F. Bellavia, F. Menna, F. Remondino, Photogrammetry now
matching under large viewpoint and viewing direction change, ISPRS and then–from hand-crafted to deep-learning tie points–, The Interna-
Journal of Photogrammetry and Remote Sensing 190 (2022) 94–112. tional Archives of the Photogrammetry, Remote Sensing and Spatial In-
[257] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- formation Sciences 48 (2022) 163–170.
son, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic [279] F. Maiwald, H.-G. Maas, An automatic workflow for orientation of his-
urban scene understanding, in: Proceedings of the IEEE conference on torical images with large radiometric and geometric differences, The
computer vision and pattern recognition, 2016, pp. 3213–3223. Photogrammetric Record 36 (174) (2021) 77–103.
[258] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, A. Torralba, Scene
parsing through ade20k dataset, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, 2017, pp. 633–641.
[259] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,
T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., Segment anything,
arXiv preprint arXiv:2304.02643 (2023).
[260] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski,
A. Joulin, Emerging properties in self-supervised vision transformers,
in: Proceedings of the IEEE/CVF international conference on computer
vision, 2021, pp. 9650–9660.
[261] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khali-
dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., Dinov2:
Learning robust visual features without supervision, arXiv preprint
arXiv:2304.07193 (2023).
[262] Anonymous, Towards seamless adaptation of pre-trained models for vi-
sual place recognition, in: Submitted to The Twelfth International Con-
ference on Learning Representations, 2023, under review.
URL https://openreview.net/forum?id=TVg6hlfsKa
[263] X. Jiang, Y. Xia, X.-P. Zhang, J. Ma, Robust image matching via local
graph structure consensus, Pattern Recognition 126 (2022) 108588.
[264] J. Ma, J. Zhao, J. Jiang, H. Zhou, X. Guo, Locality preserving matching,
International Journal of Computer Vision 127 (2019) 512–531.
[265] R. Raguram, O. Chum, M. Pollefeys, J. Matas, J.-M. Frahm, Usac: A
universal framework for random sample consensus, IEEE transactions
on pattern analysis and machine intelligence 35 (8) (2012) 2022–2038.
[266] D. Barath, J. Noskova, M. Ivashechkin, J. Matas, Magsac++, a fast, re-
liable and accurate robust estimator, in: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, 2020, pp. 1304–
1312.
[267] X. Li, Z. Hu, Rejecting mismatches by correspondence function, Inter-
national Journal of Computer Vision 89 (2010) 1–17.
[268] W. Sun, W. Jiang, E. Trulls, A. Tagliasacchi, K. M. Yi, Acne: Atten-
tive context normalization for robust permutation-equivariant learning,
in: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, 2020, pp. 11286–11295.
[269] C. Zhao, Z. Cao, C. Li, X. Li, J. Yang, Nm-net: Mining reliable
neighbors for robust feature correspondences, in: Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition,
2019, pp. 215–224.
[270] J. Chen, X. Chen, S. Chen, Y. Liu, Y. Rao, Y. Yang, H. Wang, D. Wu,
Shape-former: Bridging cnn and transformer via shapeconv for multi-
modal image matching, Information Fusion 91 (2023) 445–457.
[271] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold,
C. Rother, Dsac-differentiable ransac for camera localization, in: Pro-
ceedings of the IEEE conference on computer vision and pattern recog-
nition, 2017, pp. 6684–6692.
[272] L. Cavalli, D. Barath, M. Pollefeys, V. Larsson, Consensus-adaptive
ransac, arXiv preprint arXiv:2307.14030 (2023).
[273] J. Chen, S. Chen, X. Chen, Y. Yang, L. Xing, X. Fan, Y. Rao, Lsv-
anet: Deep learning on local structure visualization for feature matching,
IEEE Transactions on Geoscience and Remote Sensing 60 (2021) 1–18.
[274] F. Bellavia, L. Morelli, F. Menna, F. Remondino, Image orientation with
a hybrid pipeline robust to rotations and wide-baselines, The Interna-
tional Archives of the Photogrammetry, Remote Sensing and Spatial In-
formation Sciences 46 (2022) 73–80.
[275] F. Bellavia, D. Mishkin, Harrisz+: Harris corner selection for next-gen
image matching pipelines, Pattern Recognition Letters 158 (2022) 141–
147.
[276] F. Bellavia, Image matching by bare homography, IEEE Transactions on
Image Processing (2024).
[277] F. Maiwald, C. Lehmann, T. Lazariv, Fully automated pose estimation
of historical images in the context of 4d geographic information sys-
tems utilizing machine learning methods, ISPRS International Journal
30