0% found this document useful (0 votes)

28 views30 pages

Local Feature Matching Using Deep Learning - A Survey

This survey provides a comprehensive overview of local feature matching methods in computer vision, categorizing them into Detector-based and Detector-free approaches. It highlights the advancements brought by deep learning models, evaluates prevalent datasets and metrics, and discusses practical applications across various domains such as 3D reconstruction and medical imaging. The paper also identifies current challenges and suggests future research directions to enhance the accuracy and robustness of feature matching techniques.

Uploaded by

october2363

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views30 pages

Local Feature Matching Using Deep Learning - A Survey

Uploaded by

october2363

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Local Feature Matching Using Deep Learning: A Survey

Shibiao Xua , Shunpeng Chena , Rongtao Xub , Changwei Wangb , Peng Lua , Li Guoa
a School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
b The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, China

Abstract
Local feature matching enjoys wide-ranging applications in the realm of computer vision, encompassing domains such as im-
age retrieval, 3D reconstruction, and object recognition. However, challenges persist in improving the accuracy and robustness
of matching due to factors like viewpoint and lighting variations. In recent years, the introduction of deep learning models has
arXiv:2401.17592v2 [cs.CV] 11 Mar 2024

sparked widespread exploration into local feature matching techniques. The objective of this endeavor is to furnish a comprehen-
sive overview of local feature matching methods. These methods are categorized into two key segments based on the presence of
detectors. The Detector-based category encompasses models inclusive of Detect-then-Describe, Joint Detection and Description,
Describe-then-Detect, as well as Graph Based techniques. In contrast, the Detector-free category comprises CNN Based, Trans-
former Based, and Patch Based methods. Our study extends beyond methodological analysis, incorporating evaluations of prevalent
datasets and metrics to facilitate a quantitative comparison of state-of-the-art techniques. The paper also explores the practical ap-
plication of local feature matching in diverse domains such as Structure from Motion, Remote Sensing Image Registration, and
Medical Image Registration, underscoring its versatility and significance across various fields. Ultimately, we endeavor to outline
the current challenges faced in this domain and furnish future research directions, thereby serving as a reference for researchers
involved in local feature matching and its interconnected domains. A comprehensive list of studies in this survey is available at
https://github.com/vignywang/Awesome-Local-Feature-Matching.
Keywords: Local Feature Matching, Image Matching, Deep Learning, Survey.

1. Introduction In conventional image matching pipelines, the process can

be decomposed into four fundamental steps: feature detec-
In the field of image processing, the core objective of lo- tion, feature description, feature matching, and geometric trans-
cal feature matching tasks is to establish precise feature corre- formation estimation. Prior to the advent of deep learn-
spondences between different images. This encompasses vari- ing [23, 24, 25], many celebrated algorithms were tailored to
ous types of image features, such as keypoints, feature regions, focus primarily on one or several stages within this pipeline.
straight lines, and curves, among others. Establishing corre- Various techniques were committed to the process of feature de-
spondences between similar features in different images serves tection [26, 27, 28, 29], while others were honed on locally exe-
as the foundation for many computer vision tasks, including cuting the task of feature description [30, 31, 32]. Additionally,
image fusion [1, 2, 3, 4, 5], visual localization [6, 7, 8, 9], certain algorithms have been devised to cater to both feature de-
Structure from Motion (SfM) [10, 11, 12, 13], Simultaneous tection and description [33, 34, 35, 36, 37, 38]. For the purpose
Localization and Mapping (SLAM) [14, 15, 16], optical flow of feature matching, conventional approaches typically relied
estimation [17, 18, 19], image retrieval [20, 21, 22], and more. on either minimizing or maximizing specific well-established
Owing to influences such as scale transformations, view- metrics, such as the Sum of Squared Differences or correlation.
point diversity, shifts in illumination, pattern recurrences, and During the stage of geometric transformation estimation, algo-
texture variations, the depiction of an identical physical space rithms were generally employed on the basis of techniques akin
within distinct images may exhibit substantial divergence. For to RANSAC [39] to estimate the underlying epipolar geome-
instance, Figure 1 provides a visual representation of the perfor- try or homographies. Both traditional handcrafted methods and
mance of several popular deep learning models engaged in local learning-centric approaches that were constructed upon low-
image matching tasks. Nevertheless, guaranteeing the estab- level image features like gradients and sequences of grayscale.
lishment of precise correspondences between distinct images Despite being theoretically resilient to certain forms of trans-
mandates surmounting manifold perplexities and challenges, formations, these techniques were inherently restricted by the
engendered by the aforementioned factors. Consequently, the inherent prior knowledge imposed by researchers on their tasks.
quest for accuracy and dependability in local feature matching In recent years, substantial advancements have been real-
continues to be a formidable problem beset with intricacies. ized in addressing the challenges associated with local fea-
ture matching [40, 41, 42], especially those posed by scale
1 Rongtao
variations, shifts in viewpoint, and other forms of diversities.
Xu is the corresponding author.

Preprint submitted to Information Fusion

(a) SuperPoint+LightGlue (b) ALIKED+LightGlue (c) SIFT+LightGlue (d) DISK+LightGlue

(e) SuperPoint+SGMNet (f) Aspanformer (g) DISK+NN (h) LoFTR

Figure 1: Matching results for outdoor images. It can be observed that for images with significant variations in viewpoint and lighting conditions, the matching task
encounters considerable challenges.

The extant methods of image matching can be sorted into two arate keypoint detection and feature description stages by tap-
major categories: Detector-based and Detector-free methods. ping into the rich contextual information prevalent within the
Detector-based methods hinge on the detection and description images. These methods enable end-to-end image matching,
of sparsely distributed keypoints in order to establish matches thereby offering a distinct mechanism to tackle the task.
between images. The efficacy of these methods is largely de- Image matching plays a pivotal role in the domain of im-
pendent on the performance of keypoint detectors and feature age registration, where it contributes significantly by enabling
descriptors, given their significant role in the process. Con- the precise fitting of transformation functions through a reli-
trastingly, Detector-free methods sidestep the necessity for sep- able set of feature matches. This functionality positions im-

2
age matching as a crucial area of study within the broader con- deep learning. Although recent surveys [52, 53, 54] have in-
text of image fusion [43]. To coherently encapsulate the evo- corporated trainable methods, they have failed to timely sum-
lution of the local feature matching domain and stimulate in- marize the plethora of literature that has emerged in the past
novative research avenues, this paper presents an exhaustive re- five years. Furthermore, many are limited to specific aspects of
view and thorough analysis of the latest progress in local fea- image matching within the field, such as some articles introduc-
ture matching, particularly emphasizing the use of deep learning only the feature detection and description methods of local
ing algorithms. In addition, we re-examine pertinent datasets features, but not including matching [52], some particularly fo-
and evaluation criteria, and conduct detailed comparative analy- cusing on matching of cultural heritage images [55], and others
ses of key methodologies. Our investigation addresses both the solely concentrating on medical image registration [56, 57, 58],
gap and potential bridging between traditional manual meth- remote sensing image registration [59, 60], and so on. In this
ods and modern deep learning technologies. We emphasize survey, our goal is to provide the most recent and compre-
the ongoing relevance and collaboration between these two ap- hensive overview by assessing the existing methods of image
proaches by analyzing the latest developments in traditional matching, particularly the state-of-the-art learning-based ap-
manual methods alongside deep learning techniques. Further, proaches. Importantly, we not only discuss the existing meth-
we address the emerging focus on multi-modal images. This ods that serve natural image applications, but also the wide ap-
includes a detailed overview of methods specifically tailored plication of feature matching in SfM, remote sensing images,
for multi-modal image analysis. Our survey also identifies and and medical images. We illustrate the close connection of this
discusses the gaps and future needs in existing datasets for eval- research with the field of information fusion through a detailed
uating local feature matching methods, highlighting the impor- discussion on the matching of multimodal images. Addition-
tance of adapting to diverse and dynamic scenarios. In keep- ally, we have conducted a thorough examination and analysis of
ing with current trends, we examine the role of large foun- the recent mainstream methods, discussions that are evidently
dation models in feature matching. These models represent a missing in the existing literature. Figure 2 showcases a rep-
significant shift from traditional semantic segmentation mod- resentative timeline of local feature matching methodologies,
els [44, 45, 46, 47, 48], offering superior generalization capa- which provides insights on the evolution of these methods and
bilities for a wide array of scenes and objects. their pivotal contributions towards spearheading advancements
In summary, some of the key contributions of this survey in the field.
can be summarized as follows:
• This survey extensively covers the literature on contem- 2. Detector-based Models
porary local feature matching problems and provides a
detailed overview of various local feature matching al- Detector-based methodologies have been the prevailing ap-
gorithms proposed since 2018. Following the preva- proach for local feature matching for a considerable dura-
lent image matching pipeline, we primarily categorize tion. Numerous well-established handcrafted works, including
these methods into two major classes: Detector-based SIFT [33] and ORB [35], have been broadly adopted for varied
and Detector-free, and provide a comprehensive review tasks within the field of 3D computer vision [74, 75]. These tra-
of matching algorithms employing deep learning. ditional Detector-based methodologies typically comprise three
primary stages: feature detection, feature description, and fea-
• We scrutinize the deployment of these methodologies in a ture matching. Initially, a set of sparse key points is extracted
myriad of real-world scenarios, encompassing SfM, Re- from the images. Subsequently, in the feature description stage,
mote Sensing Image Registration, and Medical Image these key points are characterized using high-dimensional vec-
Registration. This investigation highlights the versatil- tors, often designed to encapsulate the specific structure and in-
ity and extensive applicability inherent in local feature formation of the region surrounding these points. Lastly, during
matching techniques. the feature matching stage, correspondences at a pixel level are
• We start from relevant computer vision tasks, review the established through mechanisms like nearest-neighbor searches
major datasets involved in local feature matching, and or more complex matching algorithms. Notable among these
classify them according to different tasks to delve into are GMS (Grid-based Motion Statistics) by Bian et al.[76] and
specific research requirements within each domain. OANET (Order-Aware Network) by Zhang et al.[77]. GMS
enhances feature correspondence quality using grid-based mo-
• We analyze various metrics used for performance evalu- tion statistics, simplifying and accelerating matching, while
ation and conduct a quantitative comparison of key local OANET innovatively optimizes two-view matching by inte-
feature matching methods. grating spatial contexts for precise correspondence and geom-
• We present a series of challenges and future research di- etry estimation. This is typically done by comparing the high-
rections, offering valuable guidance for further advance- dimensional vectors of keypoints between different images and
ments in this field. identifying matches based on the level of similarity – often de-
fined by a distance function in the vector space.
It is important to note that the initial surveys [49, 50, 51] However, in the era of deep learning, the rise of data-driven
primarily focused on manual methods, hence they do not pro- methods has made approaches like LIFT [78] popular. These
vide sufficient reference points for research centered around
3
NCNet D2Net GLU-Net COTR OETR ASTR
DOAP R2D2 GOCor LoFTR TopicFM CasMTR

GLAD SOSNet DRC-Net PDC-Net Aspanformer PMatch

KSP Key.Net Sparse-NCNet Patch2Pix PUMP DKM
Bingan GIFT SuperGlue SGMNet ClusterGNN LightGlue
Superpoint ContextDesc ASLFeat COLD MTLDesc AWDesc
LF-Net Glampoints DISK LLF ALIKE FeatureBooster
Geodesc RF-Net CAPS RoRD SeLF SFD2

2018 2019 2020 2021 2022 2023

Figure 2: Representative local feature matching methods. Blue and gray represent Detector-based Models, where gray represents the Graph Based method. The
yellow and green blocks represent the CNN Based and Transformer Based methods in Detector-free Models, respectively.In 2018, Superpoint [61] pioneered the
computation of keypoints and descriptors within a single network. Subsequently, numerous works such as D2Net [62], R2D2 [63], and others attempted to integrate
keypoint detection and description for matching purposes. Concurrently, the NCNet [64] method introduced four-dimensional cost volumes into local feature
matching, initiating a trend in utilizing correlation-based or cost volume-based convolutional neural networks for Detector-free matching research. Building upon
this trend, methods like Sparse-NCNet [65], DRC-Net [66], GLU-Net [67], and PDC-Net [68] emerged. In 2020, SuperGlue [69] framed the task as a graph matching
problem involving two sets of features. Following this, SGMNet [70] and ClusterGNN [71] focused on improving the graph matching process by addressing the
complexity of matching. In 2021, approaches such as LoFTR [72] and Aspanformer [73] successfully incorporated Transformer or Attention mechanisms into
the Detector-free matching process. They achieved this by employing interleaved self and cross-attention modules, significantly expanding the receptive field and
further advancing deep learning-based matching techniques.

methods leverage CNNs to extract more robust and discrimi- 2.1. Detect-then-Describe
native keypoint descriptors, resulting in significant progress in In feature matching methodologies, the adoption of sparse-
handling large viewpoint changes and local feature illumina- to-sparse feature matching is rather commonplace. These meth-
tion variations. Currently, Detector-based methods can be cate- ods adhere to a ’detect-then-describe’ paradigm, where the pri-
gorized into four main classes: 1. Detect-then-Describe meth- mary step involves the detection of keypoint locations. The
ods; 2. Joint Detection and Description methods; 3. Describe- detector subsequently extracts feature descriptors from patches
then-Detect methods; 4. Graph Based methods. Additionally, centered on each detected keypoint. These descriptors are then
we further subdivide Detect-then-Describe methods based on relayed to the feature description stage. This procedure is typ-
the type of supervised learning into Fully-Supervised methods, ically trained utilizing metric learning methods, which aim to
Weakly Supervised methods, and Other forms of Supervision learn a distance function where similar points are close and
methods. This classification is visually depicted in Figure 3. dissimilar points are distant in the feature space. To enhance
efficiency, feature detectors often focus on small image re-
Local Feature Matching Models Using Deep Learning
gions [78], generally emphasizing low-level structures such as
corners [26] or blobs [33]. The descriptors, on the other hand,
aim to capture more nuanced, higher-level information within
Detector-based Models Detector-free Models larger patches encompassing the keypoints. Providing verbosity
and distinctive details, these descriptors serve as the defining
Detect-then-Describe CNN Based features for matching purposes. Figure 4(a) illustrates the com-
• Fully-Supervised mon structure of the Detect-then-Describe pipeline.
• Weakly Supervised and Others Transformer Based
2.1.1. Fully-Supervised
Joint Detection and Description Patch Based The field of local feature matching has undergone a remark-
able transformation, primarily driven by the advent of annotated
Describe-then-Detect patch datasets [79] and the integration of deep learning tech-
nologies. This transformation marks a departure from tradi-
Graph Based tional handcrafted methods to more data-driven methodologies,
reshaping the landscape of feature matching. This section aims
Figure 3: Overview of the Local Feature Matching Models and taxonomy of to trace the historical development of these changes, emphasiz-
the most relevant approaches. ing the sequential progression and the interconnected nature of
various fully supervised methods. At the forefront of this evo-
lution are CNNs, which have been pivotal in revolutionizing the
4
tion of a subspace pooling methodology, leveraging CNNs to
learn invariant and discriminative descriptors. DeepBit [86] of-
fers an unsupervised deep learning framework to learn compact
binary descriptors. It encodes crucial properties like rotation,
translation, and scale invariance of local descriptors into bi-
nary representations. Bingan [87] proposes a method to learn
Describe Detect
compact binary image descriptors using regularized Generative
Adversarial Networks (GANs). GLAD [88] addresses the Per-
son Re-Identification task by considering both local and global
Detect&Describe
cues from human bodies. A four-stream CNN framework is
Detect Describe
implemented to generate discriminative and robust descriptors.
Geodesc [89] advances descriptor computation by integrating
(a) Detect-then-Describe (b) Joint Detection and Description (c) Describe-then-Detect geometric constraints from SfM algorithms. This approach em-
phasizes two aspects: first, the construction of training data us-
Figure 4: The comparison of various prominent Detector-based pipelines for ing geometric information to measure sample hardness, where
trainable local feature matching is presented. Here, the categorization is based
hardness is defined by the variability between pixel blocks of
on the relationship between the detection and description steps: (a) Detect-
then-Describe framework, (b) Joint Detection and Description framework, and the same 3D point and uniformity for different points. Sec-
(c) Describe-then-Detect framework. ond, a geometric similarity loss function is devised, promoting
closeness among pixel blocks corresponding to the same 3D
point. These innovations enable Geodesc to significantly en-
process of descriptor learning. By enabling end-to-end learning hance descriptor effectiveness in 3D reconstruction tasks. For
directly from raw local patches, CNNs have facilitated the con- GIFT [90] and COLD [91], the former underscores the impor-
struction of a hierarchy of local features. This capability has al- tance of incorporating underlying structural information from
lowed CNNs to capture complex patterns in data, leading to the group features to construct potent descriptors. Through the uti-
creation of more specialized and distinct descriptors that signif- lization of group convolutions, GIFT generates dense descrip-
icantly enhance the matching process. This revolutionary shift tors that exhibit both distinctiveness and invariance to trans-
was largely influenced by innovative models like L2Net [80], formation groups. In contrast, COLD introduces a novel ap-
which pioneered a progressive sampling strategy. L2Net’s approach through a multi-level feature distillation network archi-
proach accentuated the relative distances between descriptors tecture. This architecture leverages intermediate layers of Ima-
while applying additional supervision to intermediate feature geNet pre-trained convolutional neural networks to encapsulate
maps. This strategy significantly contributed to the develop- hierarchical features, ultimately extracting highly compact and
ment of robust descriptors, setting a new standard in descriptor robust local descriptors.
learning. Advancing the narrative, our exploration extends to recent
The shift towards these data-driven methodologies, under- strides in fully-supervised methodologies, constituting a note-
pinned by CNNs, has not only improved the accuracy and ef- worthy augmentation of the repertoire of local feature match-
ficiency of local feature matching but has also opened new av- ing capabilities. These pioneering approaches, building upon
enues for research and innovation in this area. As we explore the foundational frameworks expounded earlier, synergistically
the chronological advancements in this field, we observe a clear elevate and finesse the methodologies that underpin the field.
trajectory of growth and refinement, moving from the conven- Continuing the trend of enhancing descriptor robustness, SOS-
tional to the contemporary, each method building upon the suc- Net [92] extends HardNet by introducing a second-order simi-
cesses of its predecessors while introducing novel concepts and larity regularization term for descriptor learning. This enhance-
techniques. OriNet [81] present a method using CNNs to as- ment involves integrating second-order similarity constraints
sign a canonical orientation to feature points in an image, en- into the training process, thereby augmenting the performance
hancing feature point matching. They introduce a Siamese net- of learning robust descriptors. The term ”second-order simi-
work [82] training approach that eliminates the need for pre- larity” denotes a metric that evaluates the consistency of rel-
defined orientations and propose a novel GHH activation func- ative distances among descriptor pairs in a training batch. It
tion, showing significant performance improvements in feature measures the similarity between a descriptor pair not only di-
descriptors across multiple datasets. Building on the architec- rectly but also by considering their relative distances to other
tural principles of L2Net, HardNet [83] streamlined the learn- descriptor pairs within the same batch. Ebel et al. [93] pro-
ing process by focusing on metric learning and eliminating poses a local feature descriptor based on a log-polar sampling
the need for auxiliary loss terms, setting a precedent for sub- scheme to achieve scale invariance. This unique approach al-
sequent models to simplify learning objectives. DOAP [84] lows for keypoint matching across different scales and exhibits
shifted the focus to a learning-to-rank formulation, optimiz- less sensitivity to occlusion and background motion. Thus, it
ing local feature descriptors for nearest-neighbor matching, a effectively utilizes a larger image region to improve perfor-
methodology that found success in specific matching scenar- mance. To design a better loss function, HyNet [94] intro-
ios and influenced later models to consider ranking-based ap- duces a mixed similarity measure for triplet margin loss and
proaches. The KSP [85] method is notable for its introduc- implements a regularization term to constrain descriptor norms,
5
thus establishing a balanced and effective learning framework. ferent modal image types [104], there is an emerging focus on
CNDesc [95] also investigates L2 normalization, presenting an frequency-domain-based feature descriptors. These descriptors
innovative dense local descriptor learning approach. It uses a exhibit improved proficiency in matching cross-modal images.
special cross-normalization technique instead of L2 normaliza- For instance, RIFT [105] utilizes FAST [106] for extracting
tion, introducing a new way of normalizing the feature vectors. repeatable feature points on the phase congruency (PC) map,
Key.Net [96] proposes a keypoint detector that combines hand- subsequently constructing robust descriptors using frequency
crafted and learned CNN features, and uses scale space repre- domain information to tackle the challenges in multimodal im-
sentation in the network to extract keypoints at different levels. age feature matching. Building on RIFT, SRIFT [107] fur-
To address the non-differentiability issue in keypoint detection ther refines this approach by establishing a nonlinear diffusion
methods, ALIKE [97] offers a differentiable keypoint detection scale (NDS) space, thus constructing a multiscale space that not
(DKD) module based on the score map. In contrast to methods only achieves scale and rotation invariance but also addresses
relying on non-maximum suppression (NMS), DKD can back- the issue of slow inference speeds associated with RIFT. With
propagate gradients and produce keypoints at subpixel levels. the evolution of deep learning technologies, depth-based meth-
This enables the direct optimization of keypoint locations. S- ods have demonstrated significant prowess in feature extraction.
TREK [98] introduces an advanced local feature extractor that SemLA [108] uses semantic guidance in its registration and fu-
combines a translation and rotation equivariant keypoint de- sion processes. The feature matching is limited to the semantic
tector with a lightweight descriptor extractor. Trained through sensing area, so as to provide the most accurate registration ef-
a reinforcement learning-inspired framework to optimize key- fect for image fusion tasks.
point repeatability, S-TREK achieves remarkable performance
in repeatability and pose recovery across multiple benchmarks, 2.1.2. Weakly Supervised and Others
especially excelling in scenarios with in-plane rotations. Zip- Weakly supervised learning presents opportunities for mod-
pyPoint [99] is designed based on KP2D [100], introduces an els to learn robust features without requiring densely anno-
entire set of accelerated extraction and matching techniques. tated labels, offering a solution to one of the largest chal-
This method suggests the use of a binary descriptor normaliza- lenges in training deep learning models. Several weakly su-
tion layer, thereby enabling the generation of unique, length- pervised local feature learning methods have emerged, leverag-
invariant binary descriptors. ing easily obtainable geometric information from camera poses.
Implementing contextual information into feature descrip- AffNet [109] represents a key advancement in weakly super-
tors has been a rising trend in the advancement of local fea- vised local feature learning, focusing on the learning of affine
ture matching methods. ContextDesc [101] introduces context shape of local features. This method challenges the traditional
awareness to improve off-the-shelf local feature descriptors. It emphasis on geometric repeatability, showing that it is insuf-
encodes both geometric and visual contexts by using keypoint ficient for reliable feature matching and stressing the impor-
locations, raw local features, and high-level regional features tance of descriptor-based learning. AffNet introduces a hard
as inputs. A novel aspect of its training process is the use of negative-constant loss function to improve the matchability and
an N-pair loss, which is self-adaptive and requires no param- geometric accuracy of affine regions. This has proven effective
eter tuning. This dynamic loss function can allow for a more in enhancing the performance of affine-covariant detectors, es-
efficient learning process. MTLDesc [102] offers a strategy to pecially in wide baseline matching and image retrieval. The ap-
address the inherent locality issue confronted in the domains of proach underscores the need to consider both descriptor match-
convolutional neural networks. This is attained by introducing ability and repeatability for developing more effective local fea-
an adaptive global context enhancement module and multiple ture detectors. GLAMpoints [110] presents a semi-supervised
local context enhancement modules to inject non-local contex- keypoint detection method, creatively drawing insights from re-
tual information. By adding these non-local connections, it can inforcement learning loss formulations. Here, rewards are used
efficiently learn high-level dependencies between distant fea- to calculate the significance of detecting keypoints based on the
tures. Building upon MTLDesc, AWDesc [103] seeks to trans- quality of the final alignment. This method has been noted
fer knowledge from a larger, more complex model (teacher) to to significantly impact the matching and registration quality
a smaller and simpler one (student). This approach leverages of the final images. CAPS [111] introduces a weakly super-
the knowledge learned by the teacher, while enabling signifi- vised learning framework that utilizes the relative camera poses
cantly faster computations with the student, allowing the model between image pairs to learn feature descriptors. By employ-
to achieve an optimal balance between accuracy and speed. The ing epipolar geometric constraints as supervision signals, they
focus on context-awareness in these methods emphasizes the designed differentiable matching layers and a coarse-to-fine
importance of considering more global information when de- architecture, resulting in the generation of dense descriptors.
scribing local features. Each method leverages this information DISK [112] maximizes the potential of reinforcement learn-
in a slightly different way, leading to diverse but potentially ing to integrate weakly supervised learning into an end-to-end
complementary approaches for tackling the challenge of fea- Detector-based pipeline using policy gradients. This integra-
ture matching. tive approach of weak supervision with reinforcement learning
In light of the limitations inherent in traditional image fea- can provide more robust learning signals and achieve effective
ture descriptors (like gradients, grayscale, etc.), which struggle optimization. [113] proposes a group alignment approach that
to handle the geometric and radiometric disparities across dif- leverages the power of group-equivariant CNNs. These CNNs
6
are efficient in extracting discriminative rotation-invariant local truth key points for these images are generated through vari-
descriptors. The authors use a self-supervised loss for better ous homographic transformations, and key-point extraction is
orientation estimation and efficient local descriptor extraction. performed using the MagicPoint model. This strategy, which
Weakly and semi-supervised methods using camera pose super- involves aggregating multiple key-point heatmaps, ensures pre-
vision and other techniques provide useful strategies to tackle cise determination of key-point locations on real images. In-
the challenges of training robust local feature methods and may spired by Q-learning, LF-Net [118] predicts geometric rela-
pave the way for more efficient and scalable learning methods tionships, such as relative depth and camera poses, between
in this domain. matched image pairs using an existing SfM model. It employs
asymmetric gradient backpropagation to train a network for de-
2.2. Joint Detection and Description tecting image pairs without needing manual annotation. Build-
Sparse local feature matching has indeed proved very effec- ing upon LF-Net, RF-Net [119] introduces a receptive field-
tive under a variety of imaging conditions. Yet, under extreme based keypoint detector and designs a general loss function
variations like day-night changes [114], different seasons [115], term, referred to as ’neighbor mask’, which facilitates training
or weak-textured scenes [116], the performance of these fea- of patch selection. Reinforced SP [120] employs principles of
tures can deteriorate significantly. The limitations may stem reinforcement learning to handle the discreteness in keypoint
from the nature of keypoint detectors and local descriptors. selection and descriptor matching. It integrates a feature de-
Detecting keypoints often involves focusing on small regions tector into a complete visual pipeline and trains learnable pa-
of the image and might rely heavily on low-level information, rameters in an end-to-end manner. R2D2 [63] combines grid
such as pixel intensities. This procedure makes keypoint detec- peak detection with reliability prediction for descriptors using
tors more susceptible to variations in low-level image statistics, a dense version of the L2-Net architecture, aiming to produce
which are often affected by changes in lighting, weather, and sparse, repeatable, and reliable keypoints. D2Net [62] adopts a
other environmental factors. Moreover, when trying to indi- joint detect-and-describe approach for sparse feature extraction.
vidually learn or train keypoint detectors or feature descriptors, Unlike Superpoint, it shares all parameters between detection
even after carefully optimizing the individual components, in- and description process and uses a joint formulation that opti-
tegrating them into a feature matching pipeline could still lead mizes both tasks simultaneously. Keypoints in their method are
to information loss or inconsistencies. This is due to the fact defined as local maxima within and across channels of the depth
that the optimization of individual components might not fully feature maps. These techniques elegantly illustrate how the in-
consider the dependencies and information sharing between the tegration of detection and description tasks in a unified model
components. To tackle these issues, the approach of Joint De- leads to more efficient learning and superior performance for
tection and Description has been proposed. In this approach, local feature extraction under different imaging conditions.
the tasks of keypoint detection and description are integrated A dual-headed D2Net model with a correspondence en-
and learned simultaneously within a single model. This can semble is presented by RoRD [121] to address extreme view-
enable the model to fuse information from both tasks during point changes combing vanilla and rotation-robust feature cor-
optimization, better adapting to specific tasks and data, and al- respondences. HDD-Net [122] designs an interactively learn-
lowing deeper feature mappings through CNNs. Such a unified able detector and descriptor fusion network, handling detec-
approach can benefit the task by allowing the detection and de- tor and descriptor components independently and focusing on
scription process to be influenced by higher-level information, their interactions during the learning process. MLIFeat [123]
such as structural or shape-related features of the image. Addi- devises two lightweight modules used for keypoint detec-
tionally, dense descriptors involve richer image context, which tion and descriptor generation with multi-level information fu-
generally leads to better performance. Figure 4 (b) illustrates sion utilized to jointly detect keypoints and extract descrip-
the common structure of the Joint Detection and Description tors. LLF [124] proposes utilizing low-level features to su-
pipeline. pervise keypoint detection. It extends a single CNN layer
Image-based descriptor methods, which take the entire from the descriptor backbone as a detector and co-learns it
image as input and utilize fully convolutional neural net- with the descriptor to maximize descriptor matching. Feature-
works [117] to generate dense descriptors, have seen substan- Booster [125] introduces a descriptor enhancement stage into
tial progress in recent years. These methods often amalga- traditional feature matching pipelines. It establishes a generic
mate the processes of detection and description, leading to im- lightweight descriptor enhancement framework that takes orig-
proved performance in both tasks. SuperPoint [61] employs inal descriptors and geometric attributes of keypoints as in-
a self-supervised approach to simultaneously determine key- puts. The framework employs self-enhancement based on MLP
point locations at the pixel level and their descriptors. Initially, and cross-enhancement based on transformers [126] to enhance
the model undergoes training on synthetic shapes and images descriptors. ASLFeat [127] improves the D2Net using chan-
through the application of random homographies. A crucial nel and spatial peaks on multi-level feature maps. It intro-
aspect of the method lies in its self-annotation process with duces a precise detector and invariant descriptor as well as
real images. This process involves adapting homographies to multi-level connections and deformable convolution networks.
enhance the model’s relevance to real-world images, and the The dense prediction framework employs deformable convo-
MS-COCO dataset is employed for additional training. Ground lution networks (DCN) to alleviate limitations caused by key-
point extraction from low-resolution feature maps. SeLF [128]
7
builds on the Aslfeat architecture to leverage semantic infor- posed by localized feature descriptors solely. The ultimate out-
mation from pre-trained semantic segmentation networks used come is the generation of matches based on the soft assignment
to learn semantically aware feature mappings. It combines matrix. Figure 5 provides a comprehensive depiction of the fun-
learned correspondences-aware feature descriptors with seman- damental architecture of Graph-Based matching.
tic features, therefore, enhancing the robustness of local fea-
ture matching for long-term localization. Lastly, SFD2 [129]
proposes the extraction of reliable features from global regions ×L

Sinkhorn Algorithm
(e.g., buildings, traffic lanes) with the suppression of unreli-
able areas (e.g., sky, cars) by implicitly embedding high-level
semantics into the detection and description processes. This en-
ables the model to extract globally reliable features end-to-end
from a single network. Confidence Matrix
Self Cross
2.3. Describe-then-Detect
One common approach to local feature extraction is the
Describe-then-Detect pipeline, entailing the description of lo- Figure 5: General GNN Matching Model Architecture. Firstly, keypoint posi-
cal image regions first using feature descriptors followed by tions pi along with their visual descriptors di are mapped into individual vec-
tors. Subsequently, self-attention layers and cross-attention layers are thereafter
the detection of keypoints based on these descriptors. Fig- applied alternately, L times, within a graph neural network to create enhanced
ure 4 (c) serves as an illustration of the standard structure of matching descriptors. Finally, the Sinkhorn Algorithm is utilized to determine
the Describe-then-Detect pipeline. the optimal partial assignment.
D2D [130] presents a novel framework for keypoint detec-
tion called Describe-to-Detect (D2D), highlighting the wealth SuperGlue [69] adopts attention graph neural networks and
of information inherent in the feature description phase. This optimal transport methods to address partial assignment prob-
framework involves the generation of a voluminous collection lems. It processes two sets of interest points and their de-
of dense feature descriptors followed by the selection of key- scriptors as inputs and leverages self and cross-attention to ex-
points from this dataset. D2D introduces relative and abso- change messages between the two sets of descriptors. The
lute saliency measurements of local depth feature maps to de- complexity of this method grows quadratically with the num-
fine keypoints. Due to the challenges arising from weak su- ber of keypoints, which prompted further exploration in sub-
pervision inability to differentiate losses between detection and sequent works. SGMNet [70] builds on SuperGlue and adds
description stages, PoSFeat [131] presents a decoupled train- a Seeding Module that processes only a subset of matching
ing approach in the describe-then-detect pipeline specifically points as seeds. The fully connected graph is relinquished for a
designed for weakly supervised local feature learning. This sparse connection graph. A seed graph neural network is then
pipeline separates the description network from the detection designed with an attention mechanism to aggregate informa-
network, leveraging camera pose information for descriptor tion. Keypoints usually exhibit strong correlations with just
learning that enhances performance. Through a novel search a few points, resulting in a sparsely connected adjacency ma-
strategy, the descriptor learning process more adeptly utilizes trix for most keypoints. Therefore, ClusterGNN [71] makes
camera pose information. ReDFeat [132] uses a mutual weight- use of graph node clustering algorithms to partition nodes in
ing strategy to combine multimodal feature learning’s detection a graph into multiple clusters. This strategy applies attention
and description aspects. SCFeat [133] proposes a shared cou- GNN layers with clustering to learn feature matching between
pling bridge strategy for weakly supervised local feature learn- two sets of keypoints and their related descriptors, thus train-
ing. Through shared coupling bridges and cross-normalization ing the subgraphs to reduce redundant information propagation.
layers, the framework ensures the individual, optimal training MaKeGNN [135] introduces bilateral context-aware sampling
of description networks and detection networks. This segre- and keypoint-assisted context aggregation in a sparse attention
gation enhances the robustness and overall performance of de- GNN architecture.
scriptors. Inspired by SuperGlue, GlueStick [136] incorporates point
and line descriptors into a joint framework for joint matching
2.4. Graph Based and leveraging point-to-point relationships to link lines from
In the conventional feature matching pipelines, correspon- matched images. LightGlue [137], in an effort to make Super-
dence relationships are established via nearest neighbor (NN) Glue adaptive in computational complexity, proposes the dy-
search of feature descriptors, and outliers are eliminated based namic alteration of the network’s depth and width based on
on matching scores or mutual NN verification. In recent the matching difficulty between each image pair. It devises a
times, attention-based graph neural networks (GNNs) [134] lightweight confidence classifier to forecast and hone state as-
have emerged as effective means to obtain local feature match- signments. DenseGAP [138] devises a graph structure utiliz-
ing. These approaches create GNNs with keypoints as nodes ing anchor points as sparse, yet reliable priors for inter-image
and utilize self-attention layers and cross-attention layers from and intra-image contexts. It propagates this information to
Transformers to exchange global visual and geometric infor- all image points through directed edges. HTMatch [139] and
mation among nodes. This exchange overcomes the challenges Paraformer [140] study the application of attention for interac-
8
tive mixing and explore architectures that strike a balance be- pixel-wise correspondences between image pairs in a coarse-
tween efficiency and effectiveness. ResMatch [141] presents to-fine fashion. Utilizing a dual-resolution strategy with a Fea-
the idea of residual attention learning for feature matching, re- ture Pyramid Network (FPN)-like backbone, the approach gen-
articulating self-attention and cross-attention as learned resid- erates a 4D correlation tensor from coarse-resolution feature
ual functions of relative positional reference and descriptor sim- maps and refines it through a learnable neighborhood consen-
ilarity. It aims to bridge the divide between interpretable match- sus module, thereby augmenting matching reliability and lo-
ing and filtering pipelines and attention-based feature match- calization accuracy. GLU-Net [67] introduces a global-local
ing networks that inherently possess uncertainty via empirical universal network applicable to estimating dense correspon-
means. dences for geometric matching, semantic matching, and optical
flow. It trains the network in a self-supervised manner. GO-
Cor [143] presents a fully differentiable dense matching mod-
3. Detector-free Models
ule that predicts the global optimization matching confidence
While the feature detection stage enables a reduction in the between two depth feature maps and can be integrated into
search space for matching, handling extreme circumstances, state-of-the-art networks to replace feature correlation layers
such as image pairs involving substantial viewpoint changes directly. DCFlow [144] enhances optical flow estimation by
and textureless regions, prove to be difficult when using utilizing the four-dimensional cost volume, drawing on meth-
detection-based approaches, notwithstanding perfect descrip- ods used in stereo matching. By applying a learned feature
tors and matching methodologies [142]. Detector-free meth- embedding and adapting semi-global matching to four dimen-
ods, on the other hand, eliminate feature detectors and directly sions, DCFlow addresses the computational hurdles tradition-
extract visual descriptors on a dense grid spread across the im- ally linked to this extensive approach. Its efficiency in con-
ages to produce dense matches. Thus, compared to Detector- structing and processing the cost volume, combined with main-
based methods, these techniques can capture keypoints that are taining accuracy, marks an improvement in integrating optical
repeatable across image pairs. flow and stereo estimation techniques. Building on these con-
ceptual advancements, RAFT [145] further refines the approach
3.1. CNN Based to dense correspondence estimation. By extracting per-pixel
features and constructing multi-scale 4D correlation volumes
In the early stages, detection-free matching methodologies
for all pixel pairs, RAFT introduces a recurrent processing unit
often relied on CNNs that used correlation or cost volume to
that iteratively refines the flow field. This innovative strategy
identify potential neighborhood consistencies [142]. Figure 6
effectively addresses several limitations of previous methods,
illustrates the fundamental architecture of the 4D correspon-
such as error propagation at coarse resolutions and the neglect
dence volume.
of small, fast-moving objects, thereby enhancing the precision
and reliability of flow estimation. Following in the footsteps of
these foundational methods, PDC-Net [68] proposes a proba-
bilistic depth network that estimates dense image-to-image cor-
respondences and their associated confidence estimates. It in-
CNN troduces an architecture and an improved self-supervised train-
ing strategy to achieve robust uncertainty prediction that is gen-
eralizable. PDC-Net+ [146] introduces a probabilistic deep
network designed to estimate dense image-to-image correspon-
dences and their associated confidence estimates. They employ
Figure 6: Overview of the 4D correspondence volume. Dense feature maps, a constrained mixture model to parameterize the predictive dis-
denoted as f A and f B , are extracted from images I A and I A using convo- tribution, enhancing the modeling capacity for handling out-
lutional neural networks. Each individual feature match, fijA and f B , corre-
kl liers. PUMP [147] combines unsupervised losses with standard
sponds to the matching (i, j, k, l) coordinates. The 4D correlation tensor c is self-supervised losses to augment synthetic images. By utiliz-
ultimately formed, which contains scores for all points between a pair of im-
ages that could potentially be corresponding points. Subsequently, matching ing a 4D correlation volume, it leverages the non-parametric
pairs are obtained by analyzing the properties of corresponding points in the pyramid structure of DeepMatching [148] to learn unsupervised
four-dimensional space. descriptors. DFM [149] utilizes a pre-trained VGG architecture
as a feature extractor, capturing matches without requiring ad-
NCNet [64] analyzes the neighborhood consistency in four- ditional training strategies, thus demonstrating the robust power
dimensional space of all possible corresponding points between of features extracted from the deepest layers of the VGG net-
a pair of images, obtaining matches without requiring a global work.
geometric model. Sparse-NCNet [65] utilizes a 4D convolu-
tional neural network on sparse correlation tensors and uti- 3.2. Transformer Based
lizes submanifold sparse convolutions to significantly reduce The CNN’s dense feature receptive field may have limita-
memory consumption and execution time. DualRC-Net [66] tions in handling regions with low texture or discerning be-
introduces an innovative methodology for establishing dense tween keypoints with similar feature representations. In con-
trast, humans tend to consider both local and global infor-
9
mation when matching in such regions. Given Transform- pipeline. MatchFormer [162] devises a hierarchical transformer
ers’ success in computer vision tasks such as image classifi- encoder and a lightweight decoder. In each stage of the hi-
cation [150], object detection [151], and semantic segmenta- erarchical structure, cross-attention modules and self-attention
tion [152, 153, 154, 155, 156], researchers have explored in- modules are interleaved to provide an optimal combination
corporating Transformers’ global receptive field and long-range path, enhancing multi-scale features. CAT [163] proposes a
dependencies into local feature matching. Various approaches context-aware network based on the self-attention mechanism,
that integrate Transformers into feature extraction networks for where attention layers can be applied along the spatial dimen-
local feature matching have emerged. sion for higher efficiency or along the channel dimension for
Given that the only difference between sparse matching and higher accuracy and a reduced storage burden. TopicFM [164]
dense matching is the quantity of points to query, COTR [157] encodes high-level context in images, utilizing a topic modeling
combines the advantages of both approaches. It learns two approach. This improves matching robustness by focusing on
matching images jointly with self-attention, using some key- semantically similar regions in images. ASTR [165] introduces
points as queries and recursively refining matches in the other an Adaptive Spot-guided Transformer, which includes a point-
image through a corresponding neural network. This inte- guided aggregation module to allow most pixels to avoid the in-
gration combines both matches into one parameter-optimized fluence of irrelevant regions, while using computed depth infor-
problem. ECO-TR [158] strives to develop an end-to-end mation to adaptively adjust the size of the grid at the refinement
model to accelerate COTR by intelligently connecting multi- stage. DeepMatcher [142] introduces the Feature Transforma-
ple transformer blocks and progressively refining predicted co- tion Module to ensure a smooth transition of locally aggregated
ordinates in a coarse-to-fine manner on a shared multi-scale features extracted from CNNs to features with a global recep-
feature extraction network. LoFTR [72] is groundbreaking tive field, extracted from Transformers. It also presents Slim-
because it creates a GNN with keypoints as nodes, utilizing Former, which builds deep networks, employing a hierarchical
self-attention layers and mutual attention layers to obtain fea- strategy that enables the network to adaptively absorb informa-
ture descriptors for two images and generating dense matches tion exchange within residual blocks, simulating human-like
in regions with low texture. To overcome the absence of behavior. OAMatcher [166] proposes the Overlapping Areas
local attention interaction in LoFTR, Aspanformer [73] pro- Prediction Module to capture keypoints in co-visible regions
poses an uncertainty-driven scheme based on flow prediction and conduct feature enhancement among them, simulating how
probabilistic modeling that adaptively varies the local attention humans shift focus from entire images to overlapping regions.
span to allocate different context sizes for different positions. They also propose a Matching Label Weight Strategy to gen-
Contrary to the detect-then-describe strategy of S-TREK [98], erate coefficients for evaluating the reliability of true match-
which leverages a translation and rotation equivariant keypoint ing labels, using probabilities to determine whether the match-
detector paired with a lightweight descriptor extractor, SE2- ing labels are correct. CasMTR [167] proposes to enhance
LoFTR [159] adopts a detector-free paradigm, seamlessly ex- the transformer-based matching pipeline by incorporating new
tracting pixel-level correspondences between pairs of images stages of cascade matching and NMS detection.
without necessitating the preliminary step of keypoint detec- PMatch [168] enhances geometric matching performance
tion. This model enhances the original LoFTR framework by pretraining with transformer modules using a paired masked
by incorporating a steerable CNN, thereby achieving inher- image modeling pretext task, utilizing the LoFTR module. To
ent equivariance to translations and rotations. This modifi- effectively leverage geometric priors, SEM [169] introduces a
cation significantly boosts the model’s resilience to rotational structured feature extractor that models relative positional re-
variances, showcasing the model’s unique contribution to the lationships between pixels and highly confident anchor points.
domain of feature matching through direct image correspon- It also incorporates epipolar attention and matching techniques
dence. SE2-LoFTR’s approach exemplifies the versatility and to filter out irrelevant regions based on epipolar constraints.
efficiency of detector-free models in handling complex image DKM [170] addresses the two-view geometric estimation prob-
matching scenarios, particularly those involving significant ro- lem by devising a dense feature matching method. DKM
tational movements. presents a robust global matcher with a kernel regressor and
To address the challenges posed by the presence of nu- embedded decoder, involving warp refinement through large
merous similar points in dense matching approaches and the depth-wise kernels applied to stacked feature maps. Building
limitations on the performance of linear transformers them- on this, RoMa [171] represents a significant advancement in
selves, several recent works have proposed novel methodolo- dense feature matching by applying a Markov chain framework
gies. Quadtree [160] introduces quadtree attention to quickly to analyze and improve the matching process. It introduces
skip calculations in irrelevant regions at finer levels, reducing a two-stage approach: a coarse stage for globally consistent
the computational complexity of visual transformations from matching and a refinement stage for precise localization. This
quadratic to linear. OETR [161] introduces the Overlap Regres- method, which separates the initial matching from the refine-
sion method, which uses a Transformer decoder to estimate the ment process and employs robust regression losses for greater
degree of overlap between bounding boxes in an image. It in- accuracy, has led to notable improvements in matching perfor-
corporates a symmetric center consistency loss to ensure spatial mance, outperforming current SotA.
consistency in the overlapping regions. OETR can be inserted
as a preprocessing module into any local feature matching
10
3.3. Patch Based such as visual localization, multi-view stereo, and the syn-
The Patch-Based matching approach enhances point corre- thesis of innovative perspectives. The developmental trajec-
spondences by matching local image regions. It involves di- tory of SfM, underscored by extensive scholarly investigations,
viding images into patches, extracting descriptor vectors for has engendered firmly established methodologies, sophisticated
each, and then matching these vectors to establish correspon- open-source frameworks exemplified by Bundler [176] and
dences. This technique accommodates large displacements and COLMAP [12], and advanced proprietary software solutions.
is valuable in various computer vision applications. Figure 7 These frameworks are meticulously tailored to ensure precision
illustrates the general architecture of the Patch-Based matching and scalability when handling expansive scenes.
approach. Conventional SfM methodologies rely upon the identifica-
tion and correlation of sparse characteristic points dispersed
across manifold perspectives. Nonetheless, this methodology
encounters formidable challenges in regions with scant tex-
Patch-level Matching

√ tural features, where the identification of keypoints becomes

Match Refinement
Feature Extraction

a vexing undertaking. Lindenberger et al. [177] ameliorate

√ this predicament by meticulously refining the initial keypoints
and effecting subsequent adjustments to both points and cam-
√
√ era orientations during post-processing. The proposed ap-
proach strategically balances an initial rudimentary estima-
tion with sparse localized features and subsequent fine-tuning
Figure 7: Illustration of the Patch-Based Pipeline. Following the extraction through locally-precise dense features, thereby elevating pre-
of image features, the Match Refinement process is performed on the matched cision under challenging conditions. Recent strides in SfM
areas obtained from patch-level matches, resulting in refined point matching.
have transitioned towards holistic approaches that either di-
rectly regress poses [178, 179] or employ differential bundle ad-
Patch2Pix [172] proposes a weakly supervised method justment [180, 181]. These approaches, circumventing explicit
to learn correspondences consistent with extreme geometric feature correlation, sidestep the challenges associated with sub-
transformations between image pairs. It adopts a two-stage optimal feature matching. He et al. [182] introduce an innova-
detection-refinement strategy for correspondence prediction, tive SfM paradigm devoid of detectors, harnessing detector-free
where the first stage captures semantic information and the sec- matchers to defer the determination of keypoints. This strat-
ond stage handles local details. It introduces a novel refinement egy adeptly addresses the prevalent multi-view inconsistency in
network that utilizes weak supervision from extreme geomet- detector-free matchers, showcasing superior efficacy in texture-
ric transformations and outputs confidence in match positions impoverished scenes relative to conventional detector-centric
and outlier rejection, allowing for geometrically consistent cor- systems.
respondence prediction. AdaMatcher [173] addresses the geo- The evolutionary trajectory of SfM methodologies is dis-
metric inconsistency issue caused by applying the one-to-one cernible in the pivot from traditional sparse feature identifi-
assignment criterion in patch-level matching. It adaptively as- cation towards sophisticated, occasionally end-to-end, dense
signs patch-level matches while estimating the scale between matching paradigms. The assimilation of these pioneering
images to improve the performance of dense feature matching methodologies into extant SfM workflows is enhancing preci-
methods in extreme cases. PATS [174] proposes Patch Area sion and resilience, particularly in arduous scenes. However,
Transportation with Subdivision (PATS) to learn scale differ- the seamless integration of these methodologies into contem-
ences in a self-supervised manner. It can handle multiple-to- porary SfM systems remains an intricate challenge.
multiple relationships unlike bipartite graph matching, which
only handles one-to-one matches. SGAM [175] presents a hier- 4.2. Remote Sensing Image registration
archical feature matching framework that first performs region In the domain of remote sensing, the emergence of deep
matching based on semantic clues, narrowing down the feature learning has heralded a revolutionary epoch in Multimodal Re-
matching search space to region matches with significant se- mote Sensing Image Registration (MRSIR) [43, 183, 184, 185],
mantic distributions between images. It then refines the region augmenting conventional area- and feature-based techniques
matches through geometric consistency to obtain accurate point with a Learning-Based Pipeline (LBP) [186, 187]. This LBP di-
matches. verges into several pioneering approaches: amalgamating deep
learning with traditional registration methods, bridging the mul-
4. Local Feature Matching Applications timodal chasm through modality transformation, and directly
regressing transformation parameters for a comprehensive MR-
4.1. Structure from Motion SIR framework [60]. Techniques such as (pseudo-) Siamese
SfM represents a foundational computational vision pro- networks and Generative Adversarial Networks (GANs) have
cess, indispensable for deducing camera orientations, intrin- played a pivotal role in this evolution, facilitating the man-
sic parameters, and volumetric point clouds from an array of agement of geometric distortions and nonlinear radiometric
diverse scene images. This process substantiates endeavors disparities [188, 189]. For instance, the utilization of condi-
tional GANs has enabled the creation of pseudo-images [190],
11
thereby enhancing the precision of established methods like tient anatomy and the need for maintaining anatomical integrity
NCC [191] and SIFT [17]. through diffeomorphism and incompressibility. Deep learning-
In the LBP, a multitude of innovative methods and architec- based methods have shown efficacy in motion estimation across
tures have been formulated. MUNet [184], a multiscale strat- different organs, from the heart to the lungs. For example, the
egy for learning transformation parameters, and fully convo- application of variational autoencoder-based models, as demon-
lutional networks for scale-specific feature extraction stand as strated by Qin et al. [202], by traversing the biomechanically
quintessential examples of this innovation, addressing the chal- reasonable deformation manifold to search for the best transfor-
lenges of nonrigid MRSIR [192]. Enriching the LBP further, mation of a given heart sequence, better motion tracking accu-
various research endeavors have concentrated on integrating racy and more reasonable estimation of myocardial motion and
middle- or high-level features extracted by CNNs with classical strain are obtained, enhancing the realism and clinical reliabil-
descriptors, surmounting limitations of traditional methodolo- ity of motion estimation. DeepTag [203] and DRIMET [204]
gies. For example, Ye et al. [193, 184] devised a novel mul- showcase sophisticated methodologies for tracking internal tis-
tispectral image registration technique employing a CNN and sue motion, particularly in the context of Tagged-MRI. These
SIFT amalgamation, substantially enhancing registration effi- approaches exemplify the ability to estimate dense 3D motion
cacy. Similarly, Wang et al. [194] developed an end-to-end fields through advanced unsupervised learning techniques in
deep learning architecture that discerns the mapping function medical imaging. The recent advancement in one-shot learn-
between image patch-pairs and their matching labels, employing for deformable medical image registration is conspicuously
ing transfer learning for expedited training. Ma et al. [195] in- exemplified in contemporary research, particularly in the ap-
troduced a coarse-to-fine registration method using CNN and plication of one-shot learning to complex 3D and 4D medical
local features, attaining a profound pyramid feature representa- datasets, thereby enhancing accuracy, reducing dependency on
tion via VGG-16. Zhou et al. [196] developed a deep learning- large training datasets, and broadening the scope of applicabil-
based method for matching synthetic aperture radar (SAR) and ity. Fechter et al. [205] introduced a one-shot learning approach
optical images, concentrating on extracting Multiscale Convo- for deep motion tracking in 3D and 4D datasets, addressing the
lutional Gradient Features (MCGFs) using a shallow pseudo- challenge of requiring extensive training data. Their method
Siamese network. This approach effectively captures common- concatenates images from different phases in the channel di-
alities between SAR and optical images, transcending the con- mension, utilizing a U-Net architecture with a coarse-to-fine
fines of handcrafted features and diminishing the necessity for strategy. This approach allows the simultaneous calculation
extensive model parameters. Cui et al. [197] introduced MAP- of forward and inverse transformations in 3D datasets. Zhang
Net, an image-centric convolutional network integrating Spatial et al. [206] introduced GroupRegNet, a one-shot deep learning
Pyramid Aggregated Pooling (SPAP) and an attention mecha- method designed for 4D image registration. It employs an im-
nism, proficient in addressing geometric distortions and radio- plicit template, effectively reducing bias and accumulated error.
metric variations in cross-modal images by embedding original The simplicity of GroupRegNet’s network design and its direct
images to extract high-level semantic information and employ- registration process eliminates the need for image partitioning,
ing PCA for enhanced matching accuracy. resulting in a significant enhancement in both computational
Notwithstanding these advancements, challenges in dataset efficiency and accuracy. Furthering the development in this
construction and method generalization persist, principally ow- domain, Ji et al. [207] proposed a temporal-spatial method for
ing to the diverse and intricate nature of remote sensing im- lung 4D-CT image registration. This method integrates a CNN-
ages [184]. The development of comprehensive and represen- ConvLSTM hybrid architecture, adeptly modeling the temporal
tative training datasets, coupled with innovative methodologies motion of images while incorporating a dual-stream approach
meticulously tailored for remote sensing imagery, remains an to address periodic motion constraints. The Hybrid Paradigm-
imperative objective. Moreover, there is a dearth of valuable based Registration Network (HPRN) [208] introduces an unsu-
research in pixel-level fusion of radar and optical images, re- pervised learning framework for 4D-CT lung image registra-
quiring more attention in future endeavors [198]. tion, effectively handling large deformations without ground
truth data. HPRN achieves superior registration accuracy by
4.3. Medical Image Registration learning multi-scale features, incorporating advanced loss func-
The field of medical image registration has undergone sig- tions, and avoiding preprocessing steps like cropping and scal-
nificant evolution with the integration of sophisticated deep ing.
learning techniques, particularly in motion estimation and 2D- 2D-3D registration is a critical component, particularly in
3D registration. These advancements not only represent a tech- interventional procedures [209]. This process is vital for accu-
nological leap but also open new vistas in various medical ap- rately overlaying 2D images (like X-Ray, ultrasound, or endo-
plications [58]. scopic images) onto 3D pre-operative CT or MR images. The
Motion estimation in medical imaging, a crucial aspect of key challenge here lies in the accurate geometric alignment of
registration, has been substantially refined with deep learn- these differing dimensionalities. Traditional approaches to 2D-
ing. Unsupervised optical flow and point tracking techniques, 3D registration have relied on iterative optimization methods
as expounded by researchers like Bian et al. [199], Ranjan with similarity metrics based on image intensity [210]. How-
et al. [200], and Harley et al. [201], address the complexi- ever, these methods often struggle with the non-convex nature
ties inherent in medical image data, such as variability in pa- of the problem, potentially resulting in convergence to incorrect
12
solutions if the initial estimate is not close to the actual solution. 57 sequences marked by substantial variances in illumination,
This is compounded by the inherent difficulty of representing spanning both natural and artificial luminance conditions. In
3D spatial information on 2D images, leading to registration each test sequence, one reference image is paired with the re-
ambiguity. Recent advancements, however, have seen a shift to- maining five images. It is worth noting that according to the
wards deep learning-based methods. These approaches, unlike evaluation method of D2Net, 56 sequences with significant
their traditional counterparts, do not require explicit functional viewpoint changes and 52 sequences with significant illumina-
mappings, allowing for more robust solutions to the registration tion changes are usually used to evaluate the performance of the
challenge [211]. In the realm of recent developments concern- network. Starting from SuperPoint [61], the HPatches dataset
ing 2D-3D medical image registration, Jaganathan et al. [212] has also been used to evaluate the performance of local descrip-
have introduced a self-supervised paradigm designed for the tors in homography estimation tasks.
fusion of X-ray and CT images. This method leverages sim- Roto-360 [113] is an evaluation dataset consisting of 360
ulated X-ray projections to facilitate the training of deep neural image pairs. These pairs feature rotations within a plane rang-
networks, culminating in a noteworthy enhancement of regis- ing from 0° to 350° in 10° intervals. The dataset was generated
tration accuracy and success ratios. Simultaneously, Huang et by randomly selecting and rotating ten HPatches images, mak-
al. [213] have devised a two-stage framework tailored for neu- ing it valuable for assessing descriptor performance in terms of
rological interventions. This innovative approach amalgamates rotation invariance.
CNN regression with centroid alignment, manifesting superior ZEB [220] unveils an innovative zero-shot evaluation
efficacy in real-time clinical applications. In addition to rigid benchmark, meticulously designed to tackle the challenge of
2D-3D registration, there is growing interest in non-rigid regis- generalization in image matching across varied domains. It har-
tration, which is crucial in applications like cephalometry, lung monizes data from eight real-world and four simulated datasets,
tumor tracking in radiation therapy, and Total Hip Arthroplasty encapsulating a broad spectrum of image resolutions, environ-
(THA) [214, 215, 216]. Deep learning models, such as convo- mental conditions, and perspectives. This amalgamation culmi-
lutional encoders, have been used to address the challenges of nates in a robust benchmark featuring 46,000 evaluation image
non-rigid registration [217, 218]. pairs, curated across five distinct image overlap ratios, spanning
The convergence of motion estimation and 2D-3D registra- from 10% to 50%, ascertained through verifiable ground truth
tion techniques in medical image registration addresses critical poses and depth maps. ZEB’s comprehensive and expansive
challenges in parameter optimization and ambiguity, enhancing dataset stands unparalleled in scope and diversity, markedly sur-
both the speed and accuracy of medical imaging processes. The passing the conventional 1,500 in-domain image pairs utilized
ongoing evolution in this field is poised to revolutionize diag- in prevailing methodologies, representing a monumental ad-
nostic and interventional procedures, making them more effi- vancement in the evaluation of cross-domain generalization ca-
cient, patient-centric, and outcome-focused. Interested readers pabilities of image matching models. With its rigorous assem-
can refer to the comprehensive survey [58, 57] for a detailed bly, ZEB not only elucidates the generalization disparities of
overview of deep learning-based approaches to medical image existing domain-specific models in uncontrolled scenarios but
registration. also establishes a novel benchmark for gauging the resilience
and versatility of image matching algorithms against real-world
5. Local Feature Matching Datasets adversities.
The Image Matching Challenge (IMC) series, initiated in
Local feature matching methods are often evaluated based 2019, has consistently pushed the boundaries of local feature
on their effectiveness in downstream tasks. In this section, matching methodologies through meticulously curated datasets
we will provide summaries of some of the most widely used and a comprehensive benchmark evaluation framework. Each
datasets for evaluating local feature matching. We categorize year, the IMC introduces new challenges and datasets, reflect-
these datasets into five groups: Image Matching Datasets, Rel- ing the evolving complexities and diversities encountered in
ative Pose Estimation Datasets, Visual Localization Datasets, real-world image matching scenarios. From leveraging large
Optical Flow Estimation Datasets, and Structure from Motion collections of images for dense 3D reconstructions to explor-
Datasets. For each dataset, we will provide detailed information ing the nuances of phototourism, urban landscapes, and the in-
about the features it encompasses. tricate details captured by UAVs and DSLRs, the IMC series
provides a robust platform for advancing image matching tech-
5.1. Image Matching Datasets nologies. The IMC 2020 [221] dataset was centered around
HPatches [219] benchmark stands as a prominent yard- generating dense and accurate 3D reconstructions from vast
stick for image matching endeavors. It comprises 116 scene image collections using off-the-shelf SfM techniques, such as
sequences distinguished by fluctuations in viewpoint and lumi- Colmap. This challenge presented the Phototourism dataset,
nosity. Within each scene, there are 5 pairs of images, with derived from photo-tourism image collections of famous land-
the inaugural image serving as the point of reference, and the marks, to evaluate and benchmark different image matching
subsequent images in the sequence progressively intensifying methods against pseudo-ground truth poses and depth maps.
in complexity. This dataset is bifurcated into two distinct do- Training data included images, poses, depth maps, and co-
mains: viewpoint, encompassing 59 sequences marked by sub- visibility estimates, with a validation set available for method
stantial viewpoint alterations, and illumination, encompassing fine-tuning before submission. Building upon the foundational
13
work of IMC 2020, the IMC 2021 [222] expanded its horizons visual alterations. This dataset comprises 37 image pairs, fea-
by introducing three datasets: ”Phototourism”, ”PragueParks”, turing a blend of urban and natural environments, systemati-
and ”GoogleUrban”, each showcasing unique challenges for cally categorized based on the presence of various complicating
image matching techniques. While ”Phototourism” continued factors. The ground truth for WxBS is established through a
from the 2020 challenge, ”PragueParks” introduced imagery collection of manually selected correspondences, capturing the
from video sequences captured with modern smartphones, and segments of the scene that are visible in both images. WxBS
”GoogleUrban” focused on evaluating localization algorithms serves as a pivotal tool for the appraisal of algorithms tailored
with images collected from different cell phones across the for image matching under a spectrum of demanding conditions.
globe. For IMC 2022 [223] and IMC 2023 [224], partici-
pants are encouraged to explore the Kaggle official website 5.3. Visual Localization Datasets
for detailed information on the datasets and benchmarks. The Aachen Day-Night [230] is a dataset consisting of 4328
IMC 2023 notably introduces three new datasets - ”Heritage”, daytime images and 98 nighttime images, which are used for lo-
”Haiper”, and ”Urban” - each presenting unique challenges calization tasks. This benchmark challenges matching between
such as UAV-to-ground imaging, day-night variations, repeated daytime and nighttime images, making it a challenging dataset
patterns, wiry objects, and scale changes. These datasets aim to to work with. Aachen Day-Night v1.1 [9] is an updated version
test the limits of current image matching algorithms and foster of the Aachen Day-Night dataset with 6697 daytime images and
the development of novel solutions capable of addressing the 1015 query images (824 for the day and 191 for the night). The
intricate challenges of real-world image matching scenarios. presence of large illumination and viewpoint changes makes it
a challenging dataset to work with.
5.2. Relative Pose Estimation Datasets InLoc [116] is an indoor dataset that includes 9972 RGBD
ScanNet [225] is a large-scale indoor dataset with well- images; 329 RGB images from it are used as queries to test
defined training, validation, and test splits comprising approxi- the performance of long-term indoor visual localization algo-
mately 230 million well-defined image pairs from 1613 scenes. rithms. This dataset provides a variety of challenges due to its
This dataset includes ground truth and depth images and con- large size (around 10k images covering two buildings), signif-
tains more regions with repetitive and weak textures compared icant differences in viewpoint and/or illumination between the
to the Hpatches dataset, thus posing greater challenges. database and query images, and temporal changes in the scene.
YFCC100M [226] is a vast dataset with a diverse collection In addition to this, the InLoc dataset provides a large collection
of internet images of various tourist landmarks. It comprises of depth maps from 3D scanners.
100 million media objects, of which approximately 99.2 million RobotCar-Seasons (RoCaS) [231] is a challenging dataset
are photos and 0.8 million are videos, with each media object that contains 26121 reference images and 11934 query images.
represented by several metadata pieces, such as the Flickr iden- The dataset presents a variety of environmental conditions, in-
tifier, owner name, camera information, title, tags, geo-location, cluding rain, snow, dusk, winter, and inadequate lighting in sub-
and media source. Typically, a subset of YFCC100M is used for urban areas. These factors make the task of feature matching
evaluation, consisting of four popular landmark image sets of and visual localization difficult.
1000 image pairs each, resulting in a total of 4000 pairs for the LaMAR [232] addresses the foundational technology of lo-
test set and following the conventions used in [77, 69, 70, 73]. calization and mapping in augmented reality (AR), introducing
MegaDepth [227] is a dataset designed to address the chal- a new benchmark for realistic AR scenarios. The dataset is cap-
lenging task of matching under extreme viewpoint changes and tured using AR devices in diverse environments, including in-
repetitive patterns. It comprises 1 million image pairs from door and outdoor scenes with dynamic objects and varied light-
196 different outdoor scenes, each with known poses and depth ing. It features multi-sensor data streams (images, depth, IMU,
information, and can be used to validate pose estimation ef- etc.) from devices like HoloLens 2 and iPhones/iPads, cover-
fectiveness in outdoor scenarios. The authors also provide ing over 45,000 square meters. LaMAR’s ground-truth pipeline
depth maps generated from sparse reconstruction and multi- aligns AR trajectories against laser scans automatically, ro-
view stereo computation using COLMAP [12]. bustly handling data from heterogeneous devices. This bench-
EVD (Extreme Viewpoint Dataset) [228] is a meticu- mark is pivotal in evaluating AR-specific localization and map-
lously curated dataset, specifically developed for the assess- ping methods, highlighting the importance of considering addi-
ment of two-view matching algorithms under scenarios of ex- tional data streams like radio signals in AR devices. LaMAR
treme viewpoint alterations. It amalgamates image pairs from a offers a realistic and comprehensive dataset for AR, guiding fu-
variety of publicly accessible datasets, distinguished by their in- ture research directions in visual localization and mapping.
tricate geometric configurations. EVD’s creation was motivated
by the necessity to evaluate the resilience of matching method- 5.4. Optical Flow Estimation Datasets
ologies in contexts characterized by pronounced variations in
KITTI [233] is an image matching dataset collected in ur-
viewpoints.
ban traffic scenarios with both a 2012 and 2015 version. KITTI-
WxBS (Wide multiple Baseline Stereo) [229] addresses
2012 consists of 194 training image pairs and 195 test im-
a more expansive challenge within the realm of wide baseline
age pairs of resolution 1226×370, while KITTI-2015 contains
stereo matching, encompassing disparities in multiple facets of
200 training image pairs and 200 test image pairs of resolution
image acquisition such as viewpoint, lighting, sensor type, and
14
1242×375. The dataset includes sparse ground truth dispari- of London [238]. ETH dataset includes various cameras and
ties obtained using a laser scanner. The scenes in KITTI-2012 conditions, providing a challenging benchmark to compare dif-
are relatively simple, whereas the KITTI-2015 dataset presents ferent methods’ performance.
challenges due to its dynamic scenes and complex scenarios. ETH3D [239] is a comprehensive benchmark for multi-
view stereo algorithms. The dataset encompasses a wide array
5.5. Structure from Motion Datasets of scenes, both indoors and outdoors, captured through high-
SUN3D [234] offers a comprehensive RGB-D video resolution DSLR cameras and synchronized low-resolution
database, meticulously annotated to capture the full 3D extent stereo videos. What makes this dataset distinctive is its combi-
of a variety of indoor environments. Encompassing 415 se- nation of high spatial and temporal resolution. With scenarios
quences across 254 spaces within 41 buildings, it provides a ranging from natural to man-made environments, it introduces
rich resource for SfM evaluations. Utilizing an ASUS Xtion novel challenges for detailed 3D reconstruction, with a specific
PRO LIVE sensor, SUN3D simulates human visual perspec- focus on the application of hand-held mobile devices in stereo
tives, facilitating an innovative combination of intuitive label- vision scenarios. ETH3D provides diverse evaluation protocols
ing and generalized bundle adjustment to enhance semantic catering to high-resolution multi-view stereo, low-resolution
segmentation and camera pose accuracy. This methodology multi-view on video data, and two-view stereo. Consequently,
not only addresses reconstruction errors but also enriches the it stands as a valuable asset for advancing research in the field
dataset with detailed 3D object models, point clouds, and se- of dense 3D reconstruction.
mantic maps, making it an invaluable asset for advancing re- ENRICH [240] presents a synthetic, versatile dataset tai-
search in local feature matching, scene understanding, and 3D lored for the assessment and comparison of photogrammet-
reconstruction. ric and computer vision techniques, with a focus on SfM
DTU [235] serves as an exhaustive resource dedicated to and MVS methodologies. It encompasses three distinct
advancing the field of multi-view stereo (MVS). Comprising 80 subsets—ENRICH-Aerial, ENRICH-Square, and ENRICH-
meticulously documented scenes from either 49 or 64 precise Statue—each offering a collection of high-resolution images
camera locations, under a spectrum of seven distinct lighting rendered under various lighting conditions, camera orienta-
scenarios, the DTU dataset introduces a diverse array of chal- tions, scales, and viewpoints to closely mimic real-world com-
lenges, encompassing specular reflections, textural disparities, plexity. Created from 3D models of real objects to ensure gen-
and geometric intricacies. It offers a rigorously controlled envi- uine textures and geometries, ENRICH employs a virtual cam-
ronment, capturing images at a resolution of 1200 x 1600 pix- era modeled after the Nikon D750 DSLR to capture distortion-
els, and is supplemented by meticulously accurate structured free images. Beyond serving as a rigorous platform for SfM
light scans for benchmarking and evaluation purposes. This and MVS algorithm evaluation, ENRICH provides an extensive
dataset is crucial for delving into the nuances of balancing pre- array of ground truth data, including GCP coordinates, depth
cision and completeness in 3D reconstructions, navigating the maps, and 3D models. This rich dataset, with its quality and
complexities of mesh generation, and evaluating the influence scale diversity, from aerial views to ground-level scenes, stands
of lighting conditions on MVS efficacy. as a critical asset for pushing the boundaries of research in re-
Tanks and Temples [236] is crafted as a benchmarking mote sensing, photogrammetry, and computer vision, facilitat-
tool for assessing image-based 3D reconstruction pipelines. It ing a broad spectrum of applications including image matching
encompasses a diverse array of outdoor vistas and indoor set- and monocular depth estimation.
tings captured under realistic conditions. The ground-truth data
is acquired via an industrial laser scanner, furnishing high- 5.6. Dataset Gaps and Future Needs
resolution video sequences as input. The dataset is stratified Although the above datasets provide valuable resources for
into intermediate and advanced tiers, incorporating substan- evaluating local feature matching methods, there are significant
tial vehicles, architectural structures, and intricate geometric gaps that need to be addressed.
configurations to challenge prevailing reconstruction pipelines. One major gap is the lack of datasets that simulate extreme
Consisting of 14 scenes spanning from singular objects to ex- environmental conditions. While the presence of datasets like
pansive indoor and outdoor landscapes, Tanks and Temples RoCaS [231] offers some variability in environmental condi-
pushes the boundaries of 3D reconstruction with its meticu- tions, including diverse weather scenarios and lighting condi-
lously curated ground-truth point clouds and camera poses ex- tions, there is a need for datasets that focus specifically on
trapolated from 4K video sequences. This dataset serves as a challenging weather scenarios like heavy rain, fog, or snow.
linchpin for the refinement and evaluation of algorithms adept These conditions pose unique challenges in feature match-
at traversing the intricacies inherent in large-scale scene re- ing and are crucial for applications in climate-sensitive ar-
construction, providing an indispensable asset for enhancing eas. Another gap is the limited representation of highly dy-
the resilience and accuracy of image-based 3D reconstruction namic environments. Current datasets, including the widely
methodologies. used HPatches [219], while comprehensive in examining vari-
ETH [237] is a dataset designed to evaluate descriptors for ations in viewpoint and illumination, do not adequately capture
SfM tasks by building a 3D model from a set of available 2D the complexity of crowded urban areas or fast-moving scenes.
images. Following D2Net, three medium-sized datasets are This limitation is significant for applications requiring real-
evaluated: Madrid Metropolis, Gendarmenmarkt, and Tower time monitoring and surveillance in densely populated areas.
15
Datasets that can mimic the dynamics of such environments they fall within a certain pixel threshold of ground truth match-
are essential for advancing feature matching techniques in these ing.
contexts. Additionally, there is a noticeable lack of datasets tai-
lored for specific application domains, such as underwater or 6.1.2. Homography Estimation
aerial imagery. These domains have unique characteristics and The angular correctness metric is typically used in order
challenges that are not addressed by datasets like ETH [237] or to evaluate the performance of feature matching algorithms.
Aachen Day-Night [230]. Specialized datasets in these areas The metric involves estimating the homography transformation
would be invaluable for research and development in fields like Ĥ between two images and comparing the transformed cor-
marine biology or drone-based monitoring. ners with those computed using the ground-truth homography
In conclusion, while existing datasets have significantly H [61]. To ensure a fair comparison among methods that pro-
contributed to the field of local feature matching, there is a duce different numbers of matches, a correctness identifier is
clear need for more specialized datasets. These datasets should calculated based on the corner error between the images warped
aim to fill the existing gaps and cater to the evolving needs of with Ĥ and H. If the average error of the four corners is less
various application domains, thereby enabling further advance- than a specified pixel threshold ε, typically ranging from 1 to 10
ments in local feature matching techniques. pixels, the estimated homography is considered correct. Once
the correctness of the estimated homography is established, the
angle errors between images are evaluated using the Area Un-
6. Performance Review
der Curve (AUC) metric. This metric calculates the area under
6.1. Metrics For Matching Models the error accumulation curve at various thresholds, quantifying
the accuracy and stability of matching. The AUC value rep-
6.1.1. Image Matching
resents the overall matching performance, with higher values
Repeatability [241, 52]. The repeatability metric for com-
indicating better performance.
paring two images is computed by taking the count of matching
feature areas found between the images and dividing it by the
6.1.3. Relative Pose Estimation
lesser count of feature areas found within either of the images,
When evaluating the estimated camera pose, the typical
subsequently multiplying by 100 to express the result as a per-
method involves measuring the angular deviations in rotations
centage. This quantitative assessment is essential for gauging
and translations [242]. In this method, if the angle deviation is
the consistency of feature detectors when subjected to different
less than a certain threshold, rotation or translation is consid-
geometric alterations:
ered to be correctly estimated, and the average accuracy at that
M threshold is reported. The interval between frames is repre-
Repeatability = × 100 sented by df rame , where larger values indicate more challeng-
min(F1 , F2 )
ing image pairs for matching. For the pose error at different
where, M denotes the number of matching feature areas be- thresholds. The most common metrics include AUC, match-
tween the two images, F1 represents the total number of feature ing accuracy, matching score. Among them, the maximum
areas detected in the first image, and F2 is the total number of values of the translation error and the angular error are usually
feature areas detected in the second image. noted as the pose error.
Matching score (M-score) [61, 78]. The M-Score quan-
tifies the effectiveness of a feature detection and description 6.1.4. Visual Localization
pipeline by calculating the average ratio of correctly matched
The evaluation process typically follows the general evalua-
features to the total features detected in the overlapping ar-
tion protocol outlined in visual localization benchmarks2 . Cus-
eas of two images. Mean Matching Accuracy (MMA) [62]
tom features are used as input to the system, and then the im-
is used to measure how well feature matchings are performed
age registration process is performed using a framework such as
between image pairs considering multiple pixel error thresh-
COLMAP [12]. Finally, the percentage of images successfully
olds. It represents the average percentage of correct matches
positioned within a predefined tolerance range is calculated. In
in image pairs considering several pixel error thresholds. The
order to report the performance of the evaluated methods, the
metric considers only mutually nearest neighbor matches, and
cumulative AUC of pose error at different threshold values is
matches are considered correct if the reprojection error, esti-
often used.
mated using provided homography, is below the given match-
ing threshold. Features and Matches [62] evaluate the per-
6.1.5. Optical Flow Estimation
formance of feature descriptors. Features refers to the average
Evaluation metrics employed for optical flow estimation en-
number of detected features per image, and Matches indicates
compass the Average End-point Error (AEPE), flow outlier
the average number of successful feature matches. Percentage
ratio (Fl), and Percentage of Correct Keypoints (PCK) [243,
of Correct Keypoints (PCK) [111] metric is commonly used
67]. AEPE is characterized as the mean Euclidean separation
to measure the performance of dense matching. It involves ex-
between the estimated and ground truth correspondence map.
tracting key points from the first image on the image grid and
finding their nearest neighbors in the complete second image.
Predicted matchings of query points are considered correct if 2 https://www.visuallocalization.net

16
Specifically, it quantifies the Euclidean disparity between pre- corner error distances up to 3, 5, and 10 pixels. Table 2 focuses
dicted and actual flow fields, computed as the average over valid on the ScanNet [225] test set, following the SuperGlue [69] test-
pixels within the target image. Fl assesses the average percent- ing protocol. The reported metric is the pose AUC error. Ta-
age of outliers across all pixels, where outliers are defined as ble 3 centers on the YFCC100M [226] test set, with a protocol
flow errors exceeding either 3 pixels or 5% of the ground truth based on RANSAC-Flow [244]. Additionally, the pose mAP
flows. PCK elucidates the percentage of appropriately matched (mean Average Precision) value is reported. A pose estimation
estimated points x̂i situated within a specified threshold (in pix- is considered an outlier if its maximum degree error in trans-
els) from the corresponding ground truth points xi . lation or rotation exceeds the threshold. Table 4 highlights the
MegaDepth [227] test set. The pose estimation AUC error is re-
6.1.6. Structure from Motion ported, following the SuperGlue [69] evaluation methodology.
As delineated in the evaluation framework prescribed by Tables 5 and 6 emphasize the Aachen Day-Night v1.0 [230]
ETH [237], a suite of pivotal metrics is employed to rigorously and v1.1 [9] test sets, respectively, in the local feature evalua-
evaluate the fidelity of the reconstruction process. These en- tion track and the full visual localization track. Table 7 focuses
compass the number of registered images, which serves as an on the InLoc [116] test set. The reported metrics include the
indicator of the reconstruction’s comprehensiveness, along with percentage of correctly localized queries under specific error
the sparse points metric, which provides insights into the depth thresholds, following the HLoc [245] pipeline. Table 8 empha-
and intricacy of the scene’s depiction. Furthermore, the total sizes the KITTI [233] test set. The AEPE and the flow outlier
observations in image metric is pivotal for the calibration of ratio Fl are reported for both the 2012 and 2015 versions of the
cameras and the triangulation process, denoting the confirmed KITTI dataset. Table 9 focuses on the ETH3D [239], presenting
image projections of sparse points. The mean feature track a detailed evaluation of various SfM methods as reported in the
length, indicative of the average count of verified image obser- DetectorFreeSfM [182]. This evaluation thoroughly examines
vations per sparse point, plays a vital role in ensuring precise the effectiveness of these methods across three crucial metrics:
calibration and robust triangulation. Lastly, the mean reprojec- AUC, Accuracy, and Completeness.
tion error is a critical measure for gauging the accuracy of the
Table 1: Evaluation on HPatches [219] for homography estimation. We com-
reconstruction, encapsulating the cumulative reprojection error
pare with two groups of the methods, Detector-based and Detector-free meth-
observed in bundle adjustment and influenced by the thorough- ods.
ness of input data as well as the precision of keypoint detection.
The key metrics in ETH3D [239] are crucial for evaluat- Category Methods
Pose Estimation AUC↑
matches
ing the effectiveness of various SfM methods. The AUC of @3px @5px @10px
D2Net [62]+NN 23.2 35.9 53.6 0.2K
pose error at different thresholds is used to assess the accu- R2D2 [63]+NN 50.6 63.9 76.8 0.5K
racy of multi-view camera pose estimation. This metric reflects Detector-based DISK [112]+NN 52.3 64.9 78.9 1.1K
the precision of estimated camera poses in relation to ground SP+GFM [246] 51.9 65.8 79.1 2.0k
truth. Accuracy and Completeness percentages at different dis- SP+SuperGlue [69] 53.9 68.3 81.7 0.6K
COTR [157] 41.9 57.7 74.0 1.0K
tance thresholds evaluate the 3D Triangulation task. Accuracy Sparse-NCNet [65] 48.9 54.2 67.1 1.0K
indicates the proportion of reconstructed points within a cer- DualRC-Net [66] 50.6 56.2 68.3 1.0K
tain distance from the ground truth, and Completeness mea- Patch2Pix [172] 59.3 70.6 81.2 0.7K
sures the percentage of ground truth points that are adequately 3DG-STFM [247] 64.7 73.1 81.0 1.0k
LoFTR [72] 65.9 75.6 84.6 1.0K
represented within the reconstructed point cloud. SE2-LoFTR [159] 66.2 76.6 86.0 —
Detector-free QuadTree [160] 66.3 76.2 84.9 2.7k
6.2. Quantitative Performance PDC-Net+ [146] 66.7 76.8 85.8 1.0k
TopicFM [164] 67.3 77.0 85.7 1.0K
In this section, we analyze the performance of several key ASpanFormer [73] 67.4 76.9 85.6 —
methods in terms of the evaluation scores provided in Section SEM [169] 69.6 79.0 87.1 1.0K
6.1, which includes various algorithms discussed earlier and ad- CasMTR-2c [167] 71.4 80.2 87.9 0.5k
DKM [170] 71.3 80.6 88.5 5.0K
ditional methods. We compile their performances on popular ASTR [165] 71.7 80.3 88.0 1.0K
benchmarks into tables, where the data is sourced either from PMatch [168] 71.9 80.7 88.5 —
the original authors or from the best reported results of other RoMa [171] 72.2 81.2 89.1 —
authors under the same evaluation conditions. Furthermore,
some publications may report performances on non-standard
benchmarks/databases or only involve certain subsets of pop-
ular benchmark test sets. We do not present the performance of 7. Challenges and Opportunities
these methods.
Deep learning has brought significant advantages to image-
The following tables provide summaries of several ma-
based local feature matching. However, there are still sev-
jor DL-based matching models’ performances on different
eral challenges that remain to be addressed. In the following
datasets. Table 1 highlights the HPatches [219] test set,
sections, we will explore potential research directions that we
adopting the evaluation protocol utilized by the LoFTR ap-
believe will provide valuable momentum for further advance-
proach [72]. Performance metrics are based on the AUC of
ments in image-based local feature matching algorithms.

17
Table 2: ScanNet [225] Two-View Camera Pose Estimation. We compare with 174, 73]. For latency-sensitive applications, adaptive mecha-
two groups of the methods, Detector-based and Detector-free methods.
nisms can be incorporated into the matching process. This al-
Pose Estimation AUC↑ lows the modulation of network depth and width based on fac-
Category Methods
@5° @10° @20° tors such as visual overlap and appearance variation, enabling
ORB+GMS [76] 5.2 13.7 25.4 fine-grained control over the difficulty of the matching task.
D2Net [62]+NN 5.3 14.5 28.0
Furthermore, researchers have proposed various innovative ap-
ContextDesc+RT [101] 6.6 15.0 25.8
ContextDesc+NN [101] 9.4 21.5 36.4 proaches to address issues such as scale variations. One key
SP+NN [61] 9.4 21.5 36.4 challenge is how to adaptively adjust the size of the cropping
Detector-based SP+PointCN [248] 11.4 25.5 41.4 grid based on image scale variations to avoid matching failures.
SP+HTMatch [139] 15.1 31.4 48.2
SP+SGMNet [70] 15.4 32.1 48.3
Geometric inconsistencies in patch-level feature matching can
ContextDesc+SGMNet [70] 15.4 32.3 48.8 also be alleviated through adaptive allocation strategies, com-
SP+SuperGlue [69] 16.2 33.8 51.8 bined with adaptive patch subdivision strategies that enhance
SP+DenseGAP [138] 17.0 36.1 55.7 correspondence quality progressively from coarse to fine dur-
DualRC-Net [66] 7.7 17.9 30.5
SEM [169] 18.7 36.6 52.9
ing the matching process. On the other hand, attention spans
PDC-Net(H) [68] 18.7 37.0 54.0 can be adaptively adjusted based on the difficulty of matching,
PDC-Net+(H) [146] 20.3 39.4 57.1 achieving variable-sized adaptive attention regions at different
LoFTR-DT [72] 22.1 40.8 57.6 positions. This allows the network to better adapt to features
3DG-STFM [247] 23.6 43.6 61.2
LoFTR [72]+QuadTree [160] 23.9 43.2 60.3 at different locations while capturing contextual information,
MatchFormer [162] 24.3 43.9 61.4 thereby improving matching performance.
Detector-free
QuadTree [160] 24.9 44.7 61.8 In summary, the research on adaptability in local feature
ASpanFormer [73] 25.6 46.0 63.3 matching offers vast prospects and opportunities for future de-
OAMatcher [166] 26.1 45.3 62.1
PATS [174] 26.0 46.9 64.3 velopments, while enhancing efficiency in terms of memory
CasMTR-4c [167] 27.1 47.0 64.4 and computation. With the emergence of more demands and
DeepMatcher-L [142] 27.3 46.3 62.5 challenges across various fields, it is anticipated that adaptive
PMatch [168] 29.4 50.1 67.4
mechanisms will play an increasingly important role in local
DKM [170] 29.4 50.7 68.3
RoMa [171] 31.8 53.4 70.9 feature matching. Future research could further explore finer-
grained adaptive strategies to achieve more efficient and accu-
rate matching results.
7.1. Efficient Attention and Transformer
7.3. Weakly Supervised Learning
For local feature matching, integrating transformers into
GNN models can be considered, framing the task as a graph The field of local feature matching has not only achieved
matching problem involving two sets of features. To enhance significant progress under fully supervised settings but has also
the construction of better matchers, the self-attention and cross- shown potential in the realm of weakly supervised learning.
attention layers within transformers are commonly employed Traditional fully supervised methods rely on dense ground-
to exchange global visual and geometric information between truth correspondence labels. In recent years, researchers have
nodes. In addition to sparse descriptors generated by matching turned their attention to self-supervised and weakly super-
detectors, direct application of self-attention and cross-attention vised learning to reduce the dependency on precise annotations.
to feature maps extracted by CNNs, and generating matches Self-supervised learning methods like SuperPoint [61] train on
in a coarse-to-fine manner, has also been explored [69, 72]. pairs of images generated through virtual homography trans-
However, the computational cost of basic matrix multiplication formations, yielding promising results. However, these sim-
in transformers remains high when dealing with a large num- ple geometric transformations might not work effectively un-
ber of keypoints. In recent years, efforts have been made to der complex scenarios. Weakly supervised learning has be-
enhance the efficiency of transformers and attempts to com- come a research focus in the domain of local feature learn-
pute the two types of attention in parallel have been made, ing [111, 112, 172, 131, 133]. These methods often combine
continuously reducing complexity while maintaining perfor- weakly supervised learning with the describe-detect pipeline,
mance [70, 71, 141, 140]. Future research will further opti- but direct use of weakly supervised loss leads to noticeable per-
mize the structures of attention mechanisms and transformers, formance drops. Some approaches rely solely on solutions in-
aiming to maintain high matching performance while reduc- volving relative camera poses, learning descriptors via epipo-
ing computational complexity. This will make local feature lar loss. The limitations of weakly supervised methods lie in
matching methods more efficient and applicable in real-time their difficulty to differentiate errors introduced by descriptors
and resource-constrained environments. and keypoints, as well as accurately distinguishing different de-
scriptors. To overcome these challenges, carefully designed de-
7.2. Adaptation strategy coupled training pipelines have emerged, where the description
network and detection network are trained separately until high-
In recent years, significant progress has been made in the re- quality descriptors are obtained. Chen et al. [256] propose in-
search on adaptability in local feature matching [137, 165, 173, novative methods using convolutional neural networks for fea-
ture shape estimation, orientation assignment, and descriptor
18
Table 3: Evaluation on YFCC100M [226] for outdoor pose estimation. We compare with two groups of the methods, Detector-based and Detector-free methods.

Pose estimation AUC↑ Pose estimation mAP↑

Category Methods
@5° @10° @20° @5° @10° @20°
SuperPoint(SP) [61] — — — 30.5 50.8 67.9
SIFT [17]+RT 24.1 40.7 58.1 45.1 55.8 67.2
SP+OANet [77] 26.8 45.0 62.2 50.9 61.4 71.8
SIFT+OANet [77] 29.2 48.1 65.1 55.1 65.0 74.8
CoAM [249] — — — 55.6 66.8 —
Detector-based
SIFT+SuperGlue [69] 30.5 51.3 69.7 59.3 70.4 80.4
Paraformer [140] 31.7 52.3 70.4 60.1 70.7 78.1
RootSIFT [250]+SGMNet [70] 35.5 55.2 71.9 — — —
SP+SuperGlue [69] 38.7 59.1 75.8 67.8 77.4 85.7
DualRC-Net [66] 29.5 50.1 66.8 — — —
RANSAC-Flow [244] — — — 64.9 73.3 81.6
PDC-Net(MS) [68] 35.7 55.8 72.3 63.9 73.0 73.0
PDC-Net+(H) [146] 37.5 58.1 74.5 67.4 76.6 84.6
Detector-free LoFTR [72] 42.4 62.5 77.3 — — —
ASpanFormer [73] 44.5 63.8 78.4 — — —
PMatch [168] 45.7 65.2 79.8 75.9 83.1 89.3
PATS [174] 47.0 65.3 79.2 — — —

Table 4: MegaDepth [227] Two-View Camera Pose Estimation. We compare and keypoints. In the future, developments in weakly super-
with two groups of the methods, Detector-based and Detector-free methods.
vised learning in the domain of local feature matching might
Pose Estimation AUC↑ focus on finer loss function designs, better utilization of weak
Category Methods
@5° @10° @20° supervision signals, and broader application domains. Explor-
SP+SGMNet [70] 40.5 59.0 72.6 ing mechanisms for weakly supervised learning in conjunction
SP+DenseGAP [138] 41.2 56.9 70.2
Detector-based with traditional fully supervised methods holds promise for en-
SP+SuperGlue [69] 42.2 61.2 75.9
SP+ClusterGNN [71] 44.2 58.5 70.3 hancing the performance and generalization capabilities of lo-
Patch2Pix [172] 41.4 56.3 68.3 cal feature matching in complex scenes.
ECO-TR [158] 48.3 65.8 78.5
PDC-Net+ [146] 51.5 67.2 78.5
3DG-STFM [247] 52.6 68.5 80.0
7.4. Foundation Segmentation Models
SE2-LoFTR [159] 52.6 69.2 81.4 Typically, semantic segmentation models, trained on
LoFTR [72] 52.8 69.2 81.2 datasets such as Cityscapes [257] and MIT ADE20k [258], of-
MatchFormer [162] 52.9 69.7 82.0
TopicFM [164] 54.1 70.1 81.6 fer fundamental semantic information and play a crucial role
Detector-free in enhancing the detection and description processes of specific
QuadTree [160] 54.6 70.5 82.2
ASpanFormer [73] 55.3 71.5 83.1 environments [128, 129].
OAMatcher [166] 56.6 72.3 83.6 However, the advent of large foundation models such as
DeepMatcher-L [142] 57.0 73.1 84.2
SEM [169] 58.0 72.9 83.7 SAM [259], DINO [260], and DINOv2 [261] marks a new era
ASTR [165] 58.4 73.1 83.8 in artificial intelligence. While traditional segmentation mod-
CasMTR-2c [167] 59.1 74.3 84.8 els excel in their specific domains, these foundation models
DKM [170] 60.4 74.9 85.1
introduce a broader, more versatile approach. Their extensive
PMatch [168] 61.4 75.7 85.7
RoMa [171] 62.6 76.7 86.3 pre-training on massive, diverse datasets equips them with a
remarkable zero-shot generalization capability, enabling them
to adapt to a wide range of scenarios. For instance, SAM-
learning. Their approach establishes a standard shape and ori- Feat [42] demonstrates how SAM, a model adept at segment-
entation for each feature, enabling a transition from supervised ing ”anything” in ”any scene,” can guide local feature learn-
to self-supervised learning by eliminating the need for known ing with its rich, category-agnostic semantic knowledge. By
feature matching relationships. They also introduce a ’weak distilling fine-grained semantic relations and focusing on edge
match finder’ in descriptor learning, enhancing feature appear- detection, SAMFeat showcases a significant enhancement in lo-
ance variability and improving descriptor invariance. These ad- cal feature description and accuracy. Similarly, SelaVPR [262]
vancements signify a significant progress in weakly supervised demonstrates how to effectively adjust the DINOv2 model us-
learning for feature matching, especially in cases involving sub- ing lightweight adapters to address the challenges in Visual
stantial viewpoint and viewing direction changes. Place Recognition (VPR) by proficiently matching local fea-
These weakly supervised methods open up new prospects tures without the need for extensive spatial verification, thus
and opportunities for local feature learning, allowing models streamlining the retrieval process.
to be trained on larger and more diverse datasets, thereby ob- Looking towards an open-world scenario, the versatility and
taining more generalized descriptors. However, these methods robust generalization offered by these large foundation models
still face challenges, such as effectively leveraging weak su- present exciting prospects. Their ability to understand and in-
pervision signals and addressing the uncertainty of descriptors terpret a vast array of scenes and objects far exceeds the scope
19
Table 5: Visual localization evaluation on the Aachen Day-Night benchmark v1.0 [230] and v1.1 [9]. The evaluation results on the full visual localization track are
reported. We compare with two groups of the methods, Detector-based and Detector-free methods.

Day Night
Dataset Category Methods
(0.25m,2°) (0.5m,5°) (1.0m,10°) (0.25m,2°) (0.5m,5°) (1.0m,10°)
SP+NN [61] 85.4 93.3 97.2 75.5 86.7 92.9
SP+CAPS [111]+NN 86.3 93.0 95.9 83.7 90.8 96.9
SP+SuperGlue [69] 89.6 95.4 98.8 86.7 93.9 100.0
SP+SGMNet [70] 86.8 94.2 97.7 83.7 91.8 99.0
SP+ClusterGNN [71] 89.4 95.5 98.5 81.6 93.9 100.0
Detector-based
v1.0 SP+LightGlue [137] 89.2 95.4 98.5 87.8 93.9 100.0
ASLFeat [127]+NN 82.3 89.2 92.7 67.3 79.6 85.7
ASLFeat [127]+SGMNet [70] 86.8 93.4 97.1 86.7 94.9 98.0
ASLFeat [127]+SuperGlue [69] 87.9 95.4 98.3 81.6 91.8 99.0
ASLFeat [127]+ClusterGNN [71] 88.6 95.5 98.4 85.7 93.9 99.0
Detector-free Patch2Pix [172] 84.6 92.1 96.5 82.7 92.9 99.0
ISRF [251] 87.1 94.7 98.3 74.3 86.9 97.4
Rlocs [252] 88.8 95.4 99.0 74.3 90.6 98.4
KAPTURE+R2D2+APGeM [253] 90.0 96.2 99.5 72.3 86.4 97.9
Detector-based SP+SuperGlue [69] 89.8 96.1 99.4 77.0 90.6 100.0
SP+SuperGlue [69]+Patch2Pix [172] 89.3 95.8 99.2 78.0 90.6 99.0
SP+GFM [246] 90.2 96.4 99.5 74.0 91.5 99.5
SP+LightGlue [137] 90.2 96.0 99.4 77.0 91.1 100.0
Patch2Pix [172] 86.4 93.0 97.5 72.3 88.5 97.9
v1.1
LoFTR-DS [72] — — — 72.8 88.5 99.0
LoFTR-OT [72] 88.7 95.6 99.0 78.5 90.6 99.0
ASpanFormer [73] 89.4 95.6 99.0 77.5 91.6 99.5
Detector-free AdaMatcher-LoFTR [173] 89.2 96.0 99.3 79.1 90.6 99.5
AdaMatcher-Quad [173] 89.2 95.9 99.2 79.1 92.1 99.5
ASTR [165] 89.9 95.6 99.2 76.4 92.1 99.5
TopicFM [164] 90.2 95.9 98.9 77.5 91.1 99.5
CasMTR [167] 90.4 96.2 99.3 78.5 91.6 99.5

Table 6: Visual localization evaluation on the Aachen Day-Night benchmark v1.0 [230] and v1.1 [9]. The evaluation results on the local feature evaluation track are
reported. We compare with two groups of the methods, Detector-based and Detector-free methods.

Aachen Day-Night v1.0 Aachen Day-Night v1.1

Category Method
(0.5m,2°) (1m,5°) (5m, 10°) (0.5m,2°) (1m,5°) (5m, 10°)
SP [61] 74.5 78.6 89.8 — — —
D2Net [62] 74.5 86.7 100.0 — — —
R2D2 [63](K=20k) 76.5 90.8 100.0 71.2 86.9 97.9
ASLFeat [127] 81.6 87.8 100.0 75.4 75.4 97.6
ISRF [251] — — — 69.1 87.4 98.4
PoSFeat [131] 81.6 90.8 100.0 73.8 87.4 98.4
SIFT+CAPS [111] 77.6 86.7 99.0 — — —
Detector-based SP+S2DNet [254] 74.5 84.7 100.0 — — —
SP+CAPS [111] 82.7 87.8 100.0 — — —
SP+OANet [77] 77.6 86.7 98.0 — — —
SP+SGMNet [70] 77.6 88.8 99.0 72.3 85.3 97.9
SP+SuperGlue [69] 79.6 90.8 100.0 73.3 88.0 98.4
DSD-MatchingNet [255] 80.1 90.3 100.0 73.0 88.0 99.3
Patch2Pix [172] 79.6 87.8 100.0 72.3 88.5 97.9
Sparse-NCNet [65] 76.5 84.7 98.0 — — —
DualRC-Net [66] 79.6 88.8 100.0 71.2 86.9 97.9
Detector-free
PDC-Net [68] 76.5 85.7 100.0 — — —
PDC-Net+ [146] 80.6 89.8 100.0 — — —
LoFTR [72] — — — 72.8 88.5 99.0

of traditional segmentation networks, paving the way for ad- 7.5. Mismatch Removal
vancements in feature matching across diverse and dynamic Image matching, involving the establishment of reliable
environments. In summary, while the contributions of tradi- connections between two images portraying a shared object or
tional semantic segmentation networks remain invaluable, the scene, poses intricate challenges due to the combinatorial na-
integration of large foundation models offers a complementary ture of the process and the presence of outliers. Direct matching
and expansive approach, essential for pushing the boundaries methodologies, such as point set registration and graph match-
of what is achievable in feature matching, particularly in open- ing, often grapple with formidable computational demands and
world applications. erratic performance. Consequently, a bifurcated approach,
commencing with preliminary match construction utilizing fea-
20
Table 7: Visual Localization on the InLoc benchmark [116]. We compare with two groups of the methods, Detector-based and Detector-free methods.

DUC1 DUC2
Category Methods
(0.25m, 10°) (0.5m, 10°) (1m, 10°) (0.25m, 10°) (0.5m, 10°) (1m, 10°)
SIFT+CAPS [111]+NN 38.4 56.6 70.7 35.1 48.9 58.8
ISRF [251] 39.4 58.1 70.2 41.2 61.1 69.5
D2Net [62]+NN 38.4 56.1 71.2 37.4 55.0 64.9
R2D2 [63]+NN 36.4 57.6 74.2 45.0 60.3 67.9
KAPTURE [253]+R2D2 [63] 41.4 60.1 73.7 47.3 67.2 73.3
SeLF [128]+NN 41.4 61.6 73.2 44.3 61.1 68.7
AWDesc [103]+NN 41.9 61.6 72.2 45.0 61.1 70.2
Detector-based ASLFeat [127]+NN 39.9 59.1 71.7 43.5 58.8 64.9
ASLFeat [127]+SGMNet [70] 43.9 62.1 68.2 45.0 63.4 73.3
ASLFeat [127]+SuperGlue [69] 51.5 66.7 75.8 53.4 76.3 84.0
ASLFeat [127]+ClusterGNN [71] 52.5 68.7 76.8 55.0 76.0 82.4
SP+NN [61] 40.4 58.1 69.7 42.0 58.8 69.5
SP+ClusterGNN [71] 47.5 69.7 79.8 53.4 77.1 84.7
SP+SuperGlue [69] 49.0 68.7 80.8 53.4 77.1 82.4
SP+CAPS [111]+NN 40.9 60.6 72.7 43.5 58.8 68.7
SP+LightGlue [137] 49.0 68.2 79.3 55.0 74.8 79.4
Sparse-NCNet [65] 41.9 62.1 72.7 35.1 48.1 55.0
MTLDesc [102] 41.9 61.6 72.2 45.0 61.1 70.2
COTR [157] 41.9 61.1 73.2 42.7 67.9 75.6
Patch2Pix [172] 44.4 66.7 78.3 49.6 64.9 72.5
LoFTR-OT [72] 47.5 72.2 84.8 54.2 74.8 85.5
MatchFormer [162] 46.5 73.2 85.9 55.7 71.8 81.7
ASpanFormer [73] 51.5 73.7 86.4 55.0 74.0 81.7
Detector-free
TopicFM [164] 52.0 74.7 87.4 53.4 74.8 83.2
GlueStick [136] 49.0 70.2 84.3 55.0 83.2 87.0
SEM [169] 52.0 74.2 87.4 50.4 76.3 83.2
ASTR [165] 53.0 73.7 87.4 52.7 76.3 84.0
CasMTR [167] 53.5 76.8 85.4 51.9 70.2 83.2
ASTR [165] 53.0 73.7 87.4 52.7 76.3 84.0
PATS [174] 55.6 71.2 81.0 58.8 80.9 85.5

Table 8: Optical flow results on the training splits of KITTI [233]. Following ination, as highlighted by Ma et al. [264]. Traditional meth-
[157, 158], (”+intp”) represents interpolating the output of the model to ob-
tain the corresponding relationship per pixel. † means evaluated it with Dense- ods, epitomized by RANSAC [39] and its variants such as
Matching tools provided by the authors of GLU-Net. This part contains generic USAC [265] and MAGSAC++ [266], have significantly im-
matching networks. proved efficiency and accuracy in outlier rejection. Nonethe-
less, these approaches are limited by computational time con-
KITTI-2012 KITTI-2015
Methods
APAE↓ Fl-all[%]↓ APAE↓ Fl-all[%]↓
straints and their suitability for non-rigid contexts. Techniques
DGC-Net [243] 8.50 32.28 14.97 50.98 specific to non-rigid scenarios, like ICF [267], have shown
DGC-Net† [243] 7.96 34.35 14.33 50.35 efficacy in addressing geometric distortions. The Advent of
GLU-Net [67] 3.14 19.76 7.49 33.83 Learning-Driven Strategies in Mismatch Elimination The inte-
GLU-Net+GOCor [143] 2.68 15.43 6.68 27.57
RANSAC-Flow [244] — — 12.48 —
gration of deep learning into mismatch elimination has opened
COTR [157] 1.28 7.36 2.62 9.92 new pathways for enhancing feature matching. Yi et al. [248]
COTR+Intp. [157] 2.26 10.50 6.12 16.90 introduced Context Normalization (CNe), a groundbreaking
PDC-Net(D) [68] 2.08 7.98 5.22 15.13 concept that has transformed wide-baseline stereo correspon-
PDC-Net+(D) [146] 1.76 6.60 4.53 12.62
COTR† [157] 1.15 6.98 2.06 9.14
dences by effectively distinguishing inliers from outliers. Ex-
COTR†+Intp. [157] 1.47 8.79 3.65 13.65 panding upon this, Sun et al. [268] developed Attentive Con-
ECO-TR [158] 0.96 3.77 1.40 6.39 text Networks (ACNe), which improved the management of
ECO-TR+Intp. [158] 1.46 6.64 3.16 12.10 permutation-equivariant data through Attentive Context Nor-
PATS+Intp. [174] 1.17 4.04 3.39 9.68
malization, yielding significant advancements in camera pose
estimation and point cloud classification. Zhang et al. [77]
ture descriptors like SIFT, ORB, and SURF [33, 35, 34], suc- proposed the OANet, a novel methodology that precisely de-
ceeded by the application of local and global geometrical con- termines two-view correspondences and bolsters geometric es-
straints, has become a prevalent strategy. Nevertheless, these timations using a hierarchical clustering approach. Zhao et
methodologies encounter constraints, notably when confronted al. [269] introduced NM-Net, a layered network focusing on
with multi-modal images or substantial variations in viewpoint the selection of feature correspondences with compatibility-
and lighting [263]. specific mining, demonstrating outstanding performance in var-
The evolution of methodologies for outlier rejection has ious settings. Shape-Former [270] innovatively addresses mul-
been crucial in overcoming the challenges of mismatch elim- timodal and multiview image matching challenges, focusing
on robust mismatch removal through a hybrid neural network.
21
Table 9: Evaluation of SfM Methods on the ETH3D [239] for Multi-View Camera Pose Estimation and 3D Triangulation. The table segregates methods into
Detector-based and Detector-free categories. The results are derived from the DetectorFreeSfM [182].

Multi-View Camera Pose Estimation 3D Triangulation

Category Methods AUC Accuracy (%) Completeness (%)
@1° @3° @5° 1cm 2cm 5cm 1cm 2cm 5cm
SIFT [17]+NN+PixSfM [177] 26.94 39.01 42.19 76.18 85.60 93.16 0.17 0.71 3.29
D2Net [62]+NN+PixSfM [177] 34.50 49.77 53.58 74.75 83.81 91.98 0.83 2.69 8.95
Detector-based
R2D2 [63]+NN+PixSfM [177] 43.58 62.09 66.89 74.12 84.49 91.98 0.43 1.58 6.71
SP+SG [69]+PixSfM [177] 50.82 68.52 72.86 79.01 87.04 93.80 0.75 2.77 11.28
LoFTR [72]+DetectorFreeSfM [182] 59.12 75.59 79.53 80.38 89.01 95.83 3.73 11.07 29.54
Detector-free ASpanFormer [73]+DetectorFreeSfM [182] 57.23 73.71 77.70 77.63 87.40 95.02 3.97 12.18 32.42
MatchFormer [162]+DetectorFreeSfM [182] 56.70 73.00 76.84 79.86 88.51 95.48 3.76 11.06 29.05

Leveraging CNNs and Transformers, Shape-Former introduces 7.6. Deep Learning and Handcrafted Analogies
ShapeConv for sparse matches learning, excelling in outlier es- The field of image matching is witnessing a unique blend of
timation and consensus representation while showcasing supe- deep learning and traditional handcrafted techniques. This con-
rior performance. Given that RANSAC is an integral part of vergence is evident in the adoption of foundational elements
the matching pipeline, recent innovations have significantly en- from classic methods, such as the ”SIFT” pipeline, in recent
hanced its integration with deep learning approaches for im- semi-dense, detector-free approaches. Notable examples of this
proved performance. DSAC [271] introduces a paradigm shift trend include the Hybrid Pipeline (HP) by Bellavia et al.[274],
by making RANSAC differentiable, employing a probabilistic HarrisZ+[275], and Slime [276], all demonstrating competitive
selection mechanism that facilitates its integration into end-to- capabilities alongside state-of-the-art deep methods. The HP
end trainable deep learning pipelines. This innovative approach method integrates handcrafted and learning-based approaches,
not only maintains the robustness of traditional RANSAC but maintaining crucial rotational invariance for photogrammetric
also leverages deep learning to directly minimize expected surveys. It features the novel Keypoint Filtering by Coverage
losses. On the other hand, CA-RANSAC [272] evolves the (KFC) module, enhancing the accuracy of the overall pipeline.
RANSAC framework by incorporating an adaptive consen- HarrisZ+ represents an evolution of the classic Harris corner
sus mechanism through a novel attention layer. This mech- detector, optimized to synergize with modern image match-
anism dynamically refines per-point estimation states based ing components. It yields more discriminative and accurately
on accumulated residuals, enhancing model refinement and placed keypoints, aligning closely with results from contem-
sample selection. Recent developments like LSVANet [273], porary deep learning models. Slime, employs a novel strat-
LGSC [263], and HCA-Net [185], have shown promise in more egy of modeling scenes with local overlapping planes, merg-
effectively discerning inliers from outliers. These approaches ing local affine approximation principles with global match-
leverage deep learning modules for geometric estimation and ing constraints. This hybrid approach echoes traditional im-
feature correspondence categorization, signifying an advance- age matching processes, challenging the performance of deep
ment over conventional methods. learning methods. Meanwhile, it is important to highlight a
Primarily, the development of more generalized and robust key distinction between deep learning methods and traditional
learning-based methodologies adept at handling a diverse ar- handcrafted detectors in recent advancements in image match-
ray of scenarios, encompassing non-rigid transformations and ing: rotation equivariance. Despite the remarkable matching
multi-modal images, is imperative. Secondly, there exists a ne- performances of modern methods, they often fall short in han-
cessity for methodologies that amalgamate the virtues of both dling in-plane rotations—a fundamental feature inherently in-
traditional geometric approaches and contemporary learning- tegrated into handcrafted detectors. This oversight reveals a
based techniques. Such hybrid approaches hold the potential to performance gap under rotational transformations, underscor-
deliver superior performance by capitalizing on the strengths of ing the importance of designing or training deep learning mod-
both paradigms. Lastly, the exploration of innovative learning els to explicitly address this challenge. By focusing on the de-
architectures and loss functions tailored for mismatch elimina- velopment of rotation-equivariant approaches, as exemplified
tion can unveil novel prospects in feature matching, elevating by SE2-LoFTR [159] and S-TREK [98], the field moves closer
the overall resilience and precision of computer vision systems. to bridging this gap, combining the precision of deep learning
In conclusion, the elimination of mismatches persists as a piv- with the robustness of handcrafted detectors against orientation
otal yet formidable facet of local feature matching. The on- changes.
going evolution of both traditional and learning-based method- These advancements signify that despite the significant
ologies unfolds promising trajectories to address extant limita- strides made by deep learning methods like LoFTR and Su-
tions and unlock newfound potentials in computer vision appli- perGlue, the foundational principles of handcrafted techniques
cations. remain vital. The integration of classical concepts with mod-
ern computational power, as seen in HP, HarrisZ+, Slime, SE2-
LoFTR, and S-TREK, leads to robust image matching solu-
tions. These methods offer potential avenues for future research

22
that blends diverse methodologies, bridging the gap between cal datasets [55]. Additionally, the computational intensity re-
traditional and modern approaches in image matching. quired to process high-resolution images presents a significant
hurdle, particularly when aiming for real-time application in
7.7. Utilizing geometric information web and VR/AR platforms [279]. These technical challenges
When facing challenges such as texturelessness, occlusion, underscore the necessity for ongoing research and development
and repetitive patterns, traditional local feature matching meth- to refine deep learning algorithms and optimize them for the
ods may perform poorly. In recent years, researchers have specific demands of cultural heritage applications.
started to focus on better utilizing geometric information to en- Moreover, the dynamic nature of deep learning presents
hance the effectiveness of local feature matching in the presence an opportunity to continually improve the accuracy and effi-
of these challenges. Several studies [169, 170, 146, 168] have ciency of historical image matching processes. As algorithms
indicated that leveraging geometric information holds signifi- evolve and new approaches emerge, there is the potential for
cant potential for local feature matching. By capturing geomet- even more sophisticated methods of feature detection, extrac-
ric relationships between pixels more accurately and combin- tion, and matching that could further revolutionize the field.
ing geometric priors with image appearance information, these The exploration of these novel strategies, alongside the adap-
methods can enhance the robustness and accuracy of matching tation of current methodologies to the unique characteristics
in complex scenes. However, this direction presents numerous of historical imagery, is essential for advancing our ability to
opportunities and challenges for future development. Firstly, digitally preserve and explore our cultural heritage [55]. In
how to model geometric information more profoundly to bet- summary, the confluence of deep learning with historical im-
ter address scenarios involving large displacements, occlusions, age matching offers a promising pathway towards bridging the
and textureless regions remains a critical question. Secondly, gap between the past and present. Future research can unlock
improving the performance of confidence estimation to yield new possibilities for the preservation, understanding, and dis-
more reliable matching results is also a direction worthy of in- semination of cultural heritage, making it more accessible and
vestigation. engaging for generations to come.
The introduction of geometric priors expands feature
matching beyond mere appearance similarity to consider an 8. Conclusions
object’s behavior from different viewpoints. This trend sug-
gests that dense matching methods hold promise for tackling We have investigated various algorithms related to local fea-
challenges posed by large displacements and appearance vari- ture matching based on deep learning models over the past five
ations. It also implies that the future development in the field years. These algorithms have demonstrated impressive perfor-
of geometric matching may increasingly focus on dense feature mance in various local feature matching tasks and benchmark
matching, leveraging geometric information and prior knowl- tests. They can be broadly categorized into Detector-based
edge to enhance matching performance. Models and Detector-free Models. The application of feature
detectors reduces the scope of matching and relies on the pro-
7.8. Advancing Cultural Heritage Preservation cesses of keypoint detection and feature description. Detector-
free methods, on the other hand, directly capture richer con-
The integration of deep learning into the realm of histor-
text from the raw images to generate dense matches. Subse-
ical image matching heralds a new era for cultural heritage
quently, we discuss the strengths and weaknesses of existing
preservation, offering unparalleled opportunities while present-
local feature matching algorithms, introduce popular datasets
ing unique challenges. Insights from recent research under-
and evaluation standards, and summarize the quantitative per-
score the potential of advanced deep learning methods to over-
formance analysis of these models on some common bench-
come the limitations of traditional techniques [277]. These ap-
marks such as HPatches, ScanNet, YFCC100M, MegaDepth,
proaches have shown remarkable resilience to the inherent ra-
and Aachen Day-Night datasets. Lastly, we explore the open
diometric and geometric discrepancies that plague the match-
challenges and potential research avenues that the field of local
ing of historical and contemporary imagery, thereby facilitat-
feature matching may encounter in the forthcoming years. Our
ing the accurate co-registration of multi-temporal scenes. Tech-
aim is not only to enhance researchers’ understanding of local
niques such as SuperGlue [69], LoFTR [72], and DISK [112]
feature matching but also to inspire and guide future research
have been identified as particularly effective, surpassing classi-
endeavors in this domain.
cal methods by achieving higher robustness against severe illu-
mination and viewpoint shifts. This advancement enables more
accurate reconstructions of historical sites and artifacts [278], References
thus enhancing knowledge transfer and cultural heritage promo- [1] L. Tang, J. Yuan, J. Ma, Image fusion in the loop of high-level vision
tion through immersive technologies like virtual and augmented tasks: A semantic-aware real-time infrared and visible image fusion net-
reality. work, Information Fusion 82 (2022) 28–42.
[2] S.-Y. Cao, B. Yu, L. Luo, R. Zhang, S.-J. Chen, C. Li, H.-L. Shen, Pc-
However, the journey towards fully harnessing the capa- net: A structure similarity enhancement method for multispectral and
bilities of deep learning in this domain is not without its ob- multimodal image registration, Information Fusion 94 (2023) 200–214.
stacles. Challenges persist in managing extensive image ro- [3] M. Hu, B. Sun, X. Kang, S. Li, Multiscale structural feature transform
tations and variations in scale, which are common in histori- for multi-modal image matching, Information Fusion 95 (2023) 341–
354.

23
[4] K. Sun, J. Yu, W. Tao, X. Li, C. Tang, Y. Qian, A unified feature-spatial object detection via frequency domain disentanglement, in: Proceedings
cycle consistency fusion framework for robust image matching, Infor- of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
mation Fusion 97 (2023) 101810. tion, 2023, pp. 1064–1073.
[5] Z. Hou, Y. Liu, L. Zhang, Pos-gift: A geometric and intensity-invariant [25] C. Cao, X. Fu, H. Liu, Y. Huang, K. Wang, J. Luo, Z.-J. Zha, Event-
feature transformation for multimodal images, Information Fusion 102 guided person re-identification via sparse-dense complementary learn-
(2024) 102027. ing, in: Proceedings of the IEEE/CVF Conference on Computer Vision
[6] T. Sattler, B. Leibe, L. Kobbelt, Improving image-based localization by and Pattern Recognition, 2023, pp. 17990–17999.
active correspondence search, in: Computer Vision–ECCV 2012: 12th [26] C. Harris, M. Stephens, et al., A combined corner and edge detector, in:
European Conference on Computer Vision, Florence, Italy, October 7- Alvey vision conference, Vol. 15, Citeseer, 1988, pp. 10–5244.
13, 2012, Proceedings, Part I 12, Springer, 2012, pp. 752–765. [27] S. M. Smith, J. M. Brady, Susan—a new approach to low level image
[7] T. Sattler, A. Torii, J. Sivic, M. Pollefeys, H. Taira, M. Okutomi, T. Pa- processing, International journal of computer vision 23 (1) (1997) 45–
jdla, Are large-scale 3d models really necessary for accurate visual local- 78.
ization?, in: Proceedings of the IEEE Conference on Computer Vision [28] E. Rosten, T. Drummond, Fusing points and lines for high performance
and Pattern Recognition, 2017, pp. 1637–1646. tracking, in: Tenth IEEE International Conference on Computer Vision
[8] S. Cai, Y. Guo, S. Khan, J. Hu, G. Wen, Ground-to-aerial image geo- (ICCV’05) Volume 1, Vol. 2, Ieee, 2005, pp. 1508–1515.
localization with a hard exemplar reweighting triplet loss, in: Proceed- [29] J. Matas, O. Chum, M. Urban, T. Pajdla, Robust wide-baseline stereo
ings of the IEEE/CVF International Conference on Computer Vision, from maximally stable extremal regions, Image and vision computing
2019, pp. 8391–8400. 22 (10) (2004) 761–767.
[9] Z. Zhang, T. Sattler, D. Scaramuzza, Reference pose generation for long- [30] N. Dalal, B. Triggs, Histograms of oriented gradients for human detec-
term visual localization via learned features and view synthesis, Interna- tion, in: 2005 IEEE computer society conference on computer vision
tional Journal of Computer Vision 129 (2021) 821–844. and pattern recognition (CVPR’05), Vol. 1, Ieee, 2005, pp. 886–893.
[10] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S. M. [31] K. Mikolajczyk, C. Schmid, A performance evaluation of local descrip-
Seitz, R. Szeliski, Building rome in a day, Communications of the ACM tors, IEEE transactions on pattern analysis and machine intelligence
54 (10) (2011) 105–112. 27 (10) (2005) 1615–1630.
[11] J. Heinly, J. L. Schonberger, E. Dunn, J.-M. Frahm, Reconstructing the [32] M. Calonder, V. Lepetit, C. Strecha, P. Fua, Brief: Binary robust inde-
world* in six days*(as captured by the yahoo 100 million image dataset), pendent elementary features, in: Computer Vision–ECCV 2010: 11th
in: Proceedings of the IEEE conference on computer vision and pattern European Conference on Computer Vision, Heraklion, Crete, Greece,
recognition, 2015, pp. 3287–3295. September 5-11, 2010, Proceedings, Part IV 11, Springer, 2010, pp.
[12] J. L. Schonberger, J.-M. Frahm, Structure-from-motion revisited, in: 778–792.
Proceedings of the IEEE conference on computer vision and pattern [33] D. G. Lowe, Distinctive image features from scale-invariant keypoints,
recognition, 2016, pp. 4104–4113. International journal of computer vision 60 (2004) 91–110.
[13] J. Wang, Y. Zhong, Y. Dai, S. Birchfield, K. Zhang, N. Smolyanskiy, [34] H. Bay, T. Tuytelaars, L. Van Gool, Surf: Speeded up robust features,
H. Li, Deep two-view structure-from-motion revisited, in: Proceedings Lecture notes in computer science 3951 (2006) 404–417.
of the IEEE/CVF conference on Computer Vision and Pattern Recogni- [35] E. Rublee, V. Rabaud, K. Konolige, G. Bradski, Orb: An efficient al-
tion, 2021, pp. 8953–8962. ternative to sift or surf, in: 2011 International conference on computer
[14] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, vision, Ieee, 2011, pp. 2564–2571.
I. Reid, J. J. Leonard, Past, present, and future of simultaneous localiza- [36] S. Leutenegger, M. Chli, R. Y. Siegwart, Brisk: Binary robust invari-
tion and mapping: Toward the robust-perception age, IEEE Transactions ant scalable keypoints, in: 2011 International conference on computer
on robotics 32 (6) (2016) 1309–1332. vision, Ieee, 2011, pp. 2548–2555.
[15] R. Mur-Artal, J. D. Tardós, Orb-slam2: An open-source slam system [37] P. F. Alcantarilla, A. Bartoli, A. J. Davison, Kaze features, in: Computer
for monocular, stereo, and rgb-d cameras, IEEE transactions on robotics Vision–ECCV 2012: 12th European Conference on Computer Vision,
33 (5) (2017) 1255–1262. Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12, Springer,
[16] Y. Zhao, S. Xu, S. Bu, H. Jiang, P. Han, Gslam: A general slam frame- 2012, pp. 214–227.
work and benchmark, in: Proceedings of the IEEE/CVF International [38] P. F. Alcantarilla, T. Solutions, Fast explicit diffusion for accelerated
Conference on Computer Vision, 2019, pp. 1110–1120. features in nonlinear scale spaces, IEEE Trans. Patt. Anal. Mach. Intell
[17] C. Liu, J. Yuen, A. Torralba, Sift flow: Dense correspondence across 34 (7) (2011) 1281–1298.
scenes and its applications, IEEE transactions on pattern analysis and [39] M. A. Fischler, R. C. Bolles, Random sample consensus: a paradigm for
machine intelligence 33 (5) (2010) 978–994. model fitting with applications to image analysis and automated cartog-
[18] P. Weinzaepfel, J. Revaud, Z. Harchaoui, C. Schmid, Deepflow: Large raphy, Communications of the ACM 24 (6) (1981) 381–395.
displacement optical flow with deep matching, in: Proceedings of the [40] R. Xu, C. Wang, B. Fan, Y. Zhang, S. Xu, W. Meng, X. Zhang, Domain-
IEEE international conference on computer vision, 2013, pp. 1385– desc: Learning local descriptors with domain adaptation, in: ICASSP
1392. 2022-2022 IEEE International Conference on Acoustics, Speech and
[19] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, Signal Processing (ICASSP), IEEE, 2022, pp. 2505–2509.
P. Van Der Smagt, D. Cremers, T. Brox, Flownet: Learning optical flow [41] R. Xu, C. Wang, S. Xu, W. Meng, Y. Zhang, B. Fan, X. Zhang, Domain-
with convolutional networks, in: Proceedings of the IEEE international feat: Learning local features with domain adaptation, IEEE Transactions
conference on computer vision, 2015, pp. 2758–2766. on Circuits and Systems for Video Technology (2023).
[20] F. Radenović, G. Tolias, O. Chum, Fine-tuning cnn image retrieval with [42] J. Wu, R. Xu, Z. Wood-Doughty, C. Wang, Segment anything model is a
no human annotation, IEEE transactions on pattern analysis and machine good teacher for local feature learning, arXiv preprint arXiv:2309.16992
intelligence 41 (7) (2018) 1655–1668. (2023).
[21] B. Cao, A. Araujo, J. Sim, Unifying deep local and global features for [43] X. Jiang, J. Ma, G. Xiao, Z. Shao, X. Guo, A review of multimodal im-
image search, in: Computer Vision–ECCV 2020: 16th European Con- age matching: Methods and applications, Information Fusion 73 (2021)
ference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, 22–71.
Springer, 2020, pp. 726–743. [44] R. Xu, Y. Li, C. Wang, S. Xu, W. Meng, X. Zhang, Instance segmenta-
[22] P. Chhabra, N. K. Garg, M. Kumar, Content-based image retrieval sys- tion of biological images using graph convolutional network, Engineer-
tem using orb and sift features, Neural Computing and Applications 32 ing Applications of Artificial Intelligence 110 (2022) 104739.
(2020) 2725–2733. [45] R. Xu, C. Wang, S. Xu, W. Meng, X. Zhang, Dc-net: Dual context net-
[23] D. Zhang, H. Li, W. Cong, R. Xu, J. Dong, X. Chen, Task relation distil- work for 2d medical image segmentation, in: Medical Image Computing
lation and prototypical pseudo label for incremental named entity recog- and Computer Assisted Intervention–MICCAI 2021: 24th International
nition, in: Proceedings of the 32nd ACM International Conference on Conference, Strasbourg, France, September 27–October 1, 2021, Pro-
Information and Knowledge Management, 2023, pp. 3319–3329. ceedings, Part I 24, Springer, 2021, pp. 503–513.
[24] K. Wang, X. Fu, Y. Huang, C. Cao, G. Shi, Z.-J. Zha, Generalized uav [46] R. Xu, C. Wang, S. Xu, W. Meng, X. Zhang, Wave-like class activation

24
map with representation fusion for weakly-supervised semantic segmen- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
tation, IEEE Transactions on Multimedia (2023). 2021, pp. 5714–5724.
[47] W. Xu, R. Xu, C. Wang, S. Xu, L. Guo, M. Zhang, X. Zhang, Spectral [69] P.-E. Sarlin, D. DeTone, T. Malisiewicz, A. Rabinovich, Superglue:
prompt tuning: Unveiling unseen classes for zero-shot semantic segmen- Learning feature matching with graph neural networks, in: Proceedings
tation, arXiv preprint arXiv:2312.12754 (2023). of the IEEE/CVF conference on computer vision and pattern recogni-
[48] C. Wang, R. Xu, S. Xu, W. Meng, X. Zhang, Treating pseudo-labels tion, 2020, pp. 4938–4947.
generation as image matting for weakly supervised semantic segmen- [70] H. Chen, Z. Luo, J. Zhang, L. Zhou, X. Bai, Z. Hu, C.-L. Tai, L. Quan,
tation, in: Proceedings of the IEEE/CVF International Conference on Learning to match features with seeded graph matching network, in:
Computer Vision, 2023, pp. 755–765. Proceedings of the IEEE/CVF International Conference on Computer
[49] M. Awrangjeb, G. Lu, C. S. Fraser, Performance comparisons of Vision, 2021, pp. 6301–6310.
contour-based corner detectors, IEEE Transactions on Image Process- [71] Y. Shi, J.-X. Cai, Y. Shavit, T.-J. Mu, W. Feng, K. Zhang, Clustergnn:
ing 21 (9) (2012) 4167–4179. Cluster-based coarse-to-fine graph neural network for efficient feature
[50] Y. Li, S. Wang, Q. Tian, X. Ding, A survey of recent advances in visual matching, in: Proceedings of the IEEE/CVF Conference on Computer
feature detection, Neurocomputing 149 (2015) 736–751. Vision and Pattern Recognition, 2022, pp. 12517–12526.
[51] S. Krig, S. Krig, Interest point detector and feature descriptor survey, [72] J. Sun, Z. Shen, Y. Wang, H. Bao, X. Zhou, Loftr: Detector-free local
Computer Vision Metrics: Textbook Edition (2016) 187–246. feature matching with transformers, in: Proceedings of the IEEE/CVF
[52] K. Joshi, M. I. Patel, Recent advances in local feature detector and de- conference on computer vision and pattern recognition, 2021, pp. 8922–
scriptor: a literature survey, International Journal of Multimedia Infor- 8931.
mation Retrieval 9 (4) (2020) 231–247. [73] H. Chen, Z. Luo, L. Zhou, Y. Tian, M. Zhen, T. Fang, D. McKinnon,
[53] J. Ma, X. Jiang, A. Fan, J. Jiang, J. Yan, Image matching from hand- Y. Tsin, L. Quan, Aspanformer: Detector-free image matching with
crafted to deep features: A survey, International Journal of Computer adaptive span transformer, in: Computer Vision–ECCV 2022: 17th Eu-
Vision 129 (2021) 23–79. ropean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,
[54] J. Jing, T. Gao, W. Zhang, Y. Gao, C. Sun, Image feature information Part XXXII, Springer, 2022, pp. 20–36.
extraction for interest point detection: A comprehensive review, IEEE [74] J. Zhang, L. Dai, F. Meng, Q. Fan, X. Chen, K. Xu, H. Wang, 3d-aware
Transactions on Pattern Analysis and Machine Intelligence (2022). object goal navigation via simultaneous exploration and identification,
[55] F. Bellavia, C. Colombo, L. Morelli, F. Remondino, Challenges in image in: Proceedings of the IEEE/CVF Conference on Computer Vision and
matching for cultural heritage: an overview and perspective, in: Inter- Pattern Recognition, 2023, pp. 6672–6682.
national Conference on Image Analysis and Processing, Springer, 2022, [75] J. Zhang, Y. Tang, H. Wang, K. Xu, Asro-dio: Active subspace ran-
pp. 210–222. dom optimization based depth inertial odometry, IEEE Transactions on
[56] G. Haskins, U. Kruger, P. Yan, Deep learning in medical image registra- Robotics 39 (2) (2022) 1496–1508.
tion: a survey, Machine Vision and Applications 31 (2020) 1–18. [76] J. Bian, W.-Y. Lin, Y. Matsushita, S.-K. Yeung, T.-D. Nguyen, M.-M.
[57] S. Bharati, M. Mondal, P. Podder, V. Prasath, Deep learning for Cheng, Gms: Grid-based motion statistics for fast, ultra-robust feature
medical image registration: A comprehensive review, arXiv preprint correspondence, in: Proceedings of the IEEE conference on computer
arXiv:2204.11341 (2022). vision and pattern recognition, 2017, pp. 4181–4190.
[58] J. Chen, Y. Liu, S. Wei, Z. Bian, S. Subramanian, A. Carass, J. L. [77] J. Zhang, D. Sun, Z. Luo, A. Yao, L. Zhou, T. Shen, Y. Chen, L. Quan,
Prince, Y. Du, A survey on deep learning in medical image registration: H. Liao, Learning two-view correspondences and geometry using order-
New technologies, uncertainty, evaluation metrics, and beyond, arXiv aware network, in: Proceedings of the IEEE/CVF international confer-
preprint arXiv:2307.15615 (2023). ence on computer vision, 2019, pp. 5845–5854.
[59] S. Paul, U. C. Pati, A comprehensive review on remote sensing im- [78] K. M. Yi, E. Trulls, V. Lepetit, P. Fua, Lift: Learned invariant feature
age registration, International Journal of Remote Sensing 42 (14) (2021) transform, in: Computer Vision–ECCV 2016: 14th European Confer-
5396–5432. ence, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings,
[60] B. Zhu, L. Zhou, S. Pu, J. Fan, Y. Ye, Advances and challenges in multi- Part VI 14, Springer, 2016, pp. 467–483.
modal remote sensing image registration, IEEE Journal on Miniaturiza- [79] M. Brown, G. Hua, S. Winder, Discriminative learning of local image
tion for Air and Space Systems (2023). descriptors, IEEE transactions on pattern analysis and machine intelli-
[61] D. DeTone, T. Malisiewicz, A. Rabinovich, Superpoint: Self-supervised gence 33 (1) (2010) 43–57.
interest point detection and description, in: Proceedings of the IEEE [80] Y. Tian, B. Fan, F. Wu, L2-net: Deep learning of discriminative patch
conference on computer vision and pattern recognition workshops, descriptor in euclidean space, in: Proceedings of the IEEE conference
2018, pp. 224–236. on computer vision and pattern recognition, 2017, pp. 661–669.
[62] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, T. Sat- [81] K. M. Yi, Y. Verdie, P. Fua, V. Lepetit, Learning to assign orientations
tler, D2-net: A trainable cnn for joint description and detection of local to feature points, in: Proceedings of the IEEE conference on computer
features, in: Proceedings of the ieee/cvf conference on computer vision vision and pattern recognition, 2016, pp. 107–116.
and pattern recognition, 2019, pp. 8092–8101. [82] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah, Signature verifi-
[63] J. Revaud, P. Weinzaepfel, C. De Souza, N. Pion, G. Csurka, Y. Cabon, cation using a” siamese” time delay neural network, Advances in neural
M. Humenberger, R2d2: repeatable and reliable detector and descriptor, information processing systems 6 (1993).
arXiv preprint arXiv:1906.06195 (2019). [83] A. Mishchuk, D. Mishkin, F. Radenovic, J. Matas, Working hard to
[64] I. Rocco, M. Cimpoi, R. Arandjelović, A. Torii, T. Pajdla, J. Sivic, know your neighbor’s margins: Local descriptor learning loss, Advances
Neighbourhood consensus networks, Advances in neural information in neural information processing systems 30 (2017).
processing systems 31 (2018). [84] K. He, Y. Lu, S. Sclaroff, Local descriptors optimized for average pre-
[65] I. Rocco, R. Arandjelović, J. Sivic, Efficient neighbourhood consensus cision, in: Proceedings of the IEEE conference on computer vision and
networks via submanifold sparse convolutions, in: Computer Vision– pattern recognition, 2018, pp. 596–605.
ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, [85] X. Wei, Y. Zhang, Y. Gong, N. Zheng, Kernelized subspace pooling
2020, Proceedings, Part IX 16, Springer, 2020, pp. 605–621. for deep local descriptors, in: Proceedings of the IEEE conference on
[66] X. Li, K. Han, S. Li, V. Prisacariu, Dual-resolution correspondence net- computer vision and pattern recognition, 2018, pp. 1867–1875.
works, Advances in Neural Information Processing Systems 33 (2020) [86] K. Lin, J. Lu, C.-S. Chen, J. Zhou, M.-T. Sun, Unsupervised deep learn-
17346–17357. ing of compact binary descriptors, IEEE transactions on pattern analysis
[67] P. Truong, M. Danelljan, R. Timofte, Glu-net: Global-local univer- and machine intelligence 41 (6) (2018) 1501–1514.
sal network for dense flow and correspondences, in: Proceedings of [87] M. Zieba, P. Semberecki, T. El-Gaaly, T. Trzcinski, Bingan: Learning
the IEEE/CVF conference on computer vision and pattern recognition, compact binary descriptors with a regularized gan, Advances in neural
2020, pp. 6258–6268. information processing systems 31 (2018).
[68] P. Truong, M. Danelljan, L. Van Gool, R. Timofte, Learning accurate [88] L. Wei, S. Zhang, H. Yao, W. Gao, Q. Tian, Glad: Global–local-
dense correspondences and when to trust them, in: Proceedings of the alignment descriptor for scalable person re-identification, IEEE Trans-

25
actions on Multimedia 21 (4) (2018) 986–999. J. Zhong, Semantics lead all: Towards unified image registration and fu-
[89] Z. Luo, T. Shen, L. Zhou, S. Zhu, R. Zhang, Y. Yao, T. Fang, L. Quan, sion from a semantic perspective, Information Fusion 98 (2023) 101835.
Geodesc: Learning local descriptors by integrating geometry con- [109] D. Mishkin, F. Radenovic, J. Matas, Repeatability is not enough: Learn-
straints, in: Proceedings of the European conference on computer vision ing affine regions via discriminability, in: Proceedings of the European
(ECCV), 2018, pp. 168–183. conference on computer vision (ECCV), 2018, pp. 284–300.
[90] Y. Liu, Z. Shen, Z. Lin, S. Peng, H. Bao, X. Zhou, Gift: Learning [110] P. Truong, S. Apostolopoulos, A. Mosinska, S. Stucky, C. Ciller, S. D.
transformation-invariant dense visual descriptors via group cnns, Ad- Zanet, Glampoints: Greedily learned accurate match points, in: Pro-
vances in Neural Information Processing Systems 32 (2019). ceedings of the IEEE/CVF International Conference on Computer Vi-
[91] J. Lee, Y. Jeong, S. Kim, J. Min, M. Cho, Learning to distill con- sion, 2019, pp. 10732–10741.
volutional features into compact local descriptors, in: Proceedings of [111] Q. Wang, X. Zhou, B. Hariharan, N. Snavely, Learning feature descrip-
the IEEE/CVF Winter Conference on Applications of Computer Vision, tors using camera pose supervision, in: Computer Vision–ECCV 2020:
2021, pp. 898–908. 16th European Conference, Glasgow, UK, August 23–28, 2020, Pro-
[92] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, V. Balntas, Sosnet: Sec- ceedings, Part I 16, Springer, 2020, pp. 757–774.
ond order similarity regularization for local descriptor learning, in: Pro- [112] M. Tyszkiewicz, P. Fua, E. Trulls, Disk: Learning local features with
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern policy gradient, Advances in Neural Information Processing Systems 33
Recognition, 2019, pp. 11016–11025. (2020) 14254–14265.
[93] P. Ebel, A. Mishchuk, K. M. Yi, P. Fua, E. Trulls, Beyond cartesian [113] J. Lee, B. Kim, S. Kim, M. Cho, Learning rotation-equivariant features
representations for local descriptors, in: Proceedings of the IEEE/CVF for visual correspondence, in: Proceedings of the IEEE/CVF Conference
international conference on computer vision, 2019, pp. 253–262. on Computer Vision and Pattern Recognition, 2023, pp. 21887–21897.
[94] Y. Tian, A. Barroso Laguna, T. Ng, V. Balntas, K. Mikolajczyk, Hynet: [114] H. Zhou, T. Sattler, D. W. Jacobs, Evaluating local features for day-night
Learning local descriptor with hybrid similarity measure and triplet loss, matching, in: Computer Vision–ECCV 2016 Workshops: Amsterdam,
Advances in neural information processing systems 33 (2020) 7401– The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III
7412. 14, Springer, 2016, pp. 724–736.
[95] C. Wang, R. Xu, S. Xu, W. Meng, X. Zhang, Cndesc: Cross normal- [115] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg,
ization for local descriptors learning, IEEE Transactions on Multimedia D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, et al., Benchmarking 6dof
(2022). outdoor visual localization in changing conditions, in: Proceedings of
[96] A. Barroso-Laguna, E. Riba, D. Ponsa, K. Mikolajczyk, Key. net: Key- the IEEE conference on computer vision and pattern recognition, 2018,
point detection by handcrafted and learned cnn filters, in: Proceedings pp. 8601–8610.
of the IEEE/CVF international conference on computer vision, 2019, pp. [116] H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefeys, J. Sivic,
5836–5844. T. Pajdla, A. Torii, Inloc: Indoor visual localization with dense matching
[97] X. Zhao, X. Wu, J. Miao, W. Chen, P. C. Chen, Z. Li, Alike: Accu- and view synthesis, in: Proceedings of the IEEE Conference on Com-
rate and lightweight keypoint detection and descriptor extraction, IEEE puter Vision and Pattern Recognition, 2018, pp. 7199–7209.
Transactions on Multimedia (2022). [117] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for se-
[98] E. Santellani, C. Sormann, M. Rossi, A. Kuhn, F. Fraundorfer, S-trek: mantic segmentation, in: Proceedings of the IEEE conference on com-
Sequential translation and rotation equivariant keypoints for local fea- puter vision and pattern recognition, 2015, pp. 3431–3440.
ture extraction, in: Proceedings of the IEEE/CVF International Confer- [118] Y. Ono, E. Trulls, P. Fua, K. M. Yi, Lf-net: Learning local features from
ence on Computer Vision, 2023, pp. 9728–9737. images, Advances in neural information processing systems 31 (2018).
[99] M. Kanakis, S. Maurer, M. Spallanzani, A. Chhatkuli, L. Van Gool, Zip- [119] X. Shen, C. Wang, X. Li, Z. Yu, J. Li, C. Wen, M. Cheng, Z. He, Rf-
pypoint: Fast interest point detection, description, and matching through net: An end-to-end image matching network based on receptive field,
mixed precision discretization, in: Proceedings of the IEEE/CVF Con- in: Proceedings of the IEEE/CVF conference on computer vision and
ference on Computer Vision and Pattern Recognition, 2023, pp. 6113– pattern recognition, 2019, pp. 8132–8140.
6122. [120] A. Bhowmik, S. Gumhold, C. Rother, E. Brachmann, Reinforced fea-
[100] J. Tang, H. Kim, V. Guizilini, S. Pillai, A. Rares, Neural outlier rejection ture points: Optimizing feature detection and description for a high-level
for self-supervised keypoint learning, in: 8th International Conference task, in: Proceedings of the IEEE/CVF conference on computer vision
on Learning Representations, ICLR 2020, International Conference on and pattern recognition, 2020, pp. 4948–4957.
Learning Representations, ICLR, 2020. [121] U. S. Parihar, A. Gujarathi, K. Mehta, S. Tourani, S. Garg, M. Mil-
[101] Z. Luo, T. Shen, L. Zhou, J. Zhang, Y. Yao, S. Li, T. Fang, L. Quan, ford, K. M. Krishna, Rord: Rotation-robust descriptors and orthographic
Contextdesc: Local descriptor augmentation with cross-modality con- views for local feature matching, in: 2021 IEEE/RSJ International Con-
text, in: Proceedings of the IEEE/CVF conference on computer vision ference on Intelligent Robots and Systems (IROS), IEEE, 2021, pp.
and pattern recognition, 2019, pp. 2527–2536. 1593–1600.
[102] C. Wang, R. Xu, Y. Zhang, S. Xu, W. Meng, B. Fan, X. Zhang, Mtldesc: [122] A. Barroso-Laguna, Y. Verdie, B. Busam, K. Mikolajczyk, Hdd-net: Hy-
Looking wider to describe better, in: Proceedings of the AAAI Confer- brid detector descriptor with mutual interactive learning, in: Proceedings
ence on Artificial Intelligence, Vol. 36, 2022, pp. 2388–2396. of the Asian Conference on Computer Vision, 2020.
[103] C. Wang, R. Xu, K. Lv, S. Xu, W. Meng, Y. Zhang, B. Fan, X. Zhang, At- [123] Y. Zhang, J. Wang, S. Xu, X. Liu, X. Zhang, Mlifeat: Multi-level infor-
tention weighted local descriptors, IEEE Transactions on Pattern Anal- mation fusion based deep local features, in: Proceedings of the Asian
ysis and Machine Intelligence (2023). Conference on Computer Vision, 2020.
[104] J. Chen, S. Chen, Y. Liu, X. Chen, X. Fan, Y. Rao, C. Zhou, Y. Yang, [124] S. Suwanwimolkul, S. Komorita, K. Tasaka, Learning of low-level fea-
Igs-net: Seeking good correspondences via interactive generative structure keypoints for accurate and robust detection, in: Proceedings of
ture learning, IEEE Transactions on Geoscience and Remote Sensing 60 the IEEE/CVF Winter Conference on Applications of Computer Vision,
(2021) 1–13. 2021, pp. 2262–2271.
[105] J. Li, Q. Hu, M. Ai, Rift: Multi-modal image matching based on [125] X. Wang, Z. Liu, Y. Hu, W. Xi, W. Yu, D. Zou, Featurebooster: Boosting
radiation-variation insensitive feature transform, IEEE Transactions on feature descriptors with a lightweight neural network, in: Proceedings
Image Processing 29 (2019) 3296–3310. of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
[106] E. Rosten, T. Drummond, Machine learning for high-speed corner de- tion, 2023, pp. 7630–7639.
tection, in: Computer Vision–ECCV 2006: 9th European Conference [126] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural
I 9, Springer, 2006, pp. 430–443. information processing systems 30 (2017).
[107] S. Cui, M. Xu, A. Ma, Y. Zhong, Modality-free feature detector and [127] Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang,
descriptor for multimodal remote sensing image registration, Remote L. Quan, Aslfeat: Learning local features of accurate shape and localiza-
Sensing 12 (18) (2020) 2937. tion, in: Proceedings of the IEEE/CVF conference on computer vision
[108] H. Xie, Y. Zhang, J. Qiu, X. Zhai, X. Liu, Y. Yang, S. Zhao, Y. Luo, and pattern recognition, 2020, pp. 6589–6598.

26
[128] B. Fan, J. Zhou, W. Feng, H. Pu, Y. Yang, Q. Kong, F. Wu, H. Liu, scale, arXiv preprint arXiv:2010.11929 (2020).
Learning semantic-aware local features for long term visual localization, [151] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov,
IEEE Transactions on Image Processing 31 (2022) 4842–4855. S. Zagoruyko, End-to-end object detection with transformers, in: Eu-
[129] F. Xue, I. Budvytis, R. Cipolla, Sfd2: Semantic-guided feature detec- ropean conference on computer vision, Springer, 2020, pp. 213–229.
tion and description, in: Proceedings of the IEEE/CVF Conference on [152] R. Xu, C. Wang, J. Zhang, S. Xu, W. Meng, X. Zhang, Rssformer: Fore-
Computer Vision and Pattern Recognition, 2023, pp. 5206–5216. ground saliency enhancement for remote sensing land-cover segmenta-
[130] Y. Tian, V. Balntas, T. Ng, A. Barroso-Laguna, Y. Demiris, K. Miko- tion, IEEE Transactions on Image Processing 32 (2023) 1052–1064.
lajczyk, D2d: Keypoint extraction with describe to detect approach, in: [153] R. Xu, C. Wang, J. Sun, S. Xu, W. Meng, X. Zhang, Self correspondence
Proceedings of the Asian conference on computer vision, 2020. distillation for end-to-end weakly-supervised semantic segmentation, in:
[131] K. Li, L. Wang, L. Liu, Q. Ran, K. Xu, Y. Guo, Decoupling makes Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
weakly supervised local feature better, in: Proceedings of the IEEE/CVF [154] R. Xu, C. Wang, S. Xu, W. Meng, X. Zhang, Dual-stream representation
Conference on Computer Vision and Pattern Recognition, 2022, pp. fusion learning for accurate medical image segmentation, Engineering
15838–15848. Applications of Artificial Intelligence 123 (2023) 106402.
[132] Y. Deng, J. Ma, Redfeat: Recoupling detection and description for mul- [155] W. Cong, Y. Cong, J. Dong, G. Sun, H. Ding, Gradient-semantic com-
timodal feature learning, IEEE Transactions on Image Processing 32 pensation for incremental semantic segmentation, IEEE Transactions on
(2022) 591–602. Multimedia (2023) 1–14doi:10.1109/TMM.2023.3336243.
[133] J. Sun, J. Zhu, L. Ji, Shared coupling-bridge for weakly supervised local [156] W. Cong, Y. Cong, G. Sun, Y. Liu, J. Dong, Self-paced weight con-
feature learning, arXiv preprint arXiv:2212.07047 (2022). solidation for continual learning, IEEE Transactions on Circuits and
[134] D. Zhang, F. Chen, X. Chen, Dualgats: Dual graph attention networks Systems for Video Technology (2023) 1–1doi:10.1109/TCSVT.
for emotion recognition in conversations, in: Proceedings of the 61st 2023.3304567.
Annual Meeting of the Association for Computational Linguistics (Vol- [157] W. Jiang, E. Trulls, J. Hosang, A. Tagliasacchi, K. M. Yi, Cotr: Corre-
ume 1: Long Papers), 2023, pp. 7395–7408. spondence transformer for matching across images, in: Proceedings of
[135] Z. Li, J. Ma, Learning feature matching via matchable keypoint-assisted the IEEE/CVF International Conference on Computer Vision, 2021, pp.
graph neural network, arXiv preprint arXiv:2307.01447 (2023). 6207–6217.
[136] R. Pautrat, I. Suárez, Y. Yu, M. Pollefeys, V. Larsson, Gluestick: Ro- [158] D. Tan, J.-J. Liu, X. Chen, C. Chen, R. Zhang, Y. Shen, S. Ding, R. Ji,
bust image matching by sticking points and lines together, arXiv preprint Eco-tr: Efficient correspondences finding via coarse-to-fine refinement,
arXiv:2304.02008 (2023). in: European Conference on Computer Vision, Springer, 2022, pp. 317–
[137] P. Lindenberger, P.-E. Sarlin, M. Pollefeys, Lightglue: Local feature 334.
matching at light speed, arXiv preprint arXiv:2306.13643 (2023). [159] G. Bökman, F. Kahl, A case for using rotation invariant features in state
[138] Z. Kuang, J. Li, M. He, T. Wang, Y. Zhao, Densegap: graph-structured of the art feature matchers, in: Proceedings of the IEEE/CVF Confer-
dense correspondence learning with anchor points, in: 2022 26th In- ence on Computer Vision and Pattern Recognition, 2022, pp. 5110–
ternational Conference on Pattern Recognition (ICPR), IEEE, 2022, pp. 5119.
542–549. [160] S. Tang, J. Zhang, S. Zhu, P. Tan, Quadtree attention for vision trans-
[139] Y. Cai, L. Li, D. Wang, X. Li, X. Liu, Htmatch: An efficient hybrid formers, arXiv preprint arXiv:2201.02767 (2022).
transformer based graph neural network for local feature matching, Sig- [161] Y. Chen, D. Huang, S. Xu, J. Liu, Y. Liu, Guide local feature match-
nal Processing 204 (2023) 108859. ing by overlap estimation, in: Proceedings of the AAAI Conference on
[140] X. Lu, Y. Yan, B. Kang, S. Du, Paraformer: Parallel attention trans- Artificial Intelligence, Vol. 36, 2022, pp. 365–373.
former for efficient feature matching, arXiv preprint arXiv:2303.00941 [162] Q. Wang, J. Zhang, K. Yang, K. Peng, R. Stiefelhagen, Matchformer:
(2023). Interleaving attention in transformers for feature matching, in: Proceed-
[141] Y. Deng, J. Ma, Resmatch: Residual attention learning for local feature ings of the Asian Conference on Computer Vision, 2022, pp. 2746–2762.
matching, arXiv preprint arXiv:2307.05180 (2023). [163] J. Ma, Y. Wang, A. Fan, G. Xiao, R. Chen, Correspondence attention
[142] T. Xie, K. Dai, K. Wang, R. Li, L. Zhao, Deepmatcher: a deep transformer: A context-sensitive network for two-view correspondence
transformer-based network for robust and accurate local feature match- learning, IEEE Transactions on Multimedia (2022).
ing, arXiv preprint arXiv:2301.02993 (2023). [164] K. T. Giang, S. Song, S. Jo, Topicfm: robust and interpretable feature
[143] P. Truong, M. Danelljan, L. V. Gool, R. Timofte, Gocor: Bringing matching with topic-assisted, arXiv preprint arXiv:2207.00328 (2022).
globally optimized correspondence volumes into your neural network, [165] J. Yu, J. Chang, J. He, T. Zhang, J. Yu, F. Wu, Adaptive spot-guided
Advances in Neural Information Processing Systems 33 (2020) 14278– transformer for consistent local feature matching, in: Proceedings of the
14290. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[144] J. Xu, R. Ranftl, V. Koltun, Accurate optical flow via direct cost vol- 2023, pp. 21898–21908.
ume processing, in: Proceedings of the IEEE Conference on Computer [166] K. Dai, T. Xie, K. Wang, Z. Jiang, R. Li, L. Zhao, Oamatcher: An over-
Vision and Pattern Recognition, 2017, pp. 1289–1297. lapping areas-based network for accurate local feature matching, arXiv
[145] Z. Teed, J. Deng, Raft: Recurrent all-pairs field transforms for optical preprint arXiv:2302.05846 (2023).
flow, in: Computer Vision–ECCV 2020: 16th European Conference, [167] C. Cao, Y. Fu, Improving transformer-based image matching by
Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, Springer, cascaded capturing spatially informative keypoints, arXiv preprint
2020, pp. 402–419. arXiv:2303.02885 (2023).
[146] P. Truong, M. Danelljan, R. Timofte, L. Van Gool, Pdc-net+: Enhanced [168] S. Zhu, X. Liu, Pmatch: Paired masked image modeling for dense geo-
probabilistic dense correspondence network, IEEE Transactions on Pat- metric matching, in: Proceedings of the IEEE/CVF Conference on Com-
tern Analysis and Machine Intelligence (2023). puter Vision and Pattern Recognition, 2023, pp. 21909–21918.
[147] J. Revaud, V. Leroy, P. Weinzaepfel, B. Chidlovskii, Pump: Pyramidal [169] J. Chang, J. Yu, T. Zhang, Structured epipolar matcher for local feature
and uniqueness matching priors for unsupervised learning of local de- matching, in: Proceedings of the IEEE/CVF Conference on Computer
scriptors, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6176–6185.
Vision and Pattern Recognition, 2022, pp. 3926–3936. [170] J. Edstedt, I. Athanasiadis, M. Wadenbäck, M. Felsberg, Dkm: Dense
[148] J. Revaud, P. Weinzaepfel, Z. Harchaoui, C. Schmid, Deepmatching: kernelized feature matching for geometry estimation, in: Proceedings
Hierarchical deformable dense matching, International Journal of Com- of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
puter Vision 120 (2016) 300–323. tion, 2023, pp. 17765–17775.
[149] U. Efe, K. G. Ince, A. Alatan, Dfm: A performance baseline for deep [171] J. Edstedt, Q. Sun, G. Bökman, M. Wadenbäck, M. Felsberg, Roma:
feature matching, in: Proceedings of the IEEE/CVF conference on com- Revisiting robust losses for dense feature matching, arXiv preprint
puter vision and pattern recognition, 2021, pp. 4284–4293. arXiv:2305.15404 (2023).
[150] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [172] Q. Zhou, T. Sattler, L. Leal-Taixe, Patch2pix: Epipolar-guided pixel-
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., level correspondences, in: Proceedings of the IEEE/CVF conference on
An image is worth 16x16 words: Transformers for image recognition at computer vision and pattern recognition, 2021, pp. 4669–4678.

27
[173] D. Huang, Y. Chen, Y. Liu, J. Liu, S. Xu, W. Wu, Y. Ding, F. Tang, 3 satellite image geometric accuracy verification using the rpc model,
C. Wang, Adaptive assignment for geometry aware local feature match- Sensors 17 (9) (2017) 2005.
ing, in: Proceedings of the IEEE/CVF Conference on Computer Vision [195] W. Ma, J. Zhang, Y. Wu, L. Jiao, H. Zhu, W. Zhao, A novel two-step
and Pattern Recognition, 2023, pp. 5425–5434. registration method for remote sensing images based on deep and local
[174] J. Ni, Y. Li, Z. Huang, H. Li, H. Bao, Z. Cui, G. Zhang, Pats: Patch features, IEEE Transactions on Geoscience and Remote Sensing 57 (7)
area transportation with subdivision for local feature matching, in: Pro- (2019) 4834–4843.
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern [196] L. Zhou, Y. Ye, T. Tang, K. Nan, Y. Qin, Robust matching for sar and
Recognition, 2023, pp. 17776–17786. optical images using multiscale convolutional gradient features, IEEE
[175] Y. Zhang, X. Zhao, D. Qian, Searching from area to point: A hierarchical Geoscience and Remote Sensing Letters 19 (2021) 1–5.
framework for semantic-geometric combined feature matching, arXiv [197] S. Cui, A. Ma, L. Zhang, M. Xu, Y. Zhong, Map-net: Sar and optical
preprint arXiv:2305.00194 (2023). image matching via image-based convolutional network with attention
[176] N. Snavely, S. M. Seitz, R. Szeliski, Photo tourism: exploring photo mechanism and spatial pyramid aggregated pooling, IEEE Transactions
collections in 3d, in: ACM siggraph 2006 papers, 2006, pp. 835–846. on Geoscience and Remote Sensing 60 (2021) 1–13.
[177] P. Lindenberger, P.-E. Sarlin, V. Larsson, M. Pollefeys, Pixel-perfect [198] Z. Wang, Y. Ma, Y. Zhang, Review of pixel-level remote sensing image
structure-from-motion with featuremetric refinement, in: Proceedings fusion based on deep learning, Information Fusion 90 (2023) 36–58.
of the IEEE/CVF international conference on computer vision, 2021, [199] Z. Bian, A. Jabri, A. A. Efros, A. Owens, Learning pixel trajectories with
pp. 5987–5997. multiscale contrastive random walks, in: Proceedings of the IEEE/CVF
[178] C. M. Parameshwara, G. Hari, C. Fermüller, N. J. Sanket, Y. Aloi- Conference on Computer Vision and Pattern Recognition, 2022, pp.
monos, Diffposenet: direct differentiable camera pose estimation, in: 6508–6519.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- [200] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, M. J. Black,
tern Recognition, 2022, pp. 6845–6854. Competitive collaboration: Joint unsupervised learning of depth, cam-
[179] J. Y. Zhang, D. Ramanan, S. Tulsiani, Relpose: Predicting probabilistic era motion, optical flow and motion segmentation, in: Proceedings of
relative rotation for single objects in the wild, in: European Conference the IEEE/CVF conference on computer vision and pattern recognition,
on Computer Vision, Springer, 2022, pp. 592–611. 2019, pp. 12240–12249.
[180] C. Tang, P. Tan, Ba-net: Dense bundle adjustment network, arXiv [201] A. W. Harley, Z. Fang, K. Fragkiadaki, Particle video revisited: Tracking
preprint arXiv:1806.04807 (2018). through occlusions using point trajectories, in: European Conference on
[181] X. Gu, W. Yuan, Z. Dai, C. Tang, S. Zhu, P. Tan, Dro: Deep recurrent Computer Vision, Springer, 2022, pp. 59–75.
optimizer for structure-from-motion, arXiv preprint arXiv:2103.13201 [202] C. Qin, S. Wang, C. Chen, W. Bai, D. Rueckert, Generative myocar-
(2021). dial motion tracking via latent space exploration with biomechanics-
[182] X. He, J. Sun, Y. Wang, S. Peng, Q. Huang, H. Bao, X. Zhou, Detector- informed prior, Medical Image Analysis 83 (2023) 102682.
free structure from motion, arXiv preprint arXiv:2306.15669 (2023). [203] M. Ye, M. Kanski, D. Yang, Q. Chang, Z. Yan, Q. Huang, L. Axel,
[183] L. H. Hughes, D. Marcos, S. Lobry, D. Tuia, M. Schmitt, A deep learn- D. Metaxas, Deeptag: An unsupervised deep learning method for mo-
ing framework for matching of sar and optical imagery, ISPRS Journal tion tracking on cardiac tagging magnetic resonance images, in: Pro-
of Photogrammetry and Remote Sensing 169 (2020) 166–179. ceedings of the IEEE/CVF conference on computer vision and pattern
[184] Y. Ye, T. Tang, B. Zhu, C. Yang, B. Li, S. Hao, A multiscale framework recognition, 2021, pp. 7261–7271.
with unsupervised learning for remote sensing image registration, IEEE [204] Z. Bian, F. Xing, J. Yu, M. Shao, Y. Liu, A. Carass, J. Woo, J. L. Prince,
Transactions on Geoscience and Remote Sensing 60 (2022) 1–15. Drimet: Deep registration-based 3d incompressible motion estimation
[185] S. Chen, J. Chen, Y. Rao, X. Chen, X. Fan, H. Bai, L. Xing, C. Zhou, in tagged-mri with application to the tongue, in: Medical Imaging with
Y. Yang, A hierarchical consensus attention network for feature match- Deep Learning, 2023.
ing of remote sensing images, IEEE Transactions on Geoscience and [205] T. Fechter, D. Baltas, One-shot learning for deformable medical image
Remote Sensing 60 (2022) 1–11. registration and periodic motion tracking, IEEE transactions on medical
[186] Y. Liu, B. N. Zhao, S. Zhao, L. Zhang, Progressive motion coherence imaging 39 (7) (2020) 2506–2517.
for remote sensing image matching, IEEE Transactions on Geoscience [206] Y. Zhang, X. Wu, H. M. Gach, H. Li, D. Yang, Groupregnet: a groupwise
and Remote Sensing 60 (2022) 1–13. one-shot deep learning-based 4d image registration method, Physics in
[187] Y. Ye, B. Zhu, T. Tang, C. Yang, Q. Xu, G. Zhang, A robust multimodal Medicine & Biology 66 (4) (2021) 045030.
remote sensing image registration method and system using steerable fil- [207] Y. Ji, Z. Zhu, Y. Wei, A one-shot lung 4d-ct image registration method
ters with first-and second-order gradients, ISPRS Journal of Photogram- with temporal-spatial features, in: 2022 IEEE Biomedical Circuits and
metry and Remote Sensing 188 (2022) 331–350. Systems Conference (BioCAS), IEEE, 2022, pp. 203–207.
[188] L. H. Hughes, M. Schmitt, L. Mou, Y. Wang, X. X. Zhu, Identifying [208] M. Z. Iqbal, I. Razzak, A. Qayyum, T. T. Nguyen, M. Tanveer,
corresponding patches in sar and optical images with a pseudo-siamese A. Sowmya, Hybrid unsupervised paradigm based deformable image
cnn, IEEE Geoscience and Remote Sensing Letters 15 (5) (2018) 784– fusion for 4d ct lung image modality, Information Fusion 102 (2024)
788. 102061.
[189] D. Quan, S. Wang, X. Liang, R. Wang, S. Fang, B. Hou, L. Jiao, Deep [209] M. Pfandler, P. Stefan, C. Mehren, M. Lazarovici, M. Weigl, Technical
generative matching network for optical and sar image registration, in: and nontechnical skills in surgery: a simulated operating room environ-
IGARSS 2018-2018 IEEE International Geoscience and Remote Sens- ment study, Spine 44 (23) (2019) E1396–E1400.
ing Symposium, IEEE, 2018, pp. 6215–6218. [210] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, P. Suetens, Mul-
[190] N. Merkle, S. Auer, R. Mueller, P. Reinartz, Exploring the potential timodality image registration by maximization of mutual information,
of conditional adversarial networks for optical and sar image matching, IEEE transactions on Medical Imaging 16 (2) (1997) 187–198.
IEEE Journal of Selected Topics in Applied Earth Observations and Re- [211] M. Unberath, C. Gao, Y. Hu, M. Judish, R. H. Taylor, M. Armand,
mote Sensing 11 (6) (2018) 1811–1820. R. Grupp, The impact of machine learning on 2d/3d registration for
[191] W. Shi, F. Su, R. Wang, J. Fan, A visual circle based image registration image-guided interventions: A systematic review and perspective, Fron-
algorithm for optical and sar imagery, in: 2012 IEEE International Geo- tiers in Robotics and AI 8 (2021) 716007.
science and Remote Sensing Symposium, IEEE, 2012, pp. 2109–2112. [212] S. Jaganathan, M. Kukla, J. Wang, K. Shetty, A. Maier, Self-supervised
[192] A. Zampieri, G. Charpiat, N. Girard, Y. Tarabalka, Multimodal image 2d/3d registration for x-ray to ct image fusion, in: Proceedings of the
alignment through a multiscale chain of neural networks with applica- IEEE/CVF Winter Conference on Applications of Computer Vision,
tion to remote sensing, in: Proceedings of the European Conference on 2023, pp. 2788–2798.
Computer Vision (ECCV), 2018, pp. 657–673. [213] D.-X. Huang, X.-H. Zhou, X.-L. Xie, S.-Q. Liu, Z.-Q. Feng, J.-L. Hao,
[193] F. Ye, Y. Su, H. Xiao, X. Zhao, W. Min, Remote sensing image registra- Z.-G. Hou, N. Ma, L. Yan, A novel two-stage framework for 2d/3d reg-
tion using convolutional neural network features, IEEE Geoscience and istration in neurological interventions, in: 2022 IEEE International Con-
Remote Sensing Letters 15 (2) (2018) 232–236. ference on Robotics and Biomimetics (ROBIO), IEEE, 2022, pp. 266–
[194] T. Wang, G. Zhang, L. Yu, R. Zhao, M. Deng, K. Xu, Multi-mode gf- 271.

28
[214] Y. Pei, Y. Zhang, H. Qin, G. Ma, Y. Guo, T. Xu, H. Zha, Non-rigid cran- [235] H. Aanæs, R. R. Jensen, G. Vogiatzis, E. Tola, A. B. Dahl, Large-scale
iofacial 2d-3d registration using cnn-based regression, in: Deep Learn- data for multiple-view stereopsis, International Journal of Computer Vi-
ing in Medical Image Analysis and Multimodal Learning for Clinical sion 120 (2016) 153–168.
Decision Support: Third International Workshop, DLMIA 2017, and [236] A. Knapitsch, J. Park, Q.-Y. Zhou, V. Koltun, Tanks and temples: Bench-
7th International Workshop, ML-CDS 2017, Held in Conjunction with marking large-scale scene reconstruction, ACM Transactions on Graph-
MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings ics (ToG) 36 (4) (2017) 1–13.
3, Springer, 2017, pp. 117–125. [237] J. L. Schonberger, H. Hardmeier, T. Sattler, M. Pollefeys, Comparative
[215] M. D. Foote, B. E. Zimmerman, A. Sawant, S. C. Joshi, Real-time 2d- evaluation of hand-crafted and learned local features, in: Proceedings of
3d deformable registration with deep learning and application to lung the IEEE conference on computer vision and pattern recognition, 2017,
radiotherapy targeting, in: Information Processing in Medical Imaging: pp. 1482–1491.
26th International Conference, IPMI 2019, Hong Kong, China, June 2– [238] K. Wilson, N. Snavely, Robust global translations with 1dsfm, in: Com-
7, 2019, Proceedings 26, Springer, 2019, pp. 265–276. puter Vision–ECCV 2014: 13th European Conference, Zurich, Switzer-
[216] W. Yu, M. Tannast, G. Zheng, Non-rigid free-form 2d–3d registration land, September 6-12, 2014, Proceedings, Part III 13, Springer, 2014,
using a b-spline-based statistical deformation model, Pattern recognition pp. 61–75.
63 (2017) 689–699. [239] T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler,
[217] P. Li, Y. Pei, Y. Guo, G. Ma, T. Xu, H. Zha, Non-rigid 2d-3d registra- M. Pollefeys, A. Geiger, A multi-view stereo benchmark with high-
tion using convolutional autoencoders, in: 2020 IEEE 17th International resolution images and multi-camera videos, in: Proceedings of the IEEE
Symposium on Biomedical Imaging (ISBI), IEEE, 2020, pp. 700–704. conference on computer vision and pattern recognition, 2017, pp. 3260–
[218] G. Dong, J. Dai, N. Li, C. Zhang, W. He, L. Liu, Y. Chan, Y. Li, Y. Xie, 3269.
X. Liang, 2d/3d non-rigid image registration via two orthogonal x-ray [240] D. Marelli, L. Morelli, E. M. Farella, S. Bianco, G. Ciocca, F. Re-
projection images for lung tumor tracking, Bioengineering 10 (2) (2023) mondino, Enrich: Multi-purpose dataset for benchmarking in computer
144. vision and photogrammetry, ISPRS Journal of Photogrammetry and Re-
[219] V. Balntas, K. Lenc, A. Vedaldi, K. Mikolajczyk, Hpatches: A bench- mote Sensing 198 (2023) 84–98.
mark and evaluation of handcrafted and learned local descriptors, in: [241] C. Schmid, R. Mohr, C. Bauckhage, Evaluation of interest point detec-
Proceedings of the IEEE conference on computer vision and pattern tors, International Journal of computer vision 37 (2) (2000) 151–172.
recognition, 2017, pp. 5173–5182. [242] C. B. Choy, J. Gwak, S. Savarese, M. Chandraker, Universal correspon-
[220] X. Shen, Z. Cai, W. Yin, M. Müller, Z. Li, K. Wang, X. Chen, C. Wang, dence network, Advances in neural information processing systems 29
Gim: Learning generalizable image matcher from internet videos, arXiv (2016).
preprint arXiv:2402.11095 (2024). [243] I. Melekhov, A. Tiulpin, T. Sattler, M. Pollefeys, E. Rahtu, J. Kan-
[221] Cvpr 2020 image matching challenge, accessed March 1, 2024. nala, Dgc-net: Dense geometric correspondence network, in: 2019 IEEE
URL https://www.cs.ubc.ca/research/ Winter Conference on Applications of Computer Vision (WACV), IEEE,
image-matching-challenge/2020/ 2019, pp. 1034–1042.
[222] Cvpr 2021 image matching challenge, accessed March 1, 2024. [244] X. Shen, F. Darmon, A. A. Efros, M. Aubry, Ransac-flow: generic two-
URL https://www.cs.ubc.ca/research/ stage image alignment, in: Computer Vision–ECCV 2020: 16th Euro-
image-matching-challenge/current/ pean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part
[223] Cvpr 2022 image matching challenge, accessed March 1, 2024. IV 16, Springer, 2020, pp. 618–637.
URL https://www.kaggle.com/competitions/ [245] P.-E. Sarlin, C. Cadena, R. Siegwart, M. Dymczyk, From coarse to fine:
image-matching-challenge-2022/overview Robust hierarchical localization at large scale, in: Proceedings of the
[224] Cvpr 2023 image matching challenge, accessed March 1, 2024. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
URL https://www.kaggle.com/competitions/ 2019, pp. 12716–12725.
image-matching-challenge-2023/overview [246] X. Nan, L. Ding, Learning geometric feature embedding with transform-
[225] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, M. Nießner, ers for image matching, Sensors 22 (24) (2022) 9882.
Scannet: Richly-annotated 3d reconstructions of indoor scenes, in: Pro- [247] R. Mao, C. Bai, Y. An, F. Zhu, C. Lu, 3dg-stfm: 3d geometric guided
ceedings of the IEEE conference on computer vision and pattern recog- student-teacher feature matching, in: European Conference on Com-
nition, 2017, pp. 5828–5839. puter Vision, Springer, 2022, pp. 125–142.
[226] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, [248] K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, P. Fua, Learning
D. Borth, L.-J. Li, Yfcc100m: The new data in multimedia research, to find good correspondences, in: Proceedings of the IEEE conference
Communications of the ACM 59 (2) (2016) 64–73. on computer vision and pattern recognition, 2018, pp. 2666–2674.
[227] Z. Li, N. Snavely, Megadepth: Learning single-view depth prediction [249] O. Wiles, S. Ehrhardt, A. Zisserman, Co-attention for conditioned image
from internet photos, in: Proceedings of the IEEE conference on com- matching, in: Proceedings of the IEEE/CVF conference on computer
puter vision and pattern recognition, 2018, pp. 2041–2050. vision and pattern recognition, 2021, pp. 15920–15929.
[228] D. Mishkin, J. Matas, M. Perdoch, Mods: Fast and robust method [250] R. Arandjelović, A. Zisserman, Three things everyone should know to
for two-view matching, Computer vision and image understanding 141 improve object retrieval, in: 2012 IEEE conference on computer vision
(2015) 81–93. and pattern recognition, IEEE, 2012, pp. 2911–2918.
[229] D. Mishkin, J. Matas, M. Perdoch, K. Lenc, Wxbs: Wide baseline stereo [251] I. Melekhov, G. J. Brostow, J. Kannala, D. Turmukhambetov, Image styl-
generalizations, arXiv preprint arXiv:1504.06603 (2015). ization for robust features, arXiv preprint arXiv:2008.06959 (2020).
[230] T. Sattler, T. Weyand, B. Leibe, L. Kobbelt, Image retrieval for image- [252] Y. Zhou, H. Fan, S. Gao, Y. Yang, X. Zhang, J. Li, Y. Guo, Retrieval
based localization revisited., in: BMVC, Vol. 1, 2012, p. 4. and localization with observation constraints, in: 2021 IEEE Interna-
[231] W. Maddern, G. Pascoe, C. Linegar, P. Newman, 1 year, 1000 km: The tional Conference on Robotics and Automation (ICRA), IEEE, 2021,
oxford robotcar dataset, The International Journal of Robotics Research pp. 5237–5244.
36 (1) (2017) 3–15. [253] M. Humenberger, Y. Cabon, N. Guerin, J. Morat, V. Leroy, J. Revaud,
[232] P.-E. Sarlin, M. Dusmanu, J. L. Schönberger, P. Speciale, L. Gruber, P. Rerole, N. Pion, C. de Souza, G. Csurka, Robust image retrieval-based
V. Larsson, O. Miksik, M. Pollefeys, Lamar: Benchmarking localiza- visual localization using kapture, arXiv preprint arXiv:2007.13867
tion and mapping for augmented reality, in: European Conference on (2020).
Computer Vision, Springer, 2022, pp. 686–704. [254] H. Germain, G. Bourmaud, V. Lepetit, S2dnet: Learning image features
[233] A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: The for accurate sparse-to-dense matching, in: Computer Vision–ECCV
kitti dataset, The International Journal of Robotics Research 32 (11) 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020,
(2013) 1231–1237. Proceedings, Part III 16, Springer, 2020, pp. 626–643.
[234] J. Xiao, A. Owens, A. Torralba, Sun3d: A database of big spaces re- [255] Y. Zhao, H. Zhang, P. Lu, P. Li, E. Wu, B. Sheng, Dsd-matchingnet:
constructed using sfm and object labels, in: Proceedings of the IEEE Deformable sparse-to-dense feature matching for learning accurate cor-
international conference on computer vision, 2013, pp. 1625–1632. respondences, Virtual Reality & Intelligent Hardware 4 (5) (2022) 432–

29
443. of Geo-Information 10 (11) (2021) 748.
[256] L. Chen, C. Heipke, Deep learning feature representation for image [278] L. Morelli, F. Bellavia, F. Menna, F. Remondino, Photogrammetry now
matching under large viewpoint and viewing direction change, ISPRS and then–from hand-crafted to deep-learning tie points–, The Interna-
Journal of Photogrammetry and Remote Sensing 190 (2022) 94–112. tional Archives of the Photogrammetry, Remote Sensing and Spatial In-
[257] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- formation Sciences 48 (2022) 163–170.
son, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic [279] F. Maiwald, H.-G. Maas, An automatic workflow for orientation of his-
urban scene understanding, in: Proceedings of the IEEE conference on torical images with large radiometric and geometric differences, The
computer vision and pattern recognition, 2016, pp. 3213–3223. Photogrammetric Record 36 (174) (2021) 77–103.
[258] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, A. Torralba, Scene
parsing through ade20k dataset, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, 2017, pp. 633–641.
[259] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson,
T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., Segment anything,
arXiv preprint arXiv:2304.02643 (2023).
[260] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski,
A. Joulin, Emerging properties in self-supervised vision transformers,
in: Proceedings of the IEEE/CVF international conference on computer
vision, 2021, pp. 9650–9660.
[261] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khali-
dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., Dinov2:
Learning robust visual features without supervision, arXiv preprint
arXiv:2304.07193 (2023).
[262] Anonymous, Towards seamless adaptation of pre-trained models for vi-
sual place recognition, in: Submitted to The Twelfth International Con-
ference on Learning Representations, 2023, under review.
URL https://openreview.net/forum?id=TVg6hlfsKa
[263] X. Jiang, Y. Xia, X.-P. Zhang, J. Ma, Robust image matching via local
graph structure consensus, Pattern Recognition 126 (2022) 108588.
[264] J. Ma, J. Zhao, J. Jiang, H. Zhou, X. Guo, Locality preserving matching,
International Journal of Computer Vision 127 (2019) 512–531.
[265] R. Raguram, O. Chum, M. Pollefeys, J. Matas, J.-M. Frahm, Usac: A
universal framework for random sample consensus, IEEE transactions
on pattern analysis and machine intelligence 35 (8) (2012) 2022–2038.
[266] D. Barath, J. Noskova, M. Ivashechkin, J. Matas, Magsac++, a fast, re-
liable and accurate robust estimator, in: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, 2020, pp. 1304–
1312.
[267] X. Li, Z. Hu, Rejecting mismatches by correspondence function, Inter-
national Journal of Computer Vision 89 (2010) 1–17.
[268] W. Sun, W. Jiang, E. Trulls, A. Tagliasacchi, K. M. Yi, Acne: Atten-
tive context normalization for robust permutation-equivariant learning,
in: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, 2020, pp. 11286–11295.
[269] C. Zhao, Z. Cao, C. Li, X. Li, J. Yang, Nm-net: Mining reliable
neighbors for robust feature correspondences, in: Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition,
2019, pp. 215–224.
[270] J. Chen, X. Chen, S. Chen, Y. Liu, Y. Rao, Y. Yang, H. Wang, D. Wu,
Shape-former: Bridging cnn and transformer via shapeconv for multi-
modal image matching, Information Fusion 91 (2023) 445–457.
[271] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold,
C. Rother, Dsac-differentiable ransac for camera localization, in: Pro-
ceedings of the IEEE conference on computer vision and pattern recog-
nition, 2017, pp. 6684–6692.
[272] L. Cavalli, D. Barath, M. Pollefeys, V. Larsson, Consensus-adaptive
ransac, arXiv preprint arXiv:2307.14030 (2023).
[273] J. Chen, S. Chen, X. Chen, Y. Yang, L. Xing, X. Fan, Y. Rao, Lsv-
anet: Deep learning on local structure visualization for feature matching,
IEEE Transactions on Geoscience and Remote Sensing 60 (2021) 1–18.
[274] F. Bellavia, L. Morelli, F. Menna, F. Remondino, Image orientation with
a hybrid pipeline robust to rotations and wide-baselines, The Interna-
tional Archives of the Photogrammetry, Remote Sensing and Spatial In-
formation Sciences 46 (2022) 73–80.
[275] F. Bellavia, D. Mishkin, Harrisz+: Harris corner selection for next-gen
image matching pipelines, Pattern Recognition Letters 158 (2022) 141–
147.
[276] F. Bellavia, Image matching by bare homography, IEEE Transactions on
Image Processing (2024).
[277] F. Maiwald, C. Lehmann, T. Lazariv, Fully automated pose estimation
of historical images in the context of 4d geographic information sys-
tems utilizing machine learning methods, ISPRS International Journal

Introduction To AI - Fundamentals
No ratings yet
Introduction To AI - Fundamentals
57 pages
Feature Detection and Matching
100% (1)
Feature Detection and Matching
50 pages
Digital Image Processing Seminar
80% (5)
Digital Image Processing Seminar
23 pages
Feature Detection and Matching: P. Kuppusamy
No ratings yet
Feature Detection and Matching: P. Kuppusamy
42 pages
Feature Matching Based On Local Windowa
No ratings yet
Feature Matching Based On Local Windowa
18 pages
AI Stray Dog Monitoring System
No ratings yet
AI Stray Dog Monitoring System
14 pages
1 s2.0 S0924271625001376 Main
No ratings yet
1 s2.0 S0924271625001376 Main
25 pages
C-Fundamental Steps in Digital Image Processing Draft
No ratings yet
C-Fundamental Steps in Digital Image Processing Draft
27 pages
Modality-Aware Feature Matching: A Comprehensive Review of Single-And Cross-Modality Techniques
No ratings yet
Modality-Aware Feature Matching: A Comprehensive Review of Single-And Cross-Modality Techniques
35 pages
Remote Sensing: Geometric Correction
100% (1)
Remote Sensing: Geometric Correction
6 pages
Art Center Image Recognition
No ratings yet
Art Center Image Recognition
5 pages
Ailab References
No ratings yet
Ailab References
128 pages
Feature Detection 2
No ratings yet
Feature Detection 2
57 pages
Detector-Free Image Matching with Transformers
No ratings yet
Detector-Free Image Matching with Transformers
10 pages
Chapter 03: IMAGE: Part 1: Key Term Quiz
No ratings yet
Chapter 03: IMAGE: Part 1: Key Term Quiz
6 pages
Anchor-Based Vs Anchor-Free Object
No ratings yet
Anchor-Based Vs Anchor-Free Object
8 pages
Bay, Herbert Et Al. (2008) - Speed-Up Robust Features
No ratings yet
Bay, Herbert Et Al. (2008) - Speed-Up Robust Features
14 pages
Adaptive Spot-Guided Transformer Matching
No ratings yet
Adaptive Spot-Guided Transformer Matching
13 pages
LFM-3D Learnable Feature Matching Across Wide Baselines Using 3D Signals
No ratings yet
LFM-3D Learnable Feature Matching Across Wide Baselines Using 3D Signals
10 pages
Local Self-Similarity in Images & Videos
No ratings yet
Local Self-Similarity in Images & Videos
8 pages
2019-Fast and Robust Matching For Multimodal Remote Sensing Image Registration
No ratings yet
2019-Fast and Robust Matching For Multimodal Remote Sensing Image Registration
12 pages
V-Unit AIIA Complete Material
No ratings yet
V-Unit AIIA Complete Material
162 pages
147599146852.1167 Survey
No ratings yet
147599146852.1167 Survey
6 pages
Key Paper - Robust Feature Matching For Remote Sensing Image Registration Via Linear Adaptive Filtering
No ratings yet
Key Paper - Robust Feature Matching For Remote Sensing Image Registration Via Linear Adaptive Filtering
15 pages
Efficient Object Detection and Matching Using Feature Classification
No ratings yet
Efficient Object Detection and Matching Using Feature Classification
4 pages
Semi-Dense Loftr SFM
No ratings yet
Semi-Dense Loftr SFM
13 pages
3local Features
No ratings yet
3local Features
76 pages
TM-30 ES (Final) - 0
No ratings yet
TM-30 ES (Final) - 0
43 pages
Sensors 23 00123 With Cover
No ratings yet
Sensors 23 00123 With Cover
15 pages
Learning Detect Match
No ratings yet
Learning Detect Match
12 pages
AI Insights for Tech Professionals
No ratings yet
AI Insights for Tech Professionals
23 pages
8803 23991 1 PB
No ratings yet
8803 23991 1 PB
8 pages
A High Performance Fpga Based Image Feature Detector and Matcher Based On The Fast and Brief Algorithms
No ratings yet
A High Performance Fpga Based Image Feature Detector and Matcher Based On The Fast and Brief Algorithms
15 pages
Image Patch-Matching With Graph-Based Learning
No ratings yet
Image Patch-Matching With Graph-Based Learning
18 pages
Grounding Image Matching in 3D With Mast3R: Vincent Leroy Yohann Cabon Jerome Revaud
No ratings yet
Grounding Image Matching in 3D With Mast3R: Vincent Leroy Yohann Cabon Jerome Revaud
21 pages
IT5409 Ch4 Part2 Feature ExtractionMatching
No ratings yet
IT5409 Ch4 Part2 Feature ExtractionMatching
85 pages
Homography Estimation From Image Pairs With Hierarchical Convolutional Networks
No ratings yet
Homography Estimation From Image Pairs With Hierarchical Convolutional Networks
8 pages
SCENES Subpixel Correspondence Estimation With Epipolar Supervision
No ratings yet
SCENES Subpixel Correspondence Estimation With Epipolar Supervision
10 pages
2104 00680 PDF
No ratings yet
2104 00680 PDF
10 pages
CTech Reviewer
No ratings yet
CTech Reviewer
6 pages
Lecture 5 - Camera Calibration
No ratings yet
Lecture 5 - Camera Calibration
14 pages
Patch 2 Pix
No ratings yet
Patch 2 Pix
17 pages
Lecture 05
No ratings yet
Lecture 05
57 pages
He D2Former Jointly Learning Hierarchical Detectors and Contextual Descriptors Via Agent-Based CVPR 2023 Paper
No ratings yet
He D2Former Jointly Learning Hierarchical Detectors and Contextual Descriptors Via Agent-Based CVPR 2023 Paper
11 pages
Sift
No ratings yet
Sift
28 pages
Eismann Book Excerpt
No ratings yet
Eismann Book Excerpt
13 pages
Computer Vision: SIFT Features
No ratings yet
Computer Vision: SIFT Features
20 pages
Zero-Shot Image Feature Consensus With Deep Functional Maps
No ratings yet
Zero-Shot Image Feature Consensus With Deep Functional Maps
19 pages
Pixel Connectivity Basics
No ratings yet
Pixel Connectivity Basics
43 pages
IT5409 - Ch4 - Part2 - Feature ExtractionMatching - 4pages
No ratings yet
IT5409 - Ch4 - Part2 - Feature ExtractionMatching - 4pages
43 pages
Global Feature?: Local Feature Detection and Extraction
No ratings yet
Global Feature?: Local Feature Detection and Extraction
6 pages
Unsupervised 3D Object Recognition and Reconstruction in Unordered Datasets
No ratings yet
Unsupervised 3D Object Recognition and Reconstruction in Unordered Datasets
8 pages
Vision Based Bottle Classification and Automatic Bottle Filling System
No ratings yet
Vision Based Bottle Classification and Automatic Bottle Filling System
3 pages
DL4CV Week03 Part01
No ratings yet
DL4CV Week03 Part01
57 pages
cvpr2004 Keypoint Rahuls PDF
No ratings yet
cvpr2004 Keypoint Rahuls PDF
8 pages
Data Science Interview Preparation
No ratings yet
Data Science Interview Preparation
16 pages
CSE 185 Introduction To Computer Vision: Feature Matching
No ratings yet
CSE 185 Introduction To Computer Vision: Feature Matching
48 pages
Tech Trends
No ratings yet
Tech Trends
12 pages
Matching Local Self-Similarities Across Images and Videos
No ratings yet
Matching Local Self-Similarities Across Images and Videos
8 pages
Detectores de Caracteristicas
No ratings yet
Detectores de Caracteristicas
20 pages
Sift
No ratings yet
Sift
15 pages
Efficient CNNs for Mobile Devices
No ratings yet
Efficient CNNs for Mobile Devices
9 pages
Local Feature Detectors, Descriptors, and Image Representations: A Survey
No ratings yet
Local Feature Detectors, Descriptors, and Image Representations: A Survey
20 pages
Automating Corrosion Detection in Marine Structures With AI 2
No ratings yet
Automating Corrosion Detection in Marine Structures With AI 2
19 pages
11 Ray Tracing
No ratings yet
11 Ray Tracing
79 pages
Neurocomputing: Jing Li, Nigel M. Allinson
No ratings yet
Neurocomputing: Jing Li, Nigel M. Allinson
17 pages
Python IEEE Projects List
No ratings yet
Python IEEE Projects List
23 pages
Evaluation of Distance Measures For Feature Based Image Registration Using Alexnet
No ratings yet
Evaluation of Distance Measures For Feature Based Image Registration Using Alexnet
7 pages
Panoramic Image Stitching 3D Photography Initial Project Proposal
No ratings yet
Panoramic Image Stitching 3D Photography Initial Project Proposal
4 pages
Scale Invariant Feature Transform Plus Hue Feature
No ratings yet
Scale Invariant Feature Transform Plus Hue Feature
6 pages
A Survey On Driver Drowsiness Detection Techniques
No ratings yet
A Survey On Driver Drowsiness Detection Techniques
5 pages
A Multi Support Rotationally Invariant Descriptors Based On Gradient Histogram
No ratings yet
A Multi Support Rotationally Invariant Descriptors Based On Gradient Histogram
4 pages
Point Feature Detection & Matching
No ratings yet
Point Feature Detection & Matching
65 pages
Fügener Et Al 2021 Cognitive Challenges in Human Artificial Intelligence Collaboration Investigating The Path Toward
No ratings yet
Fügener Et Al 2021 Cognitive Challenges in Human Artificial Intelligence Collaboration Investigating The Path Toward
19 pages
DL Project Report
No ratings yet
DL Project Report
10 pages
US20170374299A1
No ratings yet
US20170374299A1
21 pages
Sprada 8
No ratings yet
Sprada 8
10 pages
Ai MCQ
No ratings yet
Ai MCQ
30 pages
Image Reconstruction Using Its Spatial and Geometrical Information
No ratings yet
Image Reconstruction Using Its Spatial and Geometrical Information
5 pages
Improved SIFT Algorithm Image Matching
No ratings yet
Improved SIFT Algorithm Image Matching
7 pages
Apple Fruit Quality Evaluation
No ratings yet
Apple Fruit Quality Evaluation
6 pages
Caldera Test Print Technical V2.1
No ratings yet
Caldera Test Print Technical V2.1
1 page
Emotion Based Music Recommendation System
100% (1)
Emotion Based Music Recommendation System
6 pages
Search Term-1
No ratings yet
Search Term-1
4 pages
Florida 1
No ratings yet
Florida 1
2 pages
SaugataPaul DS AIML
No ratings yet
SaugataPaul DS AIML
2 pages