Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
31 views13 pages

Adaptive Spot-Guided Transformer Matching

Uploaded by

natenash.gm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views13 pages

Adaptive Spot-Guided Transformer Matching

Uploaded by

natenash.gm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Adaptive Spot-Guided Transformer for Consistent Local Feature Matching

Jiahuan Yu*, Jiahao Chang*, Jianfeng He, Tianzhu Zhang†, Feng Wu


University of Science and Technology of China
arXiv:2303.16624v1 [cs.CV] 29 Mar 2023

(a)Reference (b)Linear Attention (c)Vanilla Attention (d)Ours (e)Matching Result

Figure 1. The visualization of the cross attention heatmaps and matching results. We sample two similar adjacent points in the reference
image (a), marked with green and red. (b) are two heatmaps of the linear cross attention in LoFTR [50] when green and red pixels are
queries. (c) are two heatmaps obtained from the vanilla cross attention. (d) are two heatmaps generated by our spot-guided attention. (e)
are the comparison of the final matching results produced by LoFTR [50] (top) and our method (down).

Abstract 1. Introduction
Local feature matching (LFM) is a fundamental task in
Local feature matching aims at finding correspondences computer vision, which aims to establish correspondence
between a pair of images. Although current detector-free for local features across image pairs. As a basis for many
methods leverage Transformer architecture to obtain an im- 3D vision tasks, local feature matching can be applied in
pressive performance, few works consider maintaining lo- Structure-from-Motion (SfM) [49], 3D reconstruction [13],
cal consistency. Meanwhile, most methods struggle with visual localization [48, 51], and pose estimation [18, 41].
large scale variations. To deal with the above issues, we Because of its broad applications, local feature matching
propose Adaptive Spot-Guided Transformer (ASTR) for lo- has attracted substantial attention and facilitated the devel-
cal feature matching, which jointly models the local consis- opment of many researches [14, 27, 42, 44, 50]. However,
tency and scale variations in a unified coarse-to-fine archi- finding consistent and accurate matches is still difficult due
tecture. The proposed ASTR enjoys several merits. First, to various challenging factors such as illumination varia-
we design a spot-guided aggregation module to avoid in- tions, scale changes, poor textures, and repetitive patterns.
terfering with irrelevant areas during feature aggregation.
To deal with the above challenges, numerous matching
Second, we design an adaptive scaling module to adjust the
methods have been proposed, which can be generally cat-
size of grids according to the calculated depth information
egorized into two major groups, including detector-based
at fine stage. Extensive experimental results on five stan-
matching methods [2, 14, 15, 39, 42, 47] and detector-free
dard benchmarks demonstrate that our ASTR performs fa-
matching methods [9, 23, 27, 43, 44, 50]. Detector-based
vorably against state-of-the-art methods. Our code will be
matching methods require to first design a keypoint de-
released on https://astr2023.github.io.
tector to extract the keypoints between two images, and
then establish matches between these extracted keypoints.
The quality of detected keypoints will significantly af-
fect the performance of detector-based matching methods.
* Equal Contribution Therefore, many works aim to improve keypoint detection
† Corresponding Author through multi-scale detection [36], repeatable and reliable

1
verification [42]. Thanks to the high-quality keypoints de- ule and an adaptive scaling module. In the spot-guided ag-
tected, these methods can achieve satisfactory performance gregation module, towards the goal of maintaining local
while maintaining high computational and memory effi- consistency, we design a novel attention mechanism called
ciency. However, these detector-based matching methods spot-guided attention: each point is guided by similar high-
may have difficulty in finding reliable matches in textureless confidence points around it, focusing on a local candidate
areas, where keypoints are challenging to detect. Differ- region at each layer. Here, we also adopt global features
ently, detector-free matching methods do not need to detect to enhance the matching ability of the network in the can-
keypoints and try to establish pixel-level matches between didate regions. Specifically, for any point p, we pick the
local features. In this way, it is possible to establish matches points with high feature similarity and matching confidence
in the texture-less areas. Due to the power of attention in in the local area. Their corresponding matching regions are
capturing long-distance dependencies, many Transformer- used for the next attention of point p. In addition, global
based methods [9, 50, 52, 57] have emerged in recent years. features are applied to help the network to make judgments.
As a representative work, considering the computation and The coarse feature maps are iteratively updated in the above
memory costs, LoFTR [50] applies Linear Transformer [25] way. With our spot-guided aggregation module, the red and
to aggregate global features at the coarse stage and then green pixels are guided to the correct area, avoiding the in-
crops fixed-size grids for further refinement. To alleviate terference of repetitive patterns (see Figure 1 (d)). In Fig-
the problem caused by scale changes, COTR [24] calculate ure 1 (e), our ASTR produces more accurate matching re-
the co-visible area iteratively through attention mechanism. sults, which maintains local matching consistency. In the
The promising performance of Transformer-based methods adaptive scaling module, to fully account of possible scale
proves that attention mechanism is effective on local feature variations, we attempt to adaptively crop different sizes of
matching. Nevertheless, some recent works [28, 60] indi- grids for alignment. In detail, we compute the correspond-
cate Transformer lacks spatial inductive bias for continuous ing depth map using the coarse matching result and leverage
dense prediction tasks, which may cause inconsistent local the depth information to crop adaptive size grids from the
matching results. high-resolution feature maps for fine matching.
By studying the previous matching methods, we sum up The contributions of our method could be summarized
two issues that are imperative for obtaining the dense cor- into three-fold: (1) We propose a novel Adaptive Spot-
respondence between images. (1) How to maintain local guided Transformer (ASTR) for local feature matching, in-
consistency. The correct matching result usually satisfies cluding a spot-guided aggregation module and an adap-
the local matching consistency, i.e., for two similar adja- tive scaling module. (2) We design a spot-guided aggre-
cent pixels, their matching points are also extremely close gation module that can maintain local consistency and be
to each other. Existing methods [24,50,57] utilize global at- unaffected by irrelevant regions while aggregating features.
tention in feature aggregation, introducing many irrelevant Our adaptive scaling module is able to leverage depth in-
regions that affect feature updates. Some pixels are dis- formation to adaptively crop different size grids for refine-
turbed by noisy or similar areas and aggregate information ment. (3) Extensive experimental results on five challeng-
from wrong regions, leading to false matching results. As ing benchmarks show that our proposed method performs
shown in Figure 1 (b), for two adjacent similar pixels, high- favorably against state-of-the-art image matching methods.
lighted regions of global linear attention are decentralized
and inconsistent with each other. The inconsistency is also 2. Related Work
present in vanilla attention (see Figure 1 (c)). Therefore, it is In this section, we briefly review several research lines
necessary to utilize local consistency to focus the attention that are related to sparse matching methods, dense matching
area on the correct place. (2) How to handle scale vari- methods, and vision Transformer.
ation. In a coarse-to-fine architecture, since the attention Local Feature Matching. Local feature matching can
mechanism at the coarse stage is not sensitive to scale vari- categorized into detector-based and detector-free meth-
ations, we should focus on the fine stage. Previous meth- ods. Detector-based methods can be divided into three
ods [9, 27, 50, 57] select fixed-size grids for matching at the stages: feature detection, feature description, and feature
fine stage. However, when the scale varies too much across matching. SIFT [34] and ORB [46] are the most popu-
images, the correct match point may be out of the range of lar hand-crafted local features, while learning-based meth-
the grid, resulting in matching failure. Hence, the scheme ods [2, 14, 15, 19, 42, 46, 63] also obtain good performance
of cropping grids should be adaptively adjusted according improvement compared to classical methods. There are
to scale variation across views. also some works focusing on improving the feature match-
To deal with the above issues, we propose a novel Adap- ing stage. D2Net [15] fuses the detection and description
tive Spot-guided Transformer (ASTR) for consistent local stages. R2D2 [42] attempts to train a network to find re-
feature matching, including a spot-guided aggregation mod- liable and repeatable local features. SuperGlue [47] pro-

2
poses an attention-based GNN network to update extracted tional computation overhead. Therefore, we design a fully
local features in alternating self and cross attentions. How- pluggable, lightweight and training-free module for coarse-
ever, detector-based methods rely on local feature extrac- to-fine architecture.
tors, which limits the performance in challenging scenar-
ios such as repetitive textures, weak textures, and illumina- 3. Our Approach
tion changes. Unlike detector-based approaches, detector-
In this section, we present our proposed Adaptive Spot-
free approaches do not require a local feature detector, but
guided Transformer (ASTR) for Consistent Local Feature
find dense feature matching between pixels directly. The
Matching. The overall architecture is illustrated in Figure 2.
classical methods [21, 35] exists, but few of them out-
perform detector-based methods. Learning-based methods 3.1. Overview
change the game, which can be divided into cost-volume- As shown in Figure 2, the proposed ASTR mainly con-
based methods [27,44,53,54] and Transformer-based meth- sists of two modules, including a spot-guided aggregation
ods [9, 10, 22, 24, 50, 57]. Good performance have been module and an adaptive scaling module. Here we give a
achieved by cost-volume-based methods, but most of them brief introduction to the entire process. Given an image pair
are limited by the small receptive field of CNN, which is IRef and ISrc , to start with, we extract multi-scale feature
overcome by Transformer-based methods [50]. Detector- maps of each image through a shared Feature Pyramid Net-
free methods attain better performance in local feature work (FPN) [30]. We denote feature maps with the size
1/i 1/i
matching, so we adopt this paradigm as the baseline. of 1/i as F 1/i = {Fref , Fsrc }. Then, F 1/32 and F 1/8
Vision Transformer. Transformer [56] has been proven are fed into the spot-guided aggregation module for coarse-
to be better at capturing long-range correlations than CNN matching and depth maps. Here, the coarse matching re-
sult is acquired in three phases. First, we need to com-
in vision tasks [7, 37, 38]. Despite the great success, the
pute the similarity matrix, which can be given by S(i, j) =
computational cost of vanilla attention at high resolution 1/8 1/8
is unacceptable, so some approximations [25, 33, 52, 59] τ ⟨Fref (i), Fsrc (j)⟩ with flattened features, where τ is the
have been proposed, which inevitably leads to performance temperature coefficient. Then we perform dural-softmax
operator on S to calculate matching matrix Pc :
degradation. Linear Attention [25] approximates softmax
with ELU [11] to reduce the computational complexity to \label {eq:epc7} \noindent \mathrm {P_c}(i, j) = \mathrm {softmax}(S(i, :))(i, j) \cdot \mathrm {softmax}(S(:, j))(i, j). (1)
linear but degrades the focusing ability of attention. Swin- Finally, we use the mutual nearest neighbor strategy and
Transformer [33] limits attention in local windows, which the threshold θc to filter out the coarse-matching result Mc .
harms the ability to establish long-range associations. At According to depth information and coarse-matching result,
the same time, QuadTree [52] calculates attention in a we can crop different size grids on the high-resolution fea-
coarse-to-fine manner, and ASpanFormer [9] proposes an ture map F 1/2 . After linear self and cross attention layers,
adaptive method for selecting attention spans, but few of features of the cropped grids are used to produce the final
them consider local consistency. Different from the existing fine-level matching result.
attention mechanism, we explicitly model local consistency
in our spot-guided attention without introducing excessive 3.2. Spot-Guided Aggregation Module
computation and memory costs. Correct matching always satisfies the local matching
Local Feature Matching with Scale Invariance. Scale consistency, i.e., the matching points of two similar adjacent
variation is one of the main challenges faced by local fea- pixels are also close to each other in the other image. When
ture matching. Many works have explored solutions. Hand- humans establish dense matches between two images, they
crafted local features [5, 31, 45, 46] use Gaussian pyramid will first scan through the two images quickly and keep in
model to alleviate the problem. Following the hand-crafted mind some landmarks that are easier to match correctly. For
methods, Some learning-based descriptors [2, 4, 32, 36, 42, those trouble points similar to surrounding landmarks, it is
63] also use the multi-scale representation. ScaleNet [3] and not easy to obtain correct matches in the beginning. But
Scale-Net [17], instead, try to directly estimate the scale ra- now, they can focus attention around the matching points of
tio. Another popular paradigm is to perform a wrap or scal- landmarks to revisit trouble points’ matches. In this way,
ing operation to eliminate the distortion caused by the scale more correctly matched landmarks are obtained. After sev-
variance. GeoWrap [6] introduces a homography regres- eral iterations of the above process, eventually, they will get
sion and warps images to increase overlap area. OETR [10] the matching result for the whole image. Inspired by this
limits the keypoint detection in estimated overlap areas. idea, we design a spot-guided aggregation module. Sec-
COTR [24] estimates scale by finding co-visible regions, tion 3.2.1 introduces the preliminaries of vanilla attention
and then finds correspondence by recursively zooming. and linear attention. Section 3.2.2 describes our spot-guided
However, most of above methods require significant mod- attention mechanism. Section 3.2.3 demonstrates the design
ifications to the network architecture, and introduce addi- of the entire spot-guided aggregation module.

3
CNN Backbone Adaptive Scaling Module

��
Self&Cross
�� Attention
�� �� (Linear)
=
�� ��
Depth Maps
��
�� ��
��

� � Spot-Guided Aggregation Module


��
× �� ×C Matching Matrix
Cross Cross
Attention Attention

×

×C (�, �)

� �
� �
� �
Cross

×
Spot-Guided
Attention
Attention
(Linear) ×4
� �
×
Up-sampling&Fuse Down-sampling&Fuse � �

Figure 2. The architecture of ASTR. Our ASTR consists of two major components: spot-guided aggregation module and the adaptive
scaling module. “Cross Attention” means vanilla cross attention, unless otherwise stated. Please refer to the text for detailed architecture.

3.2.1 Preliminaries Spot-Guided Attention


Reference
The calculation of vanilla attention requires three inputs: Sim. Score Conf. Score Sel. Score
query Q, key K, and value V . The output of vanilla atten-
tion is a weighted sum of the value, where the weight matrix �
is determined by the query and its corresponding key. The Reference

Matching Matrix
process can be described as Cross
Matching Attention
\label {eq:epc1} \mathrm {Attention}(Q, K, V) = \mathrm {softmax}(QK^T)V. (2)
topk
However, in vision tasks, the size of the weight matrix select spot
softmax(QK T ) increases quadratically as the image reso- areas

lution grows. When the image resolution is large, the mem- Source Element-wise Multiplication
ory and computational cost of vanilla attention is unaccept-
able. To solve this problem, Linear attention [25] is pro- Figure 3. The illustration of our spot-guided attention.
posed to replace the softmax operator with the product of
two kernel functions:
\label {eq:epc2} \mathrm {Linear\_attention}(Q, K, V) = \phi (Q) (\phi (K^T)V), (3)
across images. For any pixel p in Figure 10, we first com-
2
pute the similarity score Ssim (p) ∈ Rl −1 between p and
where ϕ(·) = elu(·) + 1. Since the number of feature chan- other pixels in the l × l area around p. Specifically, the sim-
nels is much smaller than the number of pixels, the compu- ilarity score can be obtained as
tational complexity is reduced from quadratic to linear.
\label {eq:ep3} \noindent \mathrm {S_{sim}}\left (p\right ) = \mathop {\mathrm {softmax}}\limits _{i}\left (\left \{\langle F^{1/8}_{ref}(p), F^{1/8}_{src}(p_i) \rangle \right \}_{p_i \in N(p)}\right ), (4)
3.2.2 Spot-Guided Attention
where ⟨·, ·⟩ is the inner product, and N (p) is the set of
It is known from the local matching consistency that the pixels in the l×l field around pixel p. In addition, we should
matching points of similar adjacent pixels are also close also consider the reliability of points in N (p). For each pi ∈
to each other. In Figure 10, we illustrate the case that N (p), confidence can be viewed as the highest similarity to
the reference image as query aggregates features from the all pixels on the source images. Meanwhile, we can also
source image. Given reference and source feature maps get the matching point position of pi , denoted as Loc(pi ).
1/8 1/8 2
F 1/8 = {Fref , Fsrc }, we compute a matching matrix Ps Hence, Loc(pi ) and confidence score Sconf (p) ∈ Rl −1 can

4
� 3.3. Adaptive Scaling Module
��
At the fine stage, previous methods usually crop fixed-
�� �� �� size grids based on the coarse matching result. When there
is a large scale variation, fine matching may fail since the
�� ����.
�� ground-truth matching points are out of grids. Thus, we
�� refer to depth information to adaptively crop grids of differ-
ent sizes between images. Section 3.3.1 describes the way
to obtain depth information from the coarse-matching re-
����. sult. Section 3.3.2 demonstrates the process of adaptively
cropping grids.
Figure 4. The illustration of our adaptive scaling module. On the
left is the reference image, whose optical center is CRef . On the
3.3.1 Depth Information
right is the source image, whose optical center is CSrc . xi and xj
are the projections of the real-world point X. With the coarse-level matching result, we can obtain the rel-
ative pose of two images {R, T } through RANSAC [16].
It should be noted that the T calculated here has a scale
be computed in the following way: uncertainty, i.e., Treal = αT , where α is the scale factor.
Given the image coordinates of any pair of matching points
\label {eq:epc4} \begin {aligned} &\mathrm {S_{conf}}\left (p\right ) = \left \{\mathop {\mathrm {max}}\left (\mathrm {P_s}\left (p_i,:\right )\right )\right \}_{p_i \in N(p)}. \\ &\mathrm {Loc}\left (p_i\right ) = \mathop {\mathrm {argmax}}\left (\mathrm {P_s}\left (p_i,:\right )\right ), \end {aligned} {xi , xj } from coarse-level matching result, they satisfy the
(5)
following equation:

Combining two scores, we select p and top-k points \label {eq:epc8} \noindent d_j K^{-1}_j (x_j, 1)^T = d_i R K^{-1}_i (x_i, 1)^T + \alpha T , (8)
Topk(p) whose matching points are used as seed points
where di and dj are the depth values of xi and xj . Ki
Seed(p):
and Kj are corresponding camera intrinsics. We let pi =
\label {eq:epc5} \begin {aligned} &\mathrm {Topk}(p) = \{p\} \cup \mathrm {topk}\{\mathrm {S_{sim}}(p) \cdot \mathrm {S_{conf}}(p)\}, \\ &\mathrm {Seed}(p) = \{\mathrm {Loc}(q)\}_{q \in \mathrm {Topk}(p)},\end {aligned} RKi−1 (xi , 1)T and pj = Kj−1 (xj , 1)T . From Equation (8)
(6) it can be deduced that:
\label {eq:epc9} \begin {aligned} &d_j p_j = d_i p_i + \alpha T, \\ \Rightarrow &\left \{ \begin {array}{lr} (d_j / \alpha ) p_j \wedge p_i = 0 + T \wedge p_i, \\ 0 = (d_i / \alpha ) p_i \wedge p_j + T \wedge p_j, \end {array} \right . \\ \Rightarrow &\left \{ \begin {array}{lr} d_j / \alpha = \mathrm {mean}(\mathrm {div}(T \wedge p_i, p_j \wedge p_i)), \\ d_i / \alpha = \mathrm {mean}(\mathrm {div}(- T \wedge p_j, p_i \wedge p_j)), \end {array} \right . \end {aligned}
Following that, we extend l × l regions centered on these
seed points Seed(p) on Isrc , which are the spot areas of
(9)
p. Finally, cross attention is performed between p and cor-
responding spot areas. After exchanging the source image
and the reference image, the source feature map is updated
in the same way. where ∧ indicates outer product. div(·, ·) denotes element-
wise division between two vectors. mean(·) is the scalar
3.2.3 Spot-Guided Feature Aggregation mean of each component of a vector. In this way, we have
obtained depth information of xi and xj with scale uncer-
For the input features F 1/32 and F 1/8 , F 1/32 is updated by
tainty.
vanilla cross attention, and F 1/8 is updated by linear cross
attention for initialization. Then, two features of different
resolutions are fed into the spot-guided aggregation blocks. 3.3.2 Adaptive Scaling Strategy
In each block, F 1/32 and F 1/8 are first fused into each other As shown in Figure 4, xi and xj are a pair of matching
in the following way: points at the coarse stage. di and dj are depth values of xi
\label {eq:epc6} \begin {aligned} &\hat {F}^{1/32} = F^{1/32} + \mathrm {Conv_{1 \times 1}}(\mathrm {Down}(F^{1/8})), \\ &\hat {F}^{1/8} = F^{1/8} + \mathrm {Conv_{1 \times 1}}(\mathrm {Up}(F^{1/32})), \end {aligned} and xj . To begin with, we crop a si × si region centered on
(7) xi . When the scale changes too much, the correct matching
point xej may be beyond the si × si region around xj . Be-
cause everything looks small in the distance and big on the
where F̂ 1/32 and F̂ 1/8 are features after fusion. Down(·) contrary, the size of cropped grid sj should satisfy:
and Up(·) are downsampling and upsampling. And then,
\label {eq:epc10} \noindent \frac {s_j}{s_i} = \frac {d_i}{d_j} = (\frac {d_i}{\alpha })(\frac {d_j}{\alpha })^{-1},
F̂ 1/32 aggregate features across images by vanilla atten- (10)
tion. In the meantime, F̂ 1/8 aggregate features across im-
ages by spot-guided attention. After four spot-guided ag- Following the above approach, we can crop different sizes
gregation blocks, 1/32-resolution features are fused into of grids adaptively according to the scale variation. After
1/8-resolution features, which are used to obtain the coarse- the same refinement as LoFTR [50], we get the final match-
matching result Mc . ing position xej of xi .

5
3.4. Loss Function Table 1. Evaluation on HPatches [1] for homography estimation.
Homography est. AUC
Category Method matches
Our loss function mainly consists of three parts, spot @3px @5px @10px
D2Net [15]+NN 23.2 35.9 53.6 0.2K
matching loss, coarse matching loss, and fine matching loss. R2D2 [42]+NN 50.6 63.9 76.8 0.5K
Spot matching loss is the cross entropy loss to supervise the Detector-based DISK [55]+NN 52.3 64.9 78.9 1.1K
SP [14]+SuperGlue [47] 53.9 68.3 81.7 0.6K
matching matrix during spot-guided attention: Patch2Pix [64] 46.4 59.2 73.1 1.0k
Sparse-NCNet [43] 48.9 54.2 67.1 1.0K
COTR [24] 41.9 57.7 74.0 1.0K
\label {eq:epc11} \noindent L_s = - \frac {1}{\left \lvert M^{gt}_c \right \rvert } \sum _{(i,j) \in M^{gt}_c} \log \mathrm {P_s}(i,j), (11) DRC-Net [27] 50.6 56.2 68.3 1.0K
Detector-free
LoFTR [50] 65.9 75.6 84.6 1.0K
PDC-Net+ [54] 66.7 76.8 85.8 1.0k
ASTR(ours) 71.7 80.3 88.0 1.0K
where Mcgt is the ground truth matches at coarse resolution.
Coarse matching loss is also the cross entropy loss to super-
vise the coarse matching matrix: performance of our ASTR trained on MegaDepth [29]. We
use the same evaluation protocol as LoFTR [50]. We report
\label {eq:epc12} \noindent L_c = - \frac {1}{\left \lvert M^{gt}_c \right \rvert } \sum _{(i,j) \in M^{gt}_c} \log \mathrm {P_c}(i,j). (12) the area under the cumulative curve (AUC) of the corner er-
ror distance up to 3, 5, and 10 pixels, respectively. We limit
the maximum number of output matches to 1k.
Fine matching loss Lf is a weighted L2 loss same as Results. In Table 1, we can see that our ASTR achieves
LoFTR [50]. Therefore, our total loss is: new state-of-the-art performance on HPatches [1] under all
error thresholds, which strongly proves the effectiveness of
\label {eq:epc14} \noindent L_{total} = L_s + L_c + L_f. (13) our method. ASTR outperforms the best method before
(PDC-net+ [54]), achieving a large margin of 4.4% under
4. Experiments 3 pixels, 3.5% under 5 pixels, and 2.5% under 10 pix-
In this section, we evaluate our ASTR with extensive ex- els. Thanks to the proposed spot-guided aggregation mod-
periments. First of all, we introduce implementation details, ule and adaptive scaling module, our method can yield more
followed by experiments on five benchmarks and some vi- accurate matches under extreme viewpoint and illumination
sualizations. Finally, we conduct a series of ablation studies variations.
to verify the effectiveness of each component.
4.3. Relative Pose Estimation
4.1. Implementation Details
Dataset and Metric. We use MegaDepth [29] and Scan-
We implement the proposed model in Pytorch [40]. Our Net [12] to demonstrate the performance of our ASTR in
ASTR is trained on the MegaDepth dataset [29]. In the relative pose estimation. MegaDepth [29] is a large-scale
training phase, we input images with the size of 832 × 832 outdoor dataset that contains 1 million internet images of
for training. The CNN extractor is a deepened ResNet- 196 different outdoor scenes. Each scene is reconstructed
18 [20] with features at 1/32 resolution. In spot-guided at- by COLMAP [49]. Depth maps as intermediate results can
tention, we set the kernel size of local region l to 5 and k to be converted to ground truth matches. We sample the same
4 in topk. Threshold θc in coarse matching is chosen to 0.2. 1500 pairs as [50] for testing. All test images are resized
At the fine stage, window size si in the reference image is such that their longer dimensions are 1216. ScanNet [12]
fixed to 5, and window size sj in the source image will be is usually used to validate the performance of indoor pose
adaptively calculated according to the depth information. estimation. It is composed of monocular sequences with
In particular, sj /si is clamped into [1, 3]. Our network is ground truth poses and depth maps. Wide baselines and
trained for 15 epochs with a batch size of 8 by Adam [26] extensive textureless regions in image pairs make Scan-
optimizer. The initial learning rate is 1 × 10−3 . In order Net [12] challenging. For a fair comparison, we follow the
to establish spot-guided attention efficiently, we implement same testing pairs and evaluation protocol as [50]. And
a highly optimized general sparse attention operator based all test images are resized to 640 × 480. Note that we use
on CUDA. Please refer to the Supplementary Material for our ASTR trained on MegaDepth [29] to evaluate its per-
more details about the operator. formance on ScanNet [12]. We report the AUC of the pose
error at thresholds (5◦ , 10◦ , 20◦ ), where pose error is the
4.2. Homography Estimation maximum angular error in rotation and translation. The an-
Dataset and Metric. HPatches [1] is a popular bench- gular error is computed between the ground truth pose and
mark for image matching. Following [15] , we choose 56 the predicted pose.
sequences under significant viewpoint changes and 52 se- Results. As shown in Table 2, our ASTR outperforms
quences with large illumination variation to evaluate the other state-of-the-art methods on MegaDepth [29]. In par-

6
Table 2. Evaluation on MegaDepth [29] for outdoor relative posi- Table 4. Visual localization evaluation on the InLoc [51] bench-
tion estimation. mark.
Pose estimation AUC DUC1 DUC2
Category Method Method
@5◦ @10◦ @20◦ (0.25m, 10◦ ) / (0.5m, 10◦ ) / (1m, 10◦ )
SP [14]+SuperGlue [47] 42.2 59.0 73.6 Patch2Pix [64](w.SP [47]+CAPS [58]) 42.4 / 62.6 / 76.3 43.5 / 61.1 / 71.0
Detector-based
SP [14]+SGMNet [8] 40.5 59.0 73.6 LoFTR [50] 47.5 / 72.2 / 84.8 54.2 / 74.8 / 85.5
DRC-Net [27] 27.0 42.9 58.3 MatchFormer [57] 46.5 / 73.2 / 85.9 55.7 / 71.8 / 81.7
PDC-Net+(H) [54] 43.1 61.9 76.1 ASpanFormer [9] 51.5 / 73.7 / 86.4 55.0 / 74.0 / 81.7
LoFTR [50] 52.8 69.2 81.2 ASTR(ours) 53.0 / 73.7 / 87.4 52.7 / 76.3 / 84.0
Detector-free MatchFormer [57] 53.3 69.7 81.8
QuadTree [52] 54.6 70.5 82.2
ASpanFormer [9] 55.3 71.5 83.1 Table 5. Visual localization evaluation on the Aachen Day-Night
ASTR(ours) 58.4 73.1 83.8 benchmark v1.1 [62].
Day Night
Method
(0.25m, 2◦ ) / (0.5m, 5◦ ) / (1m, 10◦ )
Table 3. Evaluation on ScanNet [12] for indoor relative position Localization with matching pairs provided in dataset
estimation. * indicates models trained on MegaDepth [29]. R2D2 [42]+NN - 71.2 / 86.9 / 98.9
Pose estimation AUC ASLFeat [36]+NN - 72.3 / 86.4 / 97.9
Category Method
@5◦ @10◦ @20◦ SP [14]+SuperGlue [47] - 73.3 / 88.0 / 98.4
D2-Net [15]+NN 5.3 14.5 28.0 SP [14]+SGMNet [8] - 72.3 / 85.3 / 97.9
Detector-based SP [14]+OANet [61] 11.8 26.9 43.9 Localization with matching pairs generated by HLoc
SP [14]+SuperGlue [47] 16.2 33.8 51.8 LoFTR [50] 88.7 / 95.6 / 99.0 78.5 / 90.6 / 99.0
DRC-Net [27]* 7.7 17.9 30.5 ASpanFormer [9] 89.4 / 95.6 / 99.0 77.5 / 91.6 / 99.0
MatchFormer [57]* 15.8 32.0 48.0 AdaMatcher [22] 89.2 / 95.9 / 99.2 79.1 / 92.1 / 99.5
Detector-free LoFTR-OT [50]* 16.9 33.6 50.6 ASTR(ours) 89.9 / 95.6 / 99.2 76.4 / 92.1 / 99.5
Quadtree [52]* 19.0 37.3 53.5
ASTR(ours)* 19.4 37.6 54.4 Table 6. Ablation Study of each component on MegaDepth [29].
Spot-Guided Pose estimation AUC
Index Multi-Level Scaling
(l = 5, k = 4) @5◦ @10◦ @20◦
1 45.6 62.2 75.3
2 ✓ 46.7 63.1 76.3
Outdoor

3 ✓ ✓ 47.7 64.5 77.4


4 ✓ ✓ ✓ 48.3 65.0 77.7

door dataset with 9972 RGBD images, of which 329 RGB


images are employed as queries for visual localization.
Indoor

The challenge of InLoc [51] mainly comes from texture-


less regions and repetitive patterns under large viewpoint
changes. In Aachen Day-Night v1.1 [62], 824 day-time im-
LoFTR Matchformer Ours
ages and 191 night-time images are chosen as queries for
Figure 5. Qualitative results of dense matching on outdoor visual localization. Large illumination and view-
MegaDepth [29] and ScanNet [12]. point changes pose challenges for Aachen [62]. For both
benchmarks, we evaluate the performance of our ASTR
ticular, our ASTR improves by 3.1% in AUC@5◦ and 1.6% trained on MegaDepth [29] in the same way as [50]. The
in AUC@10◦ . Table 3 summarizes the performance com- metrics of Inloc [51] and Aachen [62] are the same, which
parison between the proposed ASTR and state-of-the-art measure the percentage of images registered within given
methods on ScanNet [12]. Our ASTR ranks first when only error thresholds.
considering models not trained on ScanNet [12], indicat- Results. For InLoc [51] benchmark, our method
ing the impressive generalization of our method. Thanks to achieves the best performance on DUC1 and is on par
the proposed spot-guided aggregation module and adaptive with state-of-the-art methods on DUC2 (in Tabel 4). For
scaling module, our method can yield more correct matches, Aachen [62] benchmark, our ASTR performs comparative
resulting in more accurate pose estimation. In order to fur- with others on Day and Night scenes (in Tabel 5). Overall,
ther demonstrate the effectiveness of our ASTR, in Figure 5, our method exhibits strong generalization ability in visual
we visually demonstrate the comparison with other meth- localization.
ods on the matching result. Notably, our methods can better
handle the challenges such as textureless areas, repetitive 4.5. Ablation Study
patterns, and scale variations. To deeply analyze the proposed method, we perform de-
tailed ablation studies on MegaDepth [29] to evaluate the
4.4. Visual Localization
effectiveness of each component in ASTR. Here, we use
Dataset and Metric. In this experiment, InLoc [51] and images with a size of 544 for training and evaluation. As
Aachen Day-Night v1.1 [62] are used to verify the ability shown in Table 6, we intend to gradually add these com-
of our ASTR in visual localization. InLoc [51] is an in- ponents to the baseline. The baseline (Index-1) we used is

7
Outdoor
Outdoor

Indoor
Figure 7. Visualization of grids from adaptive scaling module on
MegaDepth [29] (outdoor) and ScanNet [12] (indoor).
Indoor

Then, we fix k = 4 and vary l from 3 to 9. As shown in Ta-


ble 7, we find that the model achieves the best performance
at l = 5. The reason may be that the spot area is too small
to provide sufficient information from another image when
using small k or l. With large k or l, for a certain pixel,
(a)Reference (b)Vanilla Attention (c)Ours some matching areas of low confidence or dissimilar points
will damage its feature aggregation.
Figure 6. Visualization of vanilla and spot-guided cross attention
maps on MegaDepth [29] (outdoor) and ScanNet [12] (indoor).
Effectiveness of Adaptive Scaling Module. As shown
in Table 6, comparing the results of Index-4 and Index-3,
Table 7. Ablation Study with different k and l in spot-guided at-
we can see that the performance is improved, which indi-
tention on MegaDepth [29]. cates that coarse-level matching results are better refined
Pose estimation AUC with adaptive scaling module. In Figure 7, we visualize
k(l = 5) Pose estimation AUC
@5◦ @10◦ @20◦ l(k = 4) the cropped grids from adaptive scaling module, indicating
1 46.0 62.7 76.2 @5◦ @10◦ @20◦
2 47.5 64.0 77.1 3 46.7 63.2 76.1 that our adaptive scaling module can adaptively crop grids
3 47.3 63.8 76.7 5 47.7 64.5 77.4 of different sizes according to scale variations.
4 47.7 64.5 77.4 7 47.2 63.4 76.8
5 47.1 63.7 77.0 9 43.0 60.5 74.8
6 46.9 63.6 76.6 5. Conclusion
In this paper, we propose a novel Adaptive Spot-guided
slightly different from LoFTR [50]. More details can be Transformer (ASTR) for consistent local feature matching.
found in Supplementary Material. To model local matching consistency, we design a spot-
Effectiveness of Spot-Guided Aggregation Module. guided aggregation module to make most pixels avoid the
We divide the spot-guided aggregation module into multi- impact of irrelevant areas, such as noisy and repetitive re-
level cross attention and spot-guided attention for ablation gions. To better handle large scale variation, we use the
studies. We first add vanilla cross attention layers at 1/32 calculated depth information to adaptively adjust the size of
resolution to the baseline (Index-2 in Table 6). Compar- grids at the fine stage. Extensive experimental results on
ing the results of Index-2 and Index-1, we conclude that five benchmarks demonstrate the effectiveness of the pro-
1/32 resolution global interaction across images is benefi- posed method.
cial for image matching. Then, in Index-3, linear attention Limitation. Although our adaptive scaling module is
layers at 1/8 resolution are substituted for the spot-guided lightweight and pluggable, it demands camera pose estima-
attention layers. The performance of Index-3 is improved tion in the coarse stage, which requires the camera intrin-
compared with Index-2, which verifies the effectiveness of sic parameters. While camera intrinsic parameters are ob-
our spot-guided attention. In Figure 6, we visualize vanilla tainable in standard datasets and most real-world scenarios,
and our spot-guided cross attention maps for contrast, show- there are still some images from wild that lack them, ren-
ing that spot-guided attention can indeed avoid interference dering the adaptive scaling module disabled in such cases.
from unrelated areas.
To maximize the effectiveness of our spot-guided atten-
tion, we explore how to set suitable parameters l and k.
First, in the setting of Index-3, we fix l = 5 and vary k
from 1 to 6. After observing the results in Table 7, the per-
formance drops when k is smaller than 4 or larger than 4.

8
Adaptive Spot-Guided Transformer for Consistent Local Feature Matching
Supplementary Material
In this supplementary material, we first introduce the
general sparse attention operator in Section 6. In Section 7,
we provide some details about our experiment. In Section 8,
we show additional visualizations about the spot-guided at-
tention and adaptive scaling modules.

General Sparse Attention Operator


0 1
0 2 softmax
0 4
1 1
�� 2 3
2 5 + �0
Figure 9. Learning rate curve while training on MegaDepth [29].
��

�� �� ���� �����
�0 � the sparse attention map as attn, which is calculated by
�1 �1
�2 �2
�� indexing
��/�
�4 �4 data flow \label {eq:ep1} \noindent attn[i]=Q[M_q[i]]K[M_k[i]]^T, \; i=0,1,\cdots , L_m-1.
dot prod.
+ sum
(14)
� � �
In the step 2, we group the elements in attn with the
Step 1 Step 2 Step 3
same query index and apply softmax on each group. The
Figure 8. The illustration of our general sparse attention operator. result is denoted as attns .
In the step 3, we compute the updated query

\label {eq:ep2} O[q]=\sum _{M_q[i]=q}{{attn}_s[i] \cdot V[M_k[i]]}. (15)


6. General Sparse Attention Operator
All of three steps are implemented in CUDA.
Due to irregular key/value token number for each
Compared with the naive implementation using Py-
query in Spot Attention, the naive implementation by Py-
Torch [40], our highly optimized implementation reduces
Torch [40] is not efficient for memory and computation,
the memory and time complexity from O(Nq ·Nk ·Nh ·Nd2 )
which uses a mask to set unwanted values in the attention
to O(Lm · Nh · Nd2 ), where Nq , Nk and Nh are sepa-
map to 0. More generally, the same problem also exists
rately the numbers of query tokens, key tokens and attention
when the numbers of key corresponding to queries are not
heads, and Nd is the dimension of each head. Considering
the same. Inspired by PointNet [?] and Stratified Trans-
Lm ≪ Nq · Nk , our implementation is much more efficient
former [?], we implement a general sparse attention oper-
than the naive implementation.
ator using CUDA that is efficient in terms of memory and
In particular, we also calculate the matching matrix in
computation. We attempt to only compute the necessary
spot-guided attention in this way and set the probability of
attention between much less query/key tokens.
unrelated pixels to 0, which can greatly reduce the memory
We can divide a vanilla attention operator into 3 steps. and computation cost.
Inputs are grouped as query Q, key K and value V . First,
the attention map A is computed by dot production as 7. Experimental Details
A = QK T . Then, a softmax operator√ is performed on the
attention map: As = softmax(A/ dk ). Finally, the up- 7.1. Training Details
dated query O can be obtained by O = As V . We optimize To reduce the GPU memory, we randomly sample 50%
these three steps separately. of ground truth matches to supervise the matching matrix
In the step 1, because only a few results in A are use- at the coarse stage. And we sample 20% of the maximum
ful for sparse attention, we do not need to compute the full number of coarse-level possible matches at the fine stage.
A. Instead, we compute the dot productions between Lm We train ASTR on MegaDepth [29] for 15 epochs. The
pairs of query and key. Mq and Mk record the indexes of initial learning rate is 1 × 10−3 , with a linear learning rate
query and key tokens whose dot productions are needed. warm-up for 15000 iterations. The learning rate curve is
The length of Mq and Mk are both Lm . Here, we denote shown in Figure 9.

9
Outdoor

Figure 11. Visualizations of grids from adaptive scaling module


and corresponding depth maps on MegaDepth [29]. Note that
we use depth values with scale uncertainty to compose the depth
maps.

lowing [7]:

\label {eq:ep2} \mathrm {PE}_i(x,y) = \left \{ \begin {array}{ll} \mathrm {sin}(w_k \cdot x), & i = 4k \\ \mathrm {cos}(w_k \cdot x), & i = 4k + 1 \\ \mathrm {sin}(w_k \cdot y), & i = 4k + 2 \\ \mathrm {cos}(w_k \cdot y), & i = 4k + 3 \\ \end {array}, \right . (16)
Indoor

where wk = 100001 2k/d , d denotes the number of feature


channels and i is the index for feature channels. Consider-
ing the gap in image resolution between training and testing,
we utilize the normalized positional encoding as [9], which
is proven to mitigate the impact of image resolution changes
in [9]. The normalized positional encoding NPEi (·, ·) can
be expressed as
(a)Reference (b)Vanilla Attention (c)Ours

Figure 10. Visualizations of vanilla and spot-guided attention


\label {eq:ep2} \mathrm {NPE}_i(x,y) = \mathrm {PE}_i(x * \frac {W_{train}}{W_{test}}, y * \frac {H_{train}}{H_{test}}), (17)
maps on MegaDepth [29] (outdoor) and ScanNet [12] (indoor).

where Wtrain/test and Htrain/test are width and height of


training/testing images.
(2) Convolution in Attention. Chen et al. [9] find that
7.2. Differences between Baseline and LoFTR replacing the self attention with convolution can improve
the performance. Hence, we deprecate self attention and
There are two main differences between our baseline and MLP, and utilize a 3 × 3 convolution in our ASTR.
LoFTR [50]. 7.3. CNN Backbone
(1) Normalized Positional Encoding. LoFTR [50] Here we leverage a deepened version of Feature Pyramid
adopts the absolute sinusoidal positional encoding by fol- Network (FPN) [30], which achieves a minimum resolution

10
of 1/32. The initial dimension for the stem is still 128 as [9] Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin
LoFTR [50], and the number of feature channels for subse- Zhen, Tian Fang, David Mckinnon, Yanghai Tsin, and Long
quent stages is [128, 196, 256, 256, 256]. Quan. Aspanformer: Detector-free image matching with
adaptive span transformer. arXiv preprint arXiv:2208.14201,
8. Visualization Results 2022. 1, 2, 3, 7, 10
[10] Ying Chen, Dihe Huang, Shang Xu, Jianlin Liu, and Yong
In Figure 10, we pick up two similar adjacent pixels as Liu. Guide local feature matching by overlap estimation.
queries and visualize the corresponding attention maps of In Proceedings of the AAAI Conference on Artificial Intelli-
vanilla and our spot-guided attention for comparison. The gence, volume 36, pages 365–373, 2022. 3
vanilla attention mechanism is vulnerable to repetitive tex- [11] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochre-
tures, while our spot-guided attention can focus on the cor- iter. Fast and accurate deep network learning by exponential
linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
rect areas in these repeated texture regions. Because large
3
scale variation occurs frequently on outdoor datasets, we
[12] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal-
mainly visualize the grids from the adaptive scaling mod- ber, Thomas Funkhouser, and Matthias Nießner. Scannet:
ule and corresponding depth maps on MegaDepth [29]. As Richly-annotated 3d reconstructions of indoor scenes. In
shown in Figure 11, our adaptive scaling module can adjust Proceedings of the IEEE conference on computer vision and
the size of grids according to depth information. pattern recognition, pages 5828–5839, 2017. 6, 7, 8, 10
[13] Angela Dai, Matthias Nießner, Michael Zollhöfer, Shahram
References Izadi, and Christian Theobalt. Bundlefusion: Real-time
globally consistent 3d reconstruction using on-the-fly sur-
[1] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys- face reintegration. ACM Transactions on Graphics (ToG),
tian Mikolajczyk. Hpatches: A benchmark and evaluation 36(4):1, 2017. 1
of handcrafted and learned local descriptors. In Proceed- [14] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi-
ings of the IEEE Conference on Computer Vision and Pattern novich. Superpoint: Self-supervised interest point detection
Recognition, pages 5173–5182, 2017. 6 and description. In Proceedings of the IEEE Conference on
[2] Axel Barroso-Laguna, Edgar Riba, Daniel Ponsa, and Krys- Computer Vision and Pattern Recognition Workshops, pages
tian Mikolajczyk. Key. net: Keypoint detection by hand- 224–236, 2018. 1, 2, 6, 7
crafted and learned cnn filters. In Proceedings of the IEEE [15] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle-
International Conference on Computer Vision, pages 5836– feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net:
5844, 2019. 1, 2, 3 A trainable cnn for joint detection and description of local
[3] Axel Barroso-Laguna, Yurun Tian, and Krystian Mikola- features. In Proceedings of the IEEE Conference on Com-
jczyk. Scalenet: A shallow architecture for scale estimation. puter Vision and Pattern Recognition, 2019. 1, 2, 6, 7
In Proceedings of the IEEE/CVF Conference on Computer [16] Martin A Fischler and Robert C Bolles. Random sample
Vision and Pattern Recognition, pages 12808–12818, 2022. consensus: a paradigm for model fitting with applications to
3 image analysis and automated cartography. Communications
[4] Axel Barroso-Laguna, Yannick Verdie, Benjamin Busam, of the ACM, 24(6):381–395, 1981. 5
and Krystian Mikolajczyk. Hdd-net: Hybrid detector de- [17] Yujie Fu and Yihong Wu. Scale-net: Learning to reduce scale
scriptor with mutual interactive learning. In Proceedings of differences for large-scale invariant image matching. arXiv
the Asian Conference on Computer Vision, 2020. 3 preprint arXiv:2112.10485, 2021. 3
[5] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc [18] Alexander Grabner, Peter M Roth, and Vincent Lepetit. 3d
Van Gool. Speeded-up robust features (surf). Computer vi- pose estimation and 3d model retrieval for objects in the
sion and image understanding, 110(3):346–359, 2008. 3 wild. In Proceedings of the IEEE Conference on Computer
[6] Gabriele Berton, Carlo Masone, Valerio Paolicelli, and Bar- Vision and Pattern Recognition, pages 3022–3031, 2018. 1
bara Caputo. Viewpoint invariant dense matching for visual [19] Jianfeng He, Tianzhu Zhang, Yuhui Zheng, Mingliang Xu,
geolocalization. In Proceedings of the IEEE/CVF Interna- Yongdong Zhang, and Feng Wu. Consistency graph mod-
tional Conference on Computer Vision, pages 12169–12178, eling for semantic correspondence. IEEE Transactions on
2021. 3 Image Processing, 30:4932–4946, 2021. 2
[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- Deep residual learning for image recognition. In Proceed-
end object detection with transformers. In European confer- ings of the IEEE Conference on Computer Vision and Pattern
ence on computer vision, pages 213–229. Springer, 2020. 3, Recognition, pages 770–778, 2016. 6
10 [21] Berthold KP Horn and Brian G Schunck. Determining opti-
[8] Hongkai Chen, Zixin Luo, Jiahui Zhang, Lei Zhou, Xuyang cal flow. Artificial intelligence, 17(1-3):185–203, 1981. 3
Bai, Zeyu Hu, Chiew-Lan Tai, and Long Quan. Learning [22] Dihe Huang, Ying Chen, Shang Xu, Yong Liu, Wenlong Wu,
to match features with seeded graph matching network. In Yikang Ding, Chengjie Wang, and Fan Tang. Adaptive as-
Proceedings of the IEEE/CVF International Conference on signment for geometry aware local feature matching. arXiv
Computer Vision, pages 6301–6310, 2021. 7 preprint arXiv:2207.08427, 2022. 3, 7

11
[23] Shuaiyi Huang, Qiuyue Wang, Songyang Zhang, Shipeng Vision and Pattern Recognition, pages 6589–6598, 2020. 1,
Yan, and Xuming He. Dynamic context correspondence net- 3, 7
work for semantic alignment. In Proceedings of the IEEE [37] Meng Meng, Tianzhu Zhang, Zhe Zhang, Yongdong Zhang,
International Conference on Computer Vision, pages 2010– and Feng Wu. Adversarial transformers for weakly super-
2019, 2019. 1 vised object localization. IEEE Transactions on Image Pro-
[24] Wei Jiang, Eduard Trulls, Jan Hosang, Andrea Tagliasacchi, cessing, 31:7130–7143, 2022. 3
and Kwang Moo Yi. Cotr: Correspondence transformer for [38] Meng Meng, Tianzhu Zhang, Zhe Zhang, Yongdong Zhang,
matching across images. In Proceedings of the IEEE/CVF and Feng Wu. Task-aware weakly supervised object localiza-
International Conference on Computer Vision, pages 6207– tion with transformer. IEEE Transactions on Pattern Analy-
6217, 2021. 2, 3, 6 sis and Machine Intelligence, 2022. 3
[25] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and [39] Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi.
François Fleuret. Transformers are rnns: Fast autoregressive Lf-net: learning local features from images. In Advances in
transformers with linear attention. In International Confer- Neural Information Processing Systems, pages 6237–6247,
ence on Machine Learning, pages 5156–5165. PMLR, 2020. 2018. 1
2, 3, 4 [40] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
[26] Diederik P Kingma and Jimmy Ba. Adam: A method for James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
stochastic optimization. arXiv preprint arXiv:1412.6980, Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im-
2014. 6 perative style, high-performance deep learning library. arXiv
[27] Xinghui Li, Kai Han, Shuda Li, and Victor Prisacariu. Dual- preprint arXiv:1912.01703, 2019. 6, 9
resolution correspondence networks. Advances in Neural In-
[41] Mikael Persson and Klas Nordberg. Lambda twist: An accu-
formation Processing Systems, 33, 2020. 1, 2, 3, 6, 7
rate fast robust perspective three point (p3p) solver. In Pro-
[28] Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang. ceedings of the European Conference on Computer Vision,
Depthformer: Exploiting long-range correlation and local in- pages 318–332, 2018. 1
formation for accurate monocular depth estimation. arXiv
[42] Jerome Revaud, Philippe Weinzaepfel, César Roberto de
preprint arXiv:2203.14211, 2022. 2
Souza, and Martin Humenberger. R2D2: repeatable and re-
[29] Zhengqi Li and Noah Snavely. Megadepth: Learning single-
liable detector and descriptor. In Advances in Neural Infor-
view depth prediction from internet photos. In Proceedings
mation Processing Systems, 2019. 1, 2, 3, 6, 7
of the IEEE Conference on Computer Vision and Pattern
[43] Ignacio Rocco, Relja Arandjelović, and Josef Sivic. Efficient
Recognition, pages 2041–2050, 2018. 6, 7, 8, 9, 10, 11
neighbourhood consensus networks via submanifold sparse
[30] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
convolutions. In Proceedings of the European Conference
Bharath Hariharan, and Serge Belongie. Feature pyramid
on Computer Vision, pages 605–621, 2020. 1, 6
networks for object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, [44] Ignacio Rocco, Mircea Cimpoi, Relja Arandjelović, Akihiko
pages 2117–2125, 2017. 3, 10 Torii, Tomas Pajdla, and Josef Sivic. Neighbourhood consen-
[31] Ce Liu, Jenny Yuen, and Antonio Torralba. Sift flow: Dense sus networks. In Advances in Neural Information Processing
correspondence across scenes and its applications. IEEE Systems, pages 1658–1669, 2018. 1, 3
Transactions on Pattern Analysis and Machine Intelligence, [45] Edward Rosten and Tom Drummond. Machine learning
33(5):978–994, 2010. 3 for high-speed corner detection. In Computer Vision–ECCV
[32] Dongfang Liu, Yiming Cui, Liqi Yan, Christos Mousas, Bai- 2006: 9th European Conference on Computer Vision, Graz,
jian Yang, and Yingjie Chen. Densernet: Weakly super- Austria, May 7-13, 2006. Proceedings, Part I 9, pages 430–
vised visual localization using multi-scale feature aggrega- 443. Springer, 2006. 3
tion. In Proceedings of the AAAI Conference on Artificial [46] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary
Intelligence, volume 35, pages 6101–6109, 2021. 3 Bradski. Orb: An efficient alternative to sift or surf. In 2011
[33] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng International conference on computer vision, pages 2564–
Zhang, Stephen Lin, and Baining Guo. Swin transformer: 2571. Ieee, 2011. 2, 3
Hierarchical vision transformer using shifted windows. In [47] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz,
Proceedings of the IEEE/CVF International Conference on and Andrew Rabinovich. Superglue: Learning feature
Computer Vision, pages 10012–10022, 2021. 3 matching with graph neural networks. In Proceedings of the
[34] David G Lowe. Distinctive image features from scale- IEEE Conference on Computer Vision and Pattern Recogni-
invariant keypoints. International Journal of Computer Vi- tion, pages 4938–4947, 2020. 1, 2, 6, 7
sion, 60(2):91–110, 2004. 2 [48] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii,
[35] Bruce D Lucas, Takeo Kanade, et al. An iterative image Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi
registration technique with an application to stereo vision, Okutomi, Marc Pollefeys, Josef Sivic, et al. Benchmark-
volume 81. Vancouver, 1981. 3 ing 6dof outdoor visual localization in changing conditions.
[36] Zixin Luo, Lei Zhou, Xuyang Bai, Hongkai Chen, Jiahui In Proceedings of the IEEE Conference on Computer Vision
Zhang, Yao Yao, Shiwei Li, Tian Fang, and Long Quan. and Pattern Recognition, pages 8601–8610, 2018. 1
Aslfeat: Learning local features of accurate shape and local- [49] Johannes L Schonberger and Jan-Michael Frahm. Structure-
ization. In Proceedings of the IEEE Conference on Computer from-motion revisited. In Proceedings of the IEEE Con-

12
ference on Computer Vision and Pattern Recognition, pages learned features and view synthesis. International Journal of
4104–4113, 2016. 1, 6 Computer Vision, 129(4):821–844, 2021. 7
[50] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and [63] Lei Zhou, Siyu Zhu, Tianwei Shen, Jinglu Wang, Tian Fang,
Xiaowei Zhou. Loftr: Detector-free local feature matching and Long Quan. Progressive large scale-invariant image
with transformers. In Proceedings of the IEEE/CVF Con- matching in scale space. In Proceedings of the IEEE inter-
ference on Computer Vision and Pattern Recognition, pages national conference on computer vision, pages 2362–2371,
8922–8931, 2021. 1, 2, 3, 5, 6, 7, 8, 10, 11 2017. 2, 3
[51] Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea [64] Qunjie Zhou, Torsten Sattler, and Laura Leal-Taixe.
Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Ak- Patch2pix: Epipolar-guided pixel-level correspondences. In
ihiko Torii. Inloc: Indoor visual localization with dense Proceedings of the IEEE/CVF conference on computer vi-
matching and view synthesis. In Proceedings of the IEEE sion and pattern recognition, pages 4669–4678, 2021. 6, 7
Conference on Computer Vision and Pattern Recognition,
pages 7199–7209, 2018. 1, 7
[52] Shitao Tang, Jiahui Zhang, Siyu Zhu, and Ping Tan.
Quadtree attention for vision transformers. arXiv preprint
arXiv:2201.02767, 2022. 2, 3, 7
[53] Prune Truong, Martin Danelljan, and Radu Timofte. Glu-
net: Global-local universal network for dense flow and
correspondences. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 6258–
6268, 2020. 3
[54] Prune Truong, Martin Danelljan, Radu Timofte, and Luc
Van Gool. Pdc-net+: Enhanced probabilistic dense corre-
spondence network. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 2023. 3, 6, 7
[55] Michał Tyszkiewicz, Pascal Fua, and Eduard Trulls. Disk:
Learning local features with policy gradient. Advances in
Neural Information Processing Systems, 33:14254–14265,
2020. 6
[56] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. Advances in neural
information processing systems, 30, 2017. 3
[57] Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng,
and Rainer Stiefelhagen. Matchformer: Interleaving atten-
tion in transformers for feature matching. arXiv preprint
arXiv:2203.09645, 2022. 2, 3, 7
[58] Qianqian Wang, Xiaowei Zhou, Bharath Hariharan, and
Noah Snavely. Learning feature descriptors using camera
pose supervision. In Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, August 23–28, 2020,
Proceedings, Part I 16, pages 757–774. Springer, 2020. 7
[59] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and
Hao Ma. Linformer: Self-attention with linear complexity.
arXiv preprint arXiv:2006.04768, 2020. 3
[60] Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, and
Elisa Ricci. Transformer-based attention networks for
continuous pixel-wise prediction. In Proceedings of the
IEEE/CVF International Conference on Computer Vision,
pages 16269–16279, 2021. 2
[61] Jiahui Zhang, Dawei Sun, Zixin Luo, Anbang Yao, Lei
Zhou, Tianwei Shen, Yurong Chen, Long Quan, and Hon-
gen Liao. Learning two-view correspondences and geome-
try using order-aware network. In Proceedings of the IEEE
International Conference on Computer Vision, pages 5845–
5854, 2019. 7
[62] Zichao Zhang, Torsten Sattler, and Davide Scaramuzza. Ref-
erence pose generation for long-term visual localization via

13

You might also like