CSCI585
Nearest Neighbor Queries
Instructor: Cyrus Shahabi
CSCI585
Nearest Neighbor Search
convert the nearest neighbor search to range search. Guess a range around Q that contains at least one object say O
if the current guess does not include any answers, increase range size until an object found.
Retrieve the nearest neighbor of query point Q Simple Strategy:
Compute distance d between Q and O re-execute the range query with the distance d around Q. Compute distance of Q from each retrieved object. The object at minimum distance is the nearest neighbor!!!
CSCI585
Nave Approach
A B F E 4 5 6
Query Point Q
A: B C
1 2 3
B: E F C: G H
E 4 5 6
F 1 2 3
G 10 11 12 H 7 8
C 11 G 12 10 7 8 9 H
Issues: how to guess range? The retrieval may be sub-optimal if incorrect range guessed. Would be a problem in high dimensional spaces.
CSCI585
Given a query location q, find the nearest object.
a q
Depth First and Best-First Search using R-trees Goal: avoid visiting nodes that cannot contain results
4
CSCI585
Basic Pruning Metric: MINDIST
Minimum distance between q and an MBR.
p E1 mindist(E1,q) q
mindist(E1,q) is a lower bound of d(o,q) for every object o in E1. If we have found a candidate NN p, we can prune every MBR whose mindist > d(p, q).
5
CSCI585
MINDIST Property
MINDIST is a lower bound of any k-NN distance
(p1, p2) (t1, t2)
(p1, p2) (p1, p2) (s1, s2)
(p1, p2) (p1, p2)
Depth-First (DF) NN Algorithm Roussoulos et al., SIGMOD, 1995
CSCI585
y axis
10 g 8 6 4 2 e f h
E 2 E 6
E 7
l k j
E 5
i E E 4 1 query d E a 3 b c
Main idea Starting from the root visit nodes according to their mindist in depth-first manner
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a 5 7 E 3 b c E 4 E 3 5 E 4 9 d E 5 5 e E 5 f
E 2 2 E 6 2 g h E 6
Note: distances not actually stored inside nodes. Only for illustration E 7 13 i 2 j 10
E 2 k 13 E 7 l m
CSCI585
DF Search Visit E1
y axis
10 g 8 6 4 b 2 c e f h
E 2 E 6
E 7
l k j
E 5 E 1
a i
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a 5 8 E 3 b c E 4 E 3 5 E 4 9 d E 5 5 e E 5 f
E 2 2 E 6 2 g h E 6 E 7 13 i 2 j E 2 k E 7 l m
DF Search Find Candidate NN a
CSCI585
y axis
10 g 8 6 4 b 2 c e f h
E 2 E 6
E 7
l k j
E 5 E 1
a i
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a 5 9 E 3 b c E 4 E 3 5 E 4 9 d E 5 5 e E 5 f
E 2 2 E 6 2 g h E 6 E 7 13 i 2 j
First Candidate NN: a with distance 5
E 2 k E 7 l m
CSCI585
DF Search Backtrack to Root and Visit E2 y axis
10 g 8 6 4 b 2 c e f h
E 2 E 6
E 7
l k j
E 5 E 1
a i
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a 5 10 E 3 b c E 4 E 3 5 E 4 9 d E 5 5 e E 5 f
E 2 2 E 6 2 g h E 6 E 7 13 i 2 j
First Candidate NN: a with distance 5
E 2 k E 7 l m
DF Search Find Actual NN i
CSCI585
y axis
10 g 8 6 4 b 2 c e f h
E 2 E 6
E 7
l k j
E 5 E 1
a i
E 4
d
query
First Candidate NN: a with distance 5 x axis
E 3
10
Root E 1 1 E 1 a 5 11 E 3 b c E 4 E 3 5 E 4 9 d E 5 5 e E 5 f
Actual NN: i with distance 2
E 2 2 E 6 2 g h E 6 E 7 13 i 2 j E 2 k E 7 l m
Optimality
CSCI585
y axis
10 g 8 6 4 b 2 c e f h
E 2 E 6 E 5 E 1
a i
E 7
l k j
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Question: Which is the minimal set of nodes that must be visited by any NN algorithm? Answer: The set of nodes whose MINDIST is smaller than or equal to the distance between q and its NN (e.g., E1, E2, E6).
12
CSCI585
A Better Strategy for KNN search
A sorted priority queue based on MINDIST; Nodes traversed in order; Stops when there is an object at the top of the queue; (1-NN found) k-NN can be computed incrementally;
I/O optimal
13
CSCI585
Priority Queue
A:
A B F E 4 5 6
B C
1
B: E F F 4 5 6 1 2 3
C: G G 10
2 3
A B
Query Point Q
11 12 H 7 8 9
C C 5 5 5 F 6 G 8 4 6 G F 4 9 F 6 4 F
E C
C 1211 10 G 7 8 9 H
H 7
1NN
14
CSCI585
Best-First (BF) NN Algorithm (Optimal) Hjaltason and Samet, TODS, 1998
Keep a heap H of index entries and objects, ordered by MINDIST. Initially, H contains the root. While H
Extract the element with minimum MINDIST
If it is an index entry, insert its children into H. If it is an object, return it as NN.
End while
15
CSCI585
BF Search Visit root
y axis
g h
10 8 6 4 b 2 c e f
E 2 E 6
E 7
l k j
Action Visit Root
Heap
E 1 E 2 1 2
E 5 E 1
a i
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a
16
E 2 2 E 6 2 f g h E 6 E 7 13 i 2 j 10 E 2 k 13 E 7 l m
E 3 5 c E 4
E 4 9 d
E 5 5 e E 5
E 3
CSCI585
y axis
10 g 8 6 4 b 2 c e f h
BF Search Visit E1
E 2 E 6 E 5 E 1
a i j m
E 7
l k
Action Visit Root follow E1
Heap
E 1 E 2 1 2 E 2 E 5 E 5 E 9 3 5 4 2
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a
17
E 2 2 E 6 2 f g h E 6 E 7 13 i 2 j 10 E 2 k 13 E 7 l m
E 3 5 c E 4
E 4 9 d
E 5 5 e E 5
E 3
CSCI585
y axis
10 g 8 6 4 b 2 c e f h
BF Search Visit E2
E 2 E 6 E 5 E 1
a i j m
E 7
l k
Action Visit Root follow E1 follow E2
Heap
E 1 E 2 1 2 E 2 E 5 E 5 E 9 3 5 4 2 E 2 E 5 E 5 E 9 E 13 3 5 4 7 6
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a
18
E 2 2 E 6 2 f g h E 6 E 7 13 i 2 j 10 E 2 k 13 E 7 l m
E 3 5 c E 4
E 4 9 d
E 5 5 e E 5
E 3
CSCI585
y axis
10 g 8 6 4 b 2 c e f h
BF Search Visit E6
E 2 E 6 E 5 E 1
a i j m
E 7
l k
Action Visit Root follow E1 follow E2 follow E6
Heap
E 1 E 1 2 E 2 E 3 2 E 2 E 3 6 i 2 E3 2 5 E 5 E 9 5 4 5 E 5 E 9 E 13 5 4 7 j 10 E 13 k 5 E 5 E 9 7 5 4
E 4
d
query
E 3
13
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a
19
E 2 2 E 6 2 f g h E 6 E 7 13 i 2 j 10 E 2 k 13 E 7 l m
E 3 5 c E 4
E 4 9 d
E 5 5 e E 5
E 3
CSCI585
DF Search Find Actual NN i
y axis
g h
10 8 6 4 b 2 c e f
E 2 E 6
E 7
l k j
Action Visit Root follow E1 follow E2 follow E6
Heap
E 1 E 1 2 E 2 E 3 2 E 2 E 3 6 i 2 E3 2 5 E 5 E 9 5 4 5 E 5 E 9 E 13 5 4 7 j 10 E 13 k 5 E 5 E 9 7 5 4
E 5 E 1
a i
E 4
d
query
E 3
13
Report i and terminate x axis
2 4 6 8 10
Root E 1 1 E 1 a
20
E 2 2 E 6 2 f g h E 6 E 7 13 i 2 j 10 E 2 k 13 E 7 l m
E 3 5 c E 4
E 4 9 d
E 5 5 e E 5
E 3
CSCI585
Generalizations
Both DF and BF can be easily adapted to (i) extended (instead of point) objects and (ii) retrieval of k (>1) NN. BF can be made incremental; i.e., it can report the NN in increasing order of distance without a given value of k.
Example: find the 10 closest cities to HK with population more than 1 million. We may have to retrieve many (>>10) cities around Hong Kong in order to answer the query.
21
CSCI585
Generalize to k-NN
Keep a sorted buffer of at most k current nearest neighbors Pruning is done according to the distance of the furthest nearest neighbor in this buffer Example:
R
MINDIST
P Actual_dist The k-th object in the buffer
22
CSCI585
Another filter idea based on MBR Face Property
MBR is an n-dimensional Minimal Bounding Rectangle used in R trees, which is the minimal bounding n-dimensional rectangle bounds its corresponding objects. MBR face property: Every face of any MBR contains at least one point of some object in the database.
23
CSCI585
MBR Face Property 2D
24
CSCI585
MBR Face Property 3D
Rectangle R
25
CSCI585
Improving the KNN Algorithm
While the MinDist based algorithm is I/O optimal, its performance may be further improved by pruning nodes from the priority queue.
26
CSCI585
Properties of MINMAXDIST
MINMAXDIST(P,R) is the minimum over all dimensions distances from P to the furthest point of the closest face of R. MINMAXDIST is the smallest possible upper bound of distances from the point P to the rectangle R. MINMAXDIST guarantees there is an object within the R at a distance to P less than or equal to minmaxdist.
MINMAXDIST is an upper bound of the 1-NN distance
27
CSCI585
MINDIST & MINMAXDIST
MINDIST(P,R) <= NN(P) <= MINMAXDIST(P,R)
28
CSCI585
MinDist & MinMaxDist 3D
Query Point Q MinMaxDist(Q,R) MinDist(Q,R)
Rectangle R
29
CSCI585
Pruning 1
If there exists another R such that MINDIST(P,R)> MINMAXDIST(P,R)
R R
Downward pruning: An MBR R is discarded
MINDIST
MINMAXDIST
30
CSCI585
Pruning 2
If there exists an R such that Actual_dist(P,O) > MINMAXDIST(P,R)
Downward pruning: An object O is discarded
Actual-dist
MINMAXDIST
31
CSCI585
Pruning 3
If an object O is found such that MINDIST(P,R) > Actual_dist(P,O)
Upward pruning: An MBR R is discarded
MINDIST
P Actual_dist O
32
CSCI585
MINDIST vs MINMAXDIST Ordering
MINDIST: optimistic (the box that is closer) MINMAXDIST: pessimistic (the box that has at least one object at that distance)
Example: MINDIST ordering finds the 1-NN first
33
CSCI585
MINDIST vs MINMAXDIST Ordering
Example: MINMAXDIST ordering finds the 1-NN first
34
CSCI585
NN-search Algorithm using the mentioned pruning rules (branch and bound)
1. 2.
Initialize the nearest distance as infinite distance Traverse the tree depth-first starting from the root. At each Index node, sort all MBRs using an ordering metric and put them in an Active Branch List (ABL). Apply pruning rules 1 and 2 to ABL Visit the MBRs from the ABL following the order until it is empty If Leaf node, compute actual distances, compare with the best NN so far, update if necessary. At the return from the recursion, use pruning rule 3 When the ABL is empty, the NN search returns.
3. 4. 5. 6. 7.
CSCI585
Best-First vs Branch and Bound
Best-First is the optimal algorithm in the sense that it visits all the necessary nodes and nothing more! But needs to store a large Priority Queue in main memory. If PQ becomes large, we have thrashing BB uses small Lists for each node. Also uses MINMAXDIST to prune some entries
CSCI585
References
1- N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD, pages 71-79, 1995 2- G. R. Hjaltason and H. Samet, Distance browsing in spatial databases, ACM Transactions on Database Systems 24, 2 (June 1999), 265-318 STDBM06 keynote slides by Dimitris Papadias Novel Forms of Nearest Neighbor Search in Spatial and Spatiotemporal Databases