Ranking Spatial Data by Quality Preferences
Ranking Spatial Data by Quality Preferences
Man Lung Yiu, Hua Lu, Member, IEEE, Nikos Mamoulis, and Michail Vaitis
AbstractA spatial preference query ranks objects based on the qualities of features in their spatial neighborhood. For example,
using a real estate agency database of flats for lease, a customer may want to rank the flats with respect to the appropriateness of their
location, defined after aggregating the qualities of other features (e.g., restaurants, cafes, hospital, market, etc.) within their spatial
neighborhood. Such a neighborhood concept can be specified by the user via different functions. It can be an explicit circular region
within a given distance from the flat. Another intuitive definition is to assign higher weights to the features based on their proximity to
the flat. In this paper, we formally define spatial preference queries and propose appropriate indexing techniques and search
algorithms for them. Extensive evaluation of our methods on both real and synthetic data reveals that an optimized branch-and-bound
solution is efficient and robust with respect to different parameters.
Index TermsQuery processing, spatial databases.
1 INTRODUCTION
S
PATIAL database systems manage large collections of
geographic entities, which apart from spatial attributes
contain nonspatial information (e.g., name, size, type, price,
etc.). In this paper, we study an interesting type of preference
queries, which select the best spatial location with respect to
the quality of facilities in its spatial neighborhood.
Given a set D of interesting objects (e.g., candidate
locations), a top-/ spatial preference query retrieves the
/ objects in D with the highest scores. The score of an
object is defined by the quality of features (e.g., facilities or
services) in its spatial neighborhood. As a motivating
example, consider a real estate agency office that holds a
database with available flats for lease. Here feature refers
to a class of objects in a spatial map such as specific facilities
or services. A customer may want to rank the contents of
this database with respect to the quality of their locations,
quantified by aggregating nonspatial characteristics of other
features (e.g., restaurants, cafes, hospital, market, etc.) in the
spatial neighborhood of the flat (defined by a spatial range
around it). Quality may be subjective and query-parametric.
For example, a user may define quality with respect to
nonspatial attributes of restaurants around it (e.g., whether
they serve seafood, price range, etc.).
As another example, the user (e.g., a tourist) wishes to find
a hotel j that is close to a high-quality restaurant and a high-
quality cafe. Fig. 1a illustrates the locations of an object data
set D (hotels) in white, and two feature data sets: the set F
1
(restaurants) in gray, and the set F
2
(cafes) in black. Feature
points are labeledby quality values that can be obtainedfrom
ratingproviders (e.g., http://www.zagat.com/). For the ease
of discussion, the qualities are normalized to values in 0. 1.
The score tj of a hotel j is defined in terms of: 1) the
maximum quality for each feature in the neighborhood
region of j, and 2) the aggregation of those qualities.
A simple score instance, called the range score, binds the
neighborhood region to a circular region at j with radius c
(shown as a circle), and the aggregate function to SUM. For
instance, the maximum quality of gray and black points
within the circle of j
1
are 0.9 and 0.6, respectively, so the
score of j
1
is tj
1
0.9 0.6 1.5. Similarly, we obtain
tj
2
1.0 0.1 1.1 and tj
3
0.7 0.7 1.4. Hence,
the hotel j
1
is returned as the top result.
In fact, the semantics of the aggregate function is relevant
to the users query. The SUM function attempts to balance the
overall qualities of all features. For the MIN function, the top
result becomes j
3
, with the score tj
3
minf0.7. 0.7g 0.7.
It ensures that the top result has reasonably high qualities in
all features. For the MAX function, the top result is j
2
, with
tj
2
maxf1.0. 0.1g 1.0. It is used to optimize the quality
in a particular feature, but not necessarily all of them.
The neighborhood region in the above spatial preference
query can also be defined by other score functions. A
meaningful score function is the influence score (see Section 4).
As opposed to the crisp radius c constraint in the range
score, the influence score smoothens the effect of c and
assigns higher weights to cafes that are closer to the hotel.
Fig. 1b shows a hotel j
5
and three cafes :
1
. :
2
. :
3
(with their
quality values). The circles have their radii as multiples of c.
Now, the score of a cafe :
i
is computed by multiplying its
quality with the weight 2
,
, where , is the order of the
smallest circle containing :
i
. For example, the scores of :
1
, :
2
,
and :
3
are 0.3,2
1
0.15, 0.9,2
2
0.225, and 1.0,2
3
0.125,
respectively. The influence score of j
5
is taken as the highest
value (0.225).
Traditionally, there are two basic ways for ranking
objects: 1) spatial ranking, which orders the objects
according to their distance from a reference point, and
2) nonspatial ranking, which orders the objects by an
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011 433
. M. L. Yiu is with the Department of Computing, Hong Kong Polytechnic
University, Hung Hom, Hong Kong. E-mail: [email protected].
. H. Lu is with the Department of Computer Science, Aalborg University,
DK-9220 Aalborg, Denmark. E-mail: [email protected].
. N. Mamoulis is with the Department of Computer Science, University of
Hong Kong, Pokfulam Road, Hong Kong. E-mail: [email protected].
. M. Vaitis is with the Department of Geography, University of the Aegean,
University Hill, GR-811 00 Mytilene, Greece. E-mail: [email protected].
Manuscript received 2 Feb. 2009; revised 9 Aug. 2009; accepted 21 Oct. 2009;
published online 26 July 2010.
Recommended for acceptance by M. Ester.
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TKDE-2009-02-0046.
Digital Object Identifier no. 10.1109/TKDE.2010.119.
1041-4347/11/$26.00 2011 IEEE Published by the IEEE Computer Society
aggregate function on their nonspatial values. Our top-/
spatial preference query integrates these two types of
ranking in an intuitive way. As indicated by our examples,
this new query has a wide range of applications in service
recommendation and decision support systems.
To our knowledge, there is no existing efficient solution
for processing the top-/ spatial preference query. A brute-
force approach (to be elaborated in Section 3.2) for
evaluating it is to compute the scores of all objects in D
and select the top-/ ones. This method, however, is
expected to be very expensive for large input data sets. In
this paper, we propose alternative techniques that aim at
minimizing the I/O accesses to the object and feature data
sets, while being also computationally efficient. Our
techniques apply on spatial-partitioning access methods
and compute upper score bounds for the objects indexed by
them, which are used to effectively prune the search space.
Specifically, we contribute the branch-and-bound (BB)
algorithm and the feature join (FJ) algorithm for efficiently
processing the top-/ spatial preference query.
Furthermore, this paper studies three relevant extensions
that have not been investigated in our preliminary work [1].
The first extension (Section 3.4) is an optimized version of BB
that exploits a more efficient technique for computing the
scores of the objects. The second extension (Section 3.6)
studies adaptations of the proposed algorithms for aggregate
functions other than SUM, e.g., the functions MIN and MAX.
The third extension (Section 4) develops solutions for the
top-/ spatial preference query based on the influence score.
The rest of this paper is structured as follows: Section 2
provides background on basic and advanced queries on
spatial databases, as well as top-/ query evaluation in
relational databases. Section 3 defines the top-/ spatial
preference queryandpresents our solutions. Section4studies
the query extension for the influence score. In Section 5, our
query algorithms are experimentally evaluated with real and
synthetic data. Finally, Section 6 concludes the paper with
future research directions.
2 BACKGROUND AND RELATED WORK
Object ranking is a popular retrieval task in various
applications. In relational databases, we rank tuples using
an aggregate score function on their attribute values [2]. For
example, a real estate agency maintains a database that
contains information of flats available for rent. A potential
customer wishes to view the top 10 flats with the largest
sizes and lowest prices. In this case, the score of each flat is
expressed by the sum of two qualities: size and price, after
normalization to the domain [0, 1] (e.g., 1 means the largest
size and the lowest price). In spatial databases, ranking is
often associated to nearest neighbor (NN) retrieval. Given a
query location, we are interested in retrieving the set of
nearest objects to it that qualify a condition (e.g., restau-
rants). Assuming that the set of interesting objects is
indexed by an R-tree [3], we can apply distance bounds
and traverse the index in a branch-and-bound fashion to
obtain the answer [4].
Nevertheless, it is not always possible to use multi-
dimensional indexes for top-/ retrieval. First, such indexes
break down in high-dimensional spaces [5], [6]. Second,
top-/ queries may involve an arbitrary set of user-specified
attributes (e.g., size and price) from possible ones (e.g., size,
price, distance to the beach, number of bedrooms, floor,
etc.) and indexes may not be available for all possible
attribute combinations (i.e., they are too expensive to create
and maintain). Third, information for different rankings to
be combined (i.e., for different attributes) could appear in
different databases (in a distributed database scenario) and
unified indexes may not exist for them. Solutions for top-/
queries [7], [2], [8], [9] focus on the efficient merging of
object rankings that may arrive from different (distributed)
sources. Their motivation is to minimize the number of
accesses to the input rankings until the objects with the top-
/ aggregate scores have been identified. To achieve this,
upper and lower bounds for the objects seen so far are
maintained while scanning the sorted lists.
In the following sections, we first review the R-tree,
which is the most popular spatial access method and the
NN search algorithm of [4]. Then, we survey recent research
of feature-based spatial queries.
2.1 Spatial Query Evaluation on R-Trees
The most popular spatial access method is the R-tree [3],
which indexes minimum bounding rectangles (MBRs) of
objects. Fig. 2 shows a set D fj
1
. . . . . j
8
g of spatial objects
(e.g., points) and an R-tree that indexes them. R-trees can
efficiently process main spatial query types, including
spatial range queries, nearest neighbor queries, and spatial
joins. Given a spatial region \, a spatial range query retrieves
from D the objects that intersect \. For instance, consider a
range query that asks for all objects within the shaded area
in Fig. 2. Starting from the root of the tree, the query is
processed by recursively following entries, having MBRs
that intersect the query region. For instance, c
1
does not
intersect the query region, thus the subtree pointed by c
1
434 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011
Fig. 1. Examples of top-/ spatial preference queries. (a) Range score,
c 0.2 km. (b) Influence score, c 0.2 km.
Fig. 2. Spatial queries on R-trees.
cannot contain any query result. In contrast, c
2
is followed
by the algorithm and the points in the corresponding node
are examined recursively to find the query result j
7
.
A nearest neighbor query takes as input a query object
and returns the closest object in D to . For instance, the
nearest neighbor of in Fig. 2 is j
7
. Its generalization is the
/-NN query, which returns the / closest objects to , given a
positive integer /. NN (and /-NN) queries can be efficiently
processed using the best-first (BF) algorithm of [4], provided
that D is indexed by an R-tree. A min-heap H which
organizes R-tree entries based on the (minimum) distance of
their MBRs to is initialized with the root entries. In order
to find the NN of in Fig. 2, BF first inserts to H entries c
1
,
c
2
, c
3
, and their distances to . Then, the nearest entry c
2
is
retrieved from H and objects j
1
. j
7
. j
8
are inserted to H. The
next nearest entry in H is j
7
, which is the nearest neighbor
of . In terms of I/O, the BF algorithm is shown to be no
worse than any NN algorithm on the same R-tree [4].
The aggregate R-tree (aR-tree) [10] is a variant of the R-
tree, where each nonleaf entry augments an aggregate
measure for some attribute value (measure) of all points in
its subtree. As an example, the tree shown in Fig. 2 can be
upgraded to a MAX aR-tree over the point set, if entries
c
1
. c
2
. c
3
contain the maximum measure values of sets
fj
2
. j
3
g. fj
1
. j
8
. j
7
g. fj
4
. j
5
. j
6
g, respectively. Assume that
the measure values of j
4
. j
5
. j
6
are 0.2. 0.1. 0.4, respectively.
In this case, the aggregate measure augmented in c
3
would
be maxf0.2. 0.1. 0.4g 0.4. In this paper, we employ MAX
aR-trees for indexing the feature data sets (e.g., restaurants),
in order to accelerate the processing of top-/ spatial
preference queries.
Given a feature data set F and a multidimensional
region 1, the range top-/ query selects the tuples (from F)
within the region R and returns only those with the
/ highest qualities. Hong et al. [11] indexed the data set
by a MAX aR-tree and developed an efficient tree traversal
algorithm to answer the query. Instead of finding the best
/ qualities from F in a specified region, our (range score)
query considers multiple spatial regions based on the points
from the object data set D, and attempts to find out the best
/ regions (based on scores derived from multiple feature
data sets F
c
).
2.2 Feature-Based Spatial Queries
Xia et al. [12] solved the problem of finding top-/ sites (e.g.,
restaurants) based on their influence on feature points (e.g.,
residential buildings). As an example, Fig. 3a shows a set of
sites (white points), a set of features (black points with
weights), such that each line links a feature point to its
nearest site. The influence of a site j
i
is defined by the sum
of weights of feature points having j
i
as their closest site.
For instance, the score of j
1
is 0.9 0.5 1.4. Similarly, the
scores of j
2
and j
3
are 1.5 and 1.2, respectively. Hence, j
2
is
returned as the top-1 influential site.
Related to top-/ influential sites query are the optimal
location queries studied in [13], [14]. The goal is to find the
location in space (not chosen from a specific set of sites) that
minimizes an objective function. In Figs. 3b and 3c, feature
points and existing sites are shown as black and gray
points, respectively. Assume that all feature points have the
same quality. The maximum influence optimal location
query [13] finds the location (to insert to the existing set of
sites) with the maximum influence (as defined in [12]),
whereas the minimum distance optimal location query [14]
searches for the location that minimizes the average
distance from each feature point to its nearest site. The
optimal locations for both queries are marked as white
points in Figs. 3b and 3c, respectively.
The techniques proposed in [12], [13], [14] are specific to
the particular query types described above and cannot be
extended for our top-/ spatial preference queries. Also, they
deal with a single-feature data set whereas our queries
consider multiple feature data sets.
Recently, novel spatial queries and joins [15], [16], [17],
[18] have been proposed for various spatial decision
support problems. However, they do not utilize nonspatial
qualities of facilities to define the score of a location. Finally,
[19], [20] studied the evaluation of textual location-based
queries on spatial objects.
3 SPATIAL PREFERENCE QUERIES
Section 3.1 formally defines the top-/ spatial preference
query problem and describes the index structures for the
data sets. Section 3.2 studies two baseline algorithms for
processing the query. Section 3.3 presents an efficient
branch-and-bound algorithm for the query, and its further
optimization is proposed in Section 3.4. Section 3.5 develops
a specialized spatial join algorithm for evaluating the query.
Finally, Section 3.6 extends the above algorithms for
answering top-/ spatial preference queries involving other
aggregate functions.
3.1 Definitions and Index Structures
Let F
c
be a feature data set, in which each feature object
: 2 F
c
is associated with a quality .: and a spatial point.
We assume that the domain of .: is the interval 0. 1. As
an example, the quality .: of a restaurant : can be
obtained from a ratings provider.
Let D be an object data set, where each object j 2 D is a
spatial point. In other words, D is the set of interesting
points (e.g., hotel locations) considered by the user.
Given an object data set D and i feature data sets
F
1
. F
2
. . . . . F
i
, the top-/ spatial preference query retrieves the
/ points in D with the highest score. Here, the score of an
object point j 2 D is defined as
t
0
j AGG
t
0
c
j j c 2 1. i
. 1
YIU ET AL.: RANKING SPATIAL DATA BY QUALITY PREFERENCES 435
Fig. 3. Influential sites and optimal location queries. (a) Top-/ influential.
(b) Max-influence. (c) Min-distance.
where AGG is an aggregate function and t
0
c
j is the (cth)
component score of j with respect to the neighborhood
condition 0 and the (cth) feature data set F
c
.
We proceed to elaborate the aggregate function and the
component score function. Typical examples of the aggre-
gate function AGG are: SUM, MIN, and MAX. We first focus on
the case where AGG is SUM. In Section 3.6, we will discuss
the generic scenario where AGG is an arbitrary monotone
aggregate function.
An intuitive choice for the component score function
t
0
c
j is: the range score t
iiq
c
j, taken as the maximum
quality .: of points : 2 F
c
that are within a given
parameter distance c from j, or 0 if no such point exists.
t
iiq
c
j maxf.: j : 2 F
c
^ di:tj. : cg [ f0g. 2
In our problem setting, the user requires that an object j 2
D must not be considered as a result if there exists some F
c
such that the neighborhood region of j does not contain any
feature point of F
c
.
There are other choices for the component score function
t
0
c
j. One example is the influence score function t
ii)
c
j
which will be considered in Section 4. Another example is
the NN score t
ii
c
j that has been studied in our previous
work [1], so it will not be examined again in this paper. The
condition 0 is dropped whenever the context is clear.
In this paper, we assume that the object data set D is
indexed by an R-tree and each feature data set F
c
is indexed
by an MAX aR-tree, where each nonleaf entry augments the
maximum quality (of features) in its subtree. Nevertheless,
our solutions are directly applicable to data sets that are
indexed by other hierarchical spatial indexes (e.g., point
quad-trees). The rationale of indexing different feature data
sets by separate aR-trees is that: 1) a user queries for only
few features (e.g., restaurants and cafes) out of all possible
features (e.g., restaurants, cafes, hospital, market, etc.), and
2) different users may consider different subsets of features.
Based on the above indexing scheme, we develop
various algorithms for processing top-/ spatial preference
queries. Table 1 lists the notations to be used throughout
the paper.
3.2 Probing Algorithms
We first introduce a brute-force solution that computes the
score of every point j 2 D in order to obtain the query
results. Then, we propose a group evaluation technique that
computes the scores of multiple points concurrently.
3.2.1 Simple Probing Algorithm
According to Section 3.1, the quality .: of any feature
point : falls into the interval 0. 1. Thus, for a point j 2 D,
where not all its component scores are known, its upper
bound score t
j is defined as
t
j
X
i
c1
t
c
j. if t
c
j is known.
1. otherwise.
&
3
It is guaranteed that the bound t
c;
9: If tc then
10: update \
/
(and ) by c;
3.2.2 Group Probing Algorithm
Due to separate score computations for different objects, SP
is inefficient for large-object data sets. In view of this, we
propose the group probing (GP) algorithm, a variant of SP,
that reduces I/O cost by computing scores of objects in the
same leaf node of the R-tree concurrently. In GP, when a
leaf node is visited, its points are first stored in a set \ and
then their component scores are computed concurrently at a
single traversal of the F
c
tree.
We now introduce some distance notations for MBRs.
Given a point j and an MBR c, the value iiidi:tj. c
(iordi:tj. c) [4] denotes the minimum (maximum)
possible distance between j and any point in c. Similarly,
given two MBRs c
o
and c
/
, the value iiidi:tc
o
. c
/
436 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011
TABLE 1
List of Notations
(iordi:tc
o
. c
/
) denotes the minimum (maximum) possible
distance between any point in c
o
and any point in c
/
.
Algorithm 2 shows the procedure for computing the cth
component score for a group of points. Consider a subset \
of D for which we want to compute their t
iiq
c
j score at
feature tree F
c
. Initially, the procedure is called with `
being the root node of F
c
. If c is a nonleaf entry and its
iiidi:t from some point j 2 \ is within the range c, then
the procedure is applied recursively on the child node of c,
since the subtree of F
c
rooted at c may contribute to the
component score of j. In case c is a leaf entry (i.e., a feature
point), the scores of points in \ are updated if they are
within distance c from c.
Algorithm 2. Group Range Score Algorithm
algorithm Group_Range(Node `, Set \ , Value c,
Value c)
1: for each entry c 2 ` do
2: If ` is nonleaf then
3: If 9j 2 \ . iiidi:tj. c c then
4: read the child node `
0
pointed by c;
5: Group_Range(`
0
,\ ,c, c);
6: else
7: for each j 2 \ such that di:tj. c c do
8: t
c
j : maxft
c
j. .cg;
3.3 Branch-and-Bound Algorithm
GP is still expensive as it examines all objects in D and
computes their component scores. We now propose an
algorithm that can significantly reduce the number of
objects to be examined. The key idea is to compute, for
nonleaf entries c in the object tree D, an upper bound T c of
the score tj for any point j in the subtree of c. If T c ,
then we need not access the subtree of c, thus we can save
numerous score computations.
Algorithm 3 is a pseudocode of our BB algorithm, based
on this idea. BB is called with ` being the root node of D. If
` is a nonleaf node, Lines 3-5 compute the scores T c for
nonleaf entries c concurrently. Recall that T c is an upper
bound score for any point in the subtree of c. The
techniques for computing T c will be discussed shortly.
Like (3), with the component scores T
c
c known so far, we
can derive T
c, an upper bound of T c. If T
c ,
then the subtree of c cannot contain better results than those
in \
/
and it is removed from \ . In order to obtain points
with high scores early, we sort the entries in descending
order of T c before invoking the above procedure
recursively on the child nodes pointed by the entries in \ .
If ` is a leaf node, we compute the scores for all points of `
concurrently and then update the set \
/
of the top-/ results.
Since both \
/
and are global variables, their values are
updated during recursive call of BB.
Algorithm 3. Branch-and-Bound Algorithm
\
/
: new min-heap of size / (initially empty);
: 0; > /th score in \
/
algorithm BB(Node `)
1: \ : fcjc 2 `g;
2: If ` is nonleaf then
3: for c : 1 to i do
4: compute T
c
c for all c 2 \ concurrently;
5: remove entries c in \ such that T
c ;
6: sort entries c 2 \ in descending order of T c;
7: for each entry c 2 \ such that T c do
8: read the child node `
0
pointed by c;
9: BB(`
0
);
10: else
11: for c : 1 to i do
12: compute t
c
c for all c 2 \ concurrently;
13: remove entries c in \ such that t
c ;
14: update \
/
(and ) by entries in \ ;
3.3.1 Upper Bound Score Computation
It remains to clarify how the (upper bound) scores T
c
c of
nonleaf entries (within the same node `) can be computed
concurrently (at Line 4). Our goal is to compute these upper
bound scores such that
. the bounds are computed with low I/O cost, and
. the bounds are reasonably tight, in order to facilitate
effective pruning.
To achieve this, we utilize only level-1 entries (i.e., lowest
level nonleaf entries) in F
c
for deriving upper bound scores
because: 1) there are much fewer level-1 entries than leaf
entries (i.e., points), and 2) high-level entries in F
c
cannot
provide tight bounds. In our experimental study, we will
also verify the effectiveness and the cost of using level-1
entries for upper bound score computation.
Algorithm 2 can be modified for the above upper bound
computation task (where input \ corresponds to a set of
nonleaf entries), after changing Line 2 to check whether
child nodes of ` are above the leaf-level.
The following example illustrates how upper bound
range scores are derived. In Fig. 4a, .
1
and .
2
are nonleaf
entries in the object tree D and the others are level-1 entries
in the feature tree F
c
. For the entry .
1
, we first define its
Minkowski region [21] (i.e., gray region around .
1
), the area
whose iiidi:t from .
1
is within c. Observe that only entries
c
i
intersecting the Minkowski region of .
1
can contribute to
the score of some point in .
1
. Thus, the upper bound score
T
c
.
1
is simply the maximum quality of entries c
1
. c
5
. c
6
. c
7
,
i.e., 0.9. Similarly, T
c
.
2
is computed as the maximum
quality of entries c
2
. c
3
. c
4
. c
8
, i.e., 0.7. Assuming that .
1
and
.
2
are entries in the same tree node of D, their upper bounds
are computed concurrently to reduce I/O cost.
YIU ET AL.: RANKING SPATIAL DATA BY QUALITY PREFERENCES 437
Fig. 4. Examples of deriving scores. (a) Upper bound scores.
(b) Optimized computation.
3.4 Optimized Branch-and-Bound Algorithm
This section develops a more efficient score computation
technique to reduce the cost of the BB algorithm.
3.4.1 Motivation
Recall that Lines 11-13 of the BB algorithm are used to
compute the scores of object points (i.e., leaf entries of the
R-tree on D). A leaf entry c is pruned if its upper bound
score t
j
1
for the point j
1
in Fig. 4b. The entry c
11
1
is a nonleaf entry from the feature
tree F
1
. Its augmented quality value is .c
11
1
0.8. The
entry points to a leaf node containing two feature points,
whose qualities values are 0.6 and 0.8, respectively.
Similarly, c
12
2
is a nonleaf entry from the tree F
2
and it
points to a leaf node of feature points.
Suppose that the best score found so far in BB is 1.4
(not shown in the figure). We need to check whether the
score of j
1
can be higher than . For this, we compute the
first component score t
1
j
1
0.6 by accessing the child
node of c
11
1
. Now, we have the upper bound score of j
1
as
t
j as
t
j
X
i
c1
maxf.: j : 2 F
c
. di:tj. : c. .:
! j
c
g [ fj
c
g.
4
In the max function, the first set denotes the upper bound
quality of any visited feature point within distance c from j.
The following lemma shows that the value t
j is always
greater than or equal to the actual score tj.
Lemma 1. It holds that t
j ! tj. tu
According to (4), the value t
j (Lines 13-15).
Next, we access the child node pointed to by c, and
examine each entry c
0
in the node (Lines 16-17). A nonleaf
entry c
0
is inserted into the heap H
c
if its minimum distance
from some j 2 \ is within c (Lines 18-20); whereas a leaf
entry c
0
is used to update the component score t
c
j for any
j 2 \ within distance c from c
0
(Lines 22-23). At Line 24, we
apply the round-robin strategy to find the next c value such
that the heap H
c
is not empty. The loop at Line 8 repeats
while \ is not empty and there exists a nonempty heap H
c
.
At the end, the algorithm derives the exact scores for the
remaining points of \ .
3.4.3 The BB* Algorithm
Based on the above, we extend BB (Algorithm 3) to an
optimized BB* algorithm as follows: First, Lines 11-13 of BB
are replaced by a call to Algorithm 4, for computing the
exact scores for object points in the set \ . Second, Lines 3-5
of BB are replaced by a call to a modified Algorithm 4, for
deriving the upper bound scores for nonleaf entries (in \ ).
Such a modified Algorithm 4 is obtained after replacing
Line 18 by checking whether the node C` is a nonleaf node
above the level-1.
3.5 Feature Join Algorithm
An alternative method for evaluating a top-/ spatial
preference query is to perform a multiway spatial join [23]
on the feature trees F
1
. F
2
. . . . . F
i
to obtain combinations
of feature points which can be in the neighborhood of some
object from D. Spatial regions which correspond to
combinations of high scores are then examined, in order
to find data objects in D having the corresponding feature
combination in their neighborhood. In this section, we first
introduce the concept of a combination, then discuss the
conditions for a combination to be pruned, and finally
elaborate the algorithm used to progressively identify the
combinations that correspond to query results.
Tuple h)
1
. )
2
. . . . . )
i
i is a combination if, for any c 2 1. i,
)
c
is an entry (either leaf or nonleaf) in the feature tree F
c
.
The score of the combination is defined by
th)
1
. )
2
. . . . . )
i
i
X
i
c1
.)
c
. 5
For a nonleaf entry )
c
, .)
c
is the MAX of all feature qualities
in its subtree (stored with )
c
, since F
c
is an aR-tree). A
combination disqualifies the query if
9 i 6 , ^ i. , 2 1. i. iiidi:t)
i
. )
,
2c. 6
When such a condition holds, it is impossible to have a
point in D whose iiidi:t from )
i
and )
,
are both within c,
respectively. The above validity check acts as a multiway
join condition that significantly reduces the number of
combinations to be examined.
Figs. 5a and 5b illustrate the condition for a nonleaf
combination h
1
. 1
2
i and a leaf combination ho
3
. /
4
i,
respectively, to be a candidate combination for the query.
Algorithm5 is a pseudocode of our feature join algorithm.
It employs a max-heap H for managing combinations of
feature entries in descending order of their combination
scores. The score of a combination h)
1
. )
2
. . . . . )
i
i as defined
in (5) is an upper bound of the scores of all combinations
h:
1
. :
2
. . . . . :
i
i of feature points, such that :
c
is located in the
subtree of )
c
for each c 2 1. i. Initially, the combination
with the root pointers of all feature trees is enheaped. We
progressively deheap the combination with the largest score.
If all its entries point to leaf nodes, then we load these nodes
1
1
. . . . . 1
i
andcall Find_Result to traverse the object R-tree D
and find potential results. Find_Result is a variant of the BB
algorithm, with the following difference: 1
1
. . . . . 1
i
are
viewed as i tiny feature trees (each with one node) and
accesses to them incur no extra I/O cost.
Algorithm 5. Feature Join Algorithm
\
/
: new min-heap of size / (initially empty);
: 0; >/th score in \
/
algorithm FJ(Tree D,Trees F
1
. F
2
. . . . . F
i
)
1: H : new max-heap (combination score as the key);
2: insert hF
1
.ioot. F
2
.ioot. . . . . F
i
.iooti into H;
3: while H is not empty do
4: deheap h)
1
. )
2
. . . . . )
i
i from H;
5: if 8 c 2 1. i. )
c
points to a leaf node
6: for c : 1 to i do
7: read the child node 1
c
pointed by )
c
;
8: Find_Result(D.ioot, 1
1
. . . . . 1
i
);
9: else
10: )
c
: highest level entry among )
1
. )
2
. . . . . )
i
;
11: read the child node `
c
pointed by )
c
;
12: for each entry c
c
2 `
c
do
13: insert h)
1
. )
2
. . . . . c
c
. . . . . )
i
i into H if its score is
greater than and it qualifies the query;
algorithm Find_Result(Node `, Nodes 1
1
. . . . . 1
i
)
1: for each entry c 2 ` do
2: if ` is nonleaf then
3: compute T c by entries in 1
1
. . . . . 1
i
;
4: if T c then
5: read the child node `
0
pointed by c;
6: Find_Result(`
0
, 1
1
. . . . . 1
i
);
7: else
8: compute tc by entries in 1
1
. . . . . 1
i
;
9: update \
/
(and ) by c (when necessary);
YIU ET AL.: RANKING SPATIAL DATA BY QUALITY PREFERENCES 439
Fig. 5. Qualified combinations for the join. (a) Nonleaf combination.
(b) Leaf combination.
In case not all entries of the deheaped combination point
to leaf nodes (Line 9 of FJ), we select the one at the highest
level, access its child node `
c
and then form new
combinations with the entries in `
c
. A new combination
is inserted into H for further processing if its score is higher
than and it qualifies the query. The loop (at Line 3)
continues until H becomes empty.
3.6 Extension to Monotonic Aggregate Functions
We now extend our proposed solutions for processing the
top-/ spatial preference query defined by any monotonic
aggregate function AGG. Examples of AGG include (but not
limited to) the MIN and MAX functions.
3.6.1 Adaptation of Incremental Computation
Recall that the incremental computation technique is
applied by algorithms SP, GP, and BB, for reducing I/O
cost. Specifically, even if some component score t
c
j of a
point j has not been computed yet, the upper bound score
t
j drops
below the best score found so far , the point j can be
discarded immediately without needing to compute the
unknown component scores of j.
In fact, the algorithms SP, GP, and BB are directly
applicable to any monotonic aggregate function AGG
because (3) can be generalized for AGG. Now, the upper
bound score t
j of j is defined as
t
j AGG
i
c1
t
c
j. if t
c
j is known.
1. otherwise.
&
7
Due to the monotonicity property of AGG, the bound t
j is
guaranteedto be greater thanor equal to the actual score tj.
3.6.2 Adaptation of Upper Bound Computation
The BB* and FJ algorithms compute the upper bound score of
a nonleaf entry of the object tree Dor a combination of entries
from feature trees, by summing its upper bound component
scores. Both BB* and FJ are applicable to any monotonic
aggregate function AGG, with only the slight modifications
discussedbelow. For BB*, we replace the summationoperator
byAGG, in(4), andat Lines 14 and26 of Algorithm4. For FJ, we
replace the summation by AGG, in (5).
4 INFLUENCE SCORE
This section first studies the influence score function that
combines both the qualities and relative locations of feature
points. It then presents the adaptations of our solutions in
Section 3 for the influence score function. Finally, we
discuss how our solutions can be used for other types of
influence score functions.
4.1 Score Definition
The range score has a drawback that the parameter c is not
easy to set. Consider, for instance, the example of the range
score t
iiq
in Fig. 6a, where the white points are object
points in D, the gray points and black points are feature
points in the feature sets F
1
and F
2
, respectively. If c is set
to 0.2 (shown by circles), then the object j
2
has the score
t
iiq
j
2
0.9 0.1 1.0 and it cannot be the best object (as
t
iiq
j
1
1.2). This happens because a high-quality black
feature is barely outside the c-range of j
2
. Had c been
slightly larger, that black feature (with weight 0.6) would
contribute to the score of j
2
, making it the best object.
In the field of statistics, the Gaussian density function
[24] has been used to estimate the density in the space, from
a set F of points. The density at location j is estimated as:
Gj
P
)2F
exp
di:t
2
j.)
2o
2
, where o is a parameter. Its
advantage is that the value Gj is not sensitive to a slight
change in o. Gj is mainly contributed by the points (of F)
close to j and weakly affected by the points far away.
Inspired by the above function, we devise a score
function such that it is not too sensitive to the range
parameter c. In addition, the users in our application
usually prefer a high-quality restaurant (i.e., a feature point)
rather than a large number of low-quality restaurants.
Therefore, we use the maximum operator rather than the
summation in Gj. Specifically, we define the influence score
of an object point j with respect to the feature set F
c
as
t
ii)
c
j maxf .: 2
di:tj.:
c
j : 2 F
c
g. 8
where .: is the quality of :, c is a user-specified range, and
di:tj. : is the distance between j and :.
The overall score t
ii)
j of j is then defined as
t
ii)
j AGG
t
ii)
c
j j c 2 1. i
. 9
where AGG is a monotone aggregate operator and i is the
number of feature data sets. Again, we focus on the case
where AGG is the SUM function.
Let us compute the influence score t
ii)
for the points in
Fig. 6a, assuming c 0.2. From Fig. 6a, we obtain
t
ii)
j
1
max
0.7 2
0.18
0.20
. 0.9 2
0.50
0.20
max
0.5 2
0.18
0.20
. 0.1 2
0.60
0.20
. 0.6 2
0.80
0.20
0.643 and
t
ii)
j
2
max
0.9 2
0.18
0.20
. 0.7 2
0.65
0.20
max
0.1 2
0.19
0.20
. 0.6 2
0.22
0.20
. 0.5 2
0.70
0.20
0.762.
The top-1 point is j
2
, implying that the influence score can
capture feature points outside the range c 0.2. In fact, the
influence score function possesses two nice properties. First,
a feature point : that is barely outside the range c (from the
object point j) still has potential to contribute to the score,
provided that its quality .: is sufficiently high. Second,
the distance di:tj. : has an exponentially decaying effect
440 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011
Fig. 6. Example of influence score (c 0.2). (a) Exact score. (b) Upper
bound property.
on the score, meaning that feature points nearer to j
contribute higher scores.
4.2 Query Processing for SP, GP, BB, and BB*
We now examine the extensions of the SP, GP, BB, and BB*
algorithms for top-/ spatial preference queries defined by
the influence score in (8).
4.2.1 Incremental Computation Technique
Observe that the upper bound of t
ii)
c
j is 1. Therefore, (3)
still holds for the influence score, and the incremental
computation technique (see Section 3.2) can still be applied
in SP, GP, and BB.
4.2.2 Exact Score Computation for a Single Object
For the SP algorithm, we elaborate how to compute the
score t
ii)
c
j (see (8)) of an object j 2 D. This is challenging
because some feature : 2 F
c
outside the c-range of j may
contribute to the score. Unlike the computation of the range
score, we can no longer use the c-range to restrict the search
space.
Given an object point j and an entry c from the feature
tree of F
c
, we define the upper bound function
.
ii)
c. j .c 2
iiidi:tj.c
c
. 10
In case c is a leaf entry (i.e., a feature point :), we have
.
ii)
:. j .: 2
di:tj.:
c
. The following lemma shows that
the value .
ii)
c. j is an upper bound of .
ii)
c
0
. j for any
entry c
0
in the subtree of c.
Lemma 2. Let c and c
0
be entries fromthe feature tree F
c
such that
c
0
is in the subtree of c. It holds that .
ii)
c. j ! .
ii)
c
0
. j, for
any object point j 2 D.
Proof. Let j be any object point j 2 D. Since c
0
falls into the
subtree of c, we have: iiidi:tj. c iiidi:tj. c
0
. As
F
c
is a MAX aR-tree, we have: .c ! .c
0
. Thus, we
have: .
ii)
c. j ! .
ii)
c
0
. j. tu
As an example, Fig. 6b shows an object point j and
three entries c
1
. c
2
. c
3
from the same feature tree. Note
that c
2
and c
3
are in the subtree of c
1
. The dotted lines
indicate the minimum distance from j to c
1
and c
3
,
respectively. Thus, we have .
ii)
c
1
. j 0.8 2
0.2
0.2
0.4
and .
ii)
c
3
. j 0.7 2
0.3
0.2
0.247. Clearly, .
ii)
c
1
. j is
larger than .
ii)
c
3
. j.
By using Lemma 2, we apply the best-first approach to
compute the exact component score t
c
j for the point j;
Algorithm 6 employs a max-heap H in order to visit the
entries of the tree in descending order of their .
ii)
values.
We first insert the root of the tree F
c
into H, and initialize
t
c
j to 0. The loop at Line 4 continues as long as H is not
empty. At Line 5, we deheap an entry c from H. If the value
.
ii)
c. j is above the current t
c
j, then there is potential to
update t
c
j by using some point in the subtree of c. In that
case, we read the child node pointed to by c, and examine
each entry c
0
in that node (Lines 7-8). If c
0
is a nonleaf entry,
it is inserted into H provided that its .
ii)
c
0
. j value is
above t
c
j. Otherwise, it is used to update t
c
j.
Algorithm 6. Object Influence Score Algorithm
algorithm Object_Influence(Point j, Value c, Value c)
1: H : new max-heap (with .
ii)
value as key);
2: insert hF
c
.ioot. 1.0i into H;
3: t
c
j : 0;
4: while H is not empty do
5: deheap an entry c from H;
6: if .
ii)
c. j t
c
j then
7: read the child node C` pointed to by c;
8: for each entry c
0
of C` do
9: if C` is a nonleaf node then
10: if .
ii)
c
0
. j t
c
jthen
11: insert hc
0
. .
ii)
c
0
. ji into H;
12: else > update component score
13: t
c
j : maxft
c
j. .
ii)
c
0
. jg;
4.2.3 Group Computation and Upper Bound
Computation
Recall that, for the case of range scores, both the GP and
BB algorithms apply the group computation technique
(Algorithm 2) for concurrently computing the component
score t
c
j for every object point j in a given set \ . Now,
Algorithm 6 can be modified as follows to support
concurrent computation of influence scores. First, the
parameter j is replaced by a set \ of objects. Second, we
initialize the value t
c
j for each object j 2 \ at Line 3 and
perform the score update for each j 2 \ at Line 13. Third,
the conditions at Lines 6 and 10 are checked whether they
are satisfied by some object j 2 \ .
In addition, the BB algorithm (see Algorithm 3) needs to
compute the upper bound component score T
c
c for all
nonleaf entries in the current node simultaneously. Again,
Algorithm 6 can be modified for this purpose.
4.2.4 Optimized Computation of Scores in BB*
Given an entry c (from a feature tree), we define the upper
bound score of c using a set \ of points as
.
ii)
c. \ max
j2\
.
ii)
c. j. 11
The BB* algorithm applies Algorithm 4 to compute the
range scores for a set \ of object points. With (11), we can
modify Algorithm 4 to compute the influence score, with
the following changes. First, the heap H
c
(at Line 2) is used
to organize its entries c in descending order of the key
.
ii)
c. \ , and the value .c (at Line 10) is replaced by
.
ii)
c. \ . Second, the restrictions based on the c-range (at
Lines 11-12, 19, and 22) are removed. Third, the value .c
0
is selected
such that its neighborhood region contains high number of
points. Let di:t
iii
(di:t
ior
) be the minimum (maximum)
distance of a point in F
c
from the anchor :
. Then, the
quality of a feature point : is generated as
.:
di:t
ior
di:t:. :
di:t
ior
di:t
iii
0
. 12
where di:t:. :
,
and 0 controls the skewness (default: 1.0) of quality
distribution. In this way, the qualities of points in F
c
lie
in 0. 1 and the points closer to the anchor have higher
qualities. Also, the quality distribution is highly skewed for
large values of 0.
We study the performance of our algorithms with
respect to various parameters, which are displayed in
Table 2 (their default values are shown in bold). In each
experiment, only one parameter varies while the others are
fixed to their default values.
5.2 Performance on Queries with Range Scores
This section studies the performance of our algorithms for
top-/ spatial preference queries on range scores.
Table 3 shows the I/O cost and execution time of the
algorithms, for different aggregate functions (SUM, MIN, and
MAX). GP has lower cost than SP because GP computes the
scores of points within the same leaf node concurrently. The
incremental computation technique (used by SP and GP)
derives a tight upper bound score (of each point) for the MIN
function, a partially tight bound for SUM, and a loose bound
for MAX (see Section 3.6). This explains the performance of SP
and GP across different aggregate functions. However, the
cost of the other methods are mainly influenced by the
effectiveness of pruning. BB employs an effective technique
to prune unqualified nonleaf entries in the object tree so it
outperforms GP. The optimized score computation method
enables BB* to save on average 20 percent I/Oand 30 percent
time of BB. FJ outperforms its competitors as it discovers
qualified combination of feature entries early.
We ignore SP in subsequent experiments, and compare
the cost of the remaining algorithms on synthetic data sets
with respect to different parameters.
Next, we empirically justify the choice of using level-1
entries of feature trees F
c
for the upper bound score
computation routine in the BB algorithm (see Section 3.3). In
this experiment, we use the default parameter setting and
study how the number of node accesses of BB is affected by
the level of F
c
used. Table 4 shows the decomposition of
node accesses over the tree D and the trees F
c
, and the
statistics of upper bound score computation. Each accessed
nonleaf node of D invokes a call of the upper bound score
computation routine.
When level-0 entries of F
c
are used, each upper bound
computation call incurs a high number (617.5) of node
accesses (of F
c
). On the other hand, using level-2 entries for
upper bound computation leads to very loose bounds,
making it difficult to prune the leaf nodes of D. Observe
that the total cost is minimized when level-1 entries (of F
c
)
are used. In that case, the node accesses per upper bound
computation call is low (15), and yet the obtained bounds
are tight enough for pruning most leaf nodes of D.
Fig. 8 plots the cost of the algorithms as a function of the
buffer size. As the buffer size increases, the I/O of all
algorithms drops. FJ remains the best method, BB* the
second, and BB the third; all of them outperform GP by a
wide margin. Since the buffer size does not affect the pruning
effectiveness of the algorithms, it has a small impact on the
execution time.
Fig. 9 compares the cost of the algorithms with respect to
the object data size jDj. Since the cost of FJ is dominated by
the cost of joining feature data sets, it is insensitive to jDj.
On the other hand, the cost of the other methods (GP, BB,
and BB*) increases with jDj, as score computations need to
be done for more objects in D.
YIU ET AL.: RANKING SPATIAL DATA BY QUALITY PREFERENCES 443
TABLE 2
Range of Parameter Values
TABLE 3
Effect of the Aggregate Function, Range Scores
TABLE 4
Effect of the Level of F
c
Used for Upper Bound Score
Computation in the BB Algorithm
Fig. 10 plots the I/O cost of the algorithms with respect
to the feature data size jFj (of each feature data set). As jFj
increases, the cost of GP, BB, and FJ increases. In contrast,
BB* experiences a slight cost reduction as its optimized
score computation method (for objects and nonleaf entries)
is able to perform pruning early at a large jFj value.
Fig. 11 plots the cost of the algorithms with respect to the
number i of feature data sets. The costs of GP, BB, and BB*
rise linearly as i increases because the number of
component score computations is at most linear to i. On
the other hand, the cost of FJ increases significantly with i,
because the number of qualified combinations of entries is
exponential to i.
Fig. 12 shows the cost of the algorithms as a function of
the number / of requested results. GP, BB, and BB* compute
the scores of objects in D in batches, so their performance is
insensitive to /. As / increases, FJ has weaker pruning
power and its cost increases slightly.
Fig. 13 shows the cost of the algorithms, when varying
the query range c. As c increases, all methods access more
nodes in feature trees to compute the scores of the points.
The difference in execution time between BB* and FJ shrinks
as c increases. In summary, although FJ is the clear winner
in most of the experimental instances, its performance is
significantly affected by the number i of feature data sets.
BB* is the most robust algorithm to parameter changes and
it is recommended for problems with large i.
5.3 Performance on Queries with Influence Scores
We proceed to examine the cost of our algorithms for top-/
spatial preference queries on influence scores.
Fig. 14 compares the cost of the algorithms with respect
to the number i of feature data sets. The cost follows the
trend in Fig. 11. Again, the number of combinations
examined by FJ increases exponentially with i so its cost
increases rapidly.
Fig. 15 plots the cost of the algorithms by varying the
number / of requested results. Observe that FJ becomes
more expensive than BB* (in both I/O and time) when the
value of / is beyond 8. This is attributed to two reasons.
First, FJ incurs extra computational cost as it needs to
invoke Algorithm 7 for computing the upper bound score of
a combination of feature entries. Second, FJ incurs high I/O
cost to identify objects in D that produce high scores with
the current combination of features.
Fig. 16 shows the cost of the algorithms as a function of
the parameter c. Interestingly, the trend here is different
from the one in Fig. 13. According to (8), when c decreases,
the influence score also decreases, rendering it more
difficult to distinguish the scores among different objects.
Thus, the cost of BB, BB*, and FJ becomes high at a low c
value. Summing up, for the newly introduced influence
score, FJ is more sensitive to parameter changes and it loses
to BB* not only when there are multiple feature data sets,
but also at large /.
444 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011
Fig. 9. Effect of jDj, range scores. (a) I/O. (b) Time.
Fig. 10. Effect of jFj, range scores. (a) I/O. (b) Time.
Fig. 11. Effect of i, range scores. (a) I/O. (b) Time.
Fig. 12. Effect of /, range scores. (a) I/O. (b) Time.
Fig. 13. Effect of c, range scores. (a) I/O. (b) Time.
Fig. 8. Effect of buffer size, range scores. (a) I/O. (b) Time.
5.4 Results on Real Data
In this section, we conduct experiments on real object and
feature data sets in order to demonstrate the application of
top-/ spatial preference queries.
We obtained three real spatial data sets from a travel
portal website, http://www.allstays.com/. Locations in
these data sets correspond to (longitude and latitude)
coordinates in US. We cleaned the data sets by discarding
records without longitude and latitude. Each remaining
location is normalized to a point in the 2D space
0. 10. 000
2
. One data set is used as the object data set
and the other two are used as feature data sets. The object
data set D contains 11,399 camping locations. The feature
data set F
1
contains 30,921 hotel records, each with a room
price (quality) and a location. The feature data set F
2
has
3,848 records of Wal-Mart stores, each with a gasoline
availability (quality) and a location. The domain of each
quality attribute (e.g., room price and gasoline availability)
is normalized to the unit interval 0. 1. Intuitively, a
camping location is considered as good if it is close to a
Wal-Mart store with high gasoline availability (i.e.,
convenient supply) and a hotel with high room price
(which indirectly reflects the quality of nearby outdoor
environment).
Fig. 17 plots the cost of the algorithms with respect to c,
for queries with range scores. At a very small c value, most of
the objects have the zero score as they have no feature points
within their neighborhood. This forces BB, BB*, and FJ to
access a larger number of objects (or feature combinations)
before finding an object with nonzero score, which can then
be used for pruning other unqualified objects.
Fig. 18 compares the cost of the algorithms with respect
to c, for queries with influence scores. In general, the cost
follows the trend in Fig. 16. BB* outperforms BB at low
c value whereas BB incurs a slightly lower cost than BB* at a
high c value. Observe that the cost of BB and BB* is close to
that of FJ when c is sufficiently high. In summary, the
relative performance between the algorithms in all experi-
ments is consistent to the results on synthetic data.
6 CONCLUSION
In this paper, we studied top-/ spatial preference queries,
which provide a novel type of ranking for spatial objects
based on qualities of features in their neighborhood. The
neighborhood of an object j is captured by the scoring
function: 1) the range score restricts the neighborhood to a
crisp region centered at j, whereas 2) the influence score
relaxes the neighborhood to the whole space and assigns
higher weights to locations closer to j.
We presented five algorithms for processing top-/
spatial preference queries. The baseline algorithm SP
computes the scores of every object by querying on feature
data sets. The algorithm GP is a variant of SP that reduces
I/O cost by computing scores of objects in the same leaf
node concurrently. The algorithm BB derives upper bound
scores for nonleaf entries in the object tree, and prunes
those that cannot lead to better results. The algorithm BB*
is a variant of BB that utilizes an optimized method for
computing the scores of objects (and upper bound scores
of nonleaf entries). The algorithm FJ performs a multiway
join on feature trees to obtain qualified combinations of
feature points and then search for their relevant objects in
the object tree.
Based on our experimental findings, BB* is scalable to
large data sets and it is the most robust algorithm with
respect to various parameters. However, FJ is the best
algorithm in cases where the number i of feature data sets
is low and each feature data set is small.
YIU ET AL.: RANKING SPATIAL DATA BY QUALITY PREFERENCES 445
Fig. 15. Effect of /, influence scores. (a) I/O. (b) Time.
Fig. 16. Effect of c, influence scores. (a) I/O. (b) Time.
Fig. 17. Effect of c, range scores, real data. (a) I/O. (b) Time.
Fig. 18. Effect of c, influence scores, real data. (a) I/O. (b) Time.
Fig. 14. Effect of i, influence scores. (a) I/O. (b) Time.
In the future, we will study the top-/ spatial preference
query on a road network, in which the distance between
two points is defined by their shortest path distance rather
than their euclidean distance. The challenge is to develop
alternative methods for computing the upper bound scores
for a group of points on a road network.
ACKNOWLEDGMENTS
This work was supported by grant HKU 715509E from
Hong Kong RGC.
REFERENCES
[1] M.L. Yiu, X. Dai, N. Mamoulis, and M. Vaitis, Top-k Spatial
Preference Queries, Proc. IEEE Intl Conf. Data Eng. (ICDE),
2007.
[2] N. Bruno, L. Gravano, and A. Marian, Evaluating Top-k Queries
over Web-Accessible Databases, Proc. IEEE Intl Conf. Data Eng.
(ICDE), 2002.
[3] A. Guttman, R-Trees: A Dynamic Index Structure for Spatial
Searching, Proc. ACM SIGMOD, 1984.
[4] G.R. Hjaltason and H. Samet, Distance Browsing in Spatial
Databases, ACM Trans. Database Systems, vol. 24, no. 2, pp. 265-
318, 1999.
[5] R. Weber, H.-J. Schek, and S. Blott, A Quantitative Analysis and
Performance Study for Similarity-Search Methods in High-
Dimensional Spaces, Proc. Intl Conf. Very Large Data Bases
(VLDB), 1998.
[6] K.S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, When is
Nearest Neighbor Meaningful? Proc. Seventh Intl Conf. Database
Theory (ICDT), 1999.
[7] R. Fagin, A. Lotem, and M. Naor, Optimal Aggregation
Algorithms for Middleware, Proc. Intl Symp. Principles of
Database Systems (PODS), 2001.
[8] I.F. Ilyas, W.G. Aref, and A. Elmagarmid, Supporting Top-k Join
Queries in Relational Databases, Proc. 29th Intl Conf. Very Large
Data Bases (VLDB), 2003.
[9] N. Mamoulis, M.L. Yiu, K.H. Cheng, and D.W. Cheung, Efficient
Top-k Aggregation of Ranked Inputs, ACM Trans. Database
Systems, vol. 32, no. 3, p. 19, 2007.
[10] D. Papadias, P. Kalnis, J. Zhang, and Y. Tao, Efficient OLAP
Operations in Spatial Data Warehouses, Proc. Intl Symp. Spatial
and Temporal Databases (SSTD), 2001.
[11] S. Hong, B. Moon, and S. Lee, Efficient Execution of Range Top-k
Queries in Aggregate R-Trees, IEICE Trans. Information and
Systems, vol. 88-D, no. 11, pp. 2544-2554, 2005.
[12] T. Xia, D. Zhang, E. Kanoulas, and Y. Du, On Computing Top-t
Most Influential Spatial Sites, Proc. 31st Intl Conf. Very Large Data
Bases (VLDB), 2005.
[13] Y. Du, D. Zhang, and T. Xia, The Optimal-Location Query, Proc.
Intl Symp. Spatial and Temporal Databases (SSTD), 2005.
[14] D. Zhang, Y. Du, T. Xia, and Y. Tao, Progessive Computation of
The Min-Dist Optimal-Location Query, Proc. 32nd Intl Conf. Very
Large Data Bases (VLDB), 2006.
[15] Y. Chen and J.M. Patel, Efficient Evaluation of All-Nearest-
Neighbor Queries, Proc. IEEE Intl Conf. Data Eng. (ICDE), 2007.
[16] P.G.Y. Kumar and R. Janardan, Efficient Algorithms for Reverse
Proximity Query Problems, Proc. 16th ACM Intl Conf. Advances in
Geographic Information Systems (GIS), 2008.
[17] M.L. Yiu, P. Karras, and N. Mamoulis, Ring-Constrained Join:
Deriving Fair Middleman Locations from Pointsets via a
Geometric Constraint, Proc. 11th Intl Conf. Extending Database
Technology (EDBT), 2008.
[18] M.L. Yiu, N. Mamoulis, and P. Karras, Common Influence Join:
A Natural Join Operation for Spatial Pointsets, Proc. IEEE Intl
Conf. Data Eng. (ICDE), 2008.
[19] Y.-Y. Chen, T. Suel, and A. Markowetz, Efficient Query
Processing in Geographic Web Search Engines, Proc. ACM
SIGMOD, 2006.
[20] V.S. Sengar, T. Joshi, J. Joy, S. Prakash, and K. Toyama, Robust
Location Search from Text Queries, Proc. 15th Ann. ACM Intl
Symp. Advances in Geographic Information Systems (GIS), 2007.
[21] S. Berchtold, C. Boehm, D. Keim, and H. Kriegel, A Cost Model
for Nearest Neighbor Search in High-Dimensional Data Space,
Proc. ACM Symp. Principles of Database Systems (PODS), 1997.
[22] E. Dellis, B. Seeger, and A. Vlachou, Nearest Neighbor Search on
Vertically Partitioned High-Dimensional Data, Proc. Seventh Intl
Conf. Data Warehousing and Knowledge Discovery (DaWaK), pp. 243-
253, 2005.
[23] N. Mamoulis and D. Papadias, Multiway Spatial Joins, ACM
Trans. Database Systems, vol. 26, no. 4, pp. 424-475, 2001.
[24] A. Hinneburg and D.A. Keim, An Efficient Approach to
Clustering in Large Multimedia Databases with Noise, Proc.
Fourth Intl Conf. Knowledge Discovery and Data Mining (KDD),
1998.
Man Lung Yiu received the bachelors degree in
computer engineering and the PhD degree in
computer science from the University of Hong
Kong in 2002 and 2006, respectively. Prior to his
current post, he worked at Aalborg University for
three years starting in the Fall of 2006. He is now
an assistant professor in the Department of
Computing, Hong Kong Polytechnic University.
His research focuses on the management of
complex data, in particular, query processing
topics on spatiotemporal data and multidimensional data.
Hua Lu received the BSc and MSc degrees
from Peking University, China, in 1998 and
2001, respectively, and the PhD degree in
computer science from the National University
of Singapore in 2007. He is currently an
assistant professor in the Department of
Computer Science, Aalborg University, Den-
mark. His research interests include skyline
queries, spatiotemporal databases, geographic
information systems, and mobile computing.
He is a member of the IEEE.
Nikos Mamoulis received the diploma in com-
puter engineering and informatics from the
University of Patras, Greece, in 1995, and the
PhD degree in computer science from the Hong
Kong University of Science and Technology in
2000. He is currently an associate professor in
the Department of Computer Science, University
of Hong Kong, which he joined in 2001. In the
past, he has worked as a research and devel-
opment engineer at the Computer Technology
Institute, Patras, Greece and as a postdoctoral researcher at the
Centrum voor Wiskunde en Informatica (CWI), the Netherlands. During
2008-2009, he was on leave to the Max-Planck Institute for Informatics
(MPII), Germany. His research focuses on the management and mining
of complex data types, including spatial, spatiotemporal, object-
relational, multimedia, text, and semistructured data. He has served
on the program committees of more than 70 international conferences
and workshops on data management and data mining. He was the
general chair of SSDBM 2008, the PC chair of SSTD 2009, and he
organized the SSTDM 2006 and DBRank 2009 workshops. He has
served as PC vicechair of ICDM 2007, ICDM 2008, and CIKM 2009. He
was the publicity chair of ICDE 2009. He is an editorial board member
for Geoinformatica Journal and was a field editor of the Encyclopedia of
Geographic Information Systems.
Michail Vaitis received the diploma in 1992 and
the PhD degree in computer engineering and
informatics from the University of Patras,
Greece, in 2001. He was collaborating for five
years with the Computer Technology Institute
(CTI), Greece, as a research and development
engineer, working on hypertext and database
systems. Now, he is an assistant professor at
the Department of Geography, University of the
Aegean, Greece, which he joined in 2003. His
research interests include geographical databases, spatial data infra-
structures, and hypermedia models and services.
446 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011