Keyword Search
Keyword Search
Abstract—It is common that the objects in a spatial database (e.g., restaurants/hotels) are associated with keyword(s) to indicate
their businesses/services/features. An interesting problem known as Closest Keywords search is to query objects, called keyword
cover, which together cover a set of query keywords and have the minimum inter-objects distance. In recent years, we observe
the increasing availability and importance of keyword rating in object evaluation for the better decision making. This motivates us
to investigate a generic version of Closest Keywords search called Best Keyword Cover which considers inter-objects distance
as well as the keyword rating of objects. The baseline algorithm is inspired by the methods of Closest Keywords search which
is based on exhaustively combining objects from different query keywords to generate candidate keyword covers. When the
number of query keywords increases, the performance of the baseline algorithm drops dramatically as a result of massive
candidate keyword covers generated. To attack this drawback, this work proposes a much more scalable algorithm called keyword
nearest neighbor expansion (keyword-NNE). Compared to the baseline algorithm, keyword-NNE algorithm significantly reduces
the number of candidate keyword covers generated. The in-depth analysis and extensive experiments on real data sets have
justified the superiority of our keyword-NNE algorithm.
Index Terms—Spatial database, Point of Interests, Keywords, Keyword Rating, Keyword Cover
t1 0.6 s1 0.2
wide availability of extensive digital maps and satellite 0.8
s2 0.5 0.1 t2
imagery (e.g., Google Maps and Microsoft Virtual Earth
0.3
services), the spatial keywords search problem has attracted 0.1 0.9
c1
0.3
much attention recently [3, 4, 5, 6, 8, 10, 15, 16, 18].
0.2
In a spatial database, each tuple represents a spatial 0.5
0.5
object which is associated with keyword(s) to indicate the 0.4 Keyword rating
0.3
information such as its businesses/services/features. Given a 0.5
0.4
set of query keywords, an essential task of spatial keywords Hotel Restaurant Bar 0.1
O is a keyword cover of T if one object in O is associated non-leaf nodes of an R/R*-tree with inverted indexes. The
with one and only one keyword in T . inverted index at each node refers to a pseudo-document
Definition 3 (Best Keyword Cover Query (BKC)): that represents the keywords under the node. Therefore,
Given a spatial database D and a set of query keywords in order to verify if a node is relevant to a set of query
T , BKC query returns a keyword cover O of T (O ⊂ D) keywords, the inverted index is accessed at each node to
such that O.score ≥ O0 .score for any keyword cover O0 evaluate the matching between the query keywords and the
of T (O0 ⊂ D). pseudo-document associated with the node.
The notations used in this work are summarized in In [18], bR*-tree was proposed where a bitmap is kept
Table 1. for each node instead of pseudo-document. Each bit corre-
sponds to a keyword. If a bit is “1”, it indicates some ob-
TABLE 1
ject(s) under the node is associated with the corresponding
Summery of Notations
keyword; “0” otherwise. A bR*-tree example is shown in
Notation Interpretation Figure 2(a) where a non-leaf node N has four child nodes
D A spatial database. N1 , N2 , N3 , and N4 . The bitmaps of N1 , N2 , N4 are 111
T A set of query keywords. and the bitmap of N3 is 101. In specific, the bitmap 101 in-
Ok The set of objects associated with keyword k. dicates some objects under N3 are associated with keyword
ok An object in Ok .
“hotel” and “restaurant” respectively, and no object under
KCo The set of keyword covers in each of which o is a member.
kco A keyword cover in KCo .
N3 is associated with keyword “bar”. The bitmap allows to
lbkco The local best keyword cover of o, i.e., the keyword cover combine nodes to generate candidate keyword covers. If a
in KCo with the highest score. node contains all query keywords, this node is a candidate
n
ok .N Nki ok ’s nth keyword nearest neighbor in query keyword ki .
keyword cover. If multiple nodes together cover all query
KRR*k -tree The keyword rating R*-tree of Ok .
Nk A node of KRR*k -tree.
keywords, they constitute a candidate keyword cover. Sup-
pose the query keywords are 111. When N is visited, its
child node N1 , N2 , N3 , N4 are processed. N1 , N2 , N4 are
associated with all query keywords and N3 is associated
3 R ELATED W ORK with two query keywords. The candidate keyword covers
Spatial Keyword Search generated are {N1 }, {N2 }, {N4 }, {N1 , N2 }, {N1 , N3 },
Recently, the spatial keyword search has received consid- {N1 , N4 }, {N2 , N3 }, {N2 , N4 }, {N3 , N4 }, {N1 , N2 , N3 },
erable attention from research community. Some existing {N1 , N3 , N4 } and {N2 , N3 , N4 }. Among the candidate
works focus on retrieving individual objects by specifying keyword covers, the one with the best evaluation is pro-
a query consisting of a query location and a set of query cessed by combining their child nodes to generate more
keywords (or known as document in some context). Each candidates. However, the number of candidates generated
retrieved object is associated with keywords relevant to the can be very large. Thus, the depth-first bR*-tree browsing
query keywords and is close to the query location [3, 5, 6, 8, strategy is applied in order to access the objects in leaf
10, 15, 16]. The similarity between documents (e.g., [14]) nodes as soon as possible. The purpose is to obtain the
are applied to measure the relevance between two sets of current best solution as soon as possible. The current best
keywords. solution is used to prune the candidate keyword covers. In
Since it is likely no individual object is associated with the same way, the remaining candidates are processed and
all query keywords, some other works aim to retrieve mul- the current best solution is updated once a better solution
tiple objects which together cover all query keywords [4, is identified. When all candidates have been pruned, the
17, 18]. While potentially a large number of object com- current best solution is returned to mCK query.
binations satisfy this requirement, the research problem In [17], a virtual bR*-tree based method is introduced to
is that the retrieved objects must have desirable spatial handle mCK query with attempt to handle data set with
relationship. In [4], authors put forward the problem to massive number of keywords. Compared to the method
retrieve objects which 1) cover all query keywords, 2) have in [18], a different index structure is utilized. In virtual bR*-
minimum inter-objects distance and 3) are close to a query tree based method, an R*-tree is used to index locations
location. The work [17, 18] study a similar problem called of objects and an inverted index is used to label the leaf
m Closet Keywords (mCK). mCK aims to find objects nodes in the R*-tree associated with each keyword. Since
which cover all query keywords and have the minimum only leaf nodes have keyword information the mCK query
inter-objects distance. Since no query location is asked is processed by browsing index bottom-up. At first, m
in mCK, the search space in mCK is not constrained by inverted lists corresponding to the query keywords are
the query location. The problem studied in this paper is a retrieved, and fetch all objects from the same leaf node
generic version of mCK query by also considering keyword to construct a virtual node in memory. Clearly, it has a
rating of objects. counterpart in the original R*-tree. Each time a virtual
node is constructed, it will be treated as a subtree which is
Access Methods browsed in the same way as in [18]. Compared to bR*-tree,
The approaches proposed by Cong et al. [5] and Li et the number of nodes in R*-tree has been greatly reduced
al. [10] employ a hybrid index that augments nodes in such that the I/O cost is saved.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 4
N1k1
N1k2
N1k1
N1k2
N1k1
N2k2
N1k1
N2k2
N2k1
N1k2
N2k1
N1k2
N2k1
N2k2
N2k1
N2k2
Conceptually, any query keyword can be selected as the
N1k3 N2k3 N1k3 N2k3 N1k3 N2k3 N1k3 N2k3 principal query keyword. Since computing lbkc is required
for each principal object, the query keyword with the
minimum number of objects is selected as the principal
N1k1 N1k1 N1k1 N1k1 N2k1 N2k1
N2k2
N2k1 N2k1
N2k3
N1k2
N1k3
N1k2
N2k3
N2k2
N1k3
N2k2 query keyword in order to achieve high performance.
N1k2 N2k2 N1k3 N2k3 N1k2 N1k3 N2k3
combining objects (or their MBRs). Even though pruning 6.1.1 Keyword Nearest Neighbor
techniques have been explored, it has been observed that Definition 5 (Keyword Nearest Neighbor (Keyword-NN)):
the performance drops dramatically, when the number of Given a set of query keywords T , the principal query
query keywords increases, because of the fast increase keyword is k ∈ T and a non-principal query keyword is
of candidate keyword covers generated. This motivates us ki ∈ T /{k}. Ok is the set of principal objects and Oki
to develop a different algorithm called keyword nearest is the set of objects of keyword ki . The keyword nearest
neighbor expansion (keyword-NNE). neighbor of a principal object ok ∈ Ok in keyword ki is
We focus on a particular query keyword, called principal oki ∈ Oki iif {ok , oki }.score ≥ {ok , o0ki }.score for all
query keyword. The objects associated with the principal o0ki ∈ Oki .
query keyword are called principal objects. Let k be the The first keyword-NN of ok in keyword ki is denoted
principal query keyword. The set of principle objects is as ok .nn1ki and the second keyword-NN is ok .nn2ki , and
denoted as Ok . so on. These keyword-NNs can be retrieved by browsing
Definition 4 (Local Best Keyword Cover): Given a set KRR*ki -tree. Let Nki be a node in KRR*ki -tree.
of query keywords T and the principal query keyword
k ∈ T , the local best keyword cover of a principal object {ok , Nki }.score = score(A, B). (6)
ok is A = dist(ok , Nki .)
lbkcok = {kcok |kcok ∈ KCok , B = min(Nki .maxrating, ok .rating).
kcok .score = max kc.score}. (5) where dist(ok , Nki ) is the minimum distance between ok
kc∈KCok
and Nki in the 2-dimensional geographical space defined by
Where KCok is the set of keyword covers in each of which x and y dimensions, and Nki .maxrating is the maximum
the principal object ok is a member. value of Nki in keyword rating dimension.
For each principal object ok ∈ Ok , lbkcok is identified. Lemma 4: For any object oki under node Nki in
Among all principal objects, the lbkcok with the highest KRR*ki -tree,
score is called global best keyword cover (GBKCk ).
Lemma 3: GBKCk is the solution of BKC query. {ok , Nki }.score ≥ {ok , oki }.score. (7)
Proof: Assume the solution of BKC query is a Proof: It is a special case of Lemma 2.
keyword cover kc other than GBKCk , i.e., kc.score > To retrieve keyword-NNs of a principal object ok in
GBKCk .score. Let ok be the principal object in keyword ki , KRR*ki -tree is browsed in the best-first strat-
kc. By definition, lbkcok .score ≥ kc.score, and egy [9]. The root node of KRR*ki -tree is visited first by
GBKCk .score ≥ lbkcok .score. So, GBKCk .score ≥ keeping its child nodes in a heap H. For each node Nki ∈
kc.score must be true. This conflicts to the assumption that H, {ok , Nki }.score is computed. The node in H with the
BKC is a keyword cover kc other than GBKCk . highest score is replaced by its child nodes. This operation
The sketch of keyword-NNE algorithm is as follows: is repeated until an object oki (not a KRR*ki -tree node) is
visited. {ok , oki }.score is denoted as current best and ok
Sketch of Keyword-NNE Algorithm is the current best object. According to Lemma 4, any node
Nki ∈ H is pruned if {ok , Nki }.score ≤ current best.
Step 1. One query keyword k ∈ T is selected as the When H is empty, the current best object is ok .nn1ki . In
principal query keyword; the similar way, ok .nnjki (j > 1) can be identified.
Step 2. For each principal object ok ∈ Ok , lbkcok is
computed; 6.1.2 lbkc Computing Algorithm
Step 3. In Ok , GBKCk is identified; Computing lbkcok is to incrementally retrieve keyword-
Step 4. return GBKCk . NNs of ok in each non-principal query keyword. An
example is shown in Figure 4 where query keywords
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 7
0.5 0.5
Lemma 6: lbkcok ≡ kc once kc.score >
0.1
s3 0.2 max (ki .score).
c2 ki ∈T /{k}
0.6 0.2
0.8
s2 0.5 0.1
0.1 t3 0.3
Input: A set of query keywords T , a principal object
0.9
ok
0.2 Output: lbkcok
0.5
c2 0.5
1 foreach non-principal query keyword ki ∈ T do
0.4
2 S ← retrieve ok .nn1ki ;
0.3
0.5
c4
3 ki .score ← {ok , ok .nn1ki }.score;
0.4
Hotel Restaurant Bar 0.1 4 ki .n = 1;
Computing lbkct3 5 kc ← the keyword cover in S;
while T 6= ∅ do
steps
of the current best solution bkc (bkc.score = 0 initially), ki and ki ∈T wki = 1. For example, a user may give
bkc is updated to be lbkcok . For any Nk ∈ H, Nk is higher weight to “hotel” but lower weight to “restaurant”
pruned if lbkcN k .score ≤ bkc.score since lbkcok .score ≤ in a BKC query. Given the score function in Equation (8),
lbkcok for every ok under Nk in KRR*k -tree according to the baseline algorithm and keyword-NNE algorithm can
Lemma 7. Once H is empty, bkc is returned to BKC query. be used to process BKC query with minor modification.
The core is to maintain the property in Lemma 1 and
in Lemma 2 which are the foundation of the pruning
7 W EIGHTED AVERAGE OF K EYWORD R AT- techniques in the baseline algorithm and the keyword-NNE
INGS algorithm.
To this point, the minimum keyword rating of objects in O However, the property in Lemma 1 is invalid given the
is used in O.score. However, it is unsurprising that a user score function defined in Equation (9). To maintain this
prefers the weighted average of keyword ratings of objects property, if a combination does not cover a query keyword
in O to measure O.score. ki , this combination is modified by inserting a virtual object
diam(O) W Average(O) associated with ki . This virtual object does not change
O.score = α×(1− )+(1−α)× . the diameter of the combination, but it has the maximum
max dist max rating
(8) rating of ki (for the combination of nodes, a virtual node
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 9
all candidate keyword covers generated in BF -baseline Proof: Due to the best-first browsing strategy, lbkcN k
algorithm are grouped into independent groups. Each group is further processed in keyword-NNE algorithm only if
is associated with one principal node (or object). That is, lbkcN k .score > BKC.score. In any algorithm A ∈ A, a
the candidate keyword covers fall in the same group if they number of candidate keyword covers need to be generated
have the same principal node (or object). Given a principal and assessed since no combination of objects (or nodes
node Nk , let GN k be the associated group. The example of KRR*-trees) has been pre-processed. Given a node (or
in Figure 5 shows GN k where some keyword covers such object) N , the candidate keyword covers generated can be
as kc1 , kc2 have score greater than BKC.score, denoted organized in a group if they contain N . In this group, if
as G1N k , and some keyword covers such as kc3 , kc4 have one keyword cover has score greater than BKC.score, the
score not greater than BKC.score, denoted as G2N k . In possibility exists that the solution of BKC query is related
BF -baseline algorithm, GN k is maintained in H before the to this group. In this case, A needs to process at least one
first current best solution is obtained, and every keyword keyword cover in this group. If A fails to do this, it may
cover in G1N k needs to be further processed. lead to an incorrect solution. That is, no algorithm in A can
In keyword-NNE algorithm, the keyword cover in GN k process less candidate keyword covers than keyword-NNE
with the highest score, i.e., lbkcN k , is identified and main- algorithm.
tained in memory. That is, each principal node (or object)
keeps its lbkc only. The total number of principal nodes (or 8.2.2 Candidate Keyword Covers Processing
objects) is O(n log n) where n is the number of principal Every candidate keyword cover in G1N k is further processed
objects. So, the memory requirement for maintaining H is in BF -baseline algorithm. In the example in Figure 5, kc1
O(n log n). The (almost) linear memory requirement makes is further processed, so does every kc ∈ G1N k . Let us look
the best-first browsing strategy practical in keyword-NNE closer at kc1 = {Nk , Nk1 , Nk2 } processing. As introduced
algorithm. Due to the best-first browsing strategy, lbkcN k in section 4, each node N in KRR*-tree is defined as
is further processed in keyword-NNE algorithm only if N (x, y, r, lx , ly , lr ) which can be represented with 48 bytes.
lbkcN k .score > BKC.score. If the disk pagesize is 4096 bytes, the reasonable fan-out
of KRR*-tree is 40-50. That is, each node in kc1 (i.e., Nk ,
8.2.1 Instance Optimality Nk1 and Nk2 ) has 40-50 child nodes. In kc1 processing in
The instance optimality [7] corresponds to the optimality BF -baseline algorithm, these child nodes are combined to
in every instance, as opposed to just the worst case or the generate candidate keyword covers using Algorithm 3.
average case. There are many algorithms that are optimal In keyword-NNE algorithm, one and only one keyword
in a worst-case sense, but are not instance optimal. An cover in G1N k , i.e., lbkcN k , is further processed. For
example is binary search. In the worst case, binary search each child node cNk of Nk , lbkccN k is computed. For
is guaranteed to require no more than log N probes for computing lbkccN k , a number of keyword-NNs of cNk
N data items. By linear search which scans through the are retrieved and combined to generate more candidate
sequence of data items, N probes are required in the worst keyword covers using Algorithm 3. The experiments on real
case. However, binary search is not better than linear search data sets illustrate that only 2-4 keyword-NNs in average in
in all instances. When the search item is in the very first each non-principal query keyword are retrieved in lbkccN k
position of the sequence, a positive answer can be obtained computation.
in one probe and a negative answer in two probes using When further processing a candidate keyword cover,
linear search. The binary search may still require log N keyword-NNE algorithm typically generates much less new
probes. candidate keyword covers compared to BF -baseline al-
Instance optimality can be formally defined as follows: gorithm. Since the number of candidate keyword covers
for a class of correct algorithms A and a class of valid input further processed in keyword-NNE algorithm is optimal
D to the algorithms, cost(A, D) represents the amount of a (Theorem 1), the number of keyword covers generated in
resource consumed by running algorithm A ∈ A on input BF -baseline algorithm is much more than that in keyword-
D ∈ D. An algorithm B ∈ A is instance optimal over A NNE algorithm. In turn, we conclude that the number of
and D if cost(B, D) = O(cost(A, D)) for ∀A ∈ A and keyword covers generated in baseline algorithm is much
∀D ∈ D. This cost could be running time of algorithm A more than that in keyword-NNE algorithm. This conclusion
over input D. is independent of the principal query keyword since the
Theorem 1: Let D be the class of all possible spatial analysis does not apply any constraint on the selection
databases where each tuple is a spatial object and is strategy of principal query keyword.
associated with a keyword. Let A be the class of any
correct BKC processing algorithm over D ∈ D. For
all algorithms in A, multiple KRR*-trees, each for one 9 E XPERIMENT
keyword, are explored by combining nodes at the same In this section we experimentally evaluate keyword-NNE
hierarchical level until leaf node, and no combination of algorithm and the baseline algorithm. We use four real data
objects (or nodes of KRR*-trees) has been pre-processed, sets, namely Yelp, Yellow Page, AU, and DE. Specifically,
keyword-NNE algorithm is optimal in terms of the number Yelp is a data set extracted from Yelp Academic Dataset
of candidate keyword covers which are further processed. (www.yelp.com) which contains 7707 POIs (i.e., points
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 11
of interest, which are equivalent to the objects in this maximum memory consumed, and 4) the average number
work) with 27 keywords where the average, maximum and of keyword-NNs of each principal node (or object) retrieved
minimum number of POIs in each keyword are 285, 1353 for computing lbkc and the number of lbkcs computed for
and 120 respectively. Yellow Page is a data set obtained answering BKC query. In addition, we test the perfor-
from yellowpage.com.au in Sydney which contains 30444 mance in the situation that the weighted average of keyword
POIs with 26 keywords where the average, maximum and ratings is applied as discussed in section 7. All algorithms
minimum number of POIs in each keyword are 1170, 10284 are implemented in Java 1.7.0. and all experiments have
and 154 respectively. All POIS in Yelp has been rated been performed on a Windows XP PC with 3.0 Ghz CPU
by customers from 1 to 10. About half of the POIs in and 3 GB main memory.
Yellow Page have been rated by Yelp, the unrated POIs are In Figure 7, the number of keyword covers generated in
assigned average rating 5. AU and US are extracted from a baseline algorithm is compared to that in the algorithms
public source 3 . AU contains 678581 POIs in Australia with directly extended from [17, 18] when the number of query
187 keywords where the average, maximum and minimum keywords m changes from 2 to 9. It shows that the baseline
number of POIs in each keyword are 3728, 53956 and 403 algorithm has better performance in all settings. This is
respectively. US contains 1541124 POIs with 237 keywords consistent with the analysis in section 5. The test results on
where the average, maximum and minimum number of Yellow Page and Yelp data sets are shown in Figure 7 (a)
POIs in each keyword are 6502, 122669 and 400. In AU which represents data sets with small number of keywords.
and US, the keyword ratings from 1 to 10 are randomly The test results on AU and US data sets are shown in
assigned to POIs. The ratings are in normal distribution Figure 7 (b) which represents data set with large number
where the mean µ = 5 and the standard deviation σ = 1. of keywords. As observed, when the number of keywords
The distribution of POIs in keywords are illustrated in in a data set is small, the difference between baseline
Figure 6. For each data set, the POIs of each keyword are algorithm and the directly extended algorithms is reduced.
indexed using a KRR*-tree. The reason is that the single tree index in the directly
extended algorithms has more pruning power in this case
(as discussed in section 4).
Average = 3728
9.1 Effect of m
The number of query keywords m has significant impact
to query processing efficiency. In this test, m is changed
Average = 6502
from 2 to 9 when α = 0.4. Each BKC query is generated
by randomly selecting m keyword from all keywords as the
query keywords. For each setting, we generate and perform
100 BKC queries, and the averaged results are reported 4 .
Figure 8 shows the number of candidate keyword covers
Average = 285 Average=1170 generated for BKC query processing. When m increases,
the number of candidate keyword covers generated in-
creases dramatically in the baseline algorithm. In contrast,
keyword-NNE algorithm shows much better scalability. The
reason has been explained in section 8.
Fig. 6. The distribution of keyword size in test data Figure 9 reports the average response time of BKC
sets. query when m changes. The response time is closely related
to the candidate keyword covers generated during the query
processing. In the baseline algorithm, the response time
increases very fast when m increases. This is consistent
with the fast increase of the candidate keyword covers
generated when m increases. Compared to the baseline
algorithm, the keyword-NNE algorithm shows much slower
increase when m increases.
For processing the BKC queries at the same settings
of m and α, the performances are different on datasets
(a) Yellow Page and Yelp (b) AU and US US, AU, YELP and Yellow Page as shown in Figure 8
and Figure 9. The reason is that the average number of
Fig. 7. Baseline, Virtual bR*-tree and bR*-tree. objects in one keyword in datasets US, AU, YELP and
Yellow Page are 6502, 3728, 1170 and 285 respectively as
We are interested in 1) the number of candidate keyword shown in Figure 6; in turn, the average numbers of objects
covers generated, 2) BKC query response time, 3) the in one query keyword in the BKC queries on dataset US,
3. http://s3.amazonaws.com/simplegeo-public/places dump 20110628.zip 4. In this work, all experimental results are obtained in the same way.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 12
AU, YELP and Yellow Page are expected to be the same. ignored. When α changes from 1 to 0, more weight is
The experimental results show the higher average number assigned to keyword rating. In Figure 10, an interesting
such as on dataset US leads to the more candidate keyword observation is that with the decrease of α the number of
covers and the more processing time. keyword covers generated in both the baseline algorithm
and keyword-NNE algorithm shows a constant trend of
slight decrease. The reason behind is that KRR*-tree has
9.2 Effect of α
a keyword rating dimension. Objects close to each other
This test shows the impact of α to the performance. geographically may have very different ratings and thus
As shown in Equation (2), α is an application specific they are in different nodes of KRR*-tree. If more weight is
parameter to balance the weight of keyword rating and assigned to keyword ratings, KRR*-tree tends to have more
the diameter in the score function. Compared to m, the pruning power by distinguishing the objects close to each
impact of α to the performance is limited. When α = 1, other but with different keyword ratings. As a result, less
BKC query is degraded to mKC query where the distance candidate keyword covers are generated. Figure 11 presents
between objects is the sole factor and keyword rating is
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 13
the average response time of queries which are consistent their lbkcs in different sizes of data sets. In other words,
with the number of candidate keyword covers generated. 90% of the overall principal nodes (or objects) are pruned
BKC query provides robust solutions to meet various during the query processing.
practical requirements while mCK query cannot. Suppose
we have three query keywords in Yelp dataset, namely,
“bars”, “hotels & travel”, and “fast food”. When α= 1,
the BKC query (equivalent to mCK query) returns Pita
House, Scottsdale Neighborhood Trolley, and Schlotzskys
(the names of the selected objects in keyword “bars”,
“hotels & travel”, and “fast food” respectively) where the
lowest keyword rating is 2.5 and the maximum distance
M M
is 0.045km. When α= 0.4, the BKC query returns The
Attic, Enterprise Rent-A-Car and Chick-Fil-A where the (a) Average size of S vs. m. (b) Number of lbkcs computed vs.
m.
lowest keyword rating is 4.5 and the maximum distance
is 0.662km. Fig. 13. Features of keyword-NNE (α=0.4).
keyword covers which need to be further processed in [17] Dongxiang Zhang, Beng Chin Ooi, and Anthony K.
keyword-NNE algorithm is optimal and processing each H. Tung. “Locating mapped resources in web 2.0”.
keyword candidate cover typically generates much less new In: ICDE (2010).
candidate keyword covers in keyword-NNE algorithm than [18] Dongxiang Zhang et al. “Keyword Search in Spatial
in the baseline algorithm. Databases: Towards Searching by Document”. In:
ICDE. 2009, pp. 688–699.
R EFERENCES
[1] Rakesh Agrawal and Ramakrishnan Srikant. “Fast
algorithms for mining association rules in large
Ke Deng was awarded a PhD degree in
databases”. In: VLDB. 1994, pp. 487–499. computer science from The University of
[2] T. Brinkhoff, H. Kriegel, and B. Seeger. “Efficient Queensland, Australia, in 2007 and a Master
processing of spatial joins using R-trees”. In: SIG- degree in Information and Communication
PLACE Technology from Griffith University, Australia,
MOD (1993), pp. 237–246. PHOTO in 2001. His research background include
[3] Xin Cao, Gao Cong, and Christian S. Jensen. “Re- HERE high performance database system, spa-
trieving top-k prestige-based relevant spatial web ob- tiotemporal data management, data quality
control and business information system. His
jects”. In: Proc. VLDB Endow. 3.1-2 (2010), pp. 373– current research interest focuses on big spa-
384. tiotemporal data management and mining.
[4] Xin Cao et al. “Collective spatial keyword querying”.
In: ACM SIGMOD. 2011.
[5] G. Cong, C. Jensen, and D. Wu. “Efficient retrieval
of the top-k most relevant spatial web objects”. In: Xin Li is a Master candidate in Computer
Science at Renmin University of China. He
Proc. VLDB Endow. 2.1 (2009), pp. 337–348. received the Bachelor degree from School of
[6] Ian De Felipe, Vagelis Hristidis, and Naphtali Rishe. Information at Renmin University of China in
PLACE 2007. His current research focuses on spatial
“Keyword Search on Spatial Databases”. In: ICDE. PHOTO database and geo-positioning.
2008, pp. 656–665. HERE
[7] R. Fagin, A. Lotem, and M. Naor. “Optimal Aggre-
gation Algorithms for Middleware”. In: Journal of
Computer and System Sciences 66 (2003), pp. 614–
656.
[8] Ramaswamy Hariharan et al. “Processing Spatial-
Keyword (SK) Queries in Geographic Information
Jiaheng Lu is a professor in Computer Sci-
Retrieval (GIR) Systems”. In: Proceedings of the ence at the Renmin University of China.
19th International Conference on Scientific and Sta- He received the Ph.D. degree from National
tistical Database Management. 2007, pp. 16–23. University of Singapore in 2007 and M.S.
PLACE from Shanghai Jiaotong University in 2001.
[9] G. R. Hjaltason and H. Samet. “Distance browsing in PHOTO His research interests span many aspects
spatial databases”. In: TODS 2 (1999), pp. 256–318. HERE of data management. His current research
[10] Z. Li et al. “IR-tree: An efficient index for geo- focuses on developing an academic search
engine, XML data management, and big data
graphic document search”. In: TKDE 99.4 (2010), management. He has served in the organi-
pp. 585–599. zation and program committees for various
[11] N. Mamoulis and D. Papadias. “Multiway spatial conferences, including SIGMOD, VLDB, ICDE and CIKM.
joins”. In: TODS 26.4 (2001), pp. 424–475.
[12] D. Papadias, N. Mamoulis, and B. Delis. “Algorithms
for querying by spatial structure”. In: VLDB (1998),
p. 546. Xiaofang Zhou received the BS and MS
degrees in computer science from Nanjing
[13] D. Papadias, N. Mamoulis, and Y. Theodoridis. “Pro- University, China, in 1984 and 1987, respec-
cessing and optimization of multiway spatial joins tively, and the PhD degree in computer sci-
PLACE ence from The University of Queensland,
using R-trees”. In: PODS (1999), pp. 44–55. PHOTO Australia, in 1994. He is a professor of com-
[14] J. M. Ponte and W. B. Croft. “A language modeling HERE puter science with The University of Queens-
approach to information retrieval”. In: SIGIR (1998), land. He is the head of the Data and Knowl-
pp. 275–281. edge Engineering Research Division, School
of Information Technology and Electrical En-
[15] João B. Rocha-Junior et al. “Efficient processing of gineering. He is also an specially appointed
top-k spatial keyword queries”. In: Proceedings of Adjunct Professor at Soochow University, China. From 1994 to 1999,
the 12th international conference on Advances in he was a senior research scientist and project leader in CSIRO.
His research is focused on finding effective and efficient solutions to
spatial and temporal databases. 2011, pp. 205–222. managing integrating and analyzing very large amounts of complex
[16] S. B. Roy and K. Chakrabarti. “Location-Aware Type data for business and scientific applications. His research interests
Ahead Search on Spatial Databases: Semantics and include spatial and multimedia databases, high performance query
processing, web information systems, data mining, and data quality
Efficiency”. In: SIGMOD (2011). management. He is a senior member of the IEEE.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.