Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
75 views14 pages

Keyword Search

This document proposes a new algorithm called Best Keyword Cover (BKC) to address the spatial keyword search problem. BKC considers both the distance between objects and the keyword rating of objects, which has increasing importance for decision making. It improves upon the baseline Closet Keywords algorithm, which does not scale well as the number of query keywords increases due to the massive number of candidate keyword covers generated. The new algorithm significantly reduces the number of candidates through a keyword nearest neighbor expansion approach. Experiments on real datasets demonstrate the superiority of the new algorithm.

Uploaded by

Brad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views14 pages

Keyword Search

This document proposes a new algorithm called Best Keyword Cover (BKC) to address the spatial keyword search problem. BKC considers both the distance between objects and the keyword rating of objects, which has increasing importance for decision making. It improves upon the baseline Closet Keywords algorithm, which does not scale well as the number of query keywords increases due to the massive number of candidate keyword covers generated. The new algorithm significantly reduces the number of candidates through a keyword nearest neighbor expansion approach. Experiments on real datasets demonstrate the superiority of the new algorithm.

Uploaded by

Brad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation


information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1

Best Keyword Cover Search


Ke Deng, Xin Li, Jiaheng Lu, and Xiaofang Zhou

Abstract—It is common that the objects in a spatial database (e.g., restaurants/hotels) are associated with keyword(s) to indicate
their businesses/services/features. An interesting problem known as Closest Keywords search is to query objects, called keyword
cover, which together cover a set of query keywords and have the minimum inter-objects distance. In recent years, we observe
the increasing availability and importance of keyword rating in object evaluation for the better decision making. This motivates us
to investigate a generic version of Closest Keywords search called Best Keyword Cover which considers inter-objects distance
as well as the keyword rating of objects. The baseline algorithm is inspired by the methods of Closest Keywords search which
is based on exhaustively combining objects from different query keywords to generate candidate keyword covers. When the
number of query keywords increases, the performance of the baseline algorithm drops dramatically as a result of massive
candidate keyword covers generated. To attack this drawback, this work proposes a much more scalable algorithm called keyword
nearest neighbor expansion (keyword-NNE). Compared to the baseline algorithm, keyword-NNE algorithm significantly reduces
the number of candidate keyword covers generated. The in-depth analysis and extensive experiments on real data sets have
justified the superiority of our keyword-NNE algorithm.

Index Terms—Spatial database, Point of Interests, Keywords, Keyword Rating, Keyword Cover

BKC returns {t1,s1,c1} mCK returns {t2, s2, c2}


1 I NTRODUCTION 0.5 0.5
0.1
Driven by mobile computing, location-based services and c2
0.2

t1 0.6 s1 0.2
wide availability of extensive digital maps and satellite 0.8
s2 0.5 0.1 t2
imagery (e.g., Google Maps and Microsoft Virtual Earth
0.3
services), the spatial keywords search problem has attracted 0.1 0.9
c1
0.3
much attention recently [3, 4, 5, 6, 8, 10, 15, 16, 18].
0.2
In a spatial database, each tuple represents a spatial 0.5
0.5

object which is associated with keyword(s) to indicate the 0.4 Keyword rating
0.3
information such as its businesses/services/features. Given a 0.5
0.4
set of query keywords, an essential task of spatial keywords Hotel Restaurant Bar 0.1

search is to identify spatial object(s) which are associated


with keywords relevant to a set of query keywords, and
have desirable spatial relationships (e.g., close to each other Fig. 1. BKC vs. mCK
and/or close to a query location). This problem has unique
value in various applications because users’ requirements
are often expressed as multiple keywords. For example, likely none of individual objects is associated with all
a tourist who plans to visit a city may have particular query keywords, this motivates the studies [4, 17, 18]
shopping, dining and accommodation needs. It is desirable to retrieve multiple objects, called keyword cover, which
that all these needs can be satisfied without long distance together cover (i.e., associated with) all query keywords and
traveling. are close to each other. This problem is known as m Closest
Due to the remarkable value in practice, several variants Keywords (mCK) query in [17, 18]. The problem studied
of spatial keyword search problem have been studied. The in [4] additionally requires the retrieved objects close to a
works [5, 6, 8, 15] aim to find a number of individual query location.
objects, each of which is close to a query location and the This paper investigates a generic version of mCK query,
associated keywords (or called document) are very relevant called Best Keyword Cover (BKC) query, which considers
to a set of query keywords (or called query document). inter-objects distance as well as keyword rating. It is
The document similarity (e.g., [14]) is applied to measure motivated by the observation of increasing availability and
the relevance between two sets of keywords. Since it is importance of keyword rating in decision making. Millions
of businesses/services/features around the world have been
• Ke Deng is with Huawei Noah’s Ark Research Lab, Hong Kong. This
rated by customers through online business review sites
work is completed when he was with the University of Queensland. such as Yelp, Citysearch, ZAGAT and Dianping, etc. For
E-mail: [email protected] example, a restaurant is rated 65 out of 100 (ZAGAT.com)
• Xin Li and Jiaheng Lu are with Renmin University, China.
E-mail: lixin2007,[email protected]
and a hotel is rated 3.9 out of 5 (hotels.com). According
• Xiaofang Zhou is with the University of Queensland, Australia. to a survey in 2013 conducted by Dimensional Research 1 ,
E-mail: [email protected]
1. www.dimensionalresearch.com
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 2

an overwhelming 90 percent of respondents claimed that 2 P RELIMINARY


buying decisions are influenced by online business re-
Given a spatial database, each object may be associated
view/rating. Due to the consideration of keyword rating, the
with one or multiple keywords. Without loss of general-
solution of BKC query can be very different from that of
ity, the object with multiple keywords are transformed to
mCK query. Figure 1 shows an example. Suppose the query
multiple objects located at the same location, each with
keywords are “Hotel”, “Restaurant” and “Bar”. mCK query
a distinct single keyword. So, an object is in the form
returns {t2 , s2 , c2 } since it considers the distance between
hid, x, y, keyword, ratingi where x, y define the location
the returned objects only. BKC query returns {t1 , s1 , c1 }
of the object in a 2-dimensional geographical space. No
since the keyword ratings of object are considered in
data quality problem such as misspelling exists in key-
addition to the inter-objects distance. Compared to mCK
words.
query, BKC query supports more robust object evaluation
Definition 1 (Diameter): Let O be a set of objects
and thus underpins the better decision making.
{o1 , · · · , on }. For oi , oj ∈ O, dist(oi , oj ) is the Euclidean
This work develops two BKC query processing algo- distance between oi , oj in the 2-dimensional geographical
rithms, baseline and keyword-NNE. The baseline algorithm space. The diameter of O is:
is inspired by the mCK query processing methods [17, 18].
Both the baseline algorithm and keyword-NNE algorithm diam(O) = max dist(oi , oj ). (1)
oi ,oj ∈O
are supported by indexing the objects with an R*-tree like
index, called KRR*-tree. In the baseline algorithm, the The score of O is a function with respect to not only
idea is to combine nodes in higher hierarchical levels of the diameter of O but also the keyword rating of objects in
KRR*-trees to generate candidate keyword covers. Then, O. Users may have different interests in keyword rating of
the most promising candidate is assessed in priority by objects. We first discuss the situation that a user expects to
combining their child nodes to generate new candidates. maximize the minimum keyword rating of objects in BKC
Even though BKC query can be effectively resolved, when query. Then we will discuss another situation in section 7
the number of query keywords increases, the performance that a user expects to maximize the weighted average of
drops dramatically as a result of massive candidate keyword keyword ratings.
covers generated. The linear interpolation function [5, 16] is used to obtain
the score of O such that the score is a linear interpolation
To overcome this critical drawback, we developed much of the individually normalized diameter and the minimum
scalable keyword nearest neighbor expansion (keyword- keyword rating of O.
NNE) algorithm which applies a different strategy.
Keyword-NNE selects one query keyword as principal
query keyword. The objects associated with the principal O.score = score(A, B)
query keyword are principal objects. For each principal A B
object, the local best solution (known as local best keyword = α(1 − ) + (1 − α) .
max dist max rating
cover (lbkc)) is computed. Among them, the lbkc with the (2)
highest evaluation is the solution of BKC query. Given A = diam(O).
a principal object, its lbkc can be identified by simply
B = min(o.rating).
retrieving a few nearby and highly rated objects in each o∈O
non-principal query keyword (2-4 objects in average as
where B is the minimum keyword rating of objects in O
illustrated in experiments). Compared to the baseline algo-
and α (0 ≤ α ≤ 1) is an application specific parameter. If
rithm, the number of candidate keyword covers generated
α = 1, the score of O is solely determined by the diameter
in keyword-NNE algorithm is significantly reduced. The in-
of O. In this case, BKC query is degraded to mCK query.
depth analysis reveals that the number of candidate keyword
If α = 0, the score of O only considers the minimum
covers further processed in keyword-NNE algorithm is
keyword rating of objects in Q where max dist and
optimal, and each keyword candidate cover processing
max rating are used to normalize diameter and keyword
generates much less new candidate keyword covers than
rating into [0,1] respectively. max dist is the maximum
that in the baseline algorithm.
distance between any two objects in the spatial database
The remainder of this paper is organized as follows. The D, and max rating is the maximum keyword rating of
problem is formally defined in section 2 and the related objects.
work is reviewed in section 3. After that, section 4 dis- Lemma 1: The score is of monotone property.
cusses keyword rating R*-tree (KRR*-tree). The baseline Proof: Given a set of objects Oi , suppose Oj is a
algorithm is introduced in section 5 and keyword-NNE subset of Oi . The diameter of Oi must be not less than
algorithm is proposed in section 6. Section 7 discusses that of Oj , and the minimum keyword rating of objects in
the situation of weighted average of keyword ratings. An Oi must be not greater than that of objects in Oj . Therefore,
in-depth analysis is given in section 8. Then section 9 Oi .score ≤ Oj .score.
reports the experimental results. The paper is concluded Definition 2 (Keyword Cover): Let T be a set of key-
in section 10. words {k1 , · · · , kn } and O a set of objects {o1 , · · · , on },
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 3

O is a keyword cover of T if one object in O is associated non-leaf nodes of an R/R*-tree with inverted indexes. The
with one and only one keyword in T . inverted index at each node refers to a pseudo-document
Definition 3 (Best Keyword Cover Query (BKC)): that represents the keywords under the node. Therefore,
Given a spatial database D and a set of query keywords in order to verify if a node is relevant to a set of query
T , BKC query returns a keyword cover O of T (O ⊂ D) keywords, the inverted index is accessed at each node to
such that O.score ≥ O0 .score for any keyword cover O0 evaluate the matching between the query keywords and the
of T (O0 ⊂ D). pseudo-document associated with the node.
The notations used in this work are summarized in In [18], bR*-tree was proposed where a bitmap is kept
Table 1. for each node instead of pseudo-document. Each bit corre-
sponds to a keyword. If a bit is “1”, it indicates some ob-
TABLE 1
ject(s) under the node is associated with the corresponding
Summery of Notations
keyword; “0” otherwise. A bR*-tree example is shown in
Notation Interpretation Figure 2(a) where a non-leaf node N has four child nodes
D A spatial database. N1 , N2 , N3 , and N4 . The bitmaps of N1 , N2 , N4 are 111
T A set of query keywords. and the bitmap of N3 is 101. In specific, the bitmap 101 in-
Ok The set of objects associated with keyword k. dicates some objects under N3 are associated with keyword
ok An object in Ok .
“hotel” and “restaurant” respectively, and no object under
KCo The set of keyword covers in each of which o is a member.
kco A keyword cover in KCo .
N3 is associated with keyword “bar”. The bitmap allows to
lbkco The local best keyword cover of o, i.e., the keyword cover combine nodes to generate candidate keyword covers. If a
in KCo with the highest score. node contains all query keywords, this node is a candidate
n
ok .N Nki ok ’s nth keyword nearest neighbor in query keyword ki .
keyword cover. If multiple nodes together cover all query
KRR*k -tree The keyword rating R*-tree of Ok .
Nk A node of KRR*k -tree.
keywords, they constitute a candidate keyword cover. Sup-
pose the query keywords are 111. When N is visited, its
child node N1 , N2 , N3 , N4 are processed. N1 , N2 , N4 are
associated with all query keywords and N3 is associated
3 R ELATED W ORK with two query keywords. The candidate keyword covers
Spatial Keyword Search generated are {N1 }, {N2 }, {N4 }, {N1 , N2 }, {N1 , N3 },
Recently, the spatial keyword search has received consid- {N1 , N4 }, {N2 , N3 }, {N2 , N4 }, {N3 , N4 }, {N1 , N2 , N3 },
erable attention from research community. Some existing {N1 , N3 , N4 } and {N2 , N3 , N4 }. Among the candidate
works focus on retrieving individual objects by specifying keyword covers, the one with the best evaluation is pro-
a query consisting of a query location and a set of query cessed by combining their child nodes to generate more
keywords (or known as document in some context). Each candidates. However, the number of candidates generated
retrieved object is associated with keywords relevant to the can be very large. Thus, the depth-first bR*-tree browsing
query keywords and is close to the query location [3, 5, 6, 8, strategy is applied in order to access the objects in leaf
10, 15, 16]. The similarity between documents (e.g., [14]) nodes as soon as possible. The purpose is to obtain the
are applied to measure the relevance between two sets of current best solution as soon as possible. The current best
keywords. solution is used to prune the candidate keyword covers. In
Since it is likely no individual object is associated with the same way, the remaining candidates are processed and
all query keywords, some other works aim to retrieve mul- the current best solution is updated once a better solution
tiple objects which together cover all query keywords [4, is identified. When all candidates have been pruned, the
17, 18]. While potentially a large number of object com- current best solution is returned to mCK query.
binations satisfy this requirement, the research problem In [17], a virtual bR*-tree based method is introduced to
is that the retrieved objects must have desirable spatial handle mCK query with attempt to handle data set with
relationship. In [4], authors put forward the problem to massive number of keywords. Compared to the method
retrieve objects which 1) cover all query keywords, 2) have in [18], a different index structure is utilized. In virtual bR*-
minimum inter-objects distance and 3) are close to a query tree based method, an R*-tree is used to index locations
location. The work [17, 18] study a similar problem called of objects and an inverted index is used to label the leaf
m Closet Keywords (mCK). mCK aims to find objects nodes in the R*-tree associated with each keyword. Since
which cover all query keywords and have the minimum only leaf nodes have keyword information the mCK query
inter-objects distance. Since no query location is asked is processed by browsing index bottom-up. At first, m
in mCK, the search space in mCK is not constrained by inverted lists corresponding to the query keywords are
the query location. The problem studied in this paper is a retrieved, and fetch all objects from the same leaf node
generic version of mCK query by also considering keyword to construct a virtual node in memory. Clearly, it has a
rating of objects. counterpart in the original R*-tree. Each time a virtual
node is constructed, it will be treated as a subtree which is
Access Methods browsed in the same way as in [18]. Compared to bR*-tree,
The approaches proposed by Cong et al. [5] and Li et the number of nodes in R*-tree has been greatly reduced
al. [10] employ a hybrid index that augments nodes in such that the I/O cost is saved.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 4

N1 1 1 1 In [17, 18], a single tree structure is used to index


N2 1 1 1
N3 1 0 1 Hotel Restaurant Bar objects of different keywords. In the similar way as dis-
N4 1 1 1 cussed above, the single tree can be extended with an
Rating
0.1
0.5 0.5
N additional dimension to index keyword rating. A single
0.2 N1
0.8
0.6 0.2 tree structure suits the situation that most keywords are
N1 0.5 0.1
query keywords. For the above mentioned example, all
0.3 N2 N2
0.1 0.9 0.3 keywords, i.e., “hotel”, “restaurant” and “bar”, are query
N3
N 0.5
0.2
Y keywords. However, it is more frequent that only a small
0.5
0.4 fraction of keywords are query keywords. For example in
N3 0.3
0.5 N4 0.4
N4 the experiments, only less than 5% keywords are query
0.1
X keywords. In this situation, a single tree is poor to approxi-
(a) (b) mate the spatial relationship between objects of few specific
keywords. Therefore, multiple KRR*-trees are used in this
Fig. 2. (a) A bR*-tree. (b) The KRR*-tree for keyword work, each for one keyword 2 . The KRR*-tree for keyword
“restaurant”. ki is denoted as KRR*ki -tree.
Given an object, the rating of an associated keyword
is typically the mean of ratings given by a number of
As opposed to employing a single R*-tree embedded customers for a period of time. The change does happen but
with keyword information, multiple R*-trees have been slowly. Even though dramatic change occurs, the KRR*-
used to process multiway spatial join (MWSJ) which in- tree is updated in the standard way of R*-tree update.
volves data of different keywords (or types). Given a num-
ber of R*-trees, one for each keyword, the MWSJ technique
of Papadias et al. [13] (later extended by Mamoulis and
Papadias [11]) uses the synchronous R*-tree approach [2] 5 BASELINE A LGORITHM
and the window reduction (WR) approach [12]. Given two
The baseline algorithm is inspired by the mCK query
R*-tree indexed relations, SRT performs two-way spatial
processing methods [17, 18]. For mCK query processing,
join via synchronous traversal of the two R*-trees based
the method in [18] browses index in top-down manner
on the property that if two intermediate R*-tree nodes
while the method in [17] does bottom-up. Given the same
do not satisfy the spatial join predicate, then the MBRs
hierarchical index structure, the top-down browsing manner
below them will not satisfy the spatial predicate either. WR
typically performs better than the bottom-up since the
uses window queries to identify spatial regions which may
search in lower hierarchical levels is always guided by the
contribute to MWSJ results.
search result in the higher hierarchical levels. However, the
significant advantage of the method in [17] over the method
4 I NDEXING K EYWORD R ATINGS in [18] has been reported. This is because of the different
index structures applied. Both of them use a single tree
To process BKC query, we augment R*-tree with one addi-
structure to index data objects of different keywords. But
tional dimension to index keyword ratings. Keyword rating
the number of nodes of the index in [17] has been greatly
dimension and spatial dimension are inherently different
reduced to save I/O cost by keeping keyword information
measures with different ranges. It is necessary to make
with inverted index separately. Since only leaf nodes and
adjustment. In this work, a 3-dimensional R*-tree called
their keyword information are maintained in the inverted
keyword rating R*-tree (KRR*-tree) is used. The ranges of
index, the bottom-up index browsing manner is used. When
both spatial and keyword rating dimensions are normalized
designing the baseline algorithm for BKC query process-
into [0,1]. Suppose we need construct a KRR*-tree over a
ing, we take the advantages of both methods [17, 18]. First,
set of objects D. Each object o ∈ D is mapped into a new
we apply multiple KRR*-trees which contain no keyword
space using the following mapping function:
information in nodes such that the number of nodes of the
x y rating index is not more than that of the index in [17]; second,
f : o(x, y, rating) → o( , , ).
maxx maxy max rating the top-down index browsing method can be applied since
(3) each keyword has own index.
where maxx , maxy , max rating are the maximum value Suppose KRR*-trees, each for one keyword, have
of objects in D on x, y and keyword rating dimensions been constructed. Given a set of query keywords T =
respectively. In the new space, KRR*-tree can be con- {k1 , .., kn }, the child nodes of the root of KRR*ki -tree
structed in the same way as constructing a conventional 3- (i ≤ i ≤ n) are retrieved and they are combined to generate
dimensional R*-tree. Each node N in KRR*-tree is defined candidate keyword covers. Given a candidate keyword
as N (x, y, r, lx , ly , lr ) where x is the value of N in x axle cover O = {Nk1 , · · · , Nkn } where Nki is a node of
close to the origin, i.e., (0,0,0,0,0,0), and lx is the width of
N in x axle, so do y, ly and r, lr . The Figure 2 (b) gives
2. If the total number of objects associated with a keyword is very
an example to illustrate the nodes of KRR*-tree indexing small, no index is needed for this keyword and these objects are simply
the objects in keyword “restaurant”. processed one by one.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5

KRR*ki -tree. Algorithm 3: Generate Candidate(T, can, bkc)


O.score = score(A, B). (4) Input: A set of query keywords T , a candidate can,
the current best solution bkc.
A= max dist(Ni , Nj )
Ni ,Nj ∈O Output: A set of new candidates.
B = min (N.maxrating). 1 N ew Cans ← ∅;
N ∈O
2 COM ← combining child nodes of can to generate
Where N.maxrating is the maximum value of objects keyword covers;
under N in keyword rating dimension; dist(Ni , Nj ) is 3 foreach com ∈ COM do
the minimum Euclidean distance between Ni and Nj in 4 if com.score > bkc.score then
the 2-dimensional geographical space defined by x and y 5 N ew Cans ← com;
dimensions.
Lemma 2: Given two keyword covers O and O0 , O0 6 return N ew Cans;
consists of objects {ok1 , .., okn } and O consists of nodes
{Nk1 , .., Nkn }. If oki is under Nki in KRR*ki -tree for
1 ≤ i ≤ n, it is true that O0 .score ≤ O.score.
candidate keyword covers using Generate Candidate
function which combines the child nodes of the roots of
Algorithm 1: Baseline(T, Root)
KRR*ki -trees for all ki ∈ T (line 2). These candidates
Input: A set of query keywords T , the root nodes of are maintained in a heap H. Then, the candidate with
all KRR*-trees Root. the highest score in H is selected and its child nodes are
Output: Best Keyword Cover. combined using Generate Candidate function to gener-
1 bkc ← ∅; ate more candidates. Since the number of candidates can be
2 H ← Generate Candidate(T, Root, bkc); very large, the depth-first KRR*ki -tree browsing strategy
3 while H is not empty do is applied to access the leaf nodes as soon as possible
4 can ← the candidate in H with the highest score; (line 6). The first candidate consisting of objects (not
5 Remove can from H; nodes of KRR*-tree) is the current best solution, denoted
6 Depth F irst T ree Browsing(H, T, can, bkc); as bkc, which is an intermediate solution. According to
7 foreach candidate ∈ H do Lemma 2, the candidates in H are pruned if they have score
8 if (candidate.score ≤ bkc.score) then less than bkc.score (line 8). The remaining candidates
9 remove candidate from H; are processed in the same way and bkc is updated if the
better intermediate solution is found. Once no candidate
10 return bkc; is remained in H, the algorithm terminates by returning
current bkc to BKC query.
In Generate Candidate function, it is unnecessary to
actually generate all possible keyword covers of input nodes
Algorithm 2: Depth F irst T ree Browsing (H, T , (or objects). In practice, the keyword covers are generated
can, bkc) by incrementally combining individual nodes (or objects).
Input: A set of query keywords T , a candidate can, An example in Figure 3 shows all possible combinations of
the candidate set H, and the current best input nodes incrementally generated bottom up. There are
solution bkc. three keywords k1 , k2 and k3 and each keyword has two
nodes. Due to the monotonic property in Lemma 1, the
1 if can consists of leaf nodes then
idea of Apriori algorithm [1] can be applied. Initially, each
2 S ← objects in can;
node is a combination with score = ∞. The combination
3 bkc0 ← the keyword cover with the highest score
with the highest score is always processed in priority to
identified in S;
combine one more input node in order to cover a keyword,
4 if bkc.score < bkc0 .score then
which is not covered yet. If a combination has score less
5 bkc ← bkc0 ;
than bkc.score, any superset of it must have score less
6 else than bkc.score. Thus, it is unnecessary to generate the su-
7 N ew Cans ← perset. For example, if {N 2k2 , N 2k3 }.score < bkc.score,
Generate Candidate(T, can, bkc); any superset of {N 2k2 , N 2k3 } must has score less than
8 Replace can by N ew Cans in H; bkc.score. So, it is not necessary to generate {N 2k2 , N 2k3 ,
9 can ← the candidate in N ew Cans with the N 1k1 } and {N 2k2 , N 2k3 , N 2k1 }.
highest score;
10 Depth F irst T ree Browsing(H, T, can, bkc); 6 K EYWORD N EAREST N EIGHBOR E XPAN -
SION (K EYWORD -NNE)
Algorithm 1 shows the pseudo-code of the baseline al- Using the baseline algorithm, BKC query can be ef-
gorithm. Given a set of query keywords T , it first generates fectively resolved. However, it is based on exhaustively
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 6

N1k1
N1k2
N1k1
N1k2
N1k1
N2k2
N1k1
N2k2
N2k1
N1k2
N2k1
N1k2
N2k1
N2k2
N2k1
N2k2
Conceptually, any query keyword can be selected as the
N1k3 N2k3 N1k3 N2k3 N1k3 N2k3 N1k3 N2k3 principal query keyword. Since computing lbkc is required
for each principal object, the query keyword with the
minimum number of objects is selected as the principal
N1k1 N1k1 N1k1 N1k1 N2k1 N2k1
N2k2
N2k1 N2k1
N2k3
N1k2
N1k3
N1k2
N2k3
N2k2
N1k3
N2k2 query keyword in order to achieve high performance.
N1k2 N2k2 N1k3 N2k3 N1k2 N1k3 N2k3

6.1 LBKC Computation


N1k1 N2k1 N1k2 N2k2 N1k3 N2k3 Given a principal object ok , lbkcok consists of ok and the
objects in each non-principal query keyword which is close
to ok and have high keyword ratings. It motivates us to
Fig. 3. Generating candidates. compute lbkcok by incrementally retrieving the keyword
nearest neighbors of ok .

combining objects (or their MBRs). Even though pruning 6.1.1 Keyword Nearest Neighbor
techniques have been explored, it has been observed that Definition 5 (Keyword Nearest Neighbor (Keyword-NN)):
the performance drops dramatically, when the number of Given a set of query keywords T , the principal query
query keywords increases, because of the fast increase keyword is k ∈ T and a non-principal query keyword is
of candidate keyword covers generated. This motivates us ki ∈ T /{k}. Ok is the set of principal objects and Oki
to develop a different algorithm called keyword nearest is the set of objects of keyword ki . The keyword nearest
neighbor expansion (keyword-NNE). neighbor of a principal object ok ∈ Ok in keyword ki is
We focus on a particular query keyword, called principal oki ∈ Oki iif {ok , oki }.score ≥ {ok , o0ki }.score for all
query keyword. The objects associated with the principal o0ki ∈ Oki .
query keyword are called principal objects. Let k be the The first keyword-NN of ok in keyword ki is denoted
principal query keyword. The set of principle objects is as ok .nn1ki and the second keyword-NN is ok .nn2ki , and
denoted as Ok . so on. These keyword-NNs can be retrieved by browsing
Definition 4 (Local Best Keyword Cover): Given a set KRR*ki -tree. Let Nki be a node in KRR*ki -tree.
of query keywords T and the principal query keyword
k ∈ T , the local best keyword cover of a principal object {ok , Nki }.score = score(A, B). (6)
ok is A = dist(ok , Nki .)
lbkcok = {kcok |kcok ∈ KCok , B = min(Nki .maxrating, ok .rating).
kcok .score = max kc.score}. (5) where dist(ok , Nki ) is the minimum distance between ok
kc∈KCok
and Nki in the 2-dimensional geographical space defined by
Where KCok is the set of keyword covers in each of which x and y dimensions, and Nki .maxrating is the maximum
the principal object ok is a member. value of Nki in keyword rating dimension.
For each principal object ok ∈ Ok , lbkcok is identified. Lemma 4: For any object oki under node Nki in
Among all principal objects, the lbkcok with the highest KRR*ki -tree,
score is called global best keyword cover (GBKCk ).
Lemma 3: GBKCk is the solution of BKC query. {ok , Nki }.score ≥ {ok , oki }.score. (7)
Proof: Assume the solution of BKC query is a Proof: It is a special case of Lemma 2.
keyword cover kc other than GBKCk , i.e., kc.score > To retrieve keyword-NNs of a principal object ok in
GBKCk .score. Let ok be the principal object in keyword ki , KRR*ki -tree is browsed in the best-first strat-
kc. By definition, lbkcok .score ≥ kc.score, and egy [9]. The root node of KRR*ki -tree is visited first by
GBKCk .score ≥ lbkcok .score. So, GBKCk .score ≥ keeping its child nodes in a heap H. For each node Nki ∈
kc.score must be true. This conflicts to the assumption that H, {ok , Nki }.score is computed. The node in H with the
BKC is a keyword cover kc other than GBKCk . highest score is replaced by its child nodes. This operation
The sketch of keyword-NNE algorithm is as follows: is repeated until an object oki (not a KRR*ki -tree node) is
visited. {ok , oki }.score is denoted as current best and ok
Sketch of Keyword-NNE Algorithm is the current best object. According to Lemma 4, any node
Nki ∈ H is pruned if {ok , Nki }.score ≤ current best.
Step 1. One query keyword k ∈ T is selected as the When H is empty, the current best object is ok .nn1ki . In
principal query keyword; the similar way, ok .nnjki (j > 1) can be identified.
Step 2. For each principal object ok ∈ Ok , lbkcok is
computed; 6.1.2 lbkc Computing Algorithm
Step 3. In Ok , GBKCk is identified; Computing lbkcok is to incrementally retrieve keyword-
Step 4. return GBKCk . NNs of ok in each non-principal query keyword. An
example is shown in Figure 4 where query keywords
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 7

0.5 0.5
Lemma 6: lbkcok ≡ kc once kc.score >
0.1
s3 0.2 max (ki .score).
c2 ki ∈T /{k}
0.6 0.2
0.8
s2 0.5 0.1

Algorithm 4: Local Best Keyword Cover(ok , T )


0.3

0.1 t3 0.3
Input: A set of query keywords T , a principal object
0.9
ok
0.2 Output: lbkcok
0.5
c2 0.5
1 foreach non-principal query keyword ki ∈ T do
0.4
2 S ← retrieve ok .nn1ki ;
0.3
0.5
c4
3 ki .score ← {ok , ok .nn1ki }.score;
0.4
Hotel Restaurant Bar 0.1 4 ki .n = 1;
Computing lbkct3 5 kc ← the keyword cover in S;
while T 6= ∅ do
steps

Incrementally retrieve keyword-NN of t3 6


S kc.score
“restaurant” “hotel” 7 Find ki ∈ T /{k}, ki .score = max (kj .score);
1 {t3 , c2 }.score =0.7 {t3 , s3 }.score = 0.8 t3 , s3 , c2 0.3 kj ∈T /{k}
2 {t3 , s2 }.score = 0.28 t3 ,s3 , s2 , c2 0.3 8 ki .n ← ki .n + 1;
3 {t3 , c3 }.score =0.6 t3 ,s3 , s2 , c2 , c3 0.4 9 S ← S∪ retrieve ok .nnki.nki ;
4 {t3 , c4 }.score =0.35 t3 ,s3 , s2 , c2 , c3 , c4 0.4 10 ki .score ← {ok , ok .nnki.n
ki }.score;
11 temp kc ← the keyword cover in S;
12 if temp kc.score > kc.score then
Fig. 4. An example of lbkc computation. 13 kc ← temp kc;
14 foreach ki ∈ T /{k} do
15 if ki .score ≤ kc.score then
are “bar”, “restaurant” and “hotel”. The principal query 16 remove ki from T ;
keyword is “bar”. Suppose we are computing lbkct3 . The
first keyword-NN of t3 in “restaurant” and “hotel” are c2 17 return kc;
and s3 respectively. A set S is used to keep t3 , s3 , c2 .
Let kc be the keyword cover in S which has the highest
Algorithm 4 shows the pseudo-code of lbkcok com-
score (the idea of Apriori algorithm can be used, see
puting algorithm. For each non-principal query key-
section 5). After step 1, kc.score = 0.3. In step 2, “hotel”
word ki , the first keyword-NN of ok is retrieved and
is selected and the second keyword-NN of t3 in “hotel” is
ki .score = {ok , ok .nn1ki }.score. They are kept in S
retrieved, i.e., s2 . Since {t3 , s2 }.score < kc.score, s2 can
and the best keyword cover kc in S is identified us-
be pruned and more importantly all objects not accessed in
ing Generate Candidate function in Algorithm 3. The
“hotel” can be pruned according to Lemma 5. In step 3,
objects in different keywords are combined. Each time
the second keyword-NN of t3 in “restaurant” is retrieved,
the most promising combination are selected to further
i.e., c3 . Since {t3 , c3 }.score > kc.score, c3 is inserted
do further combination until the best keyword cover is
into S. As a result, kc is updated to 0.4. Then, the third
identified. When the second keyword-NN of ok in ki
keyword-NN of t3 in “restaurant” is retrieved, i.e., c4 . Since
is retrieved, ki .score is updated to {ok , ok .nn2ki }.score,
{t3 , c4 }.score < kc.score, c4 and all objects not accessed
and so on. Each time one non-principal query key-
yet in “restaurant” can be pruned according to Lemma 5.
word is selected to search next keyword-NN in it. Note
To this point, the current kc is lbkct3 .
that we always select keyword ki ∈ T /{k} where
Lemma 5: If kc.score > {ok , ok .nntki }, ok .nntki and
0 ki .score = maxkj ∈T /{k} (kj .score) to minimize the num-
ok .nntki (t0 > t) must not be in lbkcok .
ber of keyword-NNs retrieved (line 7). After the next
Proof: By definition, kc.score ≤ lbkcok .score.
keyword-NN of ok in this keyword is retrieved, it is inserted
Since {ok , ok .nntki }.score < kc.score, we have {ok ,
into S and kc is updated. If ki .score < kc.score, all
ok .nntki }.score < lbkcok .score and in turn {ok ,
0 objects in ki can be pruned by deleting ki from T according
ok .nntki }.score < lbkcok .score. If ok .nntki is in
to Lemma 5. When T is empty, kc is returned to lbkcok
lbkcok , {ok , ok .nntki }.score ≥ lbkcok .score according to
t0 0 according to Lemma 6.
Lemma 1, so is ok .nnki . Thus, ok .nntki and ok .nntki must
not be in lbkcok .
For each non-principal query keyword ki , after retrieving 6.2 Keyword-NNE Algorithm
the first t keyword-NNs of ok in keyword ki , we use In keyword-NNE algorithm, the principal objects are pro-
ki .score to denoted {ok , ok .nntki }.score. For example in cessed in blocks instead of individually. Let k be the
Figure 4, “restaurant”.score is 0.7, 0.6 and 0.35 after principal query keyword. The principal objects are indexed
retrieving the 1st , 2nd and 3rd keyword-NN of t3 in using KRR*k -tree. Given a node Nk in KRR*k -tree, also
“restaurant”. From Lemma 5, known as a principal node, the local best keyword cover of
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 8

Nk , lbkcN k , consists of Nk and the corresponding nodes Algorithm 5: Keyword-N N E(T, D)


of Nk in each non-principal query keyword. Input: A set of query keywords T , a spatial database
Definition 6 (Corresponding Node): Nk is a node of D
KRR*k -tree at the hierarchical level i. Given a non- Output: Best Keyword Cover
principal query keyword ki , the corresponding nodes of Nk
are nodes in KRR*ki -tree at the hierarchical level i. 1 bkc.score ← 0;
The root of a KRR*-tree is at hierarchical level 1, its 2 k ← select the principal query keyword from T ;
child nodes are at hierarchical level 2, and so on. For 3 H ← child nodes of the root in KRR*k -tree;
example, if Nk is a node at hierarchical level 4 in KRR*k - 4 foreach Nk ∈ H do
tree, the corresponding nodes of Nk in keyword ki are 5 Compute lbkcN k .score;
these nodes at hierarchical level 4 in KRR*ki -tree. From the 6 H.head ← Nk ∈ H with max lbkcN k .score;
Nk ∈H
corresponding nodes, the keyword-NNs of Nk are retrieved
incrementally for computing lbkcN k . 7 while H 6= ∅ do
Lemma 7: If a principal object ok is an object under a 8 while H.head is a node in KRR*k -tree do
principal node Nk in KRR*k -tree 9 N ← child nodes of H.head;
10 foreach Nk in N do
lbkcN k .score ≥ lbkcok .score. 11 Compute lbkcN k .score;
Proof: Suppose lbkcN k = {Nk , Nk1 , .., Nkn } and 12 Insert Nk into H;
lbkcok = {ok , ok1 , .., okn }. For each non-principal query 13 Remove H.head from H;
keyword ki , oki is under a corresponding node of Nk , say 14 H.head ← Nk ∈ H with max lbkcN k .score;
0 0 Nk ∈H
Nki . Note that Nki can be in lbkcN k or not. By definition,
0 0
/* H.head is a principal object
lbkcN k .score ≥ {Nk , Nk1 , .., Nkn }.score. (i.e., not a node in KRR*k -tree)
According to Lemma 2 */
15 ok ← H.head;
0 0
{Nk , Nk1 , .., Nkn }.score ≥ lbkcok .score. 16 Compute lbkcok .score;
So, we have 17 if bkc.score < lbkcok .score then
18 bkc ← lbkcok ;
lbkcN k .score ≥ lbkcok .score. 19 foreach Nk in H do
The lemma is proved. 20 if lbkcN k .score ≤ bkc.score then
The pseudo-code of keyword-NNE algorithm is pre- 21 Remove Nk from H;
sented in Algorithm 5. Keyword-NNE algorithm starts by
selecting a principal query keyword k ∈ T (line 2). Then, 22 return bkc;
the root node of KRR*k -tree is visited by keeping its child
nodes in a heap H. For each node Nk in H, lbkcN k .score is
computed (line 5). In H, the one with the maximum score,
where W Average(O) is defined as:
denoted as H.head, is processed. If H.head is a node of X
KRR*k -tree (line 8-14), it is replaced in H by its child wki ∗ oki .rating
nodes. For each child node Nk , we compute lbkcN k .score. oki ∈O
Correspondingly, H.head is updated. If H.head is a princi- W Average(O) = . (9)
|O|
ple object ok rather than a node in KRR*k -tree (line 15-21),
where wP
lbkcok is computed. If lbkcok .score is greater than the score ki is the weight associated with the query keyword

of the current best solution bkc (bkc.score = 0 initially), ki and ki ∈T wki = 1. For example, a user may give
bkc is updated to be lbkcok . For any Nk ∈ H, Nk is higher weight to “hotel” but lower weight to “restaurant”
pruned if lbkcN k .score ≤ bkc.score since lbkcok .score ≤ in a BKC query. Given the score function in Equation (8),
lbkcok for every ok under Nk in KRR*k -tree according to the baseline algorithm and keyword-NNE algorithm can
Lemma 7. Once H is empty, bkc is returned to BKC query. be used to process BKC query with minor modification.
The core is to maintain the property in Lemma 1 and
in Lemma 2 which are the foundation of the pruning
7 W EIGHTED AVERAGE OF K EYWORD R AT- techniques in the baseline algorithm and the keyword-NNE
INGS algorithm.
To this point, the minimum keyword rating of objects in O However, the property in Lemma 1 is invalid given the
is used in O.score. However, it is unsurprising that a user score function defined in Equation (9). To maintain this
prefers the weighted average of keyword ratings of objects property, if a combination does not cover a query keyword
in O to measure O.score. ki , this combination is modified by inserting a virtual object
diam(O) W Average(O) associated with ki . This virtual object does not change
O.score = α×(1− )+(1−α)× . the diameter of the combination, but it has the maximum
max dist max rating
(8) rating of ki (for the combination of nodes, a virtual node
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 9

KRR*k-tree keywords are combined to generate candidate keyword


NK
covers (in the same way as in the baseline algorithm, see
cNK section 5) which are stored in a heap H. The candidate
kc ∈ H with the maximum score is processed by retrieving
the child nodes of kc. Then, the child nodes of kc are
combined to generate more candidates which replace kc in
BF-baseline Algorithm Keyword-NNE Algorithm
H. This process continues until a keyword cover consisting
GNk of objects only is obtained. This keyword cover is the
1 kc1 = {NK, N1K1, NK25 } current best solution bkc. Any candidate kc ∈ H is
G Nk kc2 = {NK, N K1, N K2} > BKC.score
pruned if kc.score ≤ bkc.score. The remaining candi-
lbkcNk
kc3 = {NK, N2K1, N1K1} dates in H are processed in the same way. Once H is
G2Nk kc4 = {NK, N2K1, N1K1} <= BKC.score empty, the current bkc is returned to BKC query. In BF -
baseline algorithm, only if a candidate keyword cover kc
GcNk has kc.score > BKC.score, it is further processed by
1 kc5 = {cNK, cNK1, cNK2}
G cNk kc6 = {cNK, cN3K1, cNK2} > BKC.score lbkccNk retrieving the child nodes of kc and combining them to
generate more candidates.
kc7 = {cN cN cN4K2}
G2cNk kc8 = {cN K, cN2K1,4
<= BKC.score
K, K1, cN K2}
8.1 Baseline
However, BF -baseline algorithm is not feasible in practice.
The main reason is that BF -baseline algorithm requires to
Fig. 5. BF-baseline vs. keyword-NNE. maintain H in memory. The peak size of H can be very
large because of the exhaustive combination until the first
current best solution bkc is obtained. To release the memory
is inserted in the similar way). The W Average(O) is
bottleneck, the depth-first browsing strategy is applied in
redefined to W Average∗ (O).
the baseline algorithm such that the current best solution is
E+F obtained as soon as possible (see section 5). Compared to
W Average∗ (O) = . (10)
|T | the best-first browsing strategy which is global optimal, the
depth-first browsing strategy is a kind of greedy algorithm
X
E= wki ∗ oki .rating.
oki ∈O
which is local optimal. As a consequence, if a candidate
X keyword cover kc has kc.score > bkc.score, kc is further
F = wkj ∗ Okj .maxrating. processed by retrieving the child nodes of kc and combining
kj ∈T /O.T
them to generate more candidates. Note that bkc.score
where T /O.T is the set of query keywords not covered by increases from 0 to BKC.score in the baseline algorithm.
O, Okj .maxrating is the maximum rating of objects in Therefore, the candidate keyword covers which are further
Okj . For example in Figure 1, suppose the query keywords processed in the baseline algorithm can be much more than
are “restaurant”, “hotel” and “bar”. For a combination O = that in BF -baseline algorithm.
{t1 , c1 }, W Average(O) = wt ∗t1 .rating + wc ∗c1 .rating Given a candidate keyword cover kc, it is further pro-
while W Average∗ (O) = wt ∗ t1 .rating + wc ∗ c1 .rating cessed in the same way in both the baseline algorithm and
+ ws ∗ max ratings where wt , wc and ws are the weights BF -baseline algorithm, i.e., retrieving the child nodes of
assigned to “bar”, “restaurant” and “hotel” respectively, and kc and combine them to generate more candidates using
max ratings is the highest keyword rating of objects in Generate Candidate function in Algorithm 3. Since the
“hotel”. candidate keyword covers further processed in the baseline
Given O.score with W Average∗ (O), it is easy to algorithm can be much more than that in BF -baseline
prove that the property in Lemma 1 is valid. Note that algorithm, the total candidate keyword covers generated in
the purpose of W Average∗ (O) is to apply the pruning the baseline algorithm can be much more than that in BF -
techniques in the baseline algorithm and keyword-NNE al- baseline algorithm.
gorithm. It does not affect the correctness of the algorithms. Note that the analysis captures the key characters of the
In addition, the property in Lemma 2 is valid no matter baseline algorithm in BKC query processing which are
which of W Average∗ (O) and W Average(O) is used inherited from the methods for mCK query processing [17,
in O.score. 18]. The analysis is still valid if directly extending the
methods [17, 18] to process BKC query as introduced in
8 A NALYSIS section 4.
To help analysis, we assume a special baseline algorithm
BF -baseline which is similar to the baseline algorithm but 8.2 Keyword-NNE
the best-first KRR*-tree browsing strategy is applied. For In keyword-NNE algorithm, the best-first browsing strategy
each query keyword, the child nodes of the KRR*-tree is applied like BF -baseline but large memory requirement
root are retrieved. The child nodes from different query is avoided. For the better explanation, we can imagine
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 10

all candidate keyword covers generated in BF -baseline Proof: Due to the best-first browsing strategy, lbkcN k
algorithm are grouped into independent groups. Each group is further processed in keyword-NNE algorithm only if
is associated with one principal node (or object). That is, lbkcN k .score > BKC.score. In any algorithm A ∈ A, a
the candidate keyword covers fall in the same group if they number of candidate keyword covers need to be generated
have the same principal node (or object). Given a principal and assessed since no combination of objects (or nodes
node Nk , let GN k be the associated group. The example of KRR*-trees) has been pre-processed. Given a node (or
in Figure 5 shows GN k where some keyword covers such object) N , the candidate keyword covers generated can be
as kc1 , kc2 have score greater than BKC.score, denoted organized in a group if they contain N . In this group, if
as G1N k , and some keyword covers such as kc3 , kc4 have one keyword cover has score greater than BKC.score, the
score not greater than BKC.score, denoted as G2N k . In possibility exists that the solution of BKC query is related
BF -baseline algorithm, GN k is maintained in H before the to this group. In this case, A needs to process at least one
first current best solution is obtained, and every keyword keyword cover in this group. If A fails to do this, it may
cover in G1N k needs to be further processed. lead to an incorrect solution. That is, no algorithm in A can
In keyword-NNE algorithm, the keyword cover in GN k process less candidate keyword covers than keyword-NNE
with the highest score, i.e., lbkcN k , is identified and main- algorithm.
tained in memory. That is, each principal node (or object)
keeps its lbkc only. The total number of principal nodes (or 8.2.2 Candidate Keyword Covers Processing
objects) is O(n log n) where n is the number of principal Every candidate keyword cover in G1N k is further processed
objects. So, the memory requirement for maintaining H is in BF -baseline algorithm. In the example in Figure 5, kc1
O(n log n). The (almost) linear memory requirement makes is further processed, so does every kc ∈ G1N k . Let us look
the best-first browsing strategy practical in keyword-NNE closer at kc1 = {Nk , Nk1 , Nk2 } processing. As introduced
algorithm. Due to the best-first browsing strategy, lbkcN k in section 4, each node N in KRR*-tree is defined as
is further processed in keyword-NNE algorithm only if N (x, y, r, lx , ly , lr ) which can be represented with 48 bytes.
lbkcN k .score > BKC.score. If the disk pagesize is 4096 bytes, the reasonable fan-out
of KRR*-tree is 40-50. That is, each node in kc1 (i.e., Nk ,
8.2.1 Instance Optimality Nk1 and Nk2 ) has 40-50 child nodes. In kc1 processing in
The instance optimality [7] corresponds to the optimality BF -baseline algorithm, these child nodes are combined to
in every instance, as opposed to just the worst case or the generate candidate keyword covers using Algorithm 3.
average case. There are many algorithms that are optimal In keyword-NNE algorithm, one and only one keyword
in a worst-case sense, but are not instance optimal. An cover in G1N k , i.e., lbkcN k , is further processed. For
example is binary search. In the worst case, binary search each child node cNk of Nk , lbkccN k is computed. For
is guaranteed to require no more than log N probes for computing lbkccN k , a number of keyword-NNs of cNk
N data items. By linear search which scans through the are retrieved and combined to generate more candidate
sequence of data items, N probes are required in the worst keyword covers using Algorithm 3. The experiments on real
case. However, binary search is not better than linear search data sets illustrate that only 2-4 keyword-NNs in average in
in all instances. When the search item is in the very first each non-principal query keyword are retrieved in lbkccN k
position of the sequence, a positive answer can be obtained computation.
in one probe and a negative answer in two probes using When further processing a candidate keyword cover,
linear search. The binary search may still require log N keyword-NNE algorithm typically generates much less new
probes. candidate keyword covers compared to BF -baseline al-
Instance optimality can be formally defined as follows: gorithm. Since the number of candidate keyword covers
for a class of correct algorithms A and a class of valid input further processed in keyword-NNE algorithm is optimal
D to the algorithms, cost(A, D) represents the amount of a (Theorem 1), the number of keyword covers generated in
resource consumed by running algorithm A ∈ A on input BF -baseline algorithm is much more than that in keyword-
D ∈ D. An algorithm B ∈ A is instance optimal over A NNE algorithm. In turn, we conclude that the number of
and D if cost(B, D) = O(cost(A, D)) for ∀A ∈ A and keyword covers generated in baseline algorithm is much
∀D ∈ D. This cost could be running time of algorithm A more than that in keyword-NNE algorithm. This conclusion
over input D. is independent of the principal query keyword since the
Theorem 1: Let D be the class of all possible spatial analysis does not apply any constraint on the selection
databases where each tuple is a spatial object and is strategy of principal query keyword.
associated with a keyword. Let A be the class of any
correct BKC processing algorithm over D ∈ D. For
all algorithms in A, multiple KRR*-trees, each for one 9 E XPERIMENT
keyword, are explored by combining nodes at the same In this section we experimentally evaluate keyword-NNE
hierarchical level until leaf node, and no combination of algorithm and the baseline algorithm. We use four real data
objects (or nodes of KRR*-trees) has been pre-processed, sets, namely Yelp, Yellow Page, AU, and DE. Specifically,
keyword-NNE algorithm is optimal in terms of the number Yelp is a data set extracted from Yelp Academic Dataset
of candidate keyword covers which are further processed. (www.yelp.com) which contains 7707 POIs (i.e., points
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 11

of interest, which are equivalent to the objects in this maximum memory consumed, and 4) the average number
work) with 27 keywords where the average, maximum and of keyword-NNs of each principal node (or object) retrieved
minimum number of POIs in each keyword are 285, 1353 for computing lbkc and the number of lbkcs computed for
and 120 respectively. Yellow Page is a data set obtained answering BKC query. In addition, we test the perfor-
from yellowpage.com.au in Sydney which contains 30444 mance in the situation that the weighted average of keyword
POIs with 26 keywords where the average, maximum and ratings is applied as discussed in section 7. All algorithms
minimum number of POIs in each keyword are 1170, 10284 are implemented in Java 1.7.0. and all experiments have
and 154 respectively. All POIS in Yelp has been rated been performed on a Windows XP PC with 3.0 Ghz CPU
by customers from 1 to 10. About half of the POIs in and 3 GB main memory.
Yellow Page have been rated by Yelp, the unrated POIs are In Figure 7, the number of keyword covers generated in
assigned average rating 5. AU and US are extracted from a baseline algorithm is compared to that in the algorithms
public source 3 . AU contains 678581 POIs in Australia with directly extended from [17, 18] when the number of query
187 keywords where the average, maximum and minimum keywords m changes from 2 to 9. It shows that the baseline
number of POIs in each keyword are 3728, 53956 and 403 algorithm has better performance in all settings. This is
respectively. US contains 1541124 POIs with 237 keywords consistent with the analysis in section 5. The test results on
where the average, maximum and minimum number of Yellow Page and Yelp data sets are shown in Figure 7 (a)
POIs in each keyword are 6502, 122669 and 400. In AU which represents data sets with small number of keywords.
and US, the keyword ratings from 1 to 10 are randomly The test results on AU and US data sets are shown in
assigned to POIs. The ratings are in normal distribution Figure 7 (b) which represents data set with large number
where the mean µ = 5 and the standard deviation σ = 1. of keywords. As observed, when the number of keywords
The distribution of POIs in keywords are illustrated in in a data set is small, the difference between baseline
Figure 6. For each data set, the POIs of each keyword are algorithm and the directly extended algorithms is reduced.
indexed using a KRR*-tree. The reason is that the single tree index in the directly
extended algorithms has more pruning power in this case
(as discussed in section 4).
Average = 3728

9.1 Effect of m
The number of query keywords m has significant impact
to query processing efficiency. In this test, m is changed
Average = 6502
from 2 to 9 when α = 0.4. Each BKC query is generated
by randomly selecting m keyword from all keywords as the
query keywords. For each setting, we generate and perform
100 BKC queries, and the averaged results are reported 4 .
Figure 8 shows the number of candidate keyword covers
Average = 285 Average=1170 generated for BKC query processing. When m increases,
the number of candidate keyword covers generated in-
creases dramatically in the baseline algorithm. In contrast,
keyword-NNE algorithm shows much better scalability. The
reason has been explained in section 8.
Fig. 6. The distribution of keyword size in test data Figure 9 reports the average response time of BKC
sets. query when m changes. The response time is closely related
to the candidate keyword covers generated during the query
processing. In the baseline algorithm, the response time
increases very fast when m increases. This is consistent
with the fast increase of the candidate keyword covers
generated when m increases. Compared to the baseline
algorithm, the keyword-NNE algorithm shows much slower
increase when m increases.
For processing the BKC queries at the same settings
of m and α, the performances are different on datasets
(a) Yellow Page and Yelp (b) AU and US US, AU, YELP and Yellow Page as shown in Figure 8
and Figure 9. The reason is that the average number of
Fig. 7. Baseline, Virtual bR*-tree and bR*-tree. objects in one keyword in datasets US, AU, YELP and
Yellow Page are 6502, 3728, 1170 and 285 respectively as
We are interested in 1) the number of candidate keyword shown in Figure 6; in turn, the average numbers of objects
covers generated, 2) BKC query response time, 3) the in one query keyword in the BKC queries on dataset US,
3. http://s3.amazonaws.com/simplegeo-public/places dump 20110628.zip 4. In this work, all experimental results are obtained in the same way.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 12

(a) (b) (c) (d)

Fig. 8. Number of candidate keyword covers generated vs. m (α=0.4).

(a) (b) (c) (d)

Fig. 9. Response time vs. m (α=0.4).

(a) (b) (c) (d)

Fig. 10. Number of candidate keyword covers generated vs. α (m=4).

(a) (b) (c) (d)

Fig. 11. Response time vs. α (m=4).

AU, YELP and Yellow Page are expected to be the same. ignored. When α changes from 1 to 0, more weight is
The experimental results show the higher average number assigned to keyword rating. In Figure 10, an interesting
such as on dataset US leads to the more candidate keyword observation is that with the decrease of α the number of
covers and the more processing time. keyword covers generated in both the baseline algorithm
and keyword-NNE algorithm shows a constant trend of
slight decrease. The reason behind is that KRR*-tree has
9.2 Effect of α
a keyword rating dimension. Objects close to each other
This test shows the impact of α to the performance. geographically may have very different ratings and thus
As shown in Equation (2), α is an application specific they are in different nodes of KRR*-tree. If more weight is
parameter to balance the weight of keyword rating and assigned to keyword ratings, KRR*-tree tends to have more
the diameter in the score function. Compared to m, the pruning power by distinguishing the objects close to each
impact of α to the performance is limited. When α = 1, other but with different keyword ratings. As a result, less
BKC query is degraded to mKC query where the distance candidate keyword covers are generated. Figure 11 presents
between objects is the sole factor and keyword rating is
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 13

the average response time of queries which are consistent their lbkcs in different sizes of data sets. In other words,
with the number of candidate keyword covers generated. 90% of the overall principal nodes (or objects) are pruned
BKC query provides robust solutions to meet various during the query processing.
practical requirements while mCK query cannot. Suppose
we have three query keywords in Yelp dataset, namely,
“bars”, “hotels & travel”, and “fast food”. When α= 1,
the BKC query (equivalent to mCK query) returns Pita
House, Scottsdale Neighborhood Trolley, and Schlotzskys
(the names of the selected objects in keyword “bars”,
“hotels & travel”, and “fast food” respectively) where the
lowest keyword rating is 2.5 and the maximum distance
M M
is 0.045km. When α= 0.4, the BKC query returns The
Attic, Enterprise Rent-A-Car and Chick-Fil-A where the (a) Average size of S vs. m. (b) Number of lbkcs computed vs.
m.
lowest keyword rating is 4.5 and the maximum distance
is 0.662km. Fig. 13. Features of keyword-NNE (α=0.4).

9.3 Maximum Memory Consumption


The maximum memory consumed by the baseline algorithm 9.5 Weighted Average of Keyword Ratings
and keyword-NNE algorithm are reported in Figure 12 (the The tests compare the weighted average of keyword rating
average results of 100 BKC queries on each of four data and the minimum keyword rating to performance. The
sets). It shows that the maximum memory consumed in average experimental results of 100 BKC queries on each
keyword-NNE algorithm is up to 0.5MB in all settings of of four data sets are reported in Figure 14. We can see
m while it increases very fast when m increases in the the difference between these two situations is trivial. This
baseline algorithm. As discussed in section 8, keyword- is because the score computation in the situation of the
NNE algorithm only maintains the principal nodes (or minimum keyword rating is fundamentally equivalent to
objects) in memory while the baseline algorithm maintains that in the situation of weight average. In the former
candidate keyword covers in memory. situation, if a combination O of objects (or their MBRs)
does not cover a keyword, the rating of this keyword used
for computing O.score is 0 while it is the maximum rating
of this keyword in the latter situation.

Fig. 12. Maximum memory consumed vs. m (α=0.4).


M M

(a) Response time vs. m. (b) Number of nodes visited vs. m.


9.4 Keyword-NNE
Fig. 14. Weighted average vs. minimum (α=0.4).
The high performance of keyword-NNE algorithm is due
to that each principal node (or object) only retrieves a few
keyword-NNs in each non-principal query keyword. Sup-
pose all retrieved keyword-NNs in keyword-NNE algorithm 10 C ONCLUSION
are kept in a set S. In Figure 13 (a), the average size of Compared to the most relevant mCK query, BKC query
S is shown. The data sets are randomly sampled so that provides an additional dimension to support more sensi-
the number of objects in each query keyword in a BKC ble decision making. The introduced baseline algorithm
query is from 100 to 3000. It illustrates that the impact is inspired by the methods for processing mCK query.
of the number of objects in query keywords to the size The baseline algorithm generates a large number of candi-
of S is limited. On the contrary, it shows that the size date keyword covers which leads to dramatic performance
of S is clearly influenced by m. When m increases from drop when more query keywords are given. The proposed
2 to 9, S increases linearly. In average, a principal node keyword-NNE algorithm applies a different processing
(or object) only retrieves 2-4 keyword-NNs in each non- strategy, i.e., searching local best solution for each object
principal query keyword. Figure 13 (b) shows the number in a certain query keyword. As a consequence, the number
of lbkcs computed in query processing. We can see less of candidate keyword covers generated is significantly
than 10% of principal nodes (or objects) need to compute reduced. The analysis reveals that the number of candidate
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
information: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 14

keyword covers which need to be further processed in [17] Dongxiang Zhang, Beng Chin Ooi, and Anthony K.
keyword-NNE algorithm is optimal and processing each H. Tung. “Locating mapped resources in web 2.0”.
keyword candidate cover typically generates much less new In: ICDE (2010).
candidate keyword covers in keyword-NNE algorithm than [18] Dongxiang Zhang et al. “Keyword Search in Spatial
in the baseline algorithm. Databases: Towards Searching by Document”. In:
ICDE. 2009, pp. 688–699.
R EFERENCES
[1] Rakesh Agrawal and Ramakrishnan Srikant. “Fast
algorithms for mining association rules in large
Ke Deng was awarded a PhD degree in
databases”. In: VLDB. 1994, pp. 487–499. computer science from The University of
[2] T. Brinkhoff, H. Kriegel, and B. Seeger. “Efficient Queensland, Australia, in 2007 and a Master
processing of spatial joins using R-trees”. In: SIG- degree in Information and Communication
PLACE Technology from Griffith University, Australia,
MOD (1993), pp. 237–246. PHOTO in 2001. His research background include
[3] Xin Cao, Gao Cong, and Christian S. Jensen. “Re- HERE high performance database system, spa-
trieving top-k prestige-based relevant spatial web ob- tiotemporal data management, data quality
control and business information system. His
jects”. In: Proc. VLDB Endow. 3.1-2 (2010), pp. 373– current research interest focuses on big spa-
384. tiotemporal data management and mining.
[4] Xin Cao et al. “Collective spatial keyword querying”.
In: ACM SIGMOD. 2011.
[5] G. Cong, C. Jensen, and D. Wu. “Efficient retrieval
of the top-k most relevant spatial web objects”. In: Xin Li is a Master candidate in Computer
Science at Renmin University of China. He
Proc. VLDB Endow. 2.1 (2009), pp. 337–348. received the Bachelor degree from School of
[6] Ian De Felipe, Vagelis Hristidis, and Naphtali Rishe. Information at Renmin University of China in
PLACE 2007. His current research focuses on spatial
“Keyword Search on Spatial Databases”. In: ICDE. PHOTO database and geo-positioning.
2008, pp. 656–665. HERE
[7] R. Fagin, A. Lotem, and M. Naor. “Optimal Aggre-
gation Algorithms for Middleware”. In: Journal of
Computer and System Sciences 66 (2003), pp. 614–
656.
[8] Ramaswamy Hariharan et al. “Processing Spatial-
Keyword (SK) Queries in Geographic Information
Jiaheng Lu is a professor in Computer Sci-
Retrieval (GIR) Systems”. In: Proceedings of the ence at the Renmin University of China.
19th International Conference on Scientific and Sta- He received the Ph.D. degree from National
tistical Database Management. 2007, pp. 16–23. University of Singapore in 2007 and M.S.
PLACE from Shanghai Jiaotong University in 2001.
[9] G. R. Hjaltason and H. Samet. “Distance browsing in PHOTO His research interests span many aspects
spatial databases”. In: TODS 2 (1999), pp. 256–318. HERE of data management. His current research
[10] Z. Li et al. “IR-tree: An efficient index for geo- focuses on developing an academic search
engine, XML data management, and big data
graphic document search”. In: TKDE 99.4 (2010), management. He has served in the organi-
pp. 585–599. zation and program committees for various
[11] N. Mamoulis and D. Papadias. “Multiway spatial conferences, including SIGMOD, VLDB, ICDE and CIKM.
joins”. In: TODS 26.4 (2001), pp. 424–475.
[12] D. Papadias, N. Mamoulis, and B. Delis. “Algorithms
for querying by spatial structure”. In: VLDB (1998),
p. 546. Xiaofang Zhou received the BS and MS
degrees in computer science from Nanjing
[13] D. Papadias, N. Mamoulis, and Y. Theodoridis. “Pro- University, China, in 1984 and 1987, respec-
cessing and optimization of multiway spatial joins tively, and the PhD degree in computer sci-
PLACE ence from The University of Queensland,
using R-trees”. In: PODS (1999), pp. 44–55. PHOTO Australia, in 1994. He is a professor of com-
[14] J. M. Ponte and W. B. Croft. “A language modeling HERE puter science with The University of Queens-
approach to information retrieval”. In: SIGIR (1998), land. He is the head of the Data and Knowl-
pp. 275–281. edge Engineering Research Division, School
of Information Technology and Electrical En-
[15] João B. Rocha-Junior et al. “Efficient processing of gineering. He is also an specially appointed
top-k spatial keyword queries”. In: Proceedings of Adjunct Professor at Soochow University, China. From 1994 to 1999,
the 12th international conference on Advances in he was a senior research scientist and project leader in CSIRO.
His research is focused on finding effective and efficient solutions to
spatial and temporal databases. 2011, pp. 205–222. managing integrating and analyzing very large amounts of complex
[16] S. B. Roy and K. Chakrabarti. “Location-Aware Type data for business and scientific applications. His research interests
Ahead Search on Spatial Databases: Semantics and include spatial and multimedia databases, high performance query
processing, web information systems, data mining, and data quality
Efficiency”. In: SIGMOD (2011). management. He is a senior member of the IEEE.
1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like