PPQP Ieee 04092014
PPQP Ieee 04092014
date_of_journey BETWEEN ? AND ? while they have the freedom of using intermediate
may not reveal any privacy, the complete query with results to their own advantage. As such, no database
the values of the parameters replacing the would have any real incentive to keep quiet or to
placeholders makes it sensitive [9]. This is a concern change its part of the result. The reasons are that the
for the customer. customer and the data sources are not known to each
We approach the problem of preserving query other – their identities are kept secret. By a database
privacy as that of hiding these constants from the remaining silent, the customer would think the
service provider and optionally the databases. Hiding database is not interested to disclose anything or it
constants during query processing has an obvious cost obtained empty result – there is no great distinction
because the system must make use of some hiding between these two for the customer. The service
(encryption and decryption) mechanism to achieve provider being a reputed intermediary will never risk
that. Thus instead of treating each and every constant its reputation by keeping silent or falsifying; rather it
identically we divide the set of constants into three likes to remain oblivious to both query and result.
disjoint sets – nonsensitive (NS), sensitive (S) and Query privacy of the customer and the data privacy
highly sensitive (HS), where each category has its (which includes result privacy) of the data sources are
defined protection level as shown in Table II. The the two primary considerations besides maintaining
application program allows the customer to assign identity privacy in this semi honest model.
each constant to one of these sets but in specific cases B. Privacy of Sensitive Query Constants
the choices could be predefined by the system
depending on its severity. For example, in searching Our approach to deal with the three sets of query
crime records a criminal’s name should normally constants is as follows:
belong to HS. Depending on the cardinality of the sets a) Elements of NS are substituted in the query text
HS, S and NS we get four types of query privacies as by the application program before sending the
shown in Table III. The nonsensitive constants query to the service provider who in turn sends it
provide increased efficiency as they are visible to to the databases.
everyone. Both sensitive and nonsensitive constants
are visible to the databases. This adds to the interests b) Elements of S in encrypted form are sent to the
of different data sources to participate, because they databases via service provider along with the
know what data they are revealing to the (unknown) query text. The service provider acts as conduit
customer and the service provider whereas the highly in this data transfer. The databases in turn get
sensitive constants primarily protect the interests of them decrypted with the help of the customer
the customer, who is reluctant to reveal these even to again via service provider used as a conduit.
the (unknown) databases. The data being the private After obtaining S each database substitutes these
asset of the data sources should always be protected values in the respective placeholders and
from the service provider, the customer and other data executes its local query. We have proposed
sources. Similarly, the query result obtained from a commutative encryption in this secret message
data source needs to be protected from the service transfer.
provider and the data sources. Similar privacy
c) Elements of HS in encrypted form are sent to the
constraints would also apply to any intermediate
service provider who uses them for selecting the
results.
tuples of interest of the customer’s query from
Our query processing framework is based on a
the result sets obtained by local query processing
semi-honest model where each participant is expected
in each database. The original query is stripped
to follow the protocol,
off the part or whole of WHERE clause
TABLE II. Graded Query Constants involving these constants and adding the
Query Constant Set Protected From attributes lost in this process from the WHERE
Highly sensitive (HS) Both service provider and the clause to the SELECT list of attributes. This
databases ensures that answer of the original query is
Sensitive (S) Only service provider contained in the query result of the transformed
Nonsensitive (NS) None
query.
In this paper we have provided a solution for the
complete case in which all the three sets HS, S and
1
An incomplete SQL statement within a software system, meant to
NS are non-empty. In case one or more sets are
be fully constructed and executed at runtime [11] empty its specific solution can be derived by
2
Question marks (?) are used to denote the placeholders for the necessary modification of above solution.
constants
C. Data Integration – Horizontal and Vertical and the databases, C combines the partial query
results to obtain the final query result. The idea is to
The query service collects and integrates data
perform the entire operation in a privacy preserving
from multiple data sources through the service
manner so that SP does not learn anything about the
provider virtually forming a distributed database. We
can classify the data integration into one of the query constants S and HS, about the contents of the
following types: databases, and about the results (except possibly its
sizes). No database learns about the contents of other
a. Horizontal Integration: A query can be resolved databases and also others’ results and identities of C
by retrieving data from a set of databases each of and other data sources. C does not learn anything
which in its schema has all the attributes about the contents of the databases or identity of the
(possibly along with other attributes) of the target corresponding data source. It only learns the overall
list and of the predicate in the customer query. result R or individual results Ri, without being able
The final query result is obtained by taking union to locate the data source.
of the local query results of individual databases.
This is akin to horizontal fragmentation in III. RELATED WORK
distributed databases, although the databases Aggarwal et al. [1] proposed a two-party storage
need not have identical set of attributes and can model to enable secure database query service for
be heterogeneous. outsourced data. The key idea in their approach is to
b. Vertical Integration: A query can be resolved by partition data into two logically independent database
retrieving data from a set of databases each of systems according to the privacy constraint; these
which has only a part (but not all) of the databases cannot communicate with each other yet
attributes from the target list and the predicate in execute database query in that distributed architecture.
the customer query. The final query result is Their proposed scheme does not allow queries to
obtained by joining the local query results of execute on multiple databases. Agrawal, Evfimievski,
individual databases on common key. This is and Srikant [2] developed protocols for secure
akin to vertical fragmentation in distributed intersection, intersection size and equijoin database
databases. operations for two databases using commutative
encryption and hashing. Their work exposed partial
c. Mixed/Hybrid Integration: This is a combination information such as table sizes and the query to the
of horizontal and vertical integration databases and does not support aggregation queries. In
Any query processing would be accomplished [5] Chow, Lee and Subramanian proposed a two-party
through a sequence of horizontal integration and computation model for privacy-preserving queries
vertical integration operations. In this paper we have over distributed database in an honest but curious
worked on horizontal data integration. adversarial model comprising of two semi honest
parties – randomizer and computing engine other than
II. PROBLEM DEFINITION the customer and the databases. Scalability of query
computation over large databases was their focus area,
In our system, we have a service provider SP, a
though the proposed model does not support
customer C and a number of independent
comparison across databases. Their work assumes that
heterogeneous databases D1, …, Dn of n data
randomizer picks up a random string and sends to the
sources. The databases are assumed to be relational.
databases and the customer, for de-randomization via
We assume that the data sources cannot communicate
confidential channels. Their model supports data
with each other. SP itself is not a data source. It privacy and result privacy but does not consider query
receives customer’s query Q and solves the query privacy. Emekci, Agrawal, Abbadi and Gulbeden [6]
with the help of a set of databases. C does not have proposed privacy preserving intersection, equijoin and
any knowledge about the databases that can resolve aggregation query solution over hash-based P2P
the given query. The data sources registered with SP system in which selected third parties perform query
share their data catalogue for the part of their data computation to speed up query response while
they like to share for the query service. With the help preserving privacy of the data sources. These third
of the catalogues SP locates the relevant databases by parties are selected from a peer-to-peer (P2P) system,
comparing the schema of the databases with the namely Chord [12] for computation of the query
attributes in the target list and the predicate of the results. The secrets are distributed to the third parties
query. Alternatively, SP can also send the formatted using Shamir’s secret sharing method [10]. Their
query to a set of potential data sources based on the model considers data privacy but not query privacy.
catalogs available or prior knowledge about them. Most of the existing privacy preserving query
The data sources would then match their database processing solutions deals with data and query privacy
schemas with Q and examine the constant sets {NS, separately. However, in a recent work Hu, Xu, Ren
S, HS} (without looking into the constant values) and Choi [7] dealt with data, query and result privacy.
with its business objective and decide to participate in Their model is for single database. They used
the query resolution. Finally the service provider homomorphic encryption, a computationally heavy
reformulates (and splits) the query into local queries encryption algorithm to compute Euclidean distance
and sends to the relevant databases for processing. for distance based queries such as kNN query and
Each database processes its local query. The query distance range query. Moreover homomorphic
processing engine at each database generates results encryption enforces some restrictions on the domain
in accordance with a common schema and sends the of plaintext. There is a number of privacy preserving
result in encrypted form to SP. With the help of SP
techniques for query processing over distributed 𝑚 𝑒2 𝑚𝑜𝑑 𝑛 𝑒1
𝑚𝑜𝑑 𝑛 = 𝑚 𝑒1 𝑒2 𝑚𝑜𝑑 𝑛 = 𝑚 𝑒1 𝑚𝑜𝑑 𝑛 𝑒2 𝑚𝑜𝑑 𝑛
databases [5, 6, 7, 9]. = 𝐸𝑘 𝐵 𝑚 𝑒1 𝑚𝑜𝑑 𝑛 = 𝐸𝑘 𝐵 𝐸𝑘 𝐴 𝑚
3 4
𝜑 𝑛 = 𝑝 − 1 (𝑞 − 1). is Euler’s totient function A prime p is safe if p = 2p’ + 1 where p’ is an odd prime.
V. BUILDING BLOCKS OF OUR PRIVACY has been discussed in Section 1. Before we present
PRESERVING PROTOCOL our algorithm we take a simple example query to
explain our methodology.
Following protocol is used in the beginning of query
execution by SP. Example: Consider a hypothetical query of the form
A. Setup SELECT col1, col2 FROM table1
WHERE col1 <= NS1 AND col2 BETWEEN S1 and S2
a. SP chooses two safe primes p and q and AND col3 = HS1
share these with C and each D.
Let us assume that the constants entered by the
b. Using p and q, C and each D create their
customer through the query service’s web interface
own secret key KD and KC following RSA. are as follows:
Let ED and EC (DD and DC) denote the
encryption (decryption) functions with the NS1 = 70, S1 = 1000, S2 = 2000 and HS1 = “ABC
secret key of D and C respectively. Both (E, Limited”
D) pairs satisfy the commutative properties. Before sending to SP the query text goes through the
Following protocols are used to privately exchange a following transformations:
secret message between two parties via an 1. The value of nonsensitive constant NS1 is
intermediary. substituted in the query text as this can be made
B. Privacy Preserving Message5 Passing from C to public as per user’s choice
D via SP: PPMP (Source C, Destination D, 2. The attribute corresponding to the highly
Intermediary SP, Message m) sensitive constant HS1 is stripped off from the
query text and the attribute is added to the select
a. C encrypts m with its encryption key and list so the answer of the original query is found
sends EC(m) to D via SP. within the query result of the transformed query
b. D encrypts EC(m) with its encryption key
and sends ED(EC (m)) to C via SP. The modified query Q’ after above transformations is
c. C decrypts ED(EC(m)) to obtain ED(m) and SELECT col1, col2, col3 FROM table1
sends ED(m) to D via SP. WHERE col1 <= 70 and col2 BETWEEN ? and ?
d. D decrypts ED(m) to obtain m. A. Privacy Preserving Query Service Protocol
C. Privacy Preserving Message Passing from D to C Step 1. SP chooses a set of databases D1, ..., Dn to
via SP: PPMP (Source D, Destination C, process Customer C’s query Q and runs Setup
Intermediary SP, Message m) process to distribute safe primes p and q to C and
each Di.
This is symmetrically opposite version of the
Step 2. C transforms Q into Q’ [by client side
above protocol.
program at C] and sends to SP
D. Privacy Preserving Encrypted Message Passing Step 3. SP sends Q’ to each Di.
from C to SP via SP: PPMP (Source C, Step 4. C sends set of sensitive constants S to each
Destination SP, Intermediary SP, Message m, Di using PPMP (C, Di, SP, S).
Encryption D) Step 5. Each Di executes Q’ to generate result set
a. C encrypts m with its encryption key and Ri, encrypts it with its key and sends the
sends EC(m) to D via SP. encrypted Ri to SP for further process.
b. D encrypts EC(m) with its encryption key Step 6. C sends the set of highly sensitive constants
and sends ED(EC (m)) to C via SP. HS to SP for each Di using PPMP (C, SP,
c. C decrypts ED(EC(m)) to obtain ED(m) and SP, HS,Di)
sends ED(m) to SP. Step 7. On receipt of the query result Ri from each
data source, SP picks the tuples of interest by
E. Privacy Preserving Message Decryption by C by (equality) matching the encrypted data E(Ri)
D via SP PPMP (Source C, Intermediary SP, with each s in E(HS) and sends the resultant
Message ED(m), Encryption D) data E(ri) to C.
a. C encrypts m with its encryption key and Step 8. For each data source Di , C decrypts E(ri)
sends EC(ED(m)) to D via SP. using PPMP (C, SP, EDi(ri), Di)
b. D decrypts EC(m) with its encryption key Step 9. C combines the result sets ri s to get the final
and sends EC (m) to C via SP. answer.
c. C decrypts EC(m) to obtain m.
Procedure described above describes the case where
all the constant sets are non-empty. However each
VI. PROPOSED SOLUTION variation (Ref Table III) will call for suitable
modification of the algorithm. For example, if HS is
In our query service customer’s query Q has three empty, we need not strip off col3 from the WHERE
sets of constants {NS, S, HS}. Privacy requirement clause and the databases need not send the partial
of each type of constant has been listed in Table II. result to SP for tuple selection. Instead each database
How privacies of each type of constants are handled can send its local query result to C using PPMP (D,
C, SP, m).
5
Message m can be a scalar or a vector
B. Security Analysis processing in this model preserves data privacy of the
data sources and the query privacy of customer. It
Our query processing framework is based on a semi- also preserves identity privacy of the customer and
honest or honest but curious model which means that the data sources and the result privacy. The protocols
all the parties follow the protocol correctly but they have been built on commutative encryption for
may record any intermediate input received during secretly transferring data /messages between two
the protocol execution and try to derive some benefit parties via a third party. We have suggested one-way
out of it. Databases are not known to the Customer, accumulator as our choice of commutative encryption
from a set of data sources the databases are chosen by scheme because of its computational infeasibility.
the Service Provider based on the query type. For efficiency of processing the problem of hiding the
Moreover, they do not have any common interest. So sensitive query constants has been studied in three
practically they have no chance of colluding with the different practical scenarios depending on the degree
Customer. However being known to the Service of disclosure of the constants allowed by the
Provider the databases may collude with each other. customer to other players (Table II, Section I). The
Our attempt is to prevent the collusion between them. computation and communication complexity will
We analyze security aspect of each protocol. Query vary depending on degree of disclosures. Our future
privacy of the Customer and the data privacy of the plan is to work on privacy preserving queries on
data sources are the primary consideration in this vertically distributed databases.
semi honest model. Service provider remains
completely in dark during the process but provides REFERENCES
the service without learning the query, data and the [1] Aggarwal G., M. Bawa, P. Ganesan, H. Garcia-Molina, K.
result of computation. Kenthapadi, R. Motwani, U. Srivastava, D. Thomas, Y. Xu.
“Two can keep a secret: A distributed architecture for secure
Any party - the Customer, the Service Provider
database services” In Proc. of CIDR 2005.
or the data source may act as an adversary. The [2] Agrawal R, A. Evfimievski, and R. Srikant. “Information
encryption/decryption algorithms are known to all the sharing across private databases”. In Proc. of the 2003 ACM
parties but the secret key pair of each party is SIGMOD international conference on Management of data,
pages 86–97, 2003.
unknown to others. Following situations may arise: [3] Benaloh J. and M. de Mare. “One-way accumulators: A
a. The service provider may like to know the value decentralized alternative to digital signatures”. In
of the constants of the parametric query given by EUROCRYPT ’93: Workshop on the theory and application
the customer, the data values of the attributes of cryptographic techniques on Advances in cryptology,
pages 274–285. Springer-Verlag New York, Inc., 1994.
corresponding to the sensitive constants as well [4] Boss G., P.Malladi, D. Quan, L. Legregni, H. Hall - IBM
as the query result Corporation 2007, Service provider computing.
b. The customer may like to know the data values [5] Chow S. S. M., J.H. Lee, and L. Subramanian, “Two-party
of the attributes corresponding to the sensitive computation model for privacy-preserving queries over
distributed databases,” in NDSS, 2009.
constants.
[6] Emekci F., D. Agrawal, A. E. Abbadi, and A. Gulbeden.
c. The data source may like to know the values of “Privacy preserving query processing using third parties”. in
the sensitive constants and other data sources’ ICDE 2006, page 27. IEEE Computer Society, 2006.
data. [7] Hu H., J. Xu, C. Ren and B. Choi. Processing private queries
over untrusted data service provider through privacy
Security of data depends on the security of the
homomorphism, In: ICDE IEEE Computer Society (2011), p.
encryption/decryption key pair of individual players. 601-612.
We have proposed one-way accumulator as our [8] Kantarcioglu M. and C. Clifton. “Privacy-preserving
choice of commutative encryption. Benaloh and de distributed mining of association rules on horizontally
partitioned data”. In The ACM SIGMOD Workshop on
Mare [3] showed that it is computationally infeasible. Research Issues on Data Mining and Knowledge Discovery
(DMKD'02), pages 24-31, June 2 2002.
VII. CONCLUSION AND FUTURE WORK [9] Olumofin F. and I. Goldberg. “Privacy-preserving queries
over relational databases”. In PETS’10, Berlin, 2010.
The problem of preserving privacy of sensitive [10] Shamir A., “How to share a secret,” Commun. ACM, vol. 22,
constants in customer’s query in a query service no. 11, pp. 612–613, 1979.
framework has been studied in this paper. In this [11] Silberschatz A., H. F. Korth, and S. Sudarshan. Database
System Concepts. McGraw-Hill, Inc., New York, NY, USA,
model a service provider provides query service with 5th edition, 2005.
the help of different data databases virtually forming [12] Stoica I., R. Morris, D. R. Karger, M. F. Kaashoek, and H.
a distributed database with horizontal partitioning, Balakrishnan. “Chord: A scalable peer-to-peer lookup service
though the databases could be heterogeneous. Query for internet applications”. In SIGCOMM 2001, pages 149–
160, 2001.