Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
29 views18 pages

Privacy for SQL Queries

The document discusses the use of Private Information Retrieval (PIR) to enhance privacy in SQL queries over relational databases by concealing sensitive constants. It highlights the limitations of existing privacy technologies and presents a novel approach that allows users to retrieve data without revealing sensitive information to the database. The proposed method demonstrates significant performance improvements and aims to address the growing demand for privacy-preserving solutions in various online activities.

Uploaded by

Subrata Bose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views18 pages

Privacy for SQL Queries

The document discusses the use of Private Information Retrieval (PIR) to enhance privacy in SQL queries over relational databases by concealing sensitive constants. It highlights the limitations of existing privacy technologies and presents a novel approach that allows users to retrieve data without revealing sensitive information to the database. The proposed method demonstrates significant performance improvements and aims to address the growing demand for privacy-preserving solutions in various online activities.

Uploaded by

Subrata Bose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Privacy-preserving Queries over Relational Databases

Femi Olumofin Ian Goldberg


Cheriton School of Computer Science Cheriton School of Computer Science
University of Waterloo University of Waterloo
Waterloo, Ontario, Canada N2L 3G1 Waterloo, Ontario, Canada N2L 3G1
[email protected] [email protected]

Abstract—We explore how Private Information Retrieval While today’s most developed and deployed privacy tech-
(PIR) can help users keep their sensitive information from niques, such as onion routers and mix networks, offer
being leaked in an SQL query. We show how to retrieve anonymizing protection for users’ identities, they cannot
data from a relational database with PIR by hiding sensitive
constants contained in the predicates of a query. Experimental preserve the privacy of the users’ queries. For the front
results and microbenchmarking tests show our approach incurs running example, the user could tunnel the query through
reasonable storage overhead for the added privacy benefit and Tor [17] to preserve the privacy of his or her network
performs between 7 and 480 times faster than previous work. address. Nevertheless, the server could still observe the
user’s desired domain name, and launch a successful front
I. I NTRODUCTION running attack.
Most software systems request sensitive information from The development of a practical PIR-based technique for
users to construct a query, but privacy concerns can make protecting query privacy offers users and service providers
a user unwilling to provide such information. The problem an attractive value proposition. Users are increasingly aware
addressed by private information retrieval (PIR) [4], [13] is of the problem of privacy and the need to maintain privacy
to provide such a user with the means to retrieve data from a in their online activities. The growing awareness is partly
database without the database (or the database administrator) due to increased dependence on the Internet for performing
learning any information about the particular item that was daily activities — including online banking, Twittering, and
retrieved. Development of practical PIR schemes is crucial social networking — and partly because of the rising trend of
to maintaining user privacy in important application domains online privacy invasion. Privacy-conscious users will accept
like patent databases, pharmaceutical databases, online cen- a service built on PIR for query privacy protection because
suses, real-time stock quotes, location-based services, and no currently deployed security or privacy mechanism offers
Internet domain registration. For instance, the current pro- the needed protection; they will likely be willing to trade
cess for Internet domain name registration requires a user off query performance for query privacy and even pay to
to first disclose the name for the new domain to an Internet subscribe for such a service. Similarly, service providers
domain registrar. Subsequently, the registrar could then use may adopt such a system because of its potential for revenue
this inside information to preemptively register the new generation through subscriptions and ad displays. As more
domain and thereby deprive the user of the registration Internet users value privacy, most online businesses would be
privilege for that domain. This practice is known as front motivated to embrace privacy-preserving technologies that
running [26]. The registrar is motivated to engage in front can improve their competitiveness to win this growing user
running because of the revenue to be derived from reselling population. Since the protection of a user’s identity is not a
the domain at an inflated price, and from placing ads on problem addressed by PIR, existing service models relying
the domain’s landing page. Many users, therefore, find it on service providers being able to identify a user for the
unacceptable to disclose the sensitive information contained purpose of targeted ads will not be disabled by this proposal.
in their queries by the simple act of querying a server. In other words, protection of query privacy will provide
Users’ concern for query privacy and our proposed ap- additional revenue generation opportunities for these service
proach to address it are by no means limited to domain providers, while still allowing for the utilization of informa-
names; they apply to publicly accessible databases in sev- tion collected through other means to send targeted ads to
eral application domains, as suggested by the examples the users. Thus, users and service providers have plausible
above. Although ICANN claims the practice of domain incentives to use a PIR-based solution for maintaining query
front running has subsided [26], we will, however, use privacy. In addition, the very existence of a practical privacy-
the domain name example in this paper to enable head-to- preserving database query technique could be enough to
head performance comparisons with a similar approach by persuade privacy legislators that it is reasonable to demand
Reardon et al. [35], which is based on this same example. that certain sorts of databases enforce privacy policies, since

1
it is possible to deploy these techniques without severely The client sends a desensitized version of the prepared SQL
limiting the utility of such databases. query appropriately modified to remove private information.
To address the protection of the query, we study how The database executes this public SQL query, and generates
client applications embed personal information into queries, appropriate cached indices to support further rounds of
particularly for systems that use SQL for data access. We interaction with the client. The client subsequently performs
focus on the protection of SQL queries over relational a number of keyword-based PIR operations [12] using the
databases because such databases are widely deployed. value for the placeholders against the indices to obtain the
Our goal of preserving the privacy of sensitive infor- result for the query.
mation within an SQL query requires an extension to the None of the existing proposals related to enabling privacy-
rudimentary data access model of PIR. These models are preserving queries and robust data access models for private
limited to retrieving a single bit, a block of bits [4], [13], information retrieval makes the noted observation about the
[28], or a textual keyword [12]. These theoretical primitives privacy of constants within an otherwise-public query. These
are a limiting factor in deploying successful PIR-based include techniques that eliminate database optimization by
systems. There is therefore a need for an extension to a more localizing query processing to the user’s computer [35],
expressive data access model, and to a model that enables problems on querying Database-as-a-Service [25], [22],
data retrieval from structured data sources, such as from a those that require an encrypted database before permitting
relational database. private data access [38], and those restricted to simple
Dynamic SQL is an incomplete SQL statement within a keyword search on textual data sources [6]. This observation
software system, meant to be fully constructed and executed is crucial for preserving the expressiveness and benefits
at runtime [39]. It provides a flexible, efficient, and secure of SQL, and for keeping the interface between a database
way of using SQL in software systems. The flexibility and existing software systems from changing while building
enables systems to construct and submit SQL queries to the in support for user query privacy. Our approach improves
database at runtime. Dynamic SQL is efficient because it over previous work with additional database optimization
requires only a single compilation that prepares the query opportunities and fewer PIR operations needed to retrieve
for its subsequent executions. In addition, dynamic SQL is data. To the best of our knowledge, we are the first to
more secure because malicious SQL code injection is much propose a practical technique that leverages PIR to preserve
more difficult. We observe that the shape or textual content the privacy of sensitive information in an SQL query over
of an SQL query prepared within a system is not private, but existing commercial and open-source relational database
the constants the user supplies at runtime are private, and systems.
must be protected. For domain name registration, the textual Our contributions. We address the problem of preserving
content of the query is exposed to the database, but only the the privacy of sensitive information within an SQL query
textual keyword for the domain name is really private. For using PIR. In doing this, we address two obstacles to
example, the shape of the dynamic query in Listing 1 is deploying successful PIR-based systems. First, we develop
not private; the question mark ? is used as a placeholder a generic data access model for private information retrieval
for a private value to be provided before the query is from a relational database using SQL. We show how to hide
executed at runtime. Of note is the related observation sensitive data within a query and how to use PIR to retrieve
made between parameterized SQL queries and parse tree data from a relational database. Second, we develop an ap-
validation [9], [23]. In this context, runtime parse trees proach for embedding PIR schemes into the well-established
obtained from combining user inputs with parameterized context and organization of relational database systems. It
queries are validated to ensure consistency with parse trees has been argued that performing a trivial PIR operation,
for programmer-specified queries, thereby defeating SQL which involves having a database send its entire data to
injection. Unlike valid inputs which only alter the semantics the user, and having the user select the item of interest, is
of a parse tree, SQL injection attempts to change both the more efficient than running a computational PIR scheme [1],
syntax and semantics of a parse tree [24]. [40]; however, information-theoretic PIR schemes are much
more efficient. We show how the latter PIR schemes can
Listing 1 Example Dynamic SQL query (see Appendix A be applied in realistic scenarios, achieving both efficiency
for the corresponding database schema) and query expressivity. Since relational databases and SQL
SELECT t1.domain, t1.expiry, t2.contact are the most influential of all database models and query
FROM regdomains t1, registrar t2 languages, we argue that many realistic systems needing
WHERE (t1.reg_id = t2.reg_id) AND query privacy protection will find our approach quite useful.
(t1.domain = ? ) The rest of this paper is organized as follows: Section II
provides background information on PIR, the relational
Our approach to preserving query privacy over a relational model, SQL, and database indexing. Section III discusses
database is based on hiding such private constants of a query. related work, while Section IV details the threat model,

2
security, and assumptions for the paper. Section V provides information-theoretic privacy are possible, and sometimes
a description of the approach for hiding sensitive constants hold attractive properties like robustness and byzantine ro-
within an SQL query. We provide detailed discussions of bustness [21]. The first single-database PIR proposal was
the algorithm in Section VI. Section VII gives an overview in 1997 [11]. This PIR scheme assures privacy against
of the prototype implementation and microbenchmarking an adversary with limited computational capability only;
results of this prototype privacy mechanism. Section VIII i.e., polynomially bounded attackers. This type of privacy
highlights results and discussions of the experiment used to protection is known as computational privacy, and is a
evaluate the prototype in greater depth. Section IX concludes weaker notion than information-theoretic privacy. However,
the paper and suggests some future work. computational PIR (CPIR) [11], [28] offers the benefit of
being able to field a single database, unlike information-
II. P RELIMINARIES
theoretic PIR [4], [13] that requires replication and some
A. Private Information Retrieval (PIR) form of restriction on how the databases can communicate.
PIR provides a means to retrieve data from a database Basic PIR schemes place no restriction on information
without revealing any information about which item is leaked about other items in the database, which are not of
retrieved. In its simplest form, the database stores an n-bit interest to the user. However, an extension of PIR, known as
string X, organized as r data blocks, each of size b bits. Symmetric PIR (SPIR) [29], adds that restriction by insisting
The user’s private input or query is an index i ∈ {1, ..., r} that a user learns only the result of her query. The restriction
representing the ith data block. A trivial solution for PIR is is crucial in situations where the database privacy is equally
for the database to send all r blocks to the user and have of concern.
the user select the block of interest at index i (i.e., Xi ), but Another cryptographic construction related to PIR is
this carries a very poor communication complexity. oblivious transfer (OT) [30], [31]. In OT, a database (or
The three important requirements for any PIR scheme are sender) transmits some of its items to a user (or chooser), in
correctness, privacy and non-triviality [14]. The requirement a manner that preserves their mutual privacy. The database
of correctness ensures that the scheme returns the correct has assurance that the user does not learn any information
block Xi to the user. The requirement of privacy assures beyond what he or she is entitled to, and the user has
the scheme does not leak any information to the database assurance that the database is oblivious or unaware of which
about the user’s private input i and the retrieved block particular items it received. OT and SPIR can thus be seen
Xi . The non-triviality requirement expects a communication to be generalizations of PIR. Those protocols could easily
complexity that is better than the trivial solution; that is, be used in place of PIR in our work, with the concomitant
sublinear in n. An additional requirement, which is not often extra computational cost.
addressed in the published literature, is implementation effi- Freedman et al. [18] provides a solution for database
ciency. In fact, the literature has dedicated most attention to search with keywords in various settings including OT, using
reducing communication complexity at the expense of com- oblivious polynomial evaluation and homomorphic encryp-
putational complexity [1], [40]. While the performance of tion. However, each database tuple, which they referred to
information-theoretic PIR schemes are generally better [21], as a payload, still needs to be tagged with an appropriate
this neglect of computational overhead has led to single- keyword. The key improvements over earlier results [30],
database PIR schemes that are slow for large databases [40]. [31] is the preservation of privacy against a fixed number
On the other hand, multi-server information-theoretic PIR of queries after an initial setup, a fixed number of rounds
schemes are much more efficient than the trivial solution and for oblivious query evaluation, and the ability to deal with
their use is justified in situations where the user lacks the exponential domain sizes.
bandwidth and local storage resources required for the trivial
download of data. Recent attempts at building practical B. The relational model and SQL
single-database PIR [45] using general-purpose secure co- The relational model forms the basis for data storage in
processors offers several orders of magnitude improvement many database systems. Data in this model is organized as
in performance. Nevertheless, the potential application of a collection of tables and the relationships between them.
PIR in several practical domains has been largely unrealized Tables are also called relations. Each record or row of a
with no “fruitful” or “real world” practical application. relation is a tuple, and each column represents an attribute.
When the PIR problem was first introduced in 1995 [13], SQL is a language for manipulating and retrieving data from
it was proven that a better-than-trivial solution with the relations of a database. [39]
information-theoretic privacy is impossible to achieve with The basic form of an SQL query consists of the SELECT,
a single database. Information-theoretic privacy ensures that FROM, and WHERE clauses (see Listing 2). The SELECT
the adversary cannot learn the user’s query, regardless clause produces a relation consisting of the attributes in
of its current or future computational abilities. Using at the set a1 , a2 , ..., an . The FROM clause performs a cross
least two replicated databases, however, PIR schemes with product (or Cartesian product) operation on the relations,

3
by combining each tuple of R1 with each tuple of R2 (and m. A perfect hash function is minimal when n = m. These
similarly for R3 , . . . , Rn ); each of the resulting tuples has PHF that can work with large sets of keys (on the order
all the attributes of all of the relations. The WHERE clause of billions), unlike earlier developments, such as gperf [37],
selects the tuples from the cross product that satisfy a given that can only manage small sets of keys.
condition or predicate P . The predicate P is a boolean Performance parameters of PHF are generation or con-
expression on constants and the attributes of R1 , R2 , ..., Rn struction speed to index a set of keys, representation size
and involves comparison operators =, <>, <, >, <=, >= or bits stored per key and evaluation time. The state-of-
as well as logical operators AN D, OR, and N OT . Often, the-art construction [5] takes linear time; the representation
the predicate includes a join that further constrains the tuples size can be as low as 0.67 bits per key for m = 2n. The
for the cross product. evaluation time is O(1). In addition to point queries, an
order-preserving PHF [15] can be useful for evaluating range
Listing 2 Basic form of an SQL query. queries over a B + tree index.
SELECT a1 , a2 , ..., an
FROM R1 , R2 , ..., Rn III. R ELATED W ORK
WHERE P
A common assumption for PIR schemes is that the user
Another clause of interest to our work is the HAVING knows the index or address of the item to be retrieved.
clause. This clause is similar to the WHERE clause; how- However, Chor et al. [12] proposed a way to access data with
ever, it allows aggregate expressions, such as SU M (∗) and PIR using keyword searches over three data structures: bi-
COU N T (∗), in its predicate expressions. In practice, the nary search tree, trie and perfect hashing. Our work extends
predicates of these two clauses constrain the result of a query keyword-based PIR to B + trees and PHF. In addition, we
to some selected tuples. For example, a predicate “domain provide an implemented system and combine the technique
= ‘somedomain.com’ ” restricts the tuples of some selection with the expressive SQL. The technique in [12] neither
to those with the domain value ‘somedomain.com’. explores B + trees nor considers executing SQL queries
using keyword-based PIR.
C. Indexing Reardon et al. [35] similarly explore using SQL for private
A database index is a supplementary data structure used information retrieval, and proposed the TransPIR prototype
to efficiently access data from the database. Data are indexed system. This work is the closest to our proposal and will be
either directly by the values of one or more attributes or by used as the basis for comparisons. TransPIR performs tradi-
hashes (generally not cryptographic hashes) of those values. tional database functions (such as parsing and optimization)
The attributes used to define an index form the key. Indices locally on the client; it uses PIR for data block retrieval
are typically organized into tree structures, such as B + trees. from the database server, whose function has been reduced
The number of nodes between the root and any leaf of a to a block-serving PIR server. The benefit of TransPIR is
B + tree is constant, because the tree is balanced. Internal that the database will not learn any information even about
or non-leaf nodes do not contain data; they only maintain the textual content of the user’s query. The drawbacks are
references to children or leaf nodes. Data are either stored poor query performance because the database is unable to
in the leaf nodes, or the leaf nodes maintain references to perform any optimization, and the lack of interoperability
the corresponding tuples in the database. Furthermore, the with any existing relational database system.
leaf nodes of B + trees may be linked together to enable Private matching and private set intersection schemes [19],
sequential data access during range queries over the index; [27] consider the problem of computing the intersection of
range queries return all data with an index attribute value two private sets from two users, such that each user only
in a specified range. learns the sets’ intersection. Our work is significantly differ-
Hashed indices are specifically useful for point queries, ent from private intersection schemes because SQL queries
which return a single data item for a given key. For many are richer and more complex than simple set intersection.
situations where efficient retrieval over a set of unique keys In addition, an SQL query describes the expected result of
is needed, hashed indices are preferred over B + tree indices. a query, which may not contain any itemized listing of the
However, it is challenging to generate hash functions that data, whereas private set intersection schemes require the
will hash each key to a unique hash value. Many hashed exact data to be the input of a query. The differences of these
indices used in commercial databases, for this reason, use schemes from our work remain if one considers a modified
data partitioning (bucketization) [25] techniques to hash a private matching scheme, where only one party (the user)
range of values to a single bucket, instead of to individual needs to learn of the result of the intersection.
buckets. Recent advances [8] in perfect hash functions Significant effort has been devoted to the problem of
(PHF) have produced a family of hash functions that can searching on encrypted data [3], [41], [47]. Shi et al. [38]
efficiently maps a large set of n key values to a set of m considers the problem of storing encrypted data in an un-
integers without collisions, where n is less than or equal to trusted repository. To retrieve a subset of the encrypted data,

4
the user must possess a key that will only decrypt the data privacy, the goal of the Pynchon Gate is to maintain privacy
matching some preauthorized attributes or keywords. They for users’ identities. It does this by ensuring the messages
considered the encryption and auditing of network flows, a user retrieves cannot be linked to his or her pseudonym.
but the approach is also applicable to financial audit logs, The construction resists traffic analysis, though users may
medical privacy, and identity-based biometric encryption need to perform some dummy PIR queries to prevent a
systems. Our work is different from encrypted data search passive observer from learning the number of messages she
in three ways. First, we do not require encryption of the has received.
data, regardless of the assumption of the adversary being an
IV. T HREAT M ODEL , S ECURITY AND A SSUMPTIONS
insider to the database server. The privacy provided with PIR
aims to hide the particular data that is of interest, in the midst A. Security and adversary capabilities
of the entire unencrypted data set. Second, the type of query Our main assumption is that the shape of SQL queries
supported with our approach is much more extensive. Third, submitted by the users is public or known to the database
encrypted data search is typically performed on unstructured administrator. Applicable practical scenarios include design-
or textual data, whereas our approach deals with structured time specification of dynamic SQL by programmers, who
data in the repository of relational databases. expect the users to supply sensitive constants at runtime.
A closely related research stream is the problem of Moreover, the database schema and all dynamic SQL queries
privately searching an encrypted index over an out- expected to be submitted to, for example, a patent database,
sourced database in the computing context of Database- are not really hidden from the patent database administrator.
as-a-Service [25], [22]. Hacigümüş et al. [22] presents a Simultaneous protection of both the shape and constants of
technique for executing SQL queries over a user-encrypted a query are outside of the scope of this work, and would
database hosted in a service provider’s server. The goal is likely require treating the database management system as
to protect the data from the service provider, but still enable other than a black box.
the user to query the database. The context of use for the The approach presented in this paper is sufficiently
Database-as-a-service paradigm differs from that of PIR. generic to allow an application to rely on any block-
The service provider typically owns the data that multiple based PIR system, including single-server, multi-server, and
users query with PIR. The goal is not to hide data from the coprocessor-assisted variants. We assume an adversary with
server, but to hide data access patterns, which could leak the same capability as that assumed for the underlying PIR
information about users’ requests. protocol. The two common adversary capabilities considered
A related problem to PIR is that of privately searching in theoretical private information retrieval schemes are the
an unencrypted stream of documents [6], [33]. In these curious passive adversary and the byzantine adversary [4],
schemes, the client selects some keywords, and then encrypts [13]. Either of these adversaries can be a database adminis-
them before sending it to a server. The server performs a trator or any other insider to a PIR server.
search using the keywords over a stream of unencrypted A curious passive adversary can observe PIR-encoded
documents and returns the list of documents containing the queries, but should be incapable of decoding the content. In
keywords back to the client. The server remains oblivious of addition, it should not be possible to differentiate between
which particular document it returns, and the confidentiality queries or identify the data that makes up the result of
of the keywords is preserved. Existing constructions are a query. In our context, the information this adversary
limited to returning documents that give exact matches on can observe is the desensitized SQL query from the client
a keyword list, or two keyword lists combined with logical and the PIR queries. The information obtained from the
“OR” or “AND”. These types of queries are much simpler desensitized query does not compromise the privacy of the
than a relational database query, which may contain multiple user’s query, since it does not contain any private constants.
operators — comparison, logical, and so on. In addition, Similarly, the adversary cannot obtain any information from
range queries are not presently possible with private stream the PIR queries because PIR protocols are designed to be
searching because exact keywords must be specified for the resistant against an adversary of this capability.
search. The performance of private stream searching con- A byzantine adversary with additional capabilities is as-
structions is also comparable with that of a single database sumed for some multi-server PIR protocols [4], [21]. In this
PIR, because most such schemes rely on homomorphic model, the data in some of the servers could be outdated,
encryption using the Paillier cryptosystem [34]. or some of the servers could be down, malfunctioning or
An interesting attempt to build a practical pseudonymous malicious. Nevertheless, the client is still able to compute the
message retrieval system using the technique of PIR is correct result and determine which servers misbehaved, and
presented in [36]. The system, known as the Pynchon Gate, the servers are still unable to learn the client’s query. Again,
helps preserve the anonymity of users as they privately in our specific context, the adversary may compromise some
retrieve messages using pseudonyms from a centralized of the servers in a multi-server PIR scenario by generating
server. Unlike our use of PIR to preserve a user’s query and obtaining the result for a substitute fake query or

5
executing the original query on these servers, but modifying results, without fear of the queries revealing its content.
some of the tuples in the results arbitrarily. The adversary Additionally, individual service agreements can foreclose
may respond to a PIR request with a corrupted query result any chance of collusion with a third party on legal grounds.
or even desist from acting on the request. Nevertheless, all Users then enjoy greater confidence in using the service, and
of these active attack scenarios can be effectively mitigated the registrars in turn can capitalize on revenue generation
with a byzantine-robust multi-server PIR scheme. opportunities such as pay-per-use subscriptions and revenue-
sharing ad opportunities.
B. Data size assumptions The second scenario that offers less danger of collusion
We service PIR requests using indexed data extracted from is when the query needs to be private only for a short time.
relational databases. The size of these data depends on the In this case, the user may be comfortable with knowing that
number of tuples resulting from the desensitized query. We by the time the servers collude in order to learn her query,
note that even in the event that this desensitized query yields the query’s privacy is no longer required.
a small number of tuples (including just one), the privacy Note that even in scenarios where collusion cannot be
of the sensitive part of the SQL query is not compromised. forestalled, our system can still use any computational PIR
The properties of PIR ensure that the adversary gains no protocol; recent such protocols [1], [45] offer considerable
information about the sensitive constants from observing the efficiency improvements over previous work in the area.
PIR protocol, over what he already knew by observing the
desensitized query. V. H IDING S ENSITIVE C ONSTANTS
On the other hand, many database schemas are designed A. Overview
in a way that a number of relations will contain very few Our approach is to preserve the privacy of sensitive data
rows of data, all of which are meant to be retrieved and within the WHERE and HAVING predicates of an SQL
used by every user. Therefore, it is pointless to perform query. For brevity, we will focus on the WHERE clause; a
PIR operations on these items, since every user is expected similar processing procedure applies to the HAVING clause.
to retrieve them all at some point. The adversary does This may require the user (or application) to specify the
not violate a user’s query privacy by observing this public constants that may be sensitive. For the example query in
retrieval. Listing 3, the domain name is sensitive because it could
presumably be used for domain name front running, and the
C. Avoiding server collusion
creation date may be sensitive as well.
Information-theoretic PIR is generally more computation- Our approach splits the processing of SQL queries con-
ally efficient than computational PIR, but requires that the taining sensitive data into two stages. In the first stage, the
servers not collude if privacy is to be preserved; this is client computes a public subquery, which is simply the orig-
the same assumption commonly made in other privacy- inal query that has been stripped of the predicate conditions
preserving technologies, such as mix networks [10] and containing sensitive data. The client sends this subquery to
Tor [17]. We present scenarios in which collusion among the server, and the server executes it to obtain a result for
servers is unlikely, yielding an opportunity to use the more the subquery. The desired result for the original query is
efficient information-theoretic PIR. contained within the subquery result, but the database is not
The first scenario is when several independent service aware of the particular tuples that are of interest.
providers host a copy of the database. This applies to In the second stage, the client performs PIR operations
naturally distributed databases, such as Internet domain reg- to retrieve the tuples of interest from the subquery result.
istries. In this particular instance, the problem of colluding To enable this, the database creates a cached index on the
servers is mitigated by practical business concerns. Realisti- subquery result and sends metadata for querying the index
cally, the Internet domain database is maintained by different to the client. The client subsequently performs PIR retrievals
geographically dispersed organizations that are independent on the index and finally combines the retrieved items to build
of the registrars that a user may query. However, different the result for the original query. An alternative approach to
registrars would be responsible for the content’s distribution storing materialized tuples or subquery results in an index
to end users as well as integration of partners through banner
ads and promotions. Since the registrars are operating in the Listing 3 Example SQL query with a WHERE clause
same line of business where they compete to win users and featuring sensitive domain name information.
deliver domain registry services, as well as having their own SELECT t1.contact, t1.email,
advertising models to reap economic benefits, there is no t2.created, t2.expiry
real incentive to collude in order to break the privacy of any FROM registrar t1, regdomains t2
WHERE (t1.reg_id = t2.reg_id) AND
user. In this model, it is feasible that a user would perform (t2.created > 20090101) AND
a domain name registration query on multiple registrars’ (t2.domain = ’anydomain.com’)
servers concurrently. The user would then combine the

6
Client Server B. Algorithm
Alice PIR Server Database File System We describe our algorithm with an example by assum-
subquery ing an information-theoretic PIR setup with two replicated
subquery
servers. We focus on hiding sensitive constants in the
predicates of the WHERE clause. The algorithm details for
subquery result
the SELECT query in Listing 3 follows. We assume the date
index on subquery result
20090101 and the domain anydomain.com are private.
index helper data Step 1: The client builds an attribute list, a constraint list,
PIR query q(i) on index and a desensitized SELECT query, using the attribute names
PIR retrieval of q(i) and the WHERE conditions of the input query. We refer to
PIR result the desensitized query as a subquery.
PIR result
To begin, initialize the attribute list to the attribute names
in the query’s SELECT clause, the constraint list to be
...

...

...
empty, and the subquery to the SELECT and FROM clauses
compute query result
of the original query.
Figure 1. A sequence diagram for evaluating Alice’s private query over • Attribute list: {t1.contact, t1.email,
a PIR-enabled relational database. t2.created, t2.expiry}
• Constraint list: {}
is to maintain index entries as references to actual database • Subquery:
tuples. In other words, each index entry will simply store SELECT t1.contact, t1.email,
keys and reference database tuples. An index built using this t2.created, t2.expiry
approach can be considered as maintaining a ‘view’ of the FROM registrar t1, regdomains t2
subquery result (i.e., no data materialization). The approach
offers space savings, but will incur considerable performance
overhead. PIR queries over such indices necessarily require Next, consider each WHERE condition in turn. If a
individual fetching of all tuples in the original subquery condition features a private constant, then
result (at worst), or systematic range-based fetches (at best). • add the attribute name to the attribute list (if not already

These operations will be slow and much more complex to in the list)
implement. For these reasons, our approach explores indices • add (attribute name, constant value, operator) to the

built on materialized data. constraint list


Otherwise
The important benefits of this approach as compared with
the previous approach [35] are the optimizations realizable • add the condition to the subquery

from having the database execute the non-private subquery, On completing the above steps, the attribute list and the
and the fewer number of PIR operations required to retrieve constraint list for the input query become:
the data of interest. In addition, the PIR operations are • Attribute list: {t1.contact, t1.email,
performed against a cached index which will usually be t2.created, t2.expiry, t2.domain}
smaller than the complete database. This is particularly • Constraint list: {(t2.created, 20090101,
true if there are joins and non-private conditions in the >), (t2.domain, ’anydomain.com’, =)}
WHERE clause that constrain the tuples in the query result. The subquery, which is a SELECT query with reduced
In particular, a single PIR query is needed for point queries conditions, is shown in Listing 4.
on hash table indices, while range queries on B + tree indices
are performed on fewer data blocks. Figure 1 illustrates the Listing 4 Example subquery with reduced conditions.
sequence of events during a query evaluation. SELECT t1.contact, t1.email,
t2.created, t2.expiry, t2.domain
We note that often, the non-private subqueries will be FROM registrar t1, regdomains t2
common to many users, and the database does not need to WHERE (t1.reg_id = t2.reg_id)
execute them every time a user makes a request. Neverthe-
less, our algorithm details, presented next in Section V-B,
show the steps for processing a subquery and generating Step 2: The client sends to each server
indices. Such details are useful in an ad hoc environment, • the subquery

where the shape of a query is unknown to the database a • a key attribute name

priori; each user writes his or her own query as needed. Our • an index file type

assumption is that revealing the shape of a query will not The key attribute name is selected from the at-
violate the users’ privacy (see Section IV). tribute names in the constraint list — t2.created,

7
t2.domain in our example. The choice may either be • builds the desired query result from the data retrieved
random, made by the application designer, or determined by with PIR.
a client optimizer component with some domain knowledge The encoding of a private constant in a PIR query proceeds
that could enable it to make an optimal choice. One way to as follows. For PIR queries over a hash-based index, the
make a good choice is to consider the selectivity — the ratio client computes the hash for the private constant using the
of the number of distinct values taken to the total number PHF functions derived from the metadata1 . This hash is also
of tuples — expected for each constraint list attribute, and the block number in the hash table index on the servers.
then choose the one that is most selective. This ensures the This block number is input to the PIR scheme to compute
selection of attributes with unique key values before less the PIR query for each server. For a B + tree index, the user
selective attributes. For example, in a patent database, the compares the private value for the key attribute with the
patent number is a better choice for a key than the author’s values in the root of the tree. The root of the tree is extracted
gender. A poor choice of key can lead to more rounds from the metadata it receives from the server. Each key value
of PIR queries than necessary. Point queries on a unique in this root maintains block numbers for the children blocks
key attribute can be completed with a single PIR query. or nodes. The block number corresponding to the appropriate
Similarly, a good choice of key will reduce the number of child node will be the input to the PIR scheme.
PIR queries for range queries. For the example query, we For hash-based indices, a single PIR query is sufficient
choose t2.domain as the key attribute name. to retrieve the block containing the data of interest from
For the index file type, either a PHF or a B + tree index the hash table. For B + tree indices, however, the client
type is specified. Other index structures may be possible, uses PIR to traverse the tree. Each block can hold some
with additional investigation, but these are the ones we number m of keys, and at a block level, the B + tree can be
currently support. More details on the selection of index considered an m-ary tree. The client has already been sent
types is provided below. the root block of the tree, which contains the top m keys.
Step 3: Each server Using this information, the client can perform a single PIR
• executes the subquery on its relational database block query to fetch one of the m blocks so referenced.
• generates a cached index of the specified type on the It repeats this process until it reaches the leaves of the
subquery result, using the key attribute name tree, at which point it fetches the required data with further
• returns metadata for searching the indices to the client PIR queries. The actual number of PIR queries depends on
The server computes the size of the subquery result. If it the height of the (balanced) tree, and the number of tuples
can send the entire result more cheaply than performing in the result set. Traversals of B + tree indices with our
PIR operations on it, it does so. Otherwise, it proceeds with approach are oblivious in that they leak no information about
the index generation. For hash table indices, the server first nodes’ access pattern; we realize retrieval of a node’s data
computes the perfect hash functions for the key attribute as a PIR operation over the data set of all nodes in the
values. Then it evaluates each key and inserts each tuple tree. In other words, it does not matter which particular
into a hash table. The metadata that is returned to the client branch of a B + tree is the location for the next block to be
for hash-based indices consists of the PHF parameters, the retrieved. We do not restrict PIR operations to the subset of
count of tuples in the hash table, and some PIR-specific blocks in the subtree rooted at that branch. Instead, each PIR
initialization parameters. operation considers the set of blocks in the entire B + tree.
For B + tree indices, the server bulk inserts the subquery Range queries that retrieve data from different subtrees leak
result into a new B + tree index file. B + tree bulk insertion no information about to which subtree a particular piece
algorithms provide a high-speed technique for building a of data belongs. The only information the server learns is
tree from existing data [2]. The server also returns metadata the number of blocks retrieved by such a query. Therefore,
to the client, including the size of the tree and its first data specific implementations may utilize dummy queries to
block (the root). Generated indices are stored in a disk cache, prevent the server from leaning the amount of useful data
external to the database, unlike native database indices. retrieved by a query [36].
Step 4: The client receives the responses from the servers To compute the final query result, the client applies
and verifies they are of the appropriate length. For a byzan- the other private conditions in the constraint list to the
tine robust multi-server PIR, a client may choose to proceed result obtained with PIR. For the example query, the client
in spite of errors resulting from non-responding servers or filters out all tuples with t2.created not greater than
from responses that are of inconsistent length. 20090101 from the tuple data returned with PIR. The
Next, the client remaining tuples give the final query result.
• performs one or more keyword-based PIR queries, 1 Using the CMPH Library [7] for example, the client saves the PHF data
using the value associated with the key attribute name from the metadata into a file. It reopens this file and uses it to compute a
from the constraint list, and hash by following appropriate API call sequences.

8
Capabilities for dealing with complex queries can be built (i) SELECT * FROM table WHERE a =
into the client. For example, it may be more efficient to ’SQL’ AND b = ’LEX’
request a single index keyed on the concatenation of two (ii) SELECT * FROM table WHERE a =
attributes than separate indices. If the client requests separate ’SQL’ OR b = ’LEX’
indices, it will subsequently perform PIR queries on each of The client can compute the result for (i) using either one
those indices, using the private value associated with each or two indices, whereas it requires two indices to compute
attribute from the constraint list. Finally, the client combines the result for (ii). To compute the result for (i) with a single
the partial results obtained from the queries with set oper- index, the client requests an index for a or b because both
ations (union, intersection), and performs local filtering on of the conditions in the WHERE clause can only be true
the combined result, using private constant values for any if one of them is true. If it requests an index for a, it will
remaining conditions in the constraint list to compute the first perform keyword-based PIR using the literal ’SQL’
final query result. The client thus needs query-optimization over this index, and then filter the result obtained with the
capabilities in addition to the regular query optimization second condition b = ’LEX’. To compute either (i) or (ii)
performed by the server. This is an open area of work closely with two indices, the client requests indices for both a and b,
related to database optimization. and then performs two keyword-based PIR searches using
the string literals ’SQL’ and ’LEX’ over the respective
VI. D ISCUSSION indices. Finally, the client computes the intersection of the
In this section, we discuss important architectural compo- tuples in the two PIR results to obtain the result for (i), or
nents and design decisions related to the algorithm presented it computes the union to obtain the result for (ii).
in Section V. We note that a worst case query scenario having several
private conditions combined with an OR operator will have
A. Parsing SQL queries storage and computational costs linear in the number of
unique attribute names used with the private conditions. In
The algorithm parses an input query — the WHERE certain circumstances, it may be possible to eliminate the
and HAVING clauses in particular. Other subclauses of the storage cost by maintaining references to the tuples data in
SELECT statements, such as GROUP BY and ORDER BY, the database rather than maintaining a materialized copy in
can either be processed as part of a subquery or applied on an index.
the result obtained with PIR. Specific implementations can Currently, logical NOT conditions cannot be processed
adopt the mature parsers developed with open source and with PIR. We are unable to find any practical PIR scenario
commercial databases. to justify its use. For example, performing PIR queries on a
The expression tree provides an easy way to construct the patent database will generally not require a NOT operator.
desensitized query and the constraint list. The parsing pro- We prescribe client-side processing for NOTs, after the data
cess builds an expression tree representation for the WHERE required for evaluating the condition are retrieved with PIR.
clause conditions. The internal nodes of this expression tree This expression tree is traversed twice. The first traversal
typically contain arithmetic, relational, and logical operators, lists the desensitized query’s WHERE conditions, which
while the leaf nodes consist of attribute names and constants. includes all joins and all non-private conditions. The log-
Any WHERE clause predicate expression can be a join, ical AND operator combines the joins and the non-private
a non-private condition or a private condition. The latter conditions. The boolean true value can serve as a place-
contains a sensitive constant value, whereas the former two holder for every private condition. For example, the actual
do not. Our parser allows the user to tag sensitive constants WHERE clause for the subquery in Listing 4 can be WHERE
with the symbol “#” to differentiate them from public (t1.reg_id = t2.reg_id) AND true AND true,
constants. For example, the sensitive constant ‘20090511’ which can be subsequently optimized. The second traversal
is tagged in this query: SELECT * FROM table WHERE lists the private conditions, which are used to build the
n = 20090605 AND p = #20090511. Each WHERE constraint list.
clause condition is related to another condition with the
logical AND. Logical OR conditions are not considered as B. Indexing subquery results
expression delimiters, but disjunct multiple subexpressions For many general purposes, it may be impractical to
in the same condition. Typically, relational databases convert execute the desensitized query and generate an index on the
the WHERE clause conditions in the input query to an query result for every request. The use of an index cache
equivalent set of conditions in the conjunctive normal form, addresses some of the cost, because the database can use
to facilitate query optimization. the same cached index to serve multiple PIR queries (with
For an example of AND and OR, consider the two the same private attributes, though not necessarily the same
SELECT queries below, which differ only in their WHERE private constants) from multiple users. This mitigates the
clause conditions. computational costs for generating indices. An exception for

9
the use of a cache is when the shape of the input query is first retrieve the data from the server with PIR, and then
unpredictable, especially in an environment where the users perform a more sophisticated filtering on the result, using
make ad hoc queries. In this case, a separate index must be the wildcard expression.
generated for each unique query. An IN condition has the general form column IN
(literal1 ,literal2 , ...). If the attribute column has unique
C. Database servers values, then the tuple associated with each literal can be
Practical implementations could use any commercial or retrieved with a point query on the same index over the
open-source database server to execute the desensitized column attribute. Some PIR implementations, such as [20],
query. The client does not need to install database client can simultaneously retrieve multiple blocks for a set of literal
programs to query the database server in the privacy- values in a single query. Otherwise, a combination of range
friendly manner we describe; however, the client will need and point queries will be required. The client optimizer can
an installation of the private SQL client that implements the be built to intelligently combine literal values to reduce the
client-side logic of the algorithm. Similarly, a program that overall number of PIR queries.
implements the server-side logic of the algorithm must be Client-side support for database function evaluation is
installed at the server. required when private constants are used as function param-
eters in a WHERE clause expression. Such functions can be
D. Processing specific conditions
evaluated before the data required are retrieved with PIR, or
We provide an overview on how to deal with private afterwards. The latter follows for functions that take private
constants in specific conditions of the WHERE clause. constants and attribute names as parameters.
In particular, we consider simple conditions, as well as We note that special WHERE clause conditions, such
specialized conditions such as BETWEEN, LIKE, and IN. is IS N U LL and IS N OT N U LL, do not require any
A simple WHERE clause condition consists of the general private constants. It would suffice to include them in the
form column relop literal or literal relop column, where desensitized query in many situations. Alternatively, they
relop is a relational operator, such as =, <>, <, >, <=, and could be processed locally, especially for ad hoc queries,
>=. If the column is used to index a query result, then the if they are considered to reveal sensitive information about
literal will be used as input to the keyword-based PIR. The the tuples of interest.
operator “=” indicates a point query. If the key attribute Finally, an implementation may decide to localize the
column is unique, then a single result is expected; either processing of all the above conditions, as well as other
a hash or a B + tree index is appropriate. On the other conditions of the WHERE clause. The approach to adopt
hand, a B + tree is preferred for non-unique key values, depends on the amount of optimization the client is capa-
since there may be multiple tuples in the query result. The ble of performing and the requirements of the application
other operators, which imply range queries, require B + domain.
tree indices. The literal or its next or previous neighbours
from the domain of values for the data type, in sorted or VII. I MPLEMENTATION AND M ICROBENCHMARKS
lexicographical order, provide one of the values for the range
search. The other value is determined from the smallest or A. Implementation
largest value in the domain for the data type. The input We developed a prototype implementation of our algo-
values for range search for the condition t2.created > rithm to hide the sensitive portions of SQL queries using
20090101, for example, are (20090102, 99991231). generally available open source C++ libraries and databases.
A BETWEEN condition has the general form column We developed a command-line tool to act as the client,
BET W EEN literal1 AN D literal2 , which is equivalent and a server-side database adapter to provide the func-
to the condition column >= literal1 AN D column <= tions of a PIR server. For the PIR functions, we used the
literal2 . This condition is processed as a range query on Percy++ PIR Library [20], [21], which offers three varieties
the two literal values. of privacy protection: computational, information theoretic
A LIKE condition has the form column LIKE literal, and hybrid (a combination of both). We extended Percy++
where literal is a search condition that involves one or more to support keyword-based PIR. For generating hash table
wildcards, such as % and . The allows for the matching indices for point queries, we used the C Minimal Perfect
of a single character, while the % allows for matching strings Hash (CMPH) Library [7], [8], version 0.9. We used the
of any length, including zero-length strings. Prefix-based API for CMPH to generate minimum perfect hash functions
conditions, such as domain LIKE ‘some%’, and suffix- for large data sets from query results; these perfect hash
based ones, such as domain LIKE ‘%main.com’ can functions require small amounts of disk storage per key. For
easily be processed with a B + tree index over the attribute, building B + tree indices for range queries on large data sets,
or the reverse of the attribute, respectively. Other variants we used the Transparent Parallel I/O Environment (TPIE)
are more easily processed in the client; the client would Library [16], [44]. Finally, we base the implementation

10
on the MySQL [42] relational database, version 5.1.37- and 16 KB (γ = 4), where size(os block) = 4096 bytes.
1ubuntu5.1. The actual fill factor (again, the default for the TPIE Library)
is 0.6 for internal B + tree nodes. We report the disk sizes for
B. Experimental setup indices built for our experiment on the queries with complex
We began evaluating our prototype implementation using conditions (see Table II).
a set of six whois-style queries from Reardon et al. [35], We ran the all experiments on a server with two quad-core
which is the most appropriate existing microbenchmark for 2.50 GHz Intel Xeon E5420 CPUs, 8 GB RAM, and running
our approach. We explored tests using industry-standard Ubuntu Linux 9.10. We used the information-theoretic PIR
database benchmarks, such as the Transaction Processing support of Percy++, with two database replicas. The server
Performance Council (TPC) [43] benchmarks, and open- also runs a local installation of a MySQL database.
source benchmarking kits such as Open Source Development
Labs Database Test Suite (OSDL DTS) [46], but none of C. Result overview
the tests from these benchmarks is suitable for evaluating The results from the benchmark tests indicate that while
our prototype, as their test databases cannot be readily fitted our current prototype incurs some storage and computational
into a scenario that would make applying PIR meaningful. costs over non-private queries, the costs seem entirely ac-
For example, a database schema that is based on completing ceptable for the added privacy benefit (see Table I later
online orders will only serve very limited purpose to our goal in this section and Table II in Section VIII). In addition
of protecting the privacy of sensitive information within a to being able to deal with complex queries and leverage
query. database optimization opportunities, our prototype performs
We ran the microbenchmark tests using two whois-style much better than the TransPIR prototype from Reardon et
data sets, similar to those generated for the evaluation al. [35] — between 7 and 480 times faster for equiva-
of TransPIR [35]. The smaller data set consists of 106 lent data sets. The most indicative factor of performance
domain name registration tuples, and 0.75×106 registrar and improvements with our prototype is the reduction in the
registrant contact information tuples. The second data set number of PIR queries in most cases. Other factors that
similarly consists of 4 × 106 and 3 × 106 tuples respectively. may affect the validity of the result, such as variations in
We describe the evaluation queries and the two database implementation libraries, are assumed to have negligible
relations in Appendices B and C. We choose the predicate impact on performance. Our work is based on the same
parameters for the benchmark queries to ensure query selec- PIR library as that of [35]. Our comparison is based on
tivity values (ratio of the number of matching tuples to the the measurements we took by compiling and running the
total number of tuples) similar to those used in the original code for TransPIR on the same experimental hardware
benchmarking of TransPIR [35]. The respective values for platform as our prototype. We also used the same underlying
benchmark queries Q1 through Q6 for the small data set PIR library as TransPIR. We initially attempted to run the
are 1.00 × 10−6 , 2.00 × 10−5 , 4.20 × 10−5 , 5.90 × 10−5 , microbenchmarking tests for the larger data with TransPIR
1.33 × 10−6 , and 3.87 × 10−2 . For the large data set they on the development hardware platform for our prototype, but
are 2.50 × 10−7 , 2.00 × 10−5 , 4.20 × 10−5 , 5.90 × 10−5 , was limited by this commodity hardware because TransPIR
2.50 × 10−7 , and 4.20 × 10−5 . requires a 64-bit processor and a minimum of 6 GB RAM
The measurements for all test queries are based on the to index or preprocess the larger data set. The development
default behaviour of the TPIE Library with respect to hardware had a 32-bit processor and only 3 GB RAM.
determining the branching factor λ for B + tree indices. The
following expression shows the computation of branching D. Microbenchmark experiment
factor with this default configuration: We executed the six whois-style benchmark queries over
  the data sets and obtained measurements for the time to exe-
γ × size(os block) − size(BID) − size(size t)
λ= cute the private query, the number of PIR queries performed,
size(Key) + size(BID) the number of tuples in the query results, the time to execute
Where γ, os block, BID, size t, and Key are respec- the subquery and generate the cached index, and the total
tively the data logical blocking factor, operating system data transfer between the client and the two PIR servers.
block, block ID, C++ size t data type, and the key. size(x) Table I shows the results of the experiment. The cost of
is the size of x in bytes. Specifically for our experimental indexing (QI) can be amortized over multiple queries. The
setup for the large data set, the branching factor for indices indexing measurements for BTREE (and HASH) consist of
over integer keys is 2730, and it is 409 for indices over the time spent retrieving data from the database (subquery
character keys. For the small data set, these values are re- execution), writing the data (subquery result) to a file and
spectively 1634 and 215. Our implementation stores integer building an index from this file. Since TransPIR is not
keys in 8 bytes, and character keys in 72 bytes. The branch- integrated with any relational database, it does not incur
ing factor values are based on a block size of 32 KB (γ = 8), the same database retrieval and file writing costs. However,

11
Table I
E XPERIMENTAL RESULTS FOR BENCHMARK TESTS ON THE SMALL DATA (BTREE) prototype; it achieves better performance for the
SET COMPARED WITH THOSE OF R EARDON ET AL . [35]. BTREE = large set. The query of Q1 is a point query having a single
RESULT FOR OUR B + TREE PROTOTYPE , HASH = RESULT FOR OUR
condition on the domain name attribute.
HASH TABLE PROTOTYPE , AND TransPIR = RESULT FROM
T RANS PIR [35]; Time = TIME TO EVALUATE PRIVATE QUERY, PIRs = Query Q2 is a point query on the expiry_date at-
NUMBER OF PIR OPERATIONS PERFORMED , Tuples = COUNT OF ROWS tribute, with the query result expected to have multiple
IN QUERY RESULT, QI = TIMING FOR SUBQUERY EXECUTION AND
tuples. Again, our BTREE prototype outperforms TransPIR
CACHED INDEX GENERATION , Xfer = TOTAL DATA TRANSFER BETWEEN
THE CLIENT AND THE TWO PIR SERVERS . by a significant margin for both data sets; the improvement
Small database with .75 M contact records, 1 M registration is most noticeable for the large data set. The number of PIR
records, and 16 KB blocks queries required to evaluate Q2 with BTREE is 5% of the
Query Approach Time (s) PIRs Tuples QI (s) Xfer (KB) number required by TransPIR. A similar trend is repeated
Q1 HASH 0 1 1 4 64 for Q3, Q4 and Q6. Note that the HASH prototype could
BTREE 6 4 1 9 256
TransPIR 7 2 1 120 128 not be used for Q2 because hash indices accept unique key
Q2 BTREE 3 3 20 7 192 attributes only; it can only return a single tuple in its query
TransPIR 76 23 20 120 1,472 result.
Q3 BTREE 3 3 42 7 192
TransPIR 149 45 42 120 2,880 Query Q3 is a range query on the expiry_date at-
Q4 BTREE 13 3 59 8 256 tribute. Our BTREE prototype respectively was approxi-
TransPIR 217 62 59 120 3,968 mately 50 and 411 times faster than TransPIR for the small
Q5 BTREE 5 4 1 13 256
TransPIR 10 3 1 120 192
and large data sets. Of note is the large number of PIR
Q6‡ BTREE 5 3 29 13 192 queries that TransPIR needs to evaluate the query; for the
TransPIR 558 111 42 —‡ 7,104 large data set, our BTREE prototype requires only 2% of that
number. We observed a similar trend for Q4, where BTREE
Large database with 3 M contact records, 4 M registration
records, and 32 KB blocks was 17 and 480 times faster for the small and large sets
Query Approach Time (s) PIRs Tuples QI (s) Xfer (KB) respectively. This query features two conditions in the SQL
Q1 HASH 2 1 1 16 128 WHERE clause. The combined measured time for BTREE
BTREE 4 3 1 38 384 — the time taken to both build an index to support the query
TransPIR 25 2 1 1,017 256
Q2 BTREE 5 4 80 32 512 and to run the query itself — is still 10 and 67 times faster
TransPIR 999 83 80 1,017 10,624 than the time it takes TransPIR to execute the query alone.
Q3 BTREE 5 4 168 32 512 Query Q5 is a point query with a single join. For the
TransPIR 2,055 171 168 1,017 21,888
Q4 BTREE 6 5 236 37 640
large data set, it took BTREE only about 14% of the time it
TransPIR 2,885 240 236 1,017 30,720 took TransPIR. We observed the time our BTREE spent in
Q5 BTREE 5 3 1 67 384 executing the subquery to dominate; only a small fraction
TransPIR 37 3 1 1,017 384
of the time is spent building the B + tree index.
Q6‡ BTREE 5 4 168 66 512
TransPIR 3,087 253 127 —‡ 32,384 Our BTREE prototype similarly performs faster for Q6,
with an order of magnitude similar to Q2, Q3, and Q4.
TransPIR incurs a one-time preprocessing cost (QI) which In all of the benchmark queries, the proposed approach
prepares the database for subsequent query runs. Comparing performs better than TransPIR because it leverages database
this cost to its indexing counterpart with our BTREE and optimization opportunities, such as for the processing of
HASH prototypes shows that our methods are over an order subqueries. In contrast, TransPIR assumes a type of block-
of magnitude faster. serving database that cannot give any optimization oppor-
tunity. Therefore, in our system the client is relieved from
E. Discussion having to perform many traditional database functions, such
The empirical results for the benchmark tests reflect the as query processing, in addition to its regular PIR client
benefit of our approach. For all of the tests, we mostly base functions.
our comparison on the timing for query evaluation with PIR
(Time), and sometimes on the index generation timing (QI). VIII. C OMPLEX Q UERY E VALUATION
The time to transfer data between the client and the servers In addition to the above microbenchmarks, we performed
is directly proportional to the amount of data (Xfer), but two other experiments to evaluate our prototype. The first
we will not use it for comparison purposes because the test of these studies the behaviour of our prototype on complex
queries were not run over a network. input queries, such as aggregate queries, BETWEEN and
Our hash index (HASH) prototype performs the best for LIKE queries, and queries with multiple WHERE clause
query Q1 on both data sets, followed by our B + tree conditions and joins. Each of these complex queries has
‡ We reproduced TransPIR’s measurements from [35] for query Q6
varying privacy requirements for its sensitive constants. We
because we could not get TransPIR to run Q6 due to program errors. The run the first experiment on the same hardware configuration
‘—’ under QI indicates measurements missing from [35] as the microbenchmark tests, and the second experiment

12
Listing 5 Experimental SQL queries indicating private con- database optimization. Both experiments were performed
stants or conditions with “#”. on the same commodity hardware configuration as the
CQ1 – Private point query on a range with sensitive domain name microbenchmark tests.
information.
SELECT domain, name, address,
email, reg_date, expiry_date A. Result overview
FROM registration, contact
The results obtained from the experiments demonstrate
WHERE (contact_id = registrant) AND
(reg_date > 20090501) AND the benefits of our approach for dealing with complex
(domain = #’somedomain.org’) queries. While the storage and computational costs (over
non-private querying) remain, the overall performance and
CQ2 – Private range query with sensitive registrar ID range. resource requirements are still reasonable for the added
SELECT domain, name, privacy benefit. The prototype requires additional storage for
address, email, expiry_date two types of indices used for PIR operations. The first type
FROM registration, contact
WHERE (contact_id = registrant) AND of index is generated for a particular shape of query, over one
(status IN (1,4,5,7,9)) AND or more key attributes or combinations of attributes. These
(registrar #BETWEEN 198542 AND 749999) AND types of indices need permanent storage in the same manner
(expiry_date BETWEEN 20090101 AND 20091031) as native indices for relational databases. The second type
of index is used in an ad hoc environment, where the tuples
CQ3 – Private aggregate point query with sensitive registrar ID in a subquery result can be constrained in an unpredictable
value.
SELECT registrar, count(domain) manner, with one or more WHERE clause conditions and
FROM registration, contact joins. These latter indices must be generated as needed. In
WHERE (contact_id = registrar) AND most practical situations, it should be possible for indices to
(registrar = #635393) be based on the former type, just like most software systems
GROUP BY registrar rely on indices prebuilt by the database to efficiently run
HAVING count(domain) > 0
ORDER BY registrar ASC their queries.

CQ4 – Non-private LIKE query revealing only the prefix of a B. Experiments on queries with complex conditions
domain name.
SELECT domain, name, address, We describe and present the results of experiments that
email, reg_date, expiry_date examined the behaviour of our prototype when supplied
FROM registration, contact with SQL queries that are more complex than the above
WHERE (contact_id = registrant) AND microbenchmarks. We provide a number of synthetic query
(domain LIKE ’some%’) AND scenarios having different requirements for privacy, the
(domain = #’somedomain.com’)
corresponding SQL queries with appropriate tagging for the
CQ5 – Private LIKE query with domain name prefix as wildcard. condition involving sensitive data, and the measurements. As
SELECT domain, name, address, mentioned above, our SQL parser uses the “#” character to
email, reg_date, expiry_date tag private conditions; we include that tag in the SQL queries
FROM registration, contact we present in Listing 5. We used the same database schema
WHERE (contact_id = registrant) AND (see Appendix C) as the microbenchmarks. The measure-
(domain #LIKE ’some%’)
ments show execution duration for the original query without
privacy provision over the MySQL database, the same query
on the developmental hardware platform. For the test data, after removal of conditions with sensitive information over
query selectivity for complex queries CQ1 through CQ5 are the MySQL database, and several other measurements taken
3.70 × 10−6 , 5.05 × 10−2 , 2.11 × 10−6 , 1.30 × 10−3 , and from within our prototype using a B + tree index. All of the
5.34×10−6 for the small data set. Similarly for the large data measurements are reported in Table II.
set, the values are 5.70 × 10−7 , 5.12 × 10−2 , 1.58 × 10−6 ,
Private point query (CQ1). The task is to obtain a domain
9.52 × 10−7 , and 1.50 × 10−6 .
name record from the whois server without revealing the
In addition to the above microbenchmarks, we performed sensitive domain name information.
two other experiments to evaluate our prototype. The first Private range query (CQ2). The ICANN Security and Sta-
of these studies the behaviour of our prototype on complex bility Advisory Committee may be interested in performing
input queries, such as aggregate queries, BETWEEN and an investigation on some registrars, with IDs ranging from
LIKE queries, and queries with multiple WHERE clause 198542 to 749999. The task is to privately obtain some
conditions and joins. Each of these complex queries has domain name information without revealing the range of IDs
varying privacy requirements for its sensitive constants. The for the registrars. We show the query to obtain registration
second experiment tests whether our prototype leverages records with status in the set (1, 4, 5, 7, 9), and expiration

13
Table II
M EASUREMENTS TAKEN FROM EXECUTING FIVE COMPLEX SQL QUERIES WITH VARYING REQUIREMENTS FOR PRIVACY. oQm = TIMING FOR
EXECUTING ORIGINAL SQL QUERY DIRECTLY AGAINST A M Y SQL DATABASE , BTREE = OVERALL TIMING FOR MEETING PRIVACY REQUIREMENTS
WITH OUR B + TREE PROTOTYPE , rQp = SUBQUERY EXECUTION DURATION WITHIN BTREE, cI = TIMING FOR GENERATING CACHED INDEX WITHIN
BTREE, Time = TIME TO EVALUATE PRIVATE QUERY WITHIN BTREE, PIRs = NUMBER OF PIR OPERATIONS PERFORMED , Tuples = NUMBER OF
RECORDS IN FINAL QUERY RESULT, rTuples = NUMBER OF INDEXED RECORDS IN SUBQUERY RESULT, Xfer = TOTAL DATA TRANSFER BETWEEN THE
CLIENT AND THE TWO PIR SERVERS , Size = TEMPORARY STORAGE SPACE FOR CACHED INDEX .

Small database with .75 M contact records, 1 M registration records, and 16 KB blocks
Query oQm (s) BTREE (s) = rQp (s) + cI (s) + Time (s) PIRs Tuples rTuples Xfer (KB) Size (MB)
CQ1 0 8 4 2 2 3 1 328,805 192 110.57
CQ2 0 2 0 0 2 17 686 13,594 1,088 4.57
CQ3 0 13 9 2 2 3 1 473,646 192 157.82
CQ4 0 0 0 0 0 2 1 768 128 0.32
CQ5 0 17 9 5 3 4 4 749,472 256 251.82

Large database with 3 M contact records, 4 M registration records, and 32 KB blocks


Query oQm (s) BTREE (s) = rQp (s) + cI (s) + Time (s) PIRs Tuples rTuples Xfer (KB) Size (MB)
CQ1 2 31 19 10 2 3 1 1,753,144 384 579.63
CQ2 1 15 2 0 13 41 3,716 72,568 5,248 25.13
CQ3 0 80 74 3 3 3 1 631,806 384 209.38
CQ4 2 25 12 7 5 3 1 1,050,300 384 348.63
CQ5 2 69 42 24 3 3 6 4,000,000 256 1,324.13

dates between 20090101 and 20091031, without revealing C. Database optimization experiments
the registrar ID range.
Private aggregate point query (CQ3). The task is to pri- We studied the overall response of our prototype to
vately compute the total number of registrations sponsored determine the benefits accrued from database optimization.
by a particular registrar. The registrar ID is sensitive. The experimental MySQL database runs mostly with the
Non-private LIKE query (CQ4). The task is to efficiently default settings. The only change made was reducing the
retrieve a single domain name record from a whois server default number of user connections to free up memory for
with some amount of privacy. In other words, a user wants to other processes running on the machine. In other words, we
reveal a prefix of the domain name to improve performance, did not tune the database for optimal performance in the
while still preventing the adversary from learning the exact course of our previous experiments.
textual domain name. Since many long domain names have a Most databases cache query plans and small-sized query
common prefix, the user intends to leverage that knowledge results when a query is executed for the very first time.
to improve query performance. Subsequent executions of the same query will be more
Private LIKE query (CQ5). The task is to retrieve reg- responsive by reusing the cached plan and result. For this
istration records from a whois server without revealing the experiment, we disabled cache usage by flushing the rela-
LIKE wildcard. tions in our database before running each query. Flushing
Results. We see from Table II that in most cases, the cost relations in MySQL closes the open relations and flushes
to evaluate the subquery and create the index dominates the the query cache. This ensures the database obtains a fresh
total time to privately evaluate the query (BTREE), while query plan and result set every time.
the time to evaluate the query on the already-built index We ran this experiment on the less powerful hardware
(Time) is minor. An exception is CQ2, which has a relatively platform we used to develop the prototypes, because the
small subquery result (rTuples), while having to do dozens time to build a fresh query plan does not take a significant
of (consequently smaller) PIR operations to return thousands time for the more powerful hardware platform we used for
of results to the overall range query. Note that in all but CQ2, running the previous tests.
the time to privately evaluate the query on the already-built Table III presents the measurements taken for CQ1
index is at most a few seconds longer than performing the through CQ5 over the large data set under default database
query with no privacy at all; this underscores the advantage behaviour, and the measurements taken when the relations
of using cached indices. of the database are flushed. The result obtained for these
We note from our results that it is much more costly to queries validates the claim that our approach leverages
have the client simply download the cached indices. We database optimization to improve performance. The most
observe, for example, that it will take about 5 times as long, interesting measurements taken in this experiment are the
for a user with 10 Mbps download bandwidth, to download subquery execution durations (rQp). For CQ1, CQ3, and
the index for CQ5 on the large data set. Moreover, this CQ5, the difference in measurements is obvious. However,
trivial download of data is impractical for devices with low the effect is not quite obvious for CQ2 and CQ4. For the
bandwidth and storage (e.g., mobile devices). latter pair, the fraction of the overall timing spent for PIR

14
Table III
E FFECTS OF DATABASE OPTIMIZATION ON QUERY RESPONSIVENESS , OVER THE LARGE DATA SET. BTREE = OVERALL TIMING FOR MEETING
PRIVACY REQUIREMENTS WITH OUR B + TREE PROTOTYPE , rQp = SUBQUERY EXECUTION DURATION WITHIN BTREE, cI = TIMING FOR
GENERATING CACHED INDEX WITHIN BTREE, Time = TIME TO EVALUATE PRIVATE QUERY WITHIN BTREE.

With optimization: default database settings Without optimization: query cache disabled
Query BTREE (s) = rQp (s) + cI (s) + PIR (s) BTREE (s) = rQp (s) + cI (s) + PIR (s)
CQ1 104 50 52 2 122 69 50 3
CQ2 22 3 1 18 29 4 2 23
CQ3 375 347 22 5 454 434 15 6
CQ4 66 25 32 9 66 26 29 11
CQ5 214 90 118 6 436 310 120 6

queries is nonnegligible: 79% and 17% respectively. For a single index generic enough to serve the diversity of the
CQ1, CQ3, and CQ5, the portion of the time spent for constraints in an ad hoc query.
PIR is respectively 2%, 1%, and 1%. The results for CQ1,
CQ3, and CQ5 clearly indicate the contributions of database E. Limitations
optimization to query responsiveness with our approach. Our approach can preserve the privacy of sensitive data
within the WHERE and HAVING clauses of an SQL query,
D. Improving performance by revealing keyword prefixes. with the exception of complex LIKE query expressions,
The performance of a query may be improved by revealing negated conditions with sensitive constants, and SELECT
a prefix or suffix of the sensitive keyword in the query. nested queries within a WHERE clause. The complex-
Revealing a substring of a keyword helps to constrain the ity of complex search strings for LIKE queries, such as
result set that will be indexed and retrieved with PIR. We (LIKE ’do%abs%.c%m’), is beyond the current capability
have demonstrated the feasibility of this technique with of keyword-based PIR. Similarly, negated WHERE clause
complex query CQ4 (Listing 5 and Table II). While this conditions, such as (NOT registrant = 45444), are infeasible
technique may be infeasible in some application domains, to compute with keyword-based PIR. Our solution to dealing
due to the sensitive nature of the keyword, it does improve with these conditions in a privacy-friendly manner is to com-
performance in others. This technique does, of course, trade pute them on the client, after the data for the computation has
off improved performance for some loss of privacy, though it been retrieved with PIR; converting NOT = queries into their
is in fact the user (who can best make this trade-off decision) equivalent range queries is generally less efficient than our
who can decide to what extent to use it. Making the best proposed client-based evaluation method. In addition, our
trade-off decision necessarily requires some knowledge of prototype cannot process a nested query within a WHERE
the data distribution in terms of the number of tuples there clause. We propose that the same processing described for
are for each value in the domain of values for a sensitive a general SQL query be recursively applied for nested
constant. These information can be included in the metadata queries in the WHERE clause. The result obtained from a
a server sends to the client and the client can make this trade- nested query will become an input to the client optimizer,
off decision on behalf of the user based on the user’s preset for recursively computing the enclosing query for the next
preferences. We are actually considering this extension as round. There is need for further investigation of the approach
part of our future work. for nested queries returning large result sets and for deeply
The processing of queries that allow users to reveal either nested queries.
a prefix or suffix of their private constant will proceed as
follows on a prebuilt index. A user would first request the IX. C ONCLUSION AND F UTURE W ORK
root to a particular subtree in a prebuilt B + tree index We have provided a privacy mechanism that leverages
(indexed either on the attribute or the reverse of the attribute, private information retrieval to preserve the privacy of sen-
as above), by supplying a substring for that root. The server sitive constants in an SQL query. We described techniques
would search for and return the requested root, without PIR. to hide sensitive constants found in the WHERE clause
Subsequent PIR queries by the user will be based on the of an SQL query, and to retrieve data from hash table
subtree with the retrieved root, instead of the entire B + tree. and B + tree indices using a private information retrieval
In other words, revealing a substring of a user’s private scheme. We developed a prototype privacy mechanism for
keyword reveals the portion of the data that is of interest our approach offering practical keyword-based PIR and
to the user. However, the level of privacy protection may enabled a practical transition from bit- and block-based PIR
still be sufficient for many user and application purposes. to SQL-enabled PIR. We evaluated the feasibility of our
The only realistic situations where performance cannot approach with experiments. The results of the experiments
be easily improved with this technique are when users must indicate our approach incurs reasonable performance and
make ad hoc queries that are unknown to the server before a storage demands, considering the added advantage of being
system is deployed. In such situations, it is difficult to make able to perform private SQL queries. We hope that our

15
work will provide valuable insight on how to preserve the [9] G. Buehrer, B. W. Weide, and P. A. G. Sivilotti. Using parse
privacy of sensitive information for many existing and future tree validation to prevent SQL injection attacks. In SEM,
database applications. pages 106–113, 2005.
Future work can improve on some limitations of our [10] D. L. Chaum. Untraceable electronic mail, return addresses,
prototype, such as the processing of nested queries and and digital pseudonyms. Commun. ACM, 24(2):84–90, 1981.
enhancing the client to utilize statistical information on the
data distribution to enhance privacy. The same technique [11] B. Chor and N. Gilboa. Computationally private information
retrieval (extended abstract). In STOC ’97: Proceedings
proposed in this paper can be extended to preserve the of the twenty-ninth annual ACM Symposium on Theory of
privacy of sensitive information for other query systems, Computing, pages 304–313, New York, NY, USA, 1997.
such as URL query, XQuery, SPARQL and LINQ. Private
information retrieval is only the first step for preserving a [12] B. Chor, N. Gilboa, and M. Naor. Private information
retrieval by keywords. Technical Report TR CS0917, Dept.
user’s query privacy. An extension to this work can explore
of Computer Science, Technion, Israel, 1997.
private information storage (PIS) [32], and how to use it
for augmenting the privacy of users in real-world scenarios. [13] B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan. Private
An interesting focus would be to extend PIS to SQL in the information retrieval. In FOCS, pages 41–50, Oct 1995.
manner of this paper, in order to preserve the privacy of
[14] G. D. Crescenzo. Towards Practical Private Information
sensitive data within SQL INSERT, UPDATE and DELETE Retrieval. Achieving Practical Private Information Retrieval
data manipulation statements. (Panel @ Securecomm 2006), Aug. 2006.
ACKNOWLEDGMENTS [15] Z. J. Czech, G. Havas, and B. S. Majewski. An optimal
We would like to thank Urs Hengartner, Ryan Henry, algorithm for generating minimal perfect hash functions. Inf.
Aniket Kate, Can Tang, Mashael AlSabah, John Akinyemi, Process. Lett., 43(5):257–264, 1992.
Carol Fung, Meredith L. Patterson, and the anonymous re- [16] Department of Computer Science at Duke University.
viewers for their helpful comments for improving this paper. The TPIE (Templated Portable I/O Environment).
We also gratefully acknowledge NSERC and MITACS for http://madalgo.au.dk/Trac-tpie/.
funding this research.
[17] R. Dingledine, N. Mathewson, and P. Syverson. Tor: the
R EFERENCES second-generation onion router. In USENIX Security Sym-
posium, pages 21–21, 2004.
[1] C. Aguilar-Melchor and P. Gaborit. A Lattice-Based
Computationally-Efficient Private Information Retrieval Pro- [18] M. J. Freedman, Y. Ishai, B. Pinkas, and O. Reingold.
tocol. Cryptol. ePrint Arch., Report 446, 2007. Keyword search and oblivious pseudorandom functions. In
J. Kilian, editor, TCC, volume 3378 of Lecture Notes in
[2] L. Arge, O. Procopiuc, and J. S. Vitter. Implementing I/O- Computer Science, pages 303–324. Springer, 2005.
efficient Data Structures Using TPIE. In Annual European
Symposium on Algorithms, pages 88–100, 2002. [19] M. J. Freedman, K. Nissim, and B. Pinkas. Efficient Private
Matching and Set Intersection. In C. Cachin and J. Ca-
[3] J. Baek, R. Safavi-Naini, and W. Susilo. On the Integration of menisch, editors, EUROCRYPT, volume 3027 of Lecture
Public Key Data Encryption and Public Key Encryption with Notes in Computer Science, pages 1–19. Springer, 2004.
Keyword Search. In S. K. Katsikas, J. Lopez, M. Backes,
S. Gritzalis, and B. Preneel, editors, ISC, volume 4176 of [20] I. Goldberg. Percy++ project on SourceForge.
Lecture Notes in Computer Science, pages 217–232, 2006. http://percy.sourceforge.net/.

[4] A. Beimel and Y. Stahl. Robust Information-Theoretic Private [21] I. Goldberg. Improving the Robustness of Private Information
Information Retrieval. J. Cryptol., 20(3):295–321, 2007. Retrieval. In IEEE Symposium on Security and Privacy, pages
131–148, 2007.
[5] D. Belazzougui, F. C. Botelho, and M. Dietzfelbinger. Hash,
Displace, and Compress. In ESA 2009: Proceedings of the [22] H. Hacigümüş, B. Iyer, C. Li, and S. Mehrotra. Executing sql
17th Annual European Symposium, September 7-9, 2009, over encrypted data in the database-service-provider model.
pages 682–693, 2009. In ACM SIGMOD, pages 216–227, 2002.

[6] J. Bethencourt, D. Song, and B. Waters. New Techniques [23] R. J. Hansen and M. L. Patterson. Guns and butter: Towards
for Private Stream Searching. ACM Trans. Inf. Syst. Secur., formal axioms of input validation. In Black Hat USA 2005,
12(3):1–32, 2009. Las Vegas, July 2005.

[7] F. C. Botelho, D. Reis, and N. Ziviani. CMPH: [24] R. J. Hansen and M. L. Patterson. Stopping injection attacks
C minimal perfect hashing library on SourceForge. with computational theory. In Black Hat USA 2005, Las
http://cmph.sourceforge.net/. Vegas, July 2005.

[8] F. C. Botelho and N. Ziviani. External perfect hashing for [25] B. Hore, S. Mehrotra, and G. Tsudik. A privacy-preserving
very large key sets. In ACM CIKM, pages 653–662, 2007. index for range queries. In VLDB, pages 720–731, 2004.

16
[26] ICANN Security and Stability Advisory Committee (SSAC). [42] Sun Microsystems. MySQL. http://www.mysql.com/.
Report on Domain Name Front Running, February 2008.
[43] Transaction Processing Performance Council. Benchmark C.
[27] S. Jarecki and X. Liu. Efficient Oblivious Pseudorandom http://www.tpc.org/.
Function with Applications to Adaptive OT and Secure Com-
putation of Set Intersection. In TCC ’09: Proceedings of [44] D. E. Vengroff and J. Scott Vitter. Supporting I/O-efficient
the 6th Theory of Cryptography Conference on Theory of scientific computation in TPIE. In IEEE Symp. on Parallel
Cryptography, pages 577–594, Berlin, Heidelberg, 2009. and Distributed Processing, page 74, 1995.

[28] E. Kushilevitz and R. Ostrovsky. Replication is not needed: [45] P. Williams and R. Sion. Usable PIR. In Network and
single database, computationally-private information retrieval. Distributed System Security Symposium. The Internet Society,
In FOCS, page 364, 1997. 2008.

[29] S. K. Mishra and P. Sarkar. Symmetrically Private Informa- [46] M. Wong and C. Thomas. Database Test Suite project on
tion Retrieval. In INDOCRYPT, pages 225–236, 2000. SourceForge. http://osdldbt.sourceforge.net/.

[30] M. Naor and B. Pinkas. Oblivious transfer and polynomial [47] S. S. Yau and Y. Yin. Controlled privacy preserving keyword
evaluation. In ACM Symposium on Theory of Computing, search. In ASIACCS ’08: Proceedings of the 2008 ACM
pages 245–254, 1999. Symposium on Information, Computer and Communications
Security, pages 321–324, New York, NY, USA, 2008.
[31] M. Naor and B. Pinkas. Efficient oblivious transfer protocols.
In ACM-SIAM SODA, pages 448–457, 2001. A PPENDIX
[32] R. Ostrovsky and V. Shoup. Private information storage A. Database schema for examples
(extended abstract). In STOC ’97: Proceedings of the twenty- CREATE TABLE registrar (
ninth annual ACM Symposium on Theory of Computing, reg_id int(11) NOT NULL,
pages 294–303, New York, NY, USA, 1997. contact char(60) default NULL,
phone char(80) default NULL,
[33] R. Ostrovsky and W. E. Skeith, III. Private Searching on address char(80) default NULL,
Streaming Data. J. Cryptol., 20(4):397–430, 2007. email char(60) default NULL,
PRIMARY KEY (reg_id));
[34] P. Paillier. Public-Key Cryptosystems Based on Composite
Degree Residuosity Classes. In Advances in Cryptology— CREATE TABLE regdomains (
Eurocrypt ’99, Lecture Notes in Computer Science 1592, id int(11) NOT NULL,
pages 223–238, 1999. domain char(80) default NULL,
created int(8) default NULL,
[35] J. Reardon, J. Pound, and I. Goldberg. Relational-Complete expiry int(8) default NULL,
Private Information Retrieval. Technical report, CACR 2007- reg_id int(11) NOT NULL,
34, University of Waterloo, 2007. status varchar(2) default NULL,
PRIMARY KEY (reg_id));
[36] L. Sassaman, B. Cohen, and N. Mathewson. The Pynchon
Gate: a Secure Method of Pseudonymous Mail Retrieval. In B. Microbenchmark queries from [35]
ACM WPES, pages 1–9, 2005.
Q1 – Point query with single result
[37] D. C. Schmidt. More C++ gems, chapter GPERF: a perfect SELECT domain, reg_date
hash function generator, pages 461–491. Cambridge Univer- FROM registration WHERE domain = ?
sity Press, New York, NY, USA, 2000. Q2 – Point query with multiple results
SELECT domain FROM registration
[38] E. Shi, J. Bethencourt, T.-H. H. Chan, D. Song, and A. Perrig. WHERE expiry_date = ?
Multi-Dimensional Range Query over Encrypted Data. In Q3 – Range query with single condition
IEEE SSP, pages 350–364, 2007.
SELECT domain, status FROM
registration WHERE expiry_date > ?
[39] A. Silberschatz, H. F. Korth, and S. Sudarshan. Database
System Concepts. McGraw-Hill, Inc., New York, NY, USA, Q4 – Range query with multiple conditions
5th edition, 2005. SELECT * FROM registration
WHERE expiry_date > ? AND reg_date < ?
[40] R. Sion and B. Carbunar. On the Computational Practicality Q5 – Point query with join
of Private Information Retrieval. In Network and Distributed SELECT domain, name, email
Systems Security Symposium, 2007. FROM contact, registration
WHERE domain=? AND registrant = contact_id
[41] D. X. Song, D. Wagner, and A. Perrig. Practical Techniques Q6 – Range query with join
for Searches on Encrypted Data. In SP ’00: Proceedings of SELECT * FROM contact,registration WHERE
the 2000 IEEE Symposium on Security and Privacy, page 44, expiry_date>? AND registrar=contact_id
Washington, DC, USA, 2000.

17
C. Database schema for microbenchmarks and experiments
CREATE TABLE contact (
contact_id int(11) NOT NULL,
name char(60) default NULL,
address char(80) default NULL,
email char(60) default NULL,
PRIMARY KEY (contact_id));
CREATE TABLE registration (
reg_id int(11) NOT NULL,
domain char(80) default NULL,
expiry_date int(8) default NULL,
reg_date int(8) default NULL,
registrant int(11) default NULL,
registrar int(11) default NULL,
status varchar(2) default NULL,
PRIMARY KEY (reg_id));
ADD FOREIGN KEY fk_registrant (registrant)
REFERENCES contact(contact_id);
ADD FOREIGN KEY fk_registrar (registrar)
REFERENCES contact(contact_id);

18

You might also like