DS517 – Data Security
Lecture 6
Encrypted Search
Elias Athanasopoulos
[email protected]
Data and storage
• Data is collected by several applications or
services and it is further processed
– Data can be sensitive
• We store data in database management systems
(DBMSs)
– They support storing, searching and retrieval of data,
among others
• A database can be conceptualized as
– The core database that focuses on indexing/searching
– The DBMS, which is the software that performs data
accessing
2
DBMSs
• Essentially, they perform several actions
beyond storing and searching for data
– Enforcing data access policies
– Defining data structures
– Providing transaction guarantees
– Visualization and analytics
• We focus on the traditional database’s core
functions
– Data insertion, indexing, and search
3
Protected database search
• Use cryptography to separate the roles of providing,
administering, and accessing data
• Server is not aware of the data stored
– Data breach is not possible
– Server performs actions on encrypted data, without
reading the plaintext data
• Wide variety of techniques
– Property-preserving encryption, searchable symmetric
encryption, private information retrieval by keyword,
oblivious RAM
• A protected search system must balance between
security, functionality, performance, and usability
4
Our goals
• Understand the current and future state of
database technology, enabling focus on
techniques that will be useful in future DBMSs
• Help security and database experts
understand the tradeoffs between protected
search systems so they can make an informed
decision about which technology, if any, is
most appropriate for their setting
5
Overview of database systems
• Relational databases (SQL)
– Strong transactional guarantees
– Vertically scalable: better performance through greater
hardware resources
– ACID (Atomicity, Consistency, Isolation, and Durability)
• NoSQL (not only SQL)
– Fast data insertion, flexible data structures, relaxed
transactional guarantees
– For large amounts of unstructured data
• NewSQL
– Scalability of NoSQL databases
– Transactional guarantees of relational databases
• Future systems
6
Query bases
• We define a small set of base operations that can
be combined to provide complex search
functionality
– Relational algebra (SQL): set union, set difference,
cartesian product (joins), projection and selection
– Associative arrays (key-value store for NoSQL):
construction, find, addition, element-wise
multiplication, array multiplication
– Linear algebra (NewSQL): construction, find, matrix
addition, matrix multiplication, element-wise
multiplication
7
Example systems
8
Database roles
• Provider
– Provides and modifies the data
• Querier
– Wishes to learn about the data
• Server
– Handles storage and processing
• Authorizer
– Specifies data- and query-based rules
• Enforcer
– Ensures the rules are applied
9
Database operations
Init/Query
• Init
– The initialization protocol occurs between the provider
and the server
– The server obtains a protected database representing the
loaded data
• Query
– The query protocol occurs between the querier (with a
query), the server (with the protected database), the
enforcer (with the rules), and possibly the provider
– The querier obtains the query results if the rules are
satisfied
• All systems we discuss support Init/Query
10
Database operations
Update/Refresh
• Update
– The update protocol occurs between the provider (with a set of
updates) and the server
– The server obtains an updated protected database
– Updates include insertions, deletions, and record modifications
• Refresh
– The refresh protocol occurs between the provider and the
server
– The server obtains a new protected database that represents
the same data but is designed to achieve better performance
and/or security
• Some of the systems we discuss support Update/Refresh
11
Protected database
search systems
• A system that supports the roles and operations, in which
each party learns only its intended outputs (informally)
• Ensures that the server learns nothing about the data
stored in the protected database or about the queries, and
the querier learns nothing beyond the query results
– Formalized using the real-ideal style of cryptographic definition
• Ideal
– A protected search system, in which a trusted external party
performs storage, queries, and modifications correctly and
reveals only the intended outputs to each party
• The real system is secure if no party can learn more from its
real world interactions than it can learn in the ideal system
12
Formal guarantees
• We focus on systems that provide formally
defined security guarantees
• Some other commercial systems are based on
cryptographic techniques with security proofs
– Further analysis needs to be contacted to
conclude if there are security deviations
13
Scenarios
• A few existing protected search systems consider the
enforcement of rules using an authorizer and enforcer
• Three-party scenario
– A provider, a querier, and a server
• Two-party scenario
– A single user (the client) acts as both the provider and the
querier
– E.g., a cloud-storage app in which a client uploads files to
the cloud that she can later search
– Client knows all information, so we consider security
against an adversarial server
• We focus on a single provider and single querier setting
14
Threats
• An adversary can be either an insider or an outsider
• Semi-honest adversary (honest-but-curious)
– They follow the protocol but may passively attempt to
learn additional information
• Malicious adversary
– Actively willing to reveal secret information
• An adversary me be persistent for the lifetime of the
database or having access to a snapshot
• Most common threat model
– Semi-honest, persistent, insider adversary
15
Performance and leakage
• Unprotected databases are I/O bound, while protected
databases are CPU/network bound
– Cryptography may be computational heavy (asymmetric)
or not that much (symmetric)
• A very slow system can be be very secure but not
usable
– A faster system leaks information
• Leakage profile
– A sequence of functions that formally describe all
information that is revealed to each party beyond the
intended output
– Can be complex
16
Common Leakage Profiles
• A leakage profile is composed by
– Objects that leak
– The type of information that is leaked
– Which operation leaks
– The party that learns the leakage
17
Common Leakage Profiles
Objects
• Objects vulnerable to leakage
– Data items and indexing data structures
– Queries
– Records return as a response or other
relationships between query and data
– Access-control rules
18
Common Leakage Profiles
Information that leaked
• Structure
– Properties of an object only concealable via padding
– E.g., length of a string, the cardinality of a set
• Predicates
– Identifiers plus additional information on objects
– E.g., “matches the intersection of 2 clauses within a query”
and “within a common (known) range.”
• Equalities
– Which objects have the same value
• Order (or more)
– Numerical or lexicographic ordering of objects, or perhaps
even partial plaintext data
19
Common Leakage Profiles
Operation
• Init
– The server may learn about the initial data
• Query
– The querier may learn about the rules and the current data
– The server may learn about the query, the rules, and the
current data
– The provider may learn about the query and rules
– The enforcer may learn about the query and current data
• Update
– The server may receive learn about prior/new data records
• Refresh
– The server may learn about the current data
20
Protected search systems
approach
• Legacy
– The approach can be used with an unprotected
database server
• Custom index
– Based on special-purpose protected indices and
customized protocols
• Oblivious Index
– Subset of Custom index that, additionally,
obscures object identifiers
21
Base queries supported
• There are cryptographic protocols for
supporting base queries
– Equality, range, and boolean queries
• Additional query types have been developed
– Denoted as “Other” in the final systemization
22
Performance and usability
• Scale
– The scale of updates and queries that each
scheme has been shown to support
• Crypto
– The type and amount of cryptography required to
support updates and queries
• Network
– The network latency and bandwidth
characteristics
23
Review of proposals
Legacy
• Property-preserving encryption allows operations
(e.g., equality or order) on ciphertexts that
preserve some property of the underlying
plaintexts
• Legacy databases can support those actions by
simply encrypting the data
– No other changes needed in the database
• Types of encryption
– Deterministic encryption (DET) for equality
– Order-preserving encryption (OPE) for range queries
24
Review of proposals
Custom inverted index
• Support for equality searches
– On single-table databases via a reverse lookup
that maps each keyword to a list of identifiers for
the database records containing the keyword
• Support for Boolean queries
– The inverted index finds the set of records
matching the first term in a query, and a second
index containing a list of (record identifier,
keyword) pairs is used to check whether the
remaining terms of the query are also satisfied
25
Review of proposals
Custom tree traversal
• Based on indices with a tree-based structure
• A query is executed by traversing the tree and
returning the leaf nodes at which the query
terminates
• The main cryptographic challenge here is to
hide the traversal pattern through the tree,
which can depend upon the data and query
26
Review of proposals
Other custom indices
• These schemes mostly work by building
encrypted indices out of specialized data
structures for performing the specific query
computation
27
Review of proposals
Obliv
• Systems that implement Oblivious RAM
(ORAM) protocols aim to hide access patterns
in memory
• The main idea is to re-arrange the contents of
data in memory, for each query, in order to
obscure the relationships between data and
memory access
• The challenge is to do this efficiently
28
Systemization
29
Leakage inference attacks
• Protected search systems are evaluated
against leakage
• A protected search scheme is affected by an
attack if the scheme’s leakage to the server is
at least as large as the attack’s required
minimum leakage
30
Leakage inference attacks
Attack requirements
• Attacker goal
– Recover a set of queries asked by the querier (query recovery)
or the data being stored at the server (data recovery)
• Required leakage
– Cardinality of a response set, the ordering of records in the
database, and identifiers of the returned records, etc.
• Attacker model
– Semi-honest, data injection (insert data in the database)
• Attacker prior knowledge
– Contents of full dataset (for attackers that want to recovery
queries), contents of a subset of dataset, distributional
knowledge of dataset, distributional knowledge of queries,
keyword universe (knowledge of the possible values of each
field)
31
Leakage inference attacks
Attack efficacy
• The runtime of the attack, including time
required to create any inserted records
• The sensitivity of the recovery rate to the
amount of prior knowledge
• The keyword universe size attacked
32
Leakage inference attacks
Attack techniques
• Many attacks published, but in principle they
all rely on two facts
– Different keywords are associated with different
numbers of records
– Most systems reveal keyword identifiers for a
record either at rest or when it is returned across
multiple queries
33
Leakage inference attacks
Example
• Assume the attacker has full knowledge of the
database and is trying to learn the query
– With 80% of the dataset known, the attack can
yield a 40% keyword recovery rate
• The attacker sees how many records are
returned in response to a query
– If the number is unique (per query) then the
query is identified
– Also the attacker can tell that every returned
record is associated with the keyword
34
Leakage inference attacks
Example
• Suppose that the attacker learns that the first query was for
LastName = ‘Smith’ (unique record number in response)
• Consider a second query that does not return a unique
number of records in response
– FirstName may be ‘John’ or ‘Mathew’ and both return 1,000
records
• The attacker checks how many records overlap with the
first query
– For example, there may be 100 records with ’Mathew Smith’
and only 10 with ‘John Smith’
– By checking overlaps, the attacker can reveal the first name
• The attacker can iteratively identify queries and create
constraints for further identifying unknown queries
35
Systemization
36
Leakage inference attacks
Discussion
• The provider and querier should be protected
against the server
– Privacy, the server may be compromised, etc.
• Which technique should be used?
1) How long is the keyword universe?
2) How much of the dataset or query keyword universe
(and frequency) can the attacker predict?
3) Can an attacker reasonably insert crafted records?
4) Does the adversary have persistent access to the
server, or to a snapshot at a given time?
37
Leakage inference attacks
Discussion
• Answers to the first three questions depend upon the
intended use case
– Α system with a smaller leakage profile may be necessary
in a setting where the keyword universe is small and the
attacker has the ability to add records
– A system with a larger leakage profile may suffice in a
setting where the keyword universe is very large
• The fourth question relates to adversaries that
compromise the server
– Legacy schemes leak the entire database (one snaphsot
should be enough)
– Custom schemes leak information during query (persistent
compromise is needed)
38
Leakage inference attacks
Summary
• Each protected search approach has a distinct
leakage profile that results in qualitatively
different attacks
– If queries only touch a small portion of the dataset
or the adversary only has a snapshot, the impact
of leakage from Custom systems is less than from
Legacy schemes
– If queries regularly return a large fraction of the
dataset, this distinction disappears and an Obliv
scheme may be appropriate
39
Extending Functionality
• There are techniques for combining base
queries (equality, Boolean, etc.) to richer ones
• Schemes that support a given query type by
composing base queries tend to have more
leakage than schemes that natively support
the same query
• But, in query composition, a scheme can be
extended straightforwardly to support
multiple query types
40
Extending functionality
41
Extending database systems
Controls, rules and enforcement
• Database systems support several types of access
control mechanisms which may constraint the
interaction of users (or programs) with data
• In addition, query control limits which queries are
acceptable
– In contrast to typical access control which mandates
which data is accessible
– Example, a query needs to to specify at least five
columns, in order to be sufficiently targeted
42
Extending database systems
Performance characterization
• Adding protections to the database system,
such as encryption, may affect the
performance
• Response times depend heavily on
– Network capacity, load and number of records
returned by the query
– Ordering of terms in subclauses within a query
– Complexity of rules based on query policy and
access control
43
Extending database systems
User perceptions and performance
• Users may not be ready to use protected search
systems
• In a controlled user study, it was evident that
– When response times were unpredictable,
participants were unsure whether they should wait for
a query to complete or do something else
– Participants felt the protected technologies were
slower than an unprotected system
– Participants were surprised that different types of
queries might have different performance
characteristics
44
Extending database systems
Current protected search databases
45
References
• SoK: Cryptographically Protected Database
Search, in Oakland 2017, by Benjamin Fuller,
Mayank Varia, Arkady Yerukhimovich, Emily
Shen, Ariel Hamlin, Vijay Gadepally, Richard
Shay, John Darby Mitchell, and Robert K.
Cunningham
46