Principles of Distributed Database
Systems
M. Tamer Özsu
Patrick Valduriez
© 2020, M.T. Özsu & P. Valduriez 1
Outline
■ Introduction
■ Distributed and Parallel Database Design
■ Distributed Data Control
■ Distributed Query Processing
■ Distributed Transaction Processing
■ Data Replication
■ Database Integration – Multidatabase Systems
■ Parallel Database Systems
■ Peer-to-Peer Data Management
■ Big Data Processing
■ NoSQL, NewSQL and Polystores
■ Web Data Management
© 2020, M.T. Özsu & P. Valduriez 2
Outline
■ Introduction
❑ What is a distributed DBMS
❑ History
❑ Distributed DBMS promises
❑ Design issues
❑ Distributed DBMS architecture
© 2020, M.T. Özsu & P. Valduriez 3
Distributed Computing
■ A number of autonomous processing elements (not
necessarily homogeneous) that are interconnected by a
computer network and that cooperate in performing their
assigned tasks.
■ What is being distributed?
❑ Processing logic
❑ Function
❑ Data
❑ Control
© 2020, M.T. Özsu & P. Valduriez 4
Current Distribution – Geographically
Distributed Data Centers
© 2020, M.T. Özsu & P. Valduriez 5
What is a Distributed Database System?
A distributed database is a collection of multiple, logically
interrelated databases distributed over a computer network
A distributed database management system (Distributed
DBMS) is the software that manages the DDB and provides
an access mechanism that makes this distribution
transparent to the users
© 2020, M.T. Özsu & P. Valduriez 6
What is not a DDBS?
■ A timesharing computer system
■ A loosely or tightly coupled multiprocessor system
■ A database system which resides at one of the nodes of
a network of computers - this is a centralized database
on a network node
© 2020, M.T. Özsu & P. Valduriez 7
Distributed DBMS Environment
© 2020, M.T. Özsu & P. Valduriez 8
Implicit Assumptions
■ Data stored at a number of sites → each site logically
consists of a single processor
■ Processors at different sites are interconnected by a
computer network → not a multiprocessor system
❑ Parallel database systems
■ Distributed database is a database, not a collection of
files → data logically related as exhibited in the users’
access patterns
❑ Relational data model
■ Distributed DBMS is a full-fledged DBMS
❑ Not remote file system, not a TP system
© 2020, M.T. Özsu & P. Valduriez 9
Important Point
Logically integrated
but
Physically distributed
© 2020, M.T. Özsu & P. Valduriez 10
Outline
■ Introduction
❑ What is a distributed DBMS
❑ History
❑ Distributed DBMS promises
❑ Design issues
❑ Distributed DBMS architecture
© 2020, M.T. Özsu & P. Valduriez 11
History – File Systems
© 2020, M.T. Özsu & P. Valduriez 12
History – Database Management
© 2020, M.T. Özsu & P. Valduriez 13
History – Early Distribution
Peer-to-Peer (P2P)
© 2020, M.T. Özsu & P. Valduriez 14
History – Client/Server
© 2020, M.T. Özsu & P. Valduriez 15
History – Data Integration
© 2020, M.T. Özsu & P. Valduriez 16
History – Cloud Computing
On-demand, reliable services provided over the Internet in
a cost-efficient manner
■ Cost savings: no need to maintain dedicated compute
power
■ Elasticity: better adaptivity to changing workload
© 2020, M.T. Özsu & P. Valduriez 17
Data Delivery Alternatives
■ Delivery modes
❑ Pull-only
❑ Push-only
❑ Hybrid
■ Frequency
❑ Periodic
❑ Conditional
❑ Ad-hoc or irregular
■ Communication Methods
❑ Unicast
❑ One-to-many
■ Note: not all combinations make sense
© 2020, M.T. Özsu & P. Valduriez 18
Outline
■ Introduction
❑ What is a distributed DBMS
❑ History
❑ Distributed DBMS promises
❑ Design issues
❑ Distributed DBMS architecture
© 2020, M.T. Özsu & P. Valduriez 19
Distributed DBMS Promises
❶ Transparent management of distributed, fragmented, and
replicated data
❷ Improved reliability/availability through distributed
transactions
❸ Improved performance
❹ Easier and more economical system expansion
© 2020, M.T. Özsu & P. Valduriez
Transparency
■ Transparency is the separation of the higher-level
semantics of a system from the lower level
implementation issues.
■ Fundamental issue is to provide
data independence
in the distributed environment
❑ Network (distribution) transparency
❑ Replication transparency
❑ Fragmentation transparency
■ horizontal fragmentation: selection
■ vertical fragmentation: projection
■ hybrid
© 2020, M.T. Özsu & P. Valduriez
Example
© 2020, M.T. Özsu & P. Valduriez 22
Transparent Access
Toky
o
SELECT ENAME,SAL
Boston Paris
FROM EMP,ASG,PAY
WHERE DUR > 12 Paris projects
Paris employees
AND EMP.ENO = ASG.ENO Communicatio Paris
n assignments
AND PAY.TITLE = EMP.TITLE Boston
Boston projects Network employees
Boston employees
Boston
assignments
Montrea
Ne l
Montreal projects
w Paris projects
BostonYor
projects New York projects
New York employees with budget >
k
New York projects 200000
New York Montreal employees
assignments Montreal assignments
© 2020, M.T. Özsu & P. Valduriez 23
Distributed Database - User View
Distributed
Database
© 2020, M.T. Özsu & P. Valduriez 24
Distributed DBMS - Reality
User
Quer
y
User
DBMS
Applicatio
Softwar n
e DBMS
Softwar
e
DBMS Communicati
Softwar on
e Subsystem
User
DBMS User Applicatio
Softwar Quer n
e y DBMS
Softwar
e
User
Quer
y
© 2020, M.T. Özsu & P. Valduriez 25
Types of Transparency
■ Data independence
■ Network transparency (or distribution transparency)
❑ Location transparency
❑ Fragmentation transparency
■ Fragmentation transparency
■ Replication transparency
© 2020, M.T. Özsu & P. Valduriez 26
Reliability Through Transactions
■ Replicated components and data should make distributed
DBMS more reliable.
■ Distributed transactions provide
❑ Concurrency transparency
❑ Failure atomicity
• Distributed transaction support requires implementation of
❑ Distributed concurrency control protocols
❑ Commit protocols
■ Data replication
❑ Great for read-intensive workloads, problematic for updates
❑ Replication protocols
© 2020, M.T. Özsu & P. Valduriez 27
Potentially Improved Performance
■ Proximity of data to its points of use
❑ Requires some support for fragmentation and replication
■ Parallelism in execution
❑ Inter-query parallelism
❑ Intra-query parallelism
© 2020, M.T. Özsu & P. Valduriez 28
Scalability
■ Issue is database scaling and workload scaling
■ Adding processing and storage power
■ Scale-out: add more servers
❑ Scale-up: increase the capacity of one server → has limits
© 2020, M.T. Özsu & P. Valduriez 29
Outline
■ Introduction
❑ What is a distributed DBMS
❑ History
❑ Distributed DBMS promises
❑ Design issues
❑ Distributed DBMS architecture
© 2020, M.T. Özsu & P. Valduriez 30
Distributed DBMS Issues
■ Distributed database design
❑ How to distribute the database
❑ Replicated & non-replicated database distribution
❑ A related problem in directory management
■ Distributed query processing
❑ Convert user transactions to data manipulation instructions
❑ Optimization problem
■ min{cost = data transmission + local processing}
❑ General formulation is NP-hard
© 2020, M.T. Özsu & P. Valduriez 31
Distributed DBMS Issues
■ Distributed concurrency control
❑ Synchronization of concurrent accesses
❑ Consistency and isolation of transactions' effects
❑ Deadlock management
■ Reliability
❑ How to make the system resilient to failures
❑ Atomicity and durability
© 2020, M.T. Özsu & P. Valduriez 32
Distributed DBMS Issues
■ Replication
❑ Mutual consistency
❑ Freshness of copies
❑ Eager vs lazy
❑ Centralized vs distributed
■ Parallel DBMS
❑ Objectives: high scalability and performance
❑ Not geo-distributed
❑ Cluster computing
© 2020, M.T. Özsu & P. Valduriez 33
Related Issues
■ Alternative distribution approaches
❑ Modern P2P
❑ World Wide Web (WWW or Web)
■ Big data processing
❑ 4V: volume, variety, velocity, veracity
❑ MapReduce & Spark
❑ Stream data
❑ Graph analytics
❑ NoSQL
❑ NewSQL
❑ Polystores
© 2020, M.T. Özsu & P. Valduriez 34
Outline
■ Introduction
❑ What is a distributed DBMS
❑ History
❑ Distributed DBMS promises
❑ Design issues
❑ Distributed DBMS architecture
© 2020, M.T. Özsu & P. Valduriez 35
DBMS Implementation Alternatives
© 2020, M.T. Özsu & P. Valduriez 36
Dimensions of the Problem
■ Distribution
❑ Whether the components of the system are located on the same machine or
not
■ Heterogeneity
❑ Various levels (hardware, communications, operating system)
❑ DBMS important one
■ data model, query language,transaction management algorithms
■ Autonomy
❑ Not well understood and most troublesome
❑ Various versions
■ Design autonomy: Ability of a component DBMS to decide on issues related to
its own design.
■ Communication autonomy: Ability of a component DBMS to decide whether and
how to communicate with other DBMSs.
■ Execution autonomy: Ability of a component DBMS to execute local operations
in any manner it wants to.
© 2020, M.T. Özsu & P. Valduriez 37
Client/Server Architecture
© 2020, M.T. Özsu & P. Valduriez 38
Advantages of Client-Server
Architectures
■ More efficient division of labor
■ Horizontal and vertical scaling of resources
■ Better price/performance on client machines
■ Ability to use familiar tools on client machines
■ Client access to remote data (via standards)
■ Full DBMS functionality provided to client workstations
■ Overall better system price/performance
© 2020, M.T. Özsu & P. Valduriez 39
Database Server
© 2020, M.T. Özsu & P. Valduriez 40
Distributed Database Servers
© 2020, M.T. Özsu & P. Valduriez 41
Peer-to-Peer Component Architecture
© 2020, M.T. Özsu & P. Valduriez 42
MDBS Components & Execution
© 2020, M.T. Özsu & P. Valduriez 43
Mediator/Wrapper Architecture
© 2020, M.T. Özsu & P. Valduriez 44
Cloud Computing
On-demand, reliable services provided over the Internet in
a cost-efficient manner
■ IaaS – Infrastructure-as-a-Service
■ PaaS – Platform-as-a-Service
■ SaaS – Software-as-a-Service
■ DaaS – Database-as-a-Service
© 2020, M.T. Özsu & P. Valduriez 45
Simplified Cloud Architecture
© 2020, M.T. Özsu & P. Valduriez 46