Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views53 pages

Big Data Using Relational Technologies

The document provides an overview of big data management using relational technologies, covering data storage hierarchy, query processing, and database scaling techniques. It discusses various storage devices, indexing mechanisms, and the architecture of distributed database systems. Additionally, it highlights the challenges and benefits of distributed databases, including transparency, reliability, and performance improvements.

Uploaded by

202418051
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views53 pages

Big Data Using Relational Technologies

The document provides an overview of big data management using relational technologies, covering data storage hierarchy, query processing, and database scaling techniques. It discusses various storage devices, indexing mechanisms, and the architecture of distributed database systems. Additionally, it highlights the challenges and benefits of distributed databases, including transparency, reliability, and performance improvements.

Uploaded by

202418051
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Big Data using Relational Technologies

Overview

• Data storage hierarchy


• Data on Disk
• Query Processing
Data Storage

3
Data Storage

4
Storage Devices

5
Computer Storage

6
Data Size
created, captured, copied, and consumed worldwide

• 2003: 5 exabyte data created in 2002, 92% stored on hard disk drives
• 2010: 2 zettabytes, 2025: 181 zettabytes

https://www.statista.com/statistics/871513/worldwide‐data‐created/
Storage Hierarchy
Storage Hierarchy
• primary storage: Fastest media but volatile (cache, main memory).
• secondary storage: next level in hierarchy, non‐volatile, moderately
fast access time
• Also called on‐line storage
• E.g., flash memory, magnetic disks
• tertiary storage: lowest level in hierarchy, non‐volatile, slow access
time
• also called off‐line storage and used for archival storage
• e.g., magnetic tape, optical storage
• Magnetic tape
• Sequential access, 1 to 12 TB capacity
• A few drives with many tapes
• Juke boxes with petabytes (1000’s of TB) of storage
Storage Interfaces

• Disk interface standards families


• SATA (Serial ATA)
• SATA 3 supports data transfer speeds of up to 6 gigabits/sec
• SAS (Serial Attached SCSI)
• SAS Version 3 supports 12 gigabits/sec
• NVMe (Non‐Volatile Memory Express) interface
• Works with PCIe connectors to support lower latency and higher transfer
rates
• Supports data transfer rates of up to 24 gigabits/sec
• Disks usually connected directly to computer system
• In Storage Area Networks (SAN), a large number of disks are
connected by a high‐speed network to a number of servers
• In Network Attached Storage (NAS) networked storage provides a file
system interface using networked file system protocol, instead of
providing a disk system interface
Magnetic Hard Disk Mechanism

Schematic diagram of magnetic disk drive Photo of magnetic disk drive


Organising Data
• Database
• collection of files
• Sequence of records
• Sequence of fields

• Records: Fixed length, variable length


• Data block contains records
• Data Partitioning : Vertical, Horizontal, Hybrid
• Column Oriented Storage

12
Database Scaling

• Indexing: Analyze the query patterns of the application and create the appropriate indexes
• Denormalization: Reduce complex joins to improve query performance
• Database Caching: Store frequently accessed data in a faster storage layer
• Materialized Views: Precompute complex query results and store them for faster access
• Vertical Scaling: Boost the DB server by adding more CPU, RAM, or storage
• Sharding: Load resources that the page will need before they are needed
• Replication: Create replicas of the primary database on different servers for scaling the reads
Basic Concepts

• Indexing mechanisms used to speed up access to desired data.


• E.g., author catalog in library
• Search Key ‐ attribute to set of attributes used to look up records in
a file.
• An index file consists of records (called index entries) of the form

search-key pointer
• Index files are typically much smaller than the original file
• Two basic kinds of indices:
• Ordered indices: search keys are stored in sorted order
• Hash indices: search keys are distributed uniformly across “buckets”
using a “hash function”.
Index Evaluation Metrics

• Access types supported efficiently. E.g.,


• Records with a specified value in the attribute
• Records with an attribute value falling in a specified range of values.
• Access time
• Insertion time
• Deletion time
• Space overhead
Ordered Indices

• In an ordered index, index entries are stored sorted on the search


key value.
• Clustering index: in a sequentially ordered file, the index whose
search key specifies the sequential order of the file.
• Also called primary index
• The search key of a primary index is usually but not necessarily the
primary key.
• Secondary index: an index whose search key specifies an order
different from the sequential order of the file. Also called
nonclustering index.
• Index‐sequential file: sequential file ordered on a search key, with a
clustering index on the search key.
Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation
• Parsing and translation
translate the query into
tokens/ internal form. This is
then translated into relational
algebra.
Parser checks syntax, verifies
relations
• Evaluation
The query-execution engine
takes a query-evaluation
plan, executes that plan, and
returns the answers to the
query.
• Optimization
Amongst all equivalent
evaluation plans choose the
one with lowest cost. Cost is
estimated using statistical
information
Parsing and Translation

• Lexical analysis: Break down the query into tokens, removing white spaces and
comments
• Syntactic analysis: Check that the query follows the rules of SQL
• Semantic analysis: Check that the query is semantically correct
• Translate to relational algebra: Convert the query into relational algebra
operations and operands
• Verify syntax: Check that the original query string is valid

The output of parsing and translation is a relational algebra expression, query tree, or query
graph. This is a form that can be used by the optimization engine
Query Optimization

• Relational Algebra
• Evaluation : Query plan, equivalent expressions, transformations
• Measurement of Query Cost: cost function
• Statistics for Cost estimation: size estimation of selection, join, projection, aggregate
• Materialization, Pipelining
• Multiquery Optimization
• Distributed Query Optimization

19
(Keyword based) Query Processing

20
20
(Knowledge based) Query Processing

21
21
Semantic Query Processing

22
22
Resources for Query Processing

• Resources
• Resource Optimization
• Maximizing Utilization
• Cloud Based Systems

23
Machine Learning Techniques for Databases

• Database problems
• normalization
• partitioning
• query processing
• estimation

24
Relational Technologies

25
Database System Architecture

• Centralized Database Systems


• Server System Architectures
• Parallel Systems
• Distributed Systems
• Network Types

26
Distributed Systems
• Data spread over multiple machines (also referred to as sites or nodes).
• Local-area networks (LANs)
• Wide-area networks (WANs)
• Higher latency

site A site C

network

communication
via network

site B
Distributed Databases
• Homogeneous distributed databases
• Same software/schema on all sites, data may be partitioned among
sites
• Goal: provide a view of a single database, hiding details of
distribution
• Heterogeneous distributed databases
• Different software/schema on different sites
• Goal: integrate existing databases to provide useful functionality
• Differentiate between local transactions and global
transactions
• A local transaction accesses data in the single site at which the
transaction was initiated.
• A global transaction either accesses data in a site different from the
one at which the transaction was initiated or accesses data in
several different sites.
Parallel and Distributed Systems

• Data Storage
• Query Processing : Interquery parallelism, Intraquery parallelism, Query plan evaluation

29
Distributed Databases
• A distributed database (DDB) is a collection of multiple, logically
interrelated databases distributed over a computer network.

• A distributed database management system (D–DBMS) is the software that


manages the DDB and provides an access mechanism that makes this
distribution transparent to the users.

• Distributed database system (DDBS) = DDB + D–DBMS

30
Centralized DBMS on a Network

Site 1
Site 2

Site 5

Communication
Network

Site 4 Site 3

31
Distributed DBMS Environment

Site 1
Site 2

Site 5
Communication
Network

Site 4 Site 3

32
File Systems

program 1
File 1
data description 1

program 2
data description 2 File 2

program 3
data description 3 File 3

33
Database Management

Application
program 1
(with data
semantics)
DBMS

description
Application
program 2 manipulation
(with data database
semantics) control

Application
program 3
(with data
semantics)
34
Motivation

Database Computer
Technology Networks
data distribution

Distributed
Database
Systems
integration

integration ≠ centralization
35
Distributed DBMS Promises

• Transparent management of distributed, fragmented, and replicated data


• Separation of higher level semantics of a system from lower level implementation issues.

• Improved reliability/availability through distributed transactions


• Improved performance
• Easier and more economical system expansion
Transparency

• Transparency is the separation of the higher level semantics of a system from the lower level
implementation issues.

• Fundamental issue is to provide data independence


in the distributed environment
• Network (distribution) transparency
• Replication transparency
• Fragmentation transparency
• horizontal fragmentation: selection
• vertical fragmentation: projection
• hybrid
Example

38
Transparent Access

SELECT ENAME,SAL
Tokyo
FROM EMP,ASG,PAY
WHERE DUR > 12 Paris
Boston
AND EMP.ENO = ASG.ENO Paris projects
AND PAY.TITLE = EMP.TITLE Paris employees
Communication Paris assignments
Network Boston employees

Boston projects
Boston employees
Boston assignments
Montreal
New
Montreal projects
York Paris projects
Boston projects New York projects
New York employees with budget > 200000
New York projects Montreal employees
New York assignments Montreal assignments
39
Distributed Database ‐ User View

Distributed Database

40
Distributed DBMS ‐ Reality

User
Query

DBMS User
Application
Software
DBMS
Software

DBMS Communication
Software Subsystem

User
DBMS User Application
Software Query
DBMS
Software

User
Query

41
Types of Transparency

• Data independence (logical – immunity of user apps to changes in logical structure, physical‐
hiding details of storage structure from user apps)
• Network transparency (or distribution transparency)
• Location transparency
• Fragmentation transparency
• Replication transparency (user is not aware how many copies? and where?)
• Fragmentation transparency (queries specified on original unfragmented relations will be
executed by dbms on fragmented relation.

42
Reliability Through Transactions
• Replicated components and data should make distributed DBMS more reliable.
• Distributed transactions provide
• Concurrency transparency
• Failure atomicity
• Distributed transaction support requires implementation of
• Distributed concurrency control protocols
• Commit protocols
• Data replication
• Great for read‐intensive workloads, problematic for updates
• Replication protocols

43
Potentially Improved Performance

• Proximity of data to its points of use : It requires some support for fragmentation and replication
• Parallelism in execution
• Inter‐query parallelism
• Intra‐query parallelism

44
Parallelism Requirements

• Have as much of the data required by each application at the site where the application executes
(Full replication)
• How about updates?
• Mutual consistency
• Freshness of copies

45
System Expansion

• Issue is database scaling


• Emergence of microprocessor and workstation technologies
• Demise of Grosh's law
• I believe that there is a fundamental rule, which I modestly call Grosch's law, giving added economy only as the
square root of the increase in speed — that is, to do a calculation ten times as cheaply you must do it hundred
times as fast.

• Client‐server model of computing


• Data communication cost vs telecommunication cost

46
Distributed DBMS Issues

• Distributed Database Design


• How to distribute the database
• Replicated & non‐replicated database distribution
• A related problem in directory management
• Query Processing
• Convert user transactions to data manipulation instructions
• Optimization problem
• min{cost = data transmission + local processing}
• General formulation is NP‐hard

47
Distributed DBMS Issues

• Concurrency Control
• Synchronization of concurrent accesses
• Consistency and isolation of transactions
• Deadlock management
• Reliability
• How to make the system resilient to failures
• Atomicity and durability

48
Related Issues

• Operating System Support


• Operating system with proper support for database operations
• Dichotomy between general purpose processing requirements and database processing
requirements
• Open Systems and Interoperability
• Distributed Multidatabase Systems
• More probable scenario
• Parallel issues

49
Architecture

• Defines the structure of the system


• components identified
• functions of each component defined
• interrelationships and interactions between components defined

50
ANSI/SPARC Architecture

Users

External External External External


Schema view view view

Conceptual Conceptual
view
Schema

Internal Internal view


Schema
51
Database Server

52
Distributed Database Servers

53

You might also like