0% found this document useful (0 votes)

14 views53 pages

Big Data Using Relational Technologies

The document provides an overview of big data management using relational technologies, covering data storage hierarchy, query processing, and database scaling techniques. It discusses various storage devices, indexing mechanisms, and the architecture of distributed database systems. Additionally, it highlights the challenges and benefits of distributed databases, including transparency, reliability, and performance improvements.

Uploaded by

202418051

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views53 pages

Big Data Using Relational Technologies

Uploaded by

202418051

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Big Data using Relational Technologies

Overview

• Data storage hierarchy

• Data on Disk
• Query Processing
Data Storage

3
Data Storage

4
Storage Devices

5
Computer Storage

6
Data Size
created, captured, copied, and consumed worldwide

• 2003: 5 exabyte data created in 2002, 92% stored on hard disk drives
• 2010: 2 zettabytes, 2025: 181 zettabytes

https://www.statista.com/statistics/871513/worldwide‐data‐created/
Storage Hierarchy
Storage Hierarchy
• primary storage: Fastest media but volatile (cache, main memory).
• secondary storage: next level in hierarchy, non‐volatile, moderately
fast access time
• Also called on‐line storage
• E.g., flash memory, magnetic disks
• tertiary storage: lowest level in hierarchy, non‐volatile, slow access
time
• also called off‐line storage and used for archival storage
• e.g., magnetic tape, optical storage
• Magnetic tape
• Sequential access, 1 to 12 TB capacity
• A few drives with many tapes
• Juke boxes with petabytes (1000’s of TB) of storage
Storage Interfaces

• Disk interface standards families

• SATA (Serial ATA)
• SATA 3 supports data transfer speeds of up to 6 gigabits/sec
• SAS (Serial Attached SCSI)
• SAS Version 3 supports 12 gigabits/sec
• NVMe (Non‐Volatile Memory Express) interface
• Works with PCIe connectors to support lower latency and higher transfer
rates
• Supports data transfer rates of up to 24 gigabits/sec
• Disks usually connected directly to computer system
• In Storage Area Networks (SAN), a large number of disks are
connected by a high‐speed network to a number of servers
• In Network Attached Storage (NAS) networked storage provides a file
system interface using networked file system protocol, instead of
providing a disk system interface
Magnetic Hard Disk Mechanism

Schematic diagram of magnetic disk drive Photo of magnetic disk drive

Organising Data
• Database
• collection of files
• Sequence of records
• Sequence of fields

• Records: Fixed length, variable length

• Data block contains records
• Data Partitioning : Vertical, Horizontal, Hybrid
• Column Oriented Storage

12
Database Scaling

• Indexing: Analyze the query patterns of the application and create the appropriate indexes
• Denormalization: Reduce complex joins to improve query performance
• Database Caching: Store frequently accessed data in a faster storage layer
• Materialized Views: Precompute complex query results and store them for faster access
• Vertical Scaling: Boost the DB server by adding more CPU, RAM, or storage
• Sharding: Load resources that the page will need before they are needed
• Replication: Create replicas of the primary database on different servers for scaling the reads
Basic Concepts

• Indexing mechanisms used to speed up access to desired data.

• E.g., author catalog in library
• Search Key ‐ attribute to set of attributes used to look up records in
a file.
• An index file consists of records (called index entries) of the form

search-key pointer
• Index files are typically much smaller than the original file
• Two basic kinds of indices:
• Ordered indices: search keys are stored in sorted order
• Hash indices: search keys are distributed uniformly across “buckets”
using a “hash function”.
Index Evaluation Metrics

• Access types supported efficiently. E.g.,

• Records with a specified value in the attribute
• Records with an attribute value falling in a specified range of values.
• Access time
• Insertion time
• Deletion time
• Space overhead
Ordered Indices

• In an ordered index, index entries are stored sorted on the search

key value.
• Clustering index: in a sequentially ordered file, the index whose
search key specifies the sequential order of the file.
• Also called primary index
• The search key of a primary index is usually but not necessarily the
primary key.
• Secondary index: an index whose search key specifies an order
different from the sequential order of the file. Also called
nonclustering index.
• Index‐sequential file: sequential file ordered on a search key, with a
clustering index on the search key.
Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation
• Parsing and translation
translate the query into
tokens/ internal form. This is
then translated into relational
algebra.
Parser checks syntax, verifies
relations
• Evaluation
The query-execution engine
takes a query-evaluation
plan, executes that plan, and
returns the answers to the
query.
• Optimization
Amongst all equivalent
evaluation plans choose the
one with lowest cost. Cost is
estimated using statistical
information
Parsing and Translation

• Lexical analysis: Break down the query into tokens, removing white spaces and
comments
• Syntactic analysis: Check that the query follows the rules of SQL
• Semantic analysis: Check that the query is semantically correct
• Translate to relational algebra: Convert the query into relational algebra
operations and operands
• Verify syntax: Check that the original query string is valid

The output of parsing and translation is a relational algebra expression, query tree, or query
graph. This is a form that can be used by the optimization engine
Query Optimization

• Relational Algebra
• Evaluation : Query plan, equivalent expressions, transformations
• Measurement of Query Cost: cost function
• Statistics for Cost estimation: size estimation of selection, join, projection, aggregate
• Materialization, Pipelining
• Multiquery Optimization
• Distributed Query Optimization

19
(Keyword based) Query Processing

20
20
(Knowledge based) Query Processing

21
21
Semantic Query Processing

22
22
Resources for Query Processing

• Resources
• Resource Optimization
• Maximizing Utilization
• Cloud Based Systems

23
Machine Learning Techniques for Databases

• Database problems
• normalization
• partitioning
• query processing
• estimation

24
Relational Technologies

25
Database System Architecture

• Centralized Database Systems

• Server System Architectures
• Parallel Systems
• Distributed Systems
• Network Types

26
Distributed Systems
• Data spread over multiple machines (also referred to as sites or nodes).
• Local-area networks (LANs)
• Wide-area networks (WANs)
• Higher latency

site A site C

network

communication
via network

site B
Distributed Databases
• Homogeneous distributed databases
• Same software/schema on all sites, data may be partitioned among
sites
• Goal: provide a view of a single database, hiding details of
distribution
• Heterogeneous distributed databases
• Different software/schema on different sites
• Goal: integrate existing databases to provide useful functionality
• Differentiate between local transactions and global
transactions
• A local transaction accesses data in the single site at which the
transaction was initiated.
• A global transaction either accesses data in a site different from the
one at which the transaction was initiated or accesses data in
several different sites.
Parallel and Distributed Systems

• Data Storage
• Query Processing : Interquery parallelism, Intraquery parallelism, Query plan evaluation

29
Distributed Databases
• A distributed database (DDB) is a collection of multiple, logically
interrelated databases distributed over a computer network.

• A distributed database management system (D–DBMS) is the software that

manages the DDB and provides an access mechanism that makes this
distribution transparent to the users.

• Distributed database system (DDBS) = DDB + D–DBMS

30
Centralized DBMS on a Network

Site 1
Site 2

Site 5

Communication
Network

Site 4 Site 3

31
Distributed DBMS Environment

Site 1
Site 2

Site 5
Communication
Network

Site 4 Site 3

32
File Systems

program 1
File 1
data description 1

program 2
data description 2 File 2

program 3
data description 3 File 3

33
Database Management

Application
program 1
(with data
semantics)
DBMS

description
Application
program 2 manipulation
(with data database
semantics) control

Application
program 3
(with data
semantics)
34
Motivation

Database Computer
Technology Networks
data distribution

Distributed
Database
Systems
integration

integration ≠ centralization
35
Distributed DBMS Promises

• Transparent management of distributed, fragmented, and replicated data

• Separation of higher level semantics of a system from lower level implementation issues.

• Improved reliability/availability through distributed transactions

• Improved performance
• Easier and more economical system expansion
Transparency

• Transparency is the separation of the higher level semantics of a system from the lower level
implementation issues.

• Fundamental issue is to provide data independence

in the distributed environment
• Network (distribution) transparency
• Replication transparency
• Fragmentation transparency
• horizontal fragmentation: selection
• vertical fragmentation: projection
• hybrid
Example

38
Transparent Access

SELECT ENAME,SAL
Tokyo
FROM EMP,ASG,PAY
WHERE DUR > 12 Paris
Boston
AND EMP.ENO = ASG.ENO Paris projects
AND PAY.TITLE = EMP.TITLE Paris employees
Communication Paris assignments
Network Boston employees

Boston projects
Boston employees
Boston assignments
Montreal
New
Montreal projects
York Paris projects
Boston projects New York projects
New York employees with budget > 200000
New York projects Montreal employees
New York assignments Montreal assignments
39
Distributed Database ‐ User View

Distributed Database

40
Distributed DBMS ‐ Reality

User
Query

DBMS User
Application
Software
DBMS
Software

DBMS Communication
Software Subsystem

User
DBMS User Application
Software Query
DBMS
Software

User
Query

41
Types of Transparency

• Data independence (logical – immunity of user apps to changes in logical structure, physical‐
hiding details of storage structure from user apps)
• Network transparency (or distribution transparency)
• Location transparency
• Fragmentation transparency
• Replication transparency (user is not aware how many copies? and where?)
• Fragmentation transparency (queries specified on original unfragmented relations will be
executed by dbms on fragmented relation.

42
Reliability Through Transactions
• Replicated components and data should make distributed DBMS more reliable.
• Distributed transactions provide
• Concurrency transparency
• Failure atomicity
• Distributed transaction support requires implementation of
• Distributed concurrency control protocols
• Commit protocols
• Data replication
• Great for read‐intensive workloads, problematic for updates
• Replication protocols

43
Potentially Improved Performance

• Proximity of data to its points of use : It requires some support for fragmentation and replication
• Parallelism in execution
• Inter‐query parallelism
• Intra‐query parallelism

44
Parallelism Requirements

• Have as much of the data required by each application at the site where the application executes
(Full replication)
• How about updates?
• Mutual consistency
• Freshness of copies

45
System Expansion

• Issue is database scaling

• Emergence of microprocessor and workstation technologies
• Demise of Grosh's law
• I believe that there is a fundamental rule, which I modestly call Grosch's law, giving added economy only as the
square root of the increase in speed — that is, to do a calculation ten times as cheaply you must do it hundred
times as fast.

• Client‐server model of computing

• Data communication cost vs telecommunication cost

46
Distributed DBMS Issues

• Distributed Database Design

• How to distribute the database
• Replicated & non‐replicated database distribution
• A related problem in directory management
• Query Processing
• Convert user transactions to data manipulation instructions
• Optimization problem
• min{cost = data transmission + local processing}
• General formulation is NP‐hard

47
Distributed DBMS Issues

• Concurrency Control
• Synchronization of concurrent accesses
• Consistency and isolation of transactions
• Deadlock management
• Reliability
• How to make the system resilient to failures
• Atomicity and durability

48
Related Issues

• Operating System Support

• Operating system with proper support for database operations
• Dichotomy between general purpose processing requirements and database processing
requirements
• Open Systems and Interoperability
• Distributed Multidatabase Systems
• More probable scenario
• Parallel issues

49
Architecture

• Defines the structure of the system

• components identified
• functions of each component defined
• interrelationships and interactions between components defined

50
ANSI/SPARC Architecture

Users

External External External External

Schema view view view

Conceptual Conceptual
view
Schema

Internal Internal view

Schema
51
Database Server

52
Distributed Database Servers

Oracle 1z0-082 Questions & Answers
100% (2)
Oracle 1z0-082 Questions & Answers
71 pages
EDA Guide for Data Analysts
No ratings yet
EDA Guide for Data Analysts
2 pages
Mainframes - Refresher 1 2
No ratings yet
Mainframes - Refresher 1 2
284 pages
Mba Business Analytics Syllabus
No ratings yet
Mba Business Analytics Syllabus
72 pages
Remaining Portions DB
No ratings yet
Remaining Portions DB
57 pages
Parallel & Distributed DBMS Guide
No ratings yet
Parallel & Distributed DBMS Guide
58 pages
Overview of Database
No ratings yet
Overview of Database
25 pages
Distributed Database Management Systems
No ratings yet
Distributed Database Management Systems
63 pages
Sodapdf Merged
No ratings yet
Sodapdf Merged
54 pages
Chapter 4 Distributed Database Systems
No ratings yet
Chapter 4 Distributed Database Systems
69 pages
Basis For Distributed Database Technology
No ratings yet
Basis For Distributed Database Technology
35 pages
Lecture3-Distributed Introduction
No ratings yet
Lecture3-Distributed Introduction
38 pages
B. Overview of Databases
No ratings yet
B. Overview of Databases
35 pages
Unit V
No ratings yet
Unit V
86 pages
Wa0033.
No ratings yet
Wa0033.
26 pages
Distributed Database Management System
No ratings yet
Distributed Database Management System
36 pages
Storage in Cloud
No ratings yet
Storage in Cloud
51 pages
Unit 4 Distributed DBMS by ANS
No ratings yet
Unit 4 Distributed DBMS by ANS
12 pages
Tybca Recent Trends in It Chpter 1
No ratings yet
Tybca Recent Trends in It Chpter 1
16 pages
DBMS Course Overview
No ratings yet
DBMS Course Overview
121 pages
Distributed Databases: Daniel Marcous
No ratings yet
Distributed Databases: Daniel Marcous
41 pages
Files 1 2020 April NotesHubDocument 1586849482
No ratings yet
Files 1 2020 April NotesHubDocument 1586849482
60 pages
Lec 2 Advantages DBMS and Schema
No ratings yet
Lec 2 Advantages DBMS and Schema
35 pages
Distributed Database Design: Basics
No ratings yet
Distributed Database Design: Basics
18 pages
Lecture - Database Design and Development
No ratings yet
Lecture - Database Design and Development
21 pages
1 DDBMS Introduction
No ratings yet
1 DDBMS Introduction
18 pages
Ddbms-Unit 1 Part2
No ratings yet
Ddbms-Unit 1 Part2
16 pages
BDT Unit 02 - Part1
No ratings yet
BDT Unit 02 - Part1
153 pages
Introducing Relational Database Products-2
No ratings yet
Introducing Relational Database Products-2
43 pages
10 Distributeddbms
No ratings yet
10 Distributeddbms
56 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
02 DistributedDataManagement
No ratings yet
02 DistributedDataManagement
37 pages
Unit - I DBMS
No ratings yet
Unit - I DBMS
74 pages
Adbms
No ratings yet
Adbms
70 pages
Notes - 1071 - MCA-20-23 Unit - 4.1
No ratings yet
Notes - 1071 - MCA-20-23 Unit - 4.1
48 pages
IT Infrastructure & Database & Security Issues in Business
No ratings yet
IT Infrastructure & Database & Security Issues in Business
80 pages
Chapter One-Spatial Database Mangment 4th Yaer
No ratings yet
Chapter One-Spatial Database Mangment 4th Yaer
22 pages
Distributed Database Essentials
No ratings yet
Distributed Database Essentials
18 pages
Distributed Database Systems Guide
0% (1)
Distributed Database Systems Guide
54 pages
Unit VII Advanced Topics
No ratings yet
Unit VII Advanced Topics
23 pages
Distributed Database Systems Guide
No ratings yet
Distributed Database Systems Guide
52 pages
Distributed Databases
No ratings yet
Distributed Databases
55 pages
Distributed Database Vs Conventional Database
50% (2)
Distributed Database Vs Conventional Database
4 pages
Lecture - 24 24 Parallel and Distributed Databases Parallel and Distributed Databases
No ratings yet
Lecture - 24 24 Parallel and Distributed Databases Parallel and Distributed Databases
23 pages
DBMS - Chapter 1
No ratings yet
DBMS - Chapter 1
45 pages
Advanced Data Base Management Systems
No ratings yet
Advanced Data Base Management Systems
35 pages
ADBMS
No ratings yet
ADBMS
31 pages
Lecture 16
No ratings yet
Lecture 16
31 pages
1 Distributed DB
No ratings yet
1 Distributed DB
67 pages
Database Management Systems and Distributed Systems Lesson 3
No ratings yet
Database Management Systems and Distributed Systems Lesson 3
34 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
74 pages
Unit 1 2
No ratings yet
Unit 1 2
76 pages
Basis For Distributed Database Technology
No ratings yet
Basis For Distributed Database Technology
35 pages
DataBase Types
No ratings yet
DataBase Types
23 pages
DBMS Module 1
100% (1)
DBMS Module 1
83 pages
WINSEM2024-25 CSE2007 ETH AP2024254000455 2024-12-17 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE2007 ETH AP2024254000455 2024-12-17 Reference-Material-I
35 pages
WINSEM2023-24 BCSE302L TH CH2023240502444 Reference Material I 08-01-2024 MODULE 1
No ratings yet
WINSEM2023-24 BCSE302L TH CH2023240502444 Reference Material I 08-01-2024 MODULE 1
165 pages
Unit 5 DBMS
No ratings yet
Unit 5 DBMS
34 pages
07 DistributedDataManagement
No ratings yet
07 DistributedDataManagement
44 pages
Advanced Database Systems
No ratings yet
Advanced Database Systems
16 pages
Data Preparation For Exploration 2
No ratings yet
Data Preparation For Exploration 2
66 pages
Iii. Current Trends: Distributed Databases and DBMSS: Concepts and Design
No ratings yet
Iii. Current Trends: Distributed Databases and DBMSS: Concepts and Design
32 pages
Commonly Asked MongoDB Interview Questions (2023) - Interviewbit
No ratings yet
Commonly Asked MongoDB Interview Questions (2023) - Interviewbit
21 pages
A MongoDB DBA Exam Study Guide
No ratings yet
A MongoDB DBA Exam Study Guide
34 pages
Progress Database Performance Tuning
No ratings yet
Progress Database Performance Tuning
107 pages
Dbms
No ratings yet
Dbms
16 pages
Conceptual Architecture of Postgresql
100% (1)
Conceptual Architecture of Postgresql
23 pages
GeoTools Commands for CAD Productivity
No ratings yet
GeoTools Commands for CAD Productivity
13 pages
SQL Interview Guide for Beginners
No ratings yet
SQL Interview Guide for Beginners
25 pages
SQL MCQ (Multiple Choice Questions) - Javatpoint
100% (1)
SQL MCQ (Multiple Choice Questions) - Javatpoint
26 pages
Data Files PDF
No ratings yet
Data Files PDF
26 pages
NOTE 2381435 Release 617
No ratings yet
NOTE 2381435 Release 617
131 pages
Java Collections Explained
No ratings yet
Java Collections Explained
78 pages
QTFF
No ratings yet
QTFF
446 pages
SQL Server Exam Prep Questions
No ratings yet
SQL Server Exam Prep Questions
117 pages
Database Interview Questions
No ratings yet
Database Interview Questions
21 pages
MongoDB Developer Training Guide
No ratings yet
MongoDB Developer Training Guide
13 pages
GLP Manual
No ratings yet
GLP Manual
109 pages
Multidimensional Tables Gams PDF
No ratings yet
Multidimensional Tables Gams PDF
29 pages
DB2@SAP Deep Compression
No ratings yet
DB2@SAP Deep Compression
30 pages
The Cuddletech Sas Guide To Oracle: Ben Rockwood Draft: Feb 10Th, 2005
No ratings yet
The Cuddletech Sas Guide To Oracle: Ben Rockwood Draft: Feb 10Th, 2005
99 pages
4GL Programming For OpenEdge Multi-Tenant Databases
No ratings yet
4GL Programming For OpenEdge Multi-Tenant Databases
42 pages
Apache Atlas User Guide
100% (1)
Apache Atlas User Guide
107 pages
What Is Difference Between Candidate Key and Primary Key
No ratings yet
What Is Difference Between Candidate Key and Primary Key
25 pages
Magento 2 Asynchronous Reindexing Guide
No ratings yet
Magento 2 Asynchronous Reindexing Guide
15 pages
SQL Tuning Basics and Optimization Tips
100% (1)
SQL Tuning Basics and Optimization Tips
42 pages
Aryan DBMS Assignment 1
No ratings yet
Aryan DBMS Assignment 1
12 pages
Indexing in DBMS
No ratings yet
Indexing in DBMS
12 pages

Big Data Using Relational Technologies

Uploaded by

Big Data Using Relational Technologies

Uploaded by

Big Data using Relational Technologies

• Data storage hierarchy

• Disk interface standards families

Schematic diagram of magnetic disk drive Photo of magnetic disk drive

• Records: Fixed length, variable length

• Indexing mechanisms used to speed up access to desired data.

• Access types supported efficiently. E.g.,

• In an ordered index, index entries are stored sorted on the search

• Centralized Database Systems

• A distributed database management system (D–DBMS) is the software that

• Distributed database system (DDBS) = DDB + D–DBMS

• Transparent management of distributed, fragmented, and replicated data

• Improved reliability/availability through distributed transactions

• Fundamental issue is to provide data independence

• Issue is database scaling

• Client‐server model of computing

• Distributed Database Design

• Operating System Support

• Defines the structure of the system

External External External External

Internal Internal view

You might also like