Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views54 pages

Sodapdf Merged

The document outlines the curriculum for the Big Data & Data Engineering subject at Rai Technology University, focusing on data storage concepts, distributed file systems, batch processing, SQL queries, and database management systems. It includes detailed sections on various types of storage devices, their security, and the principles of distributed file systems. The reference book for this module is 'Learning Spark: Lightning-Fast Big Data Analysis'.

Uploaded by

18yashwanthyadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views54 pages

Sodapdf Merged

The document outlines the curriculum for the Big Data & Data Engineering subject at Rai Technology University, focusing on data storage concepts, distributed file systems, batch processing, SQL queries, and database management systems. It includes detailed sections on various types of storage devices, their security, and the principles of distributed file systems. The reference book for this module is 'Learning Spark: Lightning-Fast Big Data Analysis'.

Uploaded by

18yashwanthyadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

University: Rai Technology University, Bangalore

Program: B.Tech CS/IT AIML

Semester: 2nd Sem

Subject Name: Big Data & Data Engineering

Subject Code: 0CS323T

Module-3: Basics of Big Data Storage and Processing

Module Content: Introduc on to data storage concepts, Basics of distributed file systems, Basics of
batch processing, Introduc on to basic SQL queries (SELECT, INSERT, UPDATE, DELETE), Introduc on
to database management systems (DBMS).

Reference Book for Module-3:

1. ""Learning Spark: Lightning-Fast Big Data Analysis" by Holden Karau, Andy Konwinski, Patrick
Wendell, and Matei Zaharia, O'Reilly Media
Contents
1.Introduc on to data storage concepts ................................................................................................. 4
1.1 Data storage devices .............................................................................................................. 4
1.1.1 What's the difference between NAS and SAN? ..................................................................... 4
1.2 Types of storage devices and systems........................................................................................... 5
1.3 Forms of data storage ................................................................................................................... 5
1.4 Data storage security .................................................................................................................... 6
2.Basics of distributed file systems ......................................................................................................... 7
2.1Components of DFS........................................................................................................................ 7
2.2Design Principles of Distributed File System .................................................................................. 7
3.Basics of batch processing ................................................................................................................. 10
3.1Key Concepts: ............................................................................................................................... 10
3.2Benefits: ....................................................................................................................................... 10
3.3Use Cases: .................................................................................................................................... 10
4.Introduc on to basic SQL queries ...................................................................................................... 11
4.1What is SQL? ................................................................................................................................ 11
4.2Components of a SQL System ...................................................................................................... 11
4.3 SQL Data Types ............................................................................................................................ 11
4.4SQL Queries .................................................................................................................................. 11
4.4.1Craete Query ......................................................................................................................... 12
4.4.2 Insert Query ......................................................................................................................... 13
4.4.3Select Query .......................................................................................................................... 13
4.4.4Primary key ........................................................................................................................... 13
4.4.5 FOREIGN KEY ........................................................................................................................ 13
4.4.6 SQL CHECK ............................................................................................................................ 14
4.4.7Where clause ........................................................................................................................ 15
4.4.8Sql Update ............................................................................................................................. 15
4.4.9Delete query.......................................................................................................................... 15
4.4.10Sum, Average, mul ply, count in sql ................................................................................... 16
5.Introduc on to DBMS ........................................................................................................................ 18
5.1Types of DBMS ............................................................................................................................. 18
5.2Advantages of DBMS .................................................................................................................... 19
5.3 Disadvantages of DBMS .............................................................................................................. 19
5.4Difference between File System and DBMS ................................................................................. 19
5.5 ACID Proper es in DBMS ............................................................................................................ 20
5.6Data Integrity ............................................................................................................................... 21
5.7 Data Consistency ......................................................................................................................... 22
5.8Example of DBMS in Various Fields .............................................................................................. 22
5.8.1 Banking System .................................................................................................................... 22
5.8.2 College ERP System (Enterprise Resource Planning)............................................................ 22
5.8.3 Social Media Pla orms......................................................................................................... 23
5.9 DATABASE MODELS ..................................................................................................................... 23
5.9.1 Rela onal Model .................................................................................................................. 23
5.9.2 ER (En ty-Rela onship) Model ............................................................................................ 23
5.9.3 Rela onal Vs ER model ......................................................................................................... 24
Assignment............................................................................................................................................ 26
1.Introduc on to data storage concepts

Data storage refers to magne c, op cal or mechanical media that record and preserve digital
informa on for ongoing or future opera ons. There are two types of digital informa on: input and
output data. Users provide the input data, and computers provide the output data. However, a
computer's CPU can’t compute anything or produce output data without the user's input. Today,
organiza ons and users require data storage to meet high-level computa onal needs for big data
analy cs, ar ficial intelligence (AI), machine learning (ML) and the Internet of Things (IoT). The other
side of requiring vast data storage is protec ng against data loss due to disaster, failure or fraud. So, to
avoid data loss, organiza ons can also employ data storage as a backup and restore solu on.

1.1 Data storage devices


To store data, regardless of form, users need storage devices. Data storage devices come in two main
categories: direct area storage and network-based storage.

Direct area storage, also known as direct-a ached storage (DAS), is as the name implies. This storage
is o en in the immediate area and directly connected to the compu ng machine accessing it. O en,
it's the only machine connected to it. DAS can also provide decent local backup services, but sharing
is limited. DAS devices include diske es, op cal discs—compact discs (CDs) and digital video discs
(DVDs)—hard disk drives (HDD), flash drives and solid-state drives (SSD).

Network-based storage allows mul ple computers to access it through a network, making it be er for
data sharing and collabora on. Its off-site storage capability is also be er suited for backups and data
protec on. Two standard network-based storage setups are network-a ached storage (NAS) and
storage area network (SAN).

NAS is o en a single device made up of redundant storage containers or a redundant array of


independent disks (RAID). SAN storage can be a network of mul ple devices of various types, including
SSD and flash storage, hybrid storage, hybrid cloud storage, cloud storage and backup so ware and
appliances.

1.1.1 What's the difference between NAS and SAN?


Here's how NAS and SAN differ:

NAS SAN

 Single storage device or RAI  Network of mul ple devices


 File storage system  Block storage system
 TCP/IP Ethernet network  Fibre Channel network
 Limited users  Op mized for mul ple users
 Limited speed  Faster performance
 Limited expansion op ons  Highly expandable
 Lower cost and easy setup  Higher cost and complex setup
1.2 Types of storage devices and systems
SSD and flash storage
Flash storage is a solid-state drive technology that uses flash memory
chips to write and store data. A solid-state disk (SSD) flash drive stores
data by using flash memory. Compared to hard-disk drives (HDDs), a
solid-state system has no moving parts and less latency, so there are
fewer SSDs. Because most modern SSDs are flash-based, flash storage
is synonymous with a solid-state system.
Hybrid storage
SSDs and flash offer higher throughput than HDDs, but all-flash arrays
can be more expensive. Many organiza ons adopt a hybrid approach,
mixing the speed of flash with the storage capacity of hard disk drives.
A balanced storage infrastructure enables companies to apply
specific technology to meet different storage needs. Hybrid storage
offers an economical way to transi on from tradi onal HDDs without
going en rely to flash.
Cloud storage
Cloud storage delivers a cost-effec ve, scalable alterna ve to storing
files on-premises hard disks or storage networks. Cloud service
providers (CSPs)—like Google Cloud, Microso Azure, IBM Cloud®,
Amazon Web Services (AWS)—allow you to save data and files in an
off-site loca on that you can access through the public internet or a
dedicated private network connec on. The provider hosts, secures,
manages and maintains the servers and associated infrastructure and
ensures you can access the data whenever needed.
Hybrid cloud storage
Hybrid cloud storage combines private and public cloud elements.
With hybrid cloud storage, organiza ons can choose which cloud to
store data in. For instance, highly regulated data subject to strict
archiving and replica on requirements is more suited to a private
cloud environment, while less sensi ve data can be stored in the
public cloud. Some organiza ons use hybrid clouds to supplement
their internal storage networks with public cloud storage.
Storage backup so ware and appliances
Backup storage and appliances protect data loss from disaster, failure
or fraud. They make periodic data and applica on copies to a
separate, secondary device and then use those copies for disaster
recovery. Backup appliances range from HDDs and SSDs to tape drives
and servers.

Cloud service providers (CSPs) also offer backup storage as a service


called backup-as-a-service (BaaS). Like most as-a-service solu ons,
BaaS provides a low-cost op on to protect data, saving it in a remote
loca on with scalability.

1.3 Forms of data storage


Data can be recorded and stored in three primary forms: file storage, block storage and object
storage.
File storage

File storage, or file-based storage, is a hierarchical storage methodology used to organize and store
data. In other words, data is stored in files, which are organized in folders, which are organized under
a hierarchy of directories and subdirectories.

Block storage

Block storage, some mes called block-level storage, is a technology for storing data in blocks. The
blocks are then stored as separate pieces, each with a unique iden fier. Developers favor block storage
for compu ng situa ons that require fast, efficient and reliable data transfer.

Object storage

Object storage, o en called object-based storage, is a data storage architecture for handling large
amounts of unstructured data. This data doesn't conform to—or can't be organized easily into—a
tradi onal rela onal database with rows and columns. Examples include email, videos, photos, web
pages, audio files, sensor data and other media and web content (textual or nontextual). Other use
cases include building cloud-na ve applica ons or transforming legacy applica ons into next-
genera on cloud applica ons by using cloud-based object storage as a persistent data store.

1.4 Data storage security


Data storage security protects data on-premises and in cloud-based environments against data
breaches, cybera acks and other security threats.

Data breaches are costly and present an ongoing for enterprise businesses. According to the IBM Cost
of a Data Breach Report 2023, the global average data breach cost in that year was USD 4.45 million,
a 15% increase over three years. The report also revealed that the average savings for organiza ons
that use security AI and automa on extensively is USD 1.76 million when compared to organiza ons
that don't.

Enterprises deploy data security measures to enhance visibility into data storage. Storage security
hardware and so ware features include special permissions, encryp on, data masking and redac on
of sensi ve files. The latest security storage so ware solu ons also help to automate repor ng to
streamline audits and adhere to regulatory requirements.

Moreover, cyber resilience—an organiza on's ability to prevent, withstand and recover from
cybersecurity incidents—has become an integral part of data storage security. Cyber resilience takes
data security to a new level by combining business con nuity disaster recovery (BCDR), informa on
systems security and organiza onal resilience to help organiza ons ward off threats and safeguard
their data.

Today, industries that need to preserve records and maintain data integrity (for example, healthcare,
government) can opt for immutable storage, which protects stored data by preven ng any changes or
altera ons for a set or indefinite amount of me. These file systems allow stored data to be accessed
repeatedly once created, but not modified and can help protect data from tampering, cybera acks
and ransomware.
2.Basics of distributed file systems
A Distributed File System (DFS) is a file system that is distributed on mul ple file servers or mul ple
loca ons. It allows programs to access or store isolated files as they do with the local ones, allowing
programmers to access files from any network or computer. In this ar cle, we will discuss everything
about Distributed File System.

2.1Components of DFS
In the case of failure and heavy load, these components together improve data availability by allowing
the sharing of data in different loca ons to be logically grouped under one folder, which is known as
the “DFS root”. It is not necessary to use both the two components of DFS together, it is possible to
use the namespace component without using the file replica on component and it is perfectly possible
to use the file replica on component without using the namespace component between servers.

 Loca on Transparency: Loca on Transparency achieves through the namespace component.

 Redundancy: Redundancy is done through a file replica on component.

2.2Design Principles of Distributed File System


1. Scalability The system must handle increasing amounts of data and users efficiently
without degrada on in performance.
Example:
Hadoop Distributed File System (HDFS): HDFS is designed to scale out by
adding more DataNodes to the cluster. Each DataNode stores data blocks, and
the system can handle petabytes of data across thousands of nodes. When
more storage or processing power is needed, new nodes can be added to the
cluster without significant down me or performance degrada on.
2. Consistency Ensuring that all users see the same data at the same me. This can be
achieved through different consistency models.
Example:
Google File System (GFS): GFS provides a relaxed consistency model to
achieve high availability and performance. It allows concurrent muta ons and
uses version numbers and mestamps to maintain consistency. Changes are
made in a primary replica and then propagated to secondary replicas,
ensuring eventual consistency.
3. Availability Ensuring that the system is opera onal and accessible even during failures.
Example:
Amazon S3: Amazon S3 achieves high availability by replica ng data across
mul ple Availability Zones (AZs). If one AZ fails, data is s ll accessible from
another, ensuring minimal down me and high availability.
4. Performance Op mizing the system for speed and efficiency in data access.
Example:
Ceph: Ceph is designed to provide high performance by using techniques such
as object storage, which allows for efficient, parallel data access. It uses a
dynamic distributed hashing algorithm called CRUSH (Controlled Replica on
Under Scalable Hashing) to distribute data evenly across storage nodes,
op mizing data retrieval mes.
5. Security Protec ng data from unauthorized access and ensuring data integrity.
Example:
Azure Blob Storage: Azure Blob Storage offers comprehensive security
features, including role-based access control (RBAC), encryp on of data at
rest and in transit, and integra on with Azure Ac ve Directory for
authen ca on. This ensures that only authorized users can access or modify
the data.
6. Data Efficiently distribu ng, replica ng, and caching data to ensure op mal
Management performance and reliability.
Example:
Cassandra: Apache Cassandra is a distributed NoSQL database that uses
consistent hashing to distribute data evenly across all nodes in the cluster. It
also provides tunable consistency levels and replica on strategies to ensure
data is available and performant even during node failures.
7. Metadata Efficient management of metadata, which is crucial for tracking the loca on,
Management size, and permissions of files.
Example:
HDFS: HDFS uses a centralized NameNode to manage metadata. The
NameNode stores informa on about the file system namespace and the
loca ons of data blocks. While this centraliza on simplifies management, it
also requires robust fault tolerance mechanisms to ensure the NameNode is
always available.
8. File Access and Providing efficient and secure methods for file opera ons, including reading,
Opera ons wri ng, and modifying files.
Example:
Google File System (GFS): GFS supports a single-writer, mul ple-reader model
where files are divided into chunks. Clients can access these chunks directly
from the chunk servers a er obtaining the metadata from the master server.
This approach allows for efficient data access and modifica on.
9. Fault Tolerance Ensuring the system can detect, handle, and recover from failures without
and Recovery data loss or significant down me.
Example:
HDFS: HDFS is designed for fault tolerance with data replica on. Each data
block is replicated across mul ple DataNodes. If a DataNode fails, the system
automa cally re-replicates the blocks from the remaining replicas to ensure
data integrity and availability.
3.Basics of batch processing
Batch processing is a computa onal technique where a group of tasks or data is collected and
processed together in a single opera on, o en without real- me interac on. This method is
par cularly useful for handling large volumes of data and can be automated, making it ideal for tasks
like payroll processing, end-of-month reconcilia on, or genera ng reports.

3.1Key Concepts:
 Grouping of Tasks: Batch processing involves grouping mul ple tasks or transac ons together
and processing them as a single unit, rather than individually.

 Automated Execu on: Once the batch is ini ated, it runs autonomously without user
interven on un l it completes or encounters an error.

 Efficiency for Large Data: This approach is well-suited for processing large datasets or tasks
that require significant compu ng power and me, such as data analysis or ETL (Extract,
Transform, Load) opera ons.

 Scheduled Execu on: Batch processing is o en scheduled to run during off-peak hours or at
specific intervals to op mize resource u liza on.

3.2Benefits:
 Cost-Effec ve: By processing mul ple tasks at once, batch processing can reduce the overall
processing me and cost.

 Resource Op miza on: It can be used to efficiently manage compu ng resources by


processing jobs when they are available.

 Data Consistency: Processing in batches ensures data consistency and reduces the risk of
errors.

 Scalability: Batch processing systems can be designed to handle large volumes of data and
scale to meet growing needs.

3.3Use Cases:
 Payroll Processing: Calcula ng and distribu ng employee salaries at the end of a pay period.

 End-of-Month Reconcilia on: Processing and reconciling financial transac ons at the end of
each month.

 Data Analy cs: Performing complex analy cal computa ons on large datasets.

 ETL Opera ons: Extrac ng, transforming, and loading data from mul ple sources.

 System Backups: Consolida ng and storing data in bulk at regular intervals.


4.Introduc on to basic SQL queries
4.1What is SQL?
SQL (Structured Query Language) manages and interacts with this data. It allows users to interact with
databases to perform tasks like querying data, crea ng and modifying database structures, and
managing access permissions. SQL statements are used to communicate with the database, and these
statements are executed by the database management system to perform the desired opera ons. SQL
is a fundamental skill for anyone working with rela onal databases, including database administrators,
developers, and data analysts. Examples of SQL databases: MySQL, PostgreSQL, Oracle, SQL Server.

4.2Components of a SQL System


 Databases are structured collec ons of data organized into tables, rows, and columns.
 Tables are the fundamental building blocks of a database, consis ng of rows (records) and
columns (a ributes or fields).
 Queries are SQL commands used to interact with databases.
 Constraints are rules applied to tables to maintain data integrity.

4.3 SQL Data Types


• Numeric Data Types

• Character and String Data Types

• Date and Time Data Types

• Binary Data Types

• Boolean Data Types

• Special Data Types

4.4SQL Queries
SQL commands are the fundamental building blocks for communica ng with a database management
system (DBMS). It is used to interact with the database with some opera ons. It is also used to
perform specific tasks, func ons, and queries of data. SQL can perform various tasks like crea ng a
table, adding data to tables, dropping the table, modifying the table, set permission for users.

SQL Commands are mainly categorized into five categories:

 DDL – Data Defini on Language

 DQL – Data Query Language

 DML – Data Manipula on Language

 DCL – Data Control Language

 TCL – Transac on Control Language


4.4.1Craete Query
To Create a database, we follow this query,

CREATE DATABASE databasename;


Example:
CREATE DATABASE Person;

To Create a table we follow this query,

CREATE TABLE table_name (


column1 datatype,
column2 datatype,
column3 datatype,
....
);

Example:
CREATE TABLE Persons (
PersonID int,
LastName varchar(255),
FirstName varchar(255),
Address varchar(255),
City varchar(255)
);
4.4.2 Insert Query
To insert data into the table, we follow this query,

INSERT INTO table_name (column1, column2, column3, ...)


VALUES (value1, value2, value3, ...);
Example:
INSERT INTO Persons (PersonID, LastName, FirstName, Address, City)
VALUES (1, 'Sharma', 'Amit', '123 Main St', 'Delhi');
INSERT INTO Persons (PersonID, LastName, FirstName, Address, City)
VALUES (2, 'Khan', 'Fa ma', '456 Oak St', 'Mumbai');
INSERT INTO Persons (PersonID, LastName, FirstName, Address, City)
VALUES (3, 'Singh', 'Rajesh', '789 Pine St', 'Chennai');

4.4.3Select Query
To display the data, we follow this query

SELECT column1, column2, ...


FROM table_name;

 Indicates all fields/columns


Select * From Persons;

4.4.4Primary key
In SQL, a primary key is a column or a set of columns that uniquely iden fies each row in a table. It
ensures that no two rows have the same primary key value, and it cannot contain NULL values. A table
can have only one primary key.

CREATE TABLE Persons (


PersonID int NOT NULL PRIMARY KEY,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Age int
);

4.4.5 FOREIGN KEY


In SQL, a foreign key is a constraint that establishes a link between two tables, ensuring data integrity
by preven ng ac ons that would violate rela onships between tables. It's a column or a set of columns
in one table that references the primary key of another table. Essen ally, a foreign key in the "child"
table refers to the primary key in the "parent" table, maintaining the integrity of the data and
rela onships between tables.

CREATE TABLE table_name (


column1 datatype,
column2 datatype,
…,
CONSTRAINT _constraint_name FOREIGN KEY (column1, column2, …)
REFERENCES parent_table(column1, column2, …)
);

Example:

Query:
CREATE TABLE Orders (
OrderID int NOT NULL,
OrderNumber int NOT NULL,
PersonID int,
PRIMARY KEY (OrderID),
FOREIGN KEY (PersonID) REFERENCES Persons(PersonID)
);

4.4.6 SQL CHECK


In SQL, One such constraint is the CHECK constraint, which allows to enforcement of domain integrity
by limi ng the values that can be inserted or updated in a column. By using CHECK, we can define
condi ons on a column’s values and ensure that they adhere to specific rules.

CREATE TABLE Persons (


ID int NOT NULL, LastName varchar(255) NOT NULL,
FirstName varchar(255), Age int,
CHECK (Age>=18)
);
Explana on:
NOT NULL: The field must contain a value (can't be le empty).

CHECK (Age >= 18): Ensures that any value entered into the Age column must be 18 or older. If a
value below 18 is inserted, the database will reject it.
CREATE TABLE Persons (
ID int NOT NULL, LastName varchar(255) NOT NULL,
FirstName varchar(255), Age int,
City varchar(255),
CONSTRAINT CHK_Person CHECK (Age>=18 AND City='Sandnes')
);
Explana on:
ID int NOT NULL: Each person must have a unique, non-null ID.
This named constraint CHK_Person enforces a rule:
 The Age must be at least 18,
 AND the City must be exactly 'Sandnes'.

4.4.7Where clause
The WHERE clause in SQL is used to filter records based on a specified condi on, extrac ng only those
that meet the criteria. It's a fundamental part of SQL queries, enabling you to retrieve specific subsets
of data from a database table. The WHERE clause can be used in SELECT, UPDATE,
and DELETE statements.

SELECT column_name(s)
FROM table_name
WHERE condi on;
Example:
SELECT * FROM Customers WHERE City = 'Aralumallige';

4.4.8Sql Update
The UPDATE statement is used to modify the exis ng records in a table.

UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condi on;
Example:
UPDATE Customers
SET ContactName = 'Alfred Schmidt', City= 'Frankfurt'
WHERE CustomerID = 1;

4.4.9Delete query
The SQL DELETE query is used to remove records from a table. Here’s the basic syntax:

DELETE FROM table_name


WHERE condi on;
DELETE FROM Persons
WHERE ID = 3;

Delete all Table data:


DELETE FROM Persons;

Permanently delete table and database:

DROP TABLE table_name;


Example:
DROP TABLE Persons;

DROP DATABASE database_name;


Example:
DROP DATABASE StudentDB;

4.4.10Sum, Average, mul ply, count in sql

emp_id name age department

101 Alice Sharma 30 HR

102 Raj Patel 24 IT

103 Neha Verma 28 Finance

104 Arjun Mehta 35 IT

105 Kavita Joshi 26 HR

106 Vikram Singh 29 Marke ng

Sum Query SELECT SUM(salary) AS total_salary FROM employees;


AVG query SELECT AVG(salary) AS average_salary FROM employees;
Min query SELECT MIN(salary) AS lowest_salary FROM employees;
Max Query SELECT MAX(salary) AS highest_salary FROM employees;
Count SELECT COUNT(*) AS total_employees FROM employees;
Group by SELECT department, SUM(salary) AS dept_total_salary
FROM employees
GROUP BY department;
5.Introduc on to DBMS
A Database Management System (DBMS) is a so ware solu on designed to efficiently manage,
organize, and retrieve data in a structured manner. It serves as a cri cal component in modern
compu ng, enabling organiza ons to store, manipulate, and secure their data effec vely. From small
applica ons to enterprise systems, DBMS plays a vital role in suppor ng data-driven decision-making
and opera onal efficiency.

Key Features of DBMS

1. Data Modeling: Tools to create and modify data models, defining the structure and
rela onships within the database.

2. Data Storage and Retrieval: Efficient mechanisms for storing data and execu ng queries to
retrieve it quickly.

3. Concurrency Control: Ensures mul ple users can access the database simultaneously
without conflicts.

4. Data Integrity and Security: Enforces rules to maintain accurate and secure data, including
access controls and encryp on.

5. Backup and Recovery: Protects data with regular backups and enables recovery in case of
system failures.

5.1Types of DBMS
There are several types of Database Management Systems (DBMS), each tailored to different data
structures, scalability requirements, and applica on needs. The most common types are as follows:

1. Rela onal Database Management System (RDBMS)

RDBMS organizes data into tables (rela ons) composed of rows and columns. It uses primary keys to
uniquely iden fy rows and foreign keys to establish rela onships between tables. Queries are wri en
in SQL (Structured Query Language), which allows for efficient data manipula on and retrieval.

Examples: MySQL, Oracle, Microso SQL Server and Postgre SQL.

2. NoSQL DBMS

NoSQL systems are designed to handle large-scale data and provide high performance for scenarios
where rela onal models might be restric ve. They store data in various non-rela onal formats, such
as key-value pairs, documents, graphs, or columns. These flexible data models enable rapid scaling and
are well-suited for unstructured or semi-structured data.

Examples: MongoDB, Cassandra, DynamoDB and Redis.

3. Object-Oriented DBMS (OODBMS)

OODBMS integrates object-oriented programming concepts into the database environment, allowing
data to be stored as objects. This approach supports complex data types and rela onships, making it
ideal for applica ons requiring advanced data modeling and real-world simula ons.

Examples: ObjectDB, db4o.


5.2Advantages of DBMS
1. Data organiza on: A DBMS allows for the organiza on and storage of data in a structured
manner, making it easy to retrieve and query the data as needed.

2. Data integrity: A DBMS provides mechanisms for enforcing data integrity constraints, such as
constraints on the values of data and access controls that restrict who can access the data.

3. Concurrent access: A DBMS provides mechanisms for controlling concurrent access to


the database, to ensure that mul ple users can access the data without conflic ng with each
other.

4. Data security: A DBMS provides tools for managing the security of the data, such as controlling
access to the data and encryp ng sensi ve data.

5. Backup and recovery: A DBMS provides mechanisms for backing up and recovering the data
in the event of a system failure.

6. Data sharing: A DBMS allows mul ple users to access and share the same data, which can be
useful in a collabora ve work environment.

5.3 Disadvantages of DBMS


1. Complexity: DBMS can be complex to set up and maintain, requiring specialized knowledge
and skills.

2. Performance overhead: The use of a DBMS can add overhead to the performance of an
applica on, especially in cases where high levels of concurrency are required.

3. Scalability: The use of a DBMS can limit the scalability of an applica on, since it requires the
use of locking and other synchroniza on mechanisms to ensure data consistency.

4. Cost: The cost of purchasing, maintaining and upgrading a DBMS can be high, especially for
large or complex systems.

5. Limited Use Cases: Not all use cases are suitable for a DBMS, some solu ons don’t need high
reliability, consistency or security and may be be er served by other types of data storage.

5.4Difference between File System and DBMS


File System DBMS

1. Data Storage Stores data in files, manually Stores data in a structured way using
managed tables and schemas

2. Data Redundancy High – same data may be stored in Low – normaliza on helps reduce
mul ple files redundancy

3. Data Consistency Difficult to maintain across files Easier due to central control and
integrity constraints

4. Data Security Limited – file-level protec on only High – user roles, access control,
authen ca on
5. Data Access Manual and sequen al (using file Query-based (using SQL), faster and
handlers) flexible

6. Data Integrity Hard to enforce rules Easy with constraints (e.g., primary
key, foreign key)

7. Backup and Manual and error-prone Automated tools and recovery


Recovery mechanisms available

8. Concurrency No support – leads to conflicts In-built support for concurrent access


Control

9. Scalability Not scalable for large data volumes Designed to handle large-scale
applica ons

10. Examples Text files, Excel files MySQL, Oracle, PostgreSQL, MongoDB

5.5 ACID Proper es in DBMS


ACID proper es in SQL (Atomicity, Consistency, Isola on, Durability) are crucial for ensuring reliable
and consistent database transac ons. They guarantee that opera ons are either fully completed or
rolled back to their original state, maintaining data integrity and preven ng inconsistencies, even in
the face of failures

Property Meaning

A – Atomicity A transac on is all or nothing. Either everything executes, or nothing does.

C – Consistency Ensures that the database is always in a valid state before and a er the transac on.

Transac ons do not interfere with each other. They execute as if they were
I – Isola on
independent.

Once a transac on is commi ed, the changes are permanent — even if there's a
D – Durability
crash.

Quick Real-Life Example: Bank Transfer:

• Scenario: Transfer ₹1,000 from A's account to B's account

Step Opera on

1 Debit ₹1,000 from A's account

2 Credit ₹1,000 to B's account


Property In the Bank Example

Atomicity If debit happens but credit fails → en re transac on is rolled back.

Consistency Total money before and a er transfer remains the same.

Isola on If two users are transferring money at the same me, they won't affect each other.

Durability Once the transfer is successful, it remains recorded even a er power failure.

5.6Data Integrity
• Defini on: Data integrity ensures that the data is accurate, complete, and valid according to
rules and constraints.
• Example:
• Alice's email address is stored as [email protected].
• The system requires that the email field must not be empty and must follow a proper
email format.
• If someone tries to insert aliceexample.com (missing @), the system rejects it due to
a data integrity constraint.

5.7 Data Consistency


• Defini on: Data consistency ensures that data is the same across the system or in different
parts a er a transac on or opera on.
• Example:
• Alice places an order and pays ₹10,000.
• The system should update:
• The inventory (reduce stock by 1),
• The user's order history (add the new order),
• The payment system (mark as paid).

5.8Example of DBMS in Various Fields


5.8.1 Banking System
Where DBMS is used:

• Storing customer informa on (name, account number, balance)

• Recording transac ons (deposits, withdrawals, transfers)

• ATM systems, online banking portals

Why DBMS?

• ACID compliance ensures each transac on is secure and consistent.

• Avoids duplicate accounts or mismatched balances.

• Allows mul ple users (bank branches) to access the same central database.

• Ensures data security with controlled access to customer records.

5.8.2 College ERP System (Enterprise Resource Planning)


Where DBMS is used:

• Managing student records (admission, marks, a endance)

• Faculty data, course scheduling

• Fee payments, results genera on

Why DBMS?

• Centralized data for all departments (exam cell, academics, accounts)

• Ensures data consistency — e.g., a student's marks match with their subjects

• Facilitates report genera on for result cards, a endance reports

• Supports role-based access (admin, student, faculty)


5.8.3 Social Media Pla orms
Where DBMS is used:

• Storing user profiles, posts, comments, likes, followers

• Managing messaging, no fica ons, media storage

Why DBMS?

• Scalability — handles millions of users and posts every day

• Supports real- me data (like updates, comments, live feeds)

• Efficient search queries (find friends, hashtags, posts)

• Ensures privacy and data protec on for user content

5.9 DATABASE MODELS


A database model defines how data is logically structured, stored, and accessed in a database
system. There are several models (e.g., hierarchical, network, rela onal), but in modern systems,
Rela onal and ER models are most widely used.

5.9.1 Rela onal Model


The Rela onal Model organizes data into tables (also called rela ons), where:

• Each table consists of rows (tuples) and columns (a ributes).

• Every table has a unique name.

• Each row represents a record.

• Each column represents a field or a ribute.

5.9.2 ER (En ty-Rela onship) Model


The ER Model is a conceptual design model used for planning and designing a database before
implementa on. It uses en es, a ributes, and rela onships.
Component Descrip on Example

En ty Real-world object or concept Student, Course

A ribute Property of an en ty Student has Name, Age

En ty Set Collec on of similar en es All Students

Rela onship Associa on between en es Student enrolls in Course

Symbol Meaning

Rectangle En ty

Ellipse A ribute

Diamond Rela onship

Line Connec on

Example:

5.9.3 Rela onal Vs ER model


Feature Rela onal Model ER Model

Purpose Logical structure (tables) Conceptual design (diagram)


Used in Implementa on Design phase

Structure Tables, Rows, Columns En es, A ributes, Rela onships

Tools SQL, RDBMS ER Diagrams, Modeling tools


Assignment
1.
University: Rai Technology University, Bangalore

Program: B.Tech CS/IT AIML

Semester: 2nd Sem

Subject Name: Big Data & Data Engineering

Subject Code: 0CS323T

Module-4:

Module Content: Overview of data warehousing concepts, Basics of data warehousing solu ons,
Introduc on to basic SQL queries for data retrieval (SELECT statements), Basics of data analysis and
repor ng.

Reference Book for Module-4:

1. “Building a Data Warehouse With Examples in SQL Server,


Contents
1.Overview of data warehousing concepts ............................................................................................. 3
1.1Need for Data Warehousing........................................................................................................... 3
1.2Components of Data Warehouse ................................................................................................... 4
1.3Characteris cs of Data Warehousing ............................................................................................. 4
1.4Types of Data Warehouses............................................................................................................. 5
1.5Example Applica ons of Data Warehousing .................................................................................. 5
1.6Advantages of Data Warehousing .................................................................................................. 5
1.7Disadvantages of Data Warehousing ............................................................................................. 6
2.Basics of data warehousing solu ons .................................................................................................. 7
2.1Popular Data Warehousing Solu ons ............................................................................................ 7
2.1.1On-Premise in Data Warehousing ........................................................................................... 7
2.1.2Cloud-Based Data Warehousing ............................................................................................. 8
2.1.3Comparison of Popular Data Warehousing Solu ons ............................................................. 9
2.1.4On-Premise vs Cloud Summary............................................................................................... 9
3.Introduc on to basic SQL queries ...................................................................................................... 10
3.1SQL Joins ...................................................................................................................................... 10
3.2SQL INNER JOIN............................................................................................................................ 10
3.3SQL LEFT JOIN............................................................................................................................... 12
3.4SQL RIGHT JOIN ............................................................................................................................ 13
3.4SQL FULL JOIN .............................................................................................................................. 14
4.Basics of data analysis and repor ng ................................................................................................. 16
4.1Data Analysis ................................................................................................................................ 16
4.1.1Predic ve Analy cs ............................................................................................................... 16
4.1.2Descrip ve Analy cs ............................................................................................................. 18
4.1.3Diagnos c Analy cs .............................................................................................................. 20
4.1.5Descrip ve vs. Predic ve vs. Prescrip ve Analy cs.............................................................. 23
4.1.6Microso : Data warehousing and analy cs .......................................................................... 24
4.2Repor ng...................................................................................................................................... 26
1.Overview of data warehousing concepts
A data warehouse is a centralized system used for storing and managing large volumes of data from
various sources. It is designed to help businesses analyze historical data and make informed decisions.
Data fromr different opera onal systems is collected, cleaned, and stored in a structured way, enabling
efficient querying and repor ng.

 Goal is to produce sta s cal results that may help in decision-making.

 Ensures fast data retrieval even with the vast datasets.

1.1Need for Data Warehousing


1. Handling Large Volumes of Data: Tradi onal databases can only store a limited amount of data (MBs
to GBs), whereas a data warehouse is designed to handle much larger datasets (TBs), allowing
businesses to store and manage massive amounts of historical data.

2. Enhanced Analy cs: Transac onal databases are not op mized for analy cal purposes. A data
warehouse is built specifically for data analysis, enabling businesses to perform complex queries and
gain insights from historical data.

3. Centralized Data Storage: A data warehouse acts as a central repository for all organiza onal data,
helping businesses to integrate data from mul ple sources and have a unified view of their opera ons
for be er decision-making.

4. Trend Analysis: By storing historical data, a data warehouse allows businesses to analyze trends over
me, enabling them to make strategic decisions based on past performance and predict future
outcomes.

5. Support for Business Intelligence: Data warehouses support business intelligence tools and
repor ng systems, providing decision-makers with easy access to cri cal informa on, which enhances
opera onal efficiency and supports data-driven strategies.
1.2Components of Data Warehouse
The main components of a data warehouse include:

 Data Sources: These are the various opera onal systems, databases, and external data feeds
that provide raw data to be stored in the warehouse.

 ETL (Extract, Transform, Load) Process: The ETL process is responsible for extrac ng data from
different sources, transforming it into a suitable format, and loading it into the data
warehouse.

 Data Warehouse Database: This is the central repository where cleaned and transformed data
is stored. It is typically organized in a mul dimensional format for efficient querying and
repor ng.

 Metadata: Metadata describes the structure, source, and usage of data within the warehouse,
making it easier for users and systems to understand and work with the data.

 Data Marts: These are smaller, more focused data repositories derived from the data
warehouse, designed to meet the needs of specific business departments or func ons.

 OLAP (Online Analy cal Processing) Tools: OLAP tools allow users to analyse data in mul ple
dimensions, providing deeper insights and suppor ng complex analy cal queries.

 End-User Access Tools: These are repor ng and analysis tools, such as dashboards or Business
Intelligence (BI) tools, that enable business users to query the data warehouse and generate
reports.

1.3Characteris cs of Data Warehousing


Data warehousing is essen al for modern data management, providing a strong founda on for
organiza ons to consolidate and analyze data strategically. Its dis nguishing features empower
businesses with the tools to make informed decisions and extract valuable insights from their data.

 Centralized Data Repository: Data warehousing provides a centralized repository for all
enterprise data from various sources, such as transac onal databases, opera onal systems,
and external sources. This enables organiza ons to have a comprehensive view of their data,
which can help in making informed business decisions.

 Data Integra on: Data warehousing integrates data from different sources into a single,
unified view, which can help in elimina ng data silos and reducing data inconsistencies.

 Historical Data Storage: Data warehousing stores historical data, which enables organiza ons
to analyze data trends over me. This can help in iden fying pa erns and anomalies in the
data, which can be used to improve business performance.

 Query and Analysis: Data warehousing provides powerful query and analysis capabili es that
enable users to explore and analyze data in different ways. This can help in iden fying pa erns
and trends, and can also help in making informed business decisions.

 Data Transforma on: Data warehousing includes a process of data transforma on, which
involves cleaning, filtering, and forma ng data from various sources to make it consistent and
usable. This can help in improving data quality and reducing data inconsistencies.
 Data Mining: Data warehousing provides data mining capabili es, which enable organiza ons
to discover hidden pa erns and rela onships in their data. This can help in iden fying new
opportuni es, predic ng future trends, and mi ga ng risks.

 Data Security: Data warehousing provides robust data security features, such as access
controls, data encryp on, and data backups, which ensure that the data is secure and
protected from unauthorized access.

1.4Types of Data Warehouses


The different types of Data Warehouses are:

1. Enterprise Data Warehouse (EDW): A centralized warehouse that stores data from across the
organiza on for analysis and repor ng.

2. Opera onal Data Store (ODS): Stores real- me opera onal data used for day-to-day
opera ons, not for deep analy cs.

3. Data Mart: A subset of a data warehouse, focusing on a specific business area or department.

4. Cloud Data Warehouse: A data warehouse hosted in the cloud, offering scalability and
flexibility.

5. Big Data Warehouse: Designed to store vast amounts of unstructured and structured data for
big data analysis.

6. Virtual Data Warehouse: Provides access to data from mul ple sources without physically
storing it.

7. Hybrid Data Warehouse: Combines on-premises and cloud-based storage to offer flexibility.

8. Real- me Data Warehouse: Designed to handle real- me data streaming and analysis for
immediate insights.

1.5Example Applica ons of Data Warehousing


Data Warehousing can be applied anywhere where we have a huge amount of data and we want to
see sta s cal results that help in decision making.

 Social Media Websites: The social networking websites like Facebook, Twi er, Linkedin, etc.
are based on analyzing large data sets. These sites gather data related to members, groups,
loca ons, etc., and store it in a single central repository. Being a large amount of data, Data
Warehouse is needed for implemen ng the same.

 Banking: Most of the banks these days use warehouses to see the spending pa erns of
account/cardholders. They use this to provide them with special offers, deals, etc.

 Government: Government uses a data warehouse to store and analyze tax payments which
are used to detect tax the s.

1.6Advantages of Data Warehousing


 Intelligent Decision-Making: With centralized data in warehouses, decisions may be made
more quickly and intelligently.

 Business Intelligence: Provides strong opera onal insights through business intelligence.
 Data Quality: Guarantees data quality and consistency for trustworthy repor ng.

 Scalability: Capable of managing massive data volumes and expanding to meet changing
requirements.

 Effec ve Queries: Fast and effec ve data retrieval is made possible by an op mized structure.

 Cost reduc ons: Data warehousing can result in cost savings over me by reducing data
management procedures and increasing overall efficiency, even when there are setup costs
ini ally.

 Data security: Data warehouses employ security protocols to safeguard confiden al


informa on, guaranteeing that only authorized personnel are granted access to certain data.

 Faster Queries: The data warehouse is designed to handle large queries that’s why it runs
queries faster than the database..

 Historical Insight: The warehouse stores all your historical data which contains details about
the business so that one can analyze it at any me and extract insights from it.

1.7Disadvantages of Data Warehousing


 Cost: Building a data warehouse can be expensive, requiring significant investments in
hardware, so ware, and personnel.

 Complexity: Data warehousing can be complex, and businesses may need to hire specialized
personnel to manage the system.

 Time-consuming: Building a data warehouse can take a significant amount of me, requiring
businesses to be pa ent and commi ed to the process.

 Data integra on challenges: Data from different sources can be challenging to integrate,
requiring significant effort to ensure consistency and accuracy.

 Data security: Data warehousing can pose data security risks, and businesses must take
measures to protect sensi ve data from unauthorized access or breaches.
2.Basics of data warehousing solu ons
Data warehousing is a system that consolidates data from various sources into a central repository for
efficient querying and analysis, suppor ng business intelligence ac vi es like repor ng and data
mining. It involves processes like data integra on, transforma on, and loading (ETL) to prepare data
for analysis.

2.1Popular Data Warehousing Solu ons


2.1.1On-Premise in Data Warehousing
On-premise (or on-prem) means that the data warehouse infrastructure (hardware, so ware,
databases) is physically located within the organiza on's premises (e.g., university data center), and
managed en rely by the organiza on’s IT team.

Example in a University ERP Context:

In a university ERP system:

 All student data (admissions, results, fees, a endance) is stored in servers located within the
university campus.

 The university’s IT department is responsible for managing servers, databases, security,


backups, and updates.

Key Characteris cs of On-Premise Data Warehousing:

Feature Descrip on

Ownership University owns the hardware and so ware.

Control Full control over data, security, and compliance.

Cost High upfront cost (servers, licenses), but no recurring cloud bills.

Maintenance Requires in-house IT team for support and upgrades.

Customiza on Highly customizable to specific needs.

Benefits:

 Data Privacy: Sensi ve data (e.g., student grades, ID cards, salaries) stays local.

 Custom Security Policies: Tailored to university needs.

 No Internet Dependency: Internal apps may work even without internet.

Drawbacks:

 High Ini al Cost: Infrastructure is expensive to purchase and set up.


 Scalability Issues: Hard to scale quickly when data grows.
 Maintenance Overhead: Requires skilled IT staff for up me and monitoring.

On-Premise:

• Oracle,
• Microso SQL Server,

• Teradata.

2.1.2Cloud-Based Data Warehousing


A cloud-based data warehouse stores and manages your data on remote servers provided by a third-
party cloud provider (like Amazon, Microso , Google, etc.) rather than on your local campus or data
center.

We access it via the internet, and the cloud provider handles infrastructure, storage, security, and
scalability.

Key Features of Cloud-Based Data Warehousing:

Feature Descrip on

Hosted Remotely Data is stored in data centers of cloud vendors (AWS, Azure, GCP).

Managed Service Vendor manages hardware, updates, and backups.

Scalability Easily scale up/down based on usage.

Subscrip on Pricing Pay for what you use (no large upfront cost).

Accessibility Access any me, anywhere via internet.

Benefits:

Benefit Descrip on

Cost-Effec ve No hardware purchase, pay-as-you-go model.

Scalable Handles growing student data and performance needs automa cally.

Fast Deployment Ready in hours or days, not weeks.

Automa c Backups Built-in redundancy and disaster recovery.

Remote Access Useful for remote learning/management (especially post-COVID).

Drawbacks:

Limita on Descrip on

Internet Dependency Requires stable internet connec on.

Less Control Vendor manages infrastructure; limited deep customiza on.

Ongoing Costs Monthly/annual fees based on usage.

Data Security Concerns Data is stored off-campus—may raise privacy ques ons.
Cloud-Based:

 Amazon Redshi

 Google BigQuery

 Snowflake

 Azure Synapse Analy cs

2.1.3Comparison of Popular Data Warehousing Solu ons


Feature Amazon Google BigQuery Snowflake Azure Synapse
Redshi Analy cs

Deployment Cloud Cloud Cloud Cloud (Azure)

Architecture Cluster-based Serverless Mul -cluster, Hybrid (MPP +


shared data serverless)

Performance High with tuning Very fast for ad-hoc High, auto-scaling Good for hybrid
queries workloads

Pricing Pay-per-node Pay-per-query (on- Storage + Pay-per-use or


(hourly) demand) compute reserved
separated

Ease of Use Moderate Very easy, SQL- Very easy, Easy (integrates
(needs tuning) based intui ve with MS tools)

Integra on AWS Glue, S3, Dataflow, Cloud Snowpipe, BI Power BI, Data
Tools QuickSight Storage, Looker tools Factory

Best For Large-scale Real- me & on- Flexible scaling, Microso


analy cs demand analy cs mul -cloud ecosystem users

2.1.4On-Premise vs Cloud Summary


Feature On-Premise Cloud

Cost High upfront Pay-as-you-go

Control Full control Shared control

Maintenance Organiza on’s responsibility Cloud provider’s responsibility

Scalability Limited Easy to scale

Setup Time Weeks to months Hours to days


3.Introduc on to basic SQL queries
3.1SQL Joins
An SQL JOIN clause is used to query and access data from mul ple tables by establishing logical
rela onships between them. It can access data from mul ple tables simultaneously using common
key values shared across different tables. We can use SQL JOIN with mul ple tables. It can also be
paired with other clauses, the most popular use will be using JOIN with WHERE clause to filter data
retrieval.

Before diving into the specifics, let’s visualize how each join type operates:

 INNER JOIN: Returns only the rows where there is a match in both tables.

 LEFT JOIN (LEFT OUTER JOIN): Returns all rows from the le table, and the matched rows from
the right table. If there’s no match, NULL values are returned for columns from the right table.

 RIGHT JOIN (RIGHT OUTER JOIN): Returns all rows from the right table, and the matched rows
from the le table. If there’s no match, NULL values are returned for columns from the le
table.

 FULL JOIN (FULL OUTER JOIN): Returns all rows when there is a match in one of the tables. If
there’s no match, NULL values are returned for columns from the table without a match.

3.2SQL INNER JOIN


The INNER JOIN keyword selects all rows from both the tables as long as the condi on is sa sfied. This
keyword will create the result-set by combining all rows from both the tables where the condi on
sa sfies i.e value of the common field will be the same.

Syntax

SELECT table1.column1,table1.column2,table2.column1,…. FROM table1 INNER JOIN table2 ON


table1.matching_column = table2.matching_column;

Note: We can also write JOIN instead of INNER JOIN. JOIN is same as INNER JOIN.

Example of INNER JOIN


Consider the two tables, Student and StudentCourse, which share a common column ROLL_NO. Using
SQL JOINS, we can combine data from these tables based on their rela onship, allowing us to retrieve
meaningful informa on like student details along with their enrolled courses.

Student Table

StudentCourse Table

Query:

SELECT StudentCourse.COURSE_ID, Student.NAME, Student.AGE FROM Student


INNER JOIN StudentCourse
ON Student.ROLL_NO = StudentCourse.ROLL_NO;
3.3SQL LEFT JOIN
A LEFT JOIN returns all rows from the le table, along with matching rows from the right table. If there
is no match, NULL values are returned for columns from the right table. LEFT JOIN is also known as LEFT
OUTER JOIN.

Syntax:

SELECT table1.column1,table1.column2,table2.column1,….
FROM table1
LEFT JOIN table2
ON table1.matching_column = table2.matching_column;

LEFT JOIN Example

In this example, the LEFT JOIN retrieves all rows from the Student table and the matching rows from
the StudentCourse table based on the ROLL_NO column.

SELECT Student.NAME,StudentCourse.COURSE_ID
FROM Student
LEFT JOIN StudentCourse
ON StudentCourse.ROLL_NO = Student.ROLL_NO;
3.4SQL RIGHT JOIN
RIGHT JOIN returns all the rows of the table on the right side of the join and matching rows for the
table on the le side of the join. It is very similar to LEFT JOIN for the rows for which there is no
matching row on the le side, the result-set will contain null. RIGHT JOIN is also known as RIGHT OUTER
JOIN.

SELECT table1.column1,table1.column2,table2.column1,....
FROM table1
RIGHT JOIN table2
ON table1.matching_column = table2.matching_column;

In this example, the RIGHT JOIN retrieves all rows from the StudentCourse table and the matching
rows from the Student table based on the ROLL_NO column.

SELECT Student.NAME,StudentCourse.COURSE_ID
FROM Student
RIGHT JOIN StudentCourse
ON StudentCourse.ROLL_NO = Student.ROLL_NO;
3.4SQL FULL JOIN
FULL JOIN creates the result-set by combining results of both LEFT JOIN and RIGHT JOIN. The result-
set will contain all the rows from both tables. For the rows for which there is no matching, the result-
set will contain NULL values.

SELECT table1.column1,table1.column2,table2.column1,....
FROM table1
FULL JOIN table2
ON table1.matching_column = table2.matching_column;
This example demonstrates the use of a FULL JOIN, which combines the results of both LEFT
JOIN and RIGHT JOIN. The query retrieves all rows from the Student and StudentCourse tables. If a
record in one table does not have a matching record in the other table, the result set will include that
record with NULL values for the missing fields.

SELECT Student.NAME,StudentCourse.COURSE_ID
FROM Student
FULL JOIN StudentCourse
ON StudentCourse.ROLL_NO = Student.ROLL_NO;
NAME COURSE_ID

HARSH 1

PRATIK 2

RIYANKA 2

DEEP 3

SAPTARHI 1

DHANRAJ NULL

ROHIT NULL

NIRAJ NULL

NULL 4

NULL 5

NULL 4
4.Basics of data analysis and repor ng
4.1Data Analysis
Data Analy cs is used to get conclusions by processing the raw data. It is helpful in various businesses
as it helps the company to make decisions based on the conclusions from the data. Basically, data
analy cs helps to convert a Large number of figures in the form of data into Plain English i.e.,
conclusions which are further helpful in making in-depth decisions. Data analysis can be broadly
categorized into four primary types: descrip ve, diagnos c, predic ve, and prescrip ve. Each type
serves a unique purpose in understanding data and making informed decisions.

Beyond produc on op miza on, data analy cs is u lized in diverse sectors. Gaming firms u lize it to
design reward systems that engage players effec vely, while content providers leverage analy cs to
op mize content placement and presenta on, ul mately driving user engagement.

Types of Data Analy cs

There are four major types of data analy cs:

1. Predic ve (forecas ng)

2. Descrip ve (business intelligence and data mining)

3. Prescrip ve (op miza on and simula on)

4. Diagnos c analy cs

4.1.1Predic ve Analy cs
Predic ve analy cs turn the data into valuable, ac onable informa on. predic ve analy cs uses data
to determine the probable outcome of an event or a likelihood of a situa on occurring. Predic ve
analy cs holds a variety of sta s cal techniques from modeling, machine learning , data mining ,
and game theory that analyze current and historical facts to make predic ons about a future
event. Techniques that are used for predic ve analy cs are:
 Linear Regression

 Time Series Analysis and Forecas ng

 Data Mining

Basic Cornerstones of Predic ve Analy cs

 Predic ve modeling

 Decision Analysis and op miza on

 Transac on profiling

Predic ve analysis answers the ques on, “What might happen in the future?”

How Predic ve Analy cs Modeling works?

1. Define a Problem

• Firstly data scien sts or data analysts define the problem.

• Defining the problem means clearly expressing the challenge that the organiza on
aims to focus using data analysis.

• A well- defined problem statement helps determine the appropriate predic ve


analy cs approach to employ.

2. Gather and Organize Data:

• Once you define a problem statement it is important to acquire and organize data
properly.

• Acquiring data for predic ve analy cs means collec ng and preparing relevant
informa on and data from various sources like databases, data warehouses, external
data providers, APIs, logs, surveys, and more that can be used to build and train
predic ve models.

3. Pre-process Data:

• Now a er collec ng and organizing the data, we need to pre-process data.

• Raw data collected from different sources is rarely in an ideal state for analysis. So,
before developing a predic ve models, data need to be pre-processed properly.

• Pre-processing involves cleaning the data to remove any kind of anomalies, handling
missing data points and addressing outliers that could be caused by errors or input or
transforming the data , which can be used for further analysis.

• Pre-processing ensures that data is of high quality and now the data is ready for model
development.

4. Develop Predic ve Models


• Data scien sts or data analysts leverage a range of tools or techniques to develop a
predic ve models based on the problem statement and the nature of the datasets.

• Now techniques like machine learning algorithms, regression models , decisions trees,
neural networks are much among the common techniques for this.

• These models are trained on the prepared data to iden fy correla ons and pa erns
that can be used for making predic ons.

5. Validate and Deploy Results

• A er building the predic ve model, valida on is the cri cal steps to assess the
accuracy and reliability of predic ons.

• Data scien sts rigorously evaluate the model's performance against known outcomes
or test datasets.

• If required, modifica ons are implemented to improve the accuracy of the model.

• Once the model achieve sa sfactory outcomes it can be deployed to deliver


predic ons to stakeholders.

• This can be done through applica ons, websites or data dashboards, making the
insights easily accessible to decision makers or stakeholders.

4.1.2Descrip ve Analy cs
Descrip ve analy cs looks at data and analyze past event for insight as to how to approach future
events. It looks at past performance and understands the performance by mining historical data to
understand the cause of success or failure in the past. Almost all management repor ng such as sales,
marke ng, opera ons, and finance uses this type of analysis.

The descrip ve model quan fies rela onships in data in a way that is o en used to classify customers
or prospects into groups. Unlike a predic ve model that focuses on predic ng the behavior of a single
customer, Descrip ve analy cs iden fies many different rela onships between customer and product.

How does Descrip ve Analy cs work?

let's have a closer look on the working of Descrip ve Analy cs. A type of descrip ve analy cs involves
analyzing and simplifying historical data to provide insights into previous events, trends, and pa erns.
It is much closer to repor ng than to what most people think of as analy cs.
1. Data Collec on: Collec ng useful informa on is the ini al stage in the descrip ve analy cs
process. By using mul ple resources such as databases, spreadsheets, and other data
repositories. All of these provide this data. Since they directly affect how accurate the
descrip ve analy cs is, the accuracy and standard of the data are extremely important.

2. Cleaning the Data and Preprocessing: The obtained data usually needs to be cleaned and
preprocessed before analysis can start. This includes conver ng data into a uniform structure,
standardizing formats, and handling missing or incorrect values. Clean and well-preprocessed
data ensures that the subsequent analy cs is reliable.

3. Data analysis: It provides an understanding of the structure and features of the dataset. Here
EDA (exploratory data analysis) methods helps to find the pa erns, trends, and possible
outliers in the data. These methods include making histograms, sca er plots, and summary
sta s cs.

4. Compila on and Summary: The goal of descrip ve analy cs is to offer an overview of the data
at a high level. To get important metrics and sta s cs, such as mean, median, mode, range,
and standard devia on, this frequently requires combining the data.

5. Visualiza on: In descrip ve analy cs, visualiza ons are extremely useful tools. It helps us to
communicate complex informa on with a variety of charts, graphs, and other visual
representa ons are employed. Data pa erns and trends can be highlighted with the use of
visualiza on, which also makes it easier to convey insights to a wide range of audiences.

6. Fic on Crea on: Descrip ve analy cs can include the crea on of descrip ons that offer a
logical and contextualized explana on of the data, in addi on to visuals. When communica ng
findings to those in the audience who might not be familiar with the complexi es of the data,
this can be especially helpful.

7. Interpreta on: To obtain significant knowledge, analysts interpret the outcomes of descrip ve
analy cs. This involves knowing the effects of the trends and pa erns seen in the data. While
interpreta on provides the founda on for more in-depth analyses that inves gate "why" and
"what might happen in the future," descrip ve analy cs concentrates on the "what
happened" topic.
8. Tes ng Ac vely: The process of descrip ve analy cs is not one- me. Organiza ons con nually
repeat the descrip ve analy cs when new data becomes available in order to keep informed
about the latest developments and pa erns. This way, people making decisions get the newest
informa on.

Advantages of Descrip ve Analy cs

 Data-driven decision making: It provides well-informed decision-making based on facts rather


than gut ins ncts by evalua ng and simplifying data.

 Presents data clearly: Descrip ve analy cs simplifies complex data, making it easy to
understand through reports and visualiza ons like charts and graphs.

 Convenient to Realize: Data that has been summarized and graphically represented is easier
to clarify and evaluate for a larger audience.

 Iden fies Relevant Data Points: It offers straigh orward metrics that give an accurate
es ma on of important data points.

 Simple and cost-effec ve: Descrip ve analy cs is simple to use and just requires basic
arithme c knowledge for execu on.

 Efficient with tools: With the aid of tools like Python or MS Excel, which make things fast and
easy.

Disadvantages of Descrip ve Analy cs

 Inability of Cause Analysis: The main goal of descrip ve analy cs is to explain historical
events. It doesn't explore the root causes or reasons for the pa erns that are seen.

 Analysis Simplicity: The reach of descrip ve analy cs is restricted to basic analyses that look
at the rela onships between a small number of variables.

 Doesn't Explain Why: History offers lessons for future genera ons, by offering facts, but
causes and predic ons are not provided to the readers.

 Inappropriate for Making Decisions in Real Time: Normally, descrip ve analy cs involves
ge ng summary informa on at intervals intervals and this might not be the best op on for
decision- making when the me ma er. In many situa ons, fast responsiveness is vital,
therefore, some mes only relying on the descrip ve analy cs might drag you behind.

 Lack of ability to handle unstructured data: Structured and well-organized datasets are be er
suited for descrip ve analy cs. while analyzing semi-structured or unstructured data, such as
text, photos, or mul media, it could make challenging to offer insigh ul analysis.

4.1.3Diagnos c Analy cs
Diagnos c Analy cs plays an important role in today’s world of data in helping businesses to
understand not just what happened but also why it happened. Think of it like solving a mystery and
asking ques ons like “Why did sales fall?”, “Why are customers leaving?” or “What caused this system
breakdown?” By understanding data, businesses can iden fy main reasons behind these issues and
can take ac on to resolve them.
Purpose of Diagnos c Analy cs

1. Iden fy Root Causes: It helps businesses understand data in a be er way, also to iden fy the
key factors of these specific outcomes which leads to clear, ac onable insights.

2. Solve Problems: By iden fying root causes, the companies can choose targeted solu ons to
resolve issues and improve performance.

3. Inform Future Decisions: Understanding past events and their causes helps businesses to
make data-driven decisions and develop smarter strategies.

Key Steps in Diagnos c Analy cs

1. Iden fy the Anomaly: Detect irregulari es in data by using sources such as website logs,
customer feedback and financial records. This helps in iden fying issues that requires further
inves ga on.

2. Data Collec on: Collect data from various sources which include transac on records, surveys,
system logs or other sources that provide important informa on to understand the situa on
be er.

3. Data Explora on: Explore the collected data to iden fy trends, pa erns and correla ons.
Techniques like sta s cal analysis and data visualiza on help us to find insights that explain
the anomaly.

4. Pa ern Iden fica on: Using data analysis methods like machine learning and correla on
analysis helps to detect recurring pa erns or trends. This step helps in merging the anomalies
to poten al causes.

5. Root Cause Analysis: Check the iden fied pa erns to find the main cause of the issue. This
step helps to answers ques ons like whether the cause is due to opera onal issues, external
factors or system issues.

6. Tes ng and Confirma on: Check the hypothesis using various tests or simula ons. For
example tes ng whether a new website feature caused a decline in user engagement or if it
was due to a marke ng change.

Benefits of Diagnos c Analy cs


1. Deeper Insights: It helps businesses to find hidden pa erns and trends which provides a
clearer understanding of data. This help them to make informed decisions based on facts
rather than assump ons.

2. Improved Problem-Solving: By iden fying root causes, businesses can follow targeted
solu ons that helps in solving problems.

3. Op mized Processes: It highlights inefficiencies in workflows which allows businesses to


streamline processes. This helps in improving produc vity, faster delivery and be er resource
u liza on.

4. Enhanced Decision-Making: With data driven insights, businesses can make more informed
and strategic decisions. This helps in minimizing risks and ensures that all the ac ons align with
long-term goals.

5. Risk Reduc on: Early detec on of issues helps businesses to avoid risks before they increases.
By taking mely measures, companies can prevent disrup ons and avoid costly mistakes.

6. Customer Sa sfac on: It help businesses to understand customer needs and preferences
which helps in crea ng personalized experiences, improving sa sfac on and stronger
customer loyalty.

4.1.4Prescrip ve Analy cs
Prescrip ve Analy cs is the area of Business Analy cs dedicated to searching out the best solu on for
day-to-day occurring problems. It is directly related to the other two comparable processes,
i.e. Descrip ve and Predic ve Analy cs. Prescrip ve Analy cs can be defined as a type of data
analy cs that uses algorithms and analysis of raw data to achieve be er and more effec ve decisions
for a long and short span of me. It suggests strategy over possible scenarios, accumulated sta s cs,
and past/present databases collected through the consumer community.

Prescrip ve Analy cs Approach

Step 1 Data Collec on: Gather data for a customer’s loca ons, their requirement, company
warehouses, and transporta on

Step 2 Mathema cal Modeling: We will create mathema cal models that will handle supply chain
data like customer loca on, me, warehouse loca on, and routes, we will also finalize an op miza on
func on that will minimize company cost and delivery me

Step 3 Op miza on: We will use an op miza on approach like linear programming or differen al
calculus to solve mathema cal models and find op mal loca ons.

Step 4 Scenario Analysis: We will perform a scenario analysis for our assump ons variables about the
models.

Step 5 Decision Support: Based on our data modeling and business knowledge that we got from the
raw data we will create dashboards and visualiza on graphs that will stakeholders in taking decisions.

Step 5 Implementa on: The Final and most important part a er doing all the five steps is to implement
it with changes that maximizes the company’s revenues

Advantages of Prescrip ve Analy cs


 Effortlessly map Business analysis to declare out steps necessary to avoid failure and achieve
success.

 An accurate and Comprehensive form of data aggrega on and analysis also reduces human
error and bias.

 Helping in decision-making threads related to problems rather than jumping to unreliable


conclusions based on ins ncts.

 Removing immediate uncertain es helps in the preven on of fraud, limits risk, increases
efficiency, and creates logical customers.

4.1.5Descrip ve vs. Predic ve vs. Prescrip ve Analy cs


Feature Descrip ve Analy cs Predic ve Analy cs Prescrip ve Analy cs

Understand what
Forecast what might Recommend ac ons to
happened in the
happen in the future. achieve desired outcomes.
Purpose past.

Historical data Future trends and Decision-making and


Focus analysis. pa erns. op miza on.

Past events and Future events and Future ac ons and


Time Frame trends. probabili es. recommenda ons.

Predic ng future sales


Summarizing sales Recommending product
based on market
data from the pricing strategies to maximize
trends and historical
previous month. profits.
Examples data.

Repor ng tools, Sta s cal models,


Op miza on algorithms,
dashboards, data machine learning
decision support systems.
Tools visualiza on. algorithms.

Descrip ve sta s cs: Predic ve accuracy Prescrip ve performance


mean, median, metrics: RMSE, MAE, metrics: ROI, cost-benefit
Key Metrics mode, etc. etc. analysis, etc.
Feature Descrip ve Analy cs Predic ve Analy cs Prescrip ve Analy cs

Provides insights for Offers ac onable


Guides future ac ons
Decision informed decision- recommenda ons to achieve
and strategies.
Support making. specific goals.

Analyzing website Predic ng customer Sugges ng personalized


Example traffic to understand churn to an cipate marke ng campaigns based
Applica on user behavior. and prevent losses. on customer segmenta on.

Historical
Future predic on and Op mal decision-making and
understanding and
risk assessment. performance improvement.
Objec ve trend analysis.

An cipa ng future
Maximizing outcomes and
Historical insights for scenarios for
efficiency through informed
strategy refinement. proac ve decision-
ac ons.
Impact making.

Data Historical data sets, Historical data sets, future


Historical data sets.
Requirements future predictors. predictors, decision variables.

4.1.6Microso : Data warehousing and analy cs


This example scenario demonstrates a data pipeline that integrates large amounts of data from
mul ple sources into a unified analy cs pla orm in Azure. This specific scenario is based on a sales
and marke ng solu on, but the design pa erns are relevant for many industries requiring advanced
analy cs of large datasets such as e-commerce, retail, and healthcare.
Dataflow

The data flows through the solu on as follows:

1. For each data source, any updates are exported periodically into a staging area in Azure Data
Lake Storage.

2. Azure Data Factory incrementally loads the data from Azure Data Lake Storage into staging
tables in Azure Synapse Analy cs. The data is cleansed and transformed during this process.
PolyBase can parallelize the process for large datasets.

3. A er loading a new batch of data into the warehouse, a previously created Azure Analysis
Services tabular model is refreshed. This seman c model simplifies the analysis of business
data and rela onships.

4. Business analysts use Microso Power BI to analyze warehoused data via the Analysis Services
seman c model.

Components

The company has data sources on many different pla orms:

 SQL Server on-premises

 Oracle on-premises

 Azure SQL Database

 Azure table storage

 Azure Cosmos DB

Data is loaded from these different data sources using several Azure components:

 Azure Data Lake Storage is used to stage source data before it's loaded into Azure Synapse.
 Data Factory orchestrates the transforma on of staged data into a common structure in Azure
Synapse. Data Factory uses PolyBase when loading data into Azure Synapse to maximize
throughput.

 Azure Synapse is a distributed system for storing and analyzing large datasets. Its use of
massive parallel processing (MPP) makes it suitable for running high-performance analy cs.
Azure Synapse can use PolyBase to rapidly load data from Azure Data Lake Storage.

 Analysis Services provides a seman c model for your data. It can also increase system
performance when analyzing your data.

 Power BI is a suite of business analy cs tools to analyze data and share insights. Power BI can
query a seman c model stored in Analysis Services, or it can query Azure Synapse directly.

 Microso Entra ID authen cates users who connect to the Analysis Services server through
Power BI. Data Factory can also use Microso Entra ID to authen cate to Azure Synapse via a
service principal or Managed iden ty for Azure resources.

4.2Repor ng
Reports on data analysis are essen al for communica ng data-driven insights to decision-makers,
stakeholders, and other per nent par es. These reports provide an organized format for providing
conclusions, analyses, and sugges ons derived from data set analysis.

These reports are crucial in various fields such as business, science, healthcare, finance, and
government, where data-driven decision-making is essen al. It combines quan ta ve and qualita ve
data to evaluate past performance, understand current trends, and make informed recommenda ons
for the future. Think of it as a translator, taking the language of numbers and transforming it into a
clear and concise story that guides decision-making.

Why is Data Analysis Repor ng Important?

1. Making decisions: Reports on data analysis provide decision-makers insigh ul informa on


that helps them make well-informed choices. These reports assist stakeholders in
understanding trends, pa erns, and linkages that may guide strategic planning and decision-
making procedures by summarizing and analyzing data.

2. Performance Evalua on: Data analysis reports are used by organiza ons to assess how well
procedures, goods, or services are working. Through the examina on of per nent metrics
and key performance indicators (KPIs), enterprises may pinpoint opportuni es for
improvement and maximize produc vity.

3. Risk management: Within a company, data analysis reports may be used to detect possible
dangers, difficul es, or opportuni es. Businesses may reduce risks and take advantage of new
possibili es by examining past data and predic ng future pa erns.

4. Communica on and Transparency: By providing a concise and impar al summary of study


findings, data analysis reports promote communica on and transparency within enterprises.
With the help of these reports, stakeholders may successfully cooperate to solve problems and
accomplish goals by understanding complicated data insights.

How to Write a Data Analysis Report?


1. Map Your Report with an Outline

Crea ng a well-structured outline is like drawing a roadmap for your report. It acts as a guide, to
organize your thoughts and content logically. Begin by iden fying the key sec ons of report, such as
introduc on, methodology, findings, analysis, conclusions, and recommenda ons. Within each
sec on, break down the specific points or subtopics you want to address. This step-by-step approach
not only streamlines the wri ng process but also ensures that you cover all essen al elements of your
analysis. Moreover, an outline helps you maintain focus and prevents you from veering off track,
ensuring that your report remains coherent and easy to follow for your audience.

2. Priori ze Key Performance Indicators (KPIs)

In a data analysis report, it's crucial to priori ze the most relevant Key Performance Indicators (KPIs)
to avoid overwhelming your audience with unnecessary informa on. Start by iden fying the KPIs that
directly impact your business objec ves and overall performance. These could include metrics like
revenue growth, customer reten on rates, conversion rates, or website traffic. By focusing on these
key metrics, the audience can track report with ac onable insights that drive strategic decision-
making. Addi onally, consider contextualizing these KPIs within your industry or market landscape to
provide a comprehensive understanding of your performance rela ve to compe tors or benchmarks.

3. Visualize Data with Impact

Data visualiza on plays a pivotal role in conveying complex informa on in a clear and engaging
manner. When selec ng visualiza on tools, consider the nature of the data and the story you want to
tell. For instance, if you're illustra ng historical trends, melines or line graphs can effec vely
showcase pa erns over me. On the other hand, if you're comparing categorical data, pie charts or
bar graphs might be more suitable. The key is to choose visualiza on methods that accurately
represent your findings and facilitate comprehension for your audience. Addi onally, pay a en on to
design principles such as color contrast, labeling, and scale to ensure that your visuals are both
informa ve and visually appealing.

4. Cra a Compelling Data Narra ve

Transforming your data into a compelling narra ve is essen al for engaging your audience and
highligh ng key insights. Instead of presen ng raw data, strive to tell a story that contextualizes the
numbers and unveils their significance.

Start by iden fying specific events or trends in data and explore the underlying reasons behind them.
For example, if you no ce a sudden spike in sales, inves gate the marke ng campaign or external
factors that may have contributed to this increase. By weaving these insights into a cohesive narra ve,
you can guide your audience through your analysis and make your findings more memorable and
impac ul. Remember to keep your language clear and concise, avoiding jargon or technical terms that
may confuse your audience.

5. Organize for Clarity

Establishing a clear informa on hierarchy is essen al for ensuring that your report is easy to navigate
and understand. Start by outlining the main points or sec ons of your report and consider the logical
flow of informa on. Typically, it's best to start with broader, more general informa on and gradually
delve into specifics as needed. This approach helps orient your audience and provides them with a
framework for understanding the rest of the report.

6. Summarize Key Findings


A concise summary at the beginning of your report serves as a roadmap for your audience, providing
them with a quick overview of the report's objec ves and key findings. However, it's important to write
this summary a er comple ng the report, as it requires a comprehensive understanding of the data
and analysis.

To create an effec ve summary, dis ll the main points of the report into a few succinct paragraphs.
Focus on highligh ng the most significant insights and outcomes, avoiding unnecessary details or
technical language. Consider the needs of your audience and tailor the summary to address their
interests and priori es. By providing a clear and concise summary upfront, you set the stage for the
rest of the report and help busy readers grasp the essence of your analysis quickly.

7. Offer Ac onable Recommenda ons

Effec ve communica on of data analysis findings goes beyond simply repor ng the numbers; it
involves providing ac onable recommenda ons that drive decision-making and facilitate
improvements. When offering recommenda ons, remain objec ve and avoid assigning blame for any
nega ve outcomes. Instead, focus on iden fying solu ons and sugges ng prac cal steps for
addressing challenges or leveraging opportuni es.

Consider the implica ons of your findings for the broader business strategy and provide specific
guidance on how to implement changes or ini a ves. Moreover, priori ze recommenda ons that are
realis c, achievable, and aligned with the organiza on's goals and resources. By offering ac onable
recommenda ons, you demonstrate the value of your analysis and empower stakeholders to take
proac ve steps towards improvement.

8. Leverage Interac ve Dashboards for Enhanced Presenta on

The presenta on format of the report is as crucial as its content, as it directly impacts the effec veness
of your communica on and engagement with your audience. Interac ve dashboards offer a dynamic
and visually appealing way to present data, allowing users to explore and interact with the informa on
in real- me.

You might also like