Sodapdf Merged
Sodapdf Merged
Module Content: Introduc on to data storage concepts, Basics of distributed file systems, Basics of
batch processing, Introduc on to basic SQL queries (SELECT, INSERT, UPDATE, DELETE), Introduc on
to database management systems (DBMS).
1. ""Learning Spark: Lightning-Fast Big Data Analysis" by Holden Karau, Andy Konwinski, Patrick
Wendell, and Matei Zaharia, O'Reilly Media
Contents
1.Introduc on to data storage concepts ................................................................................................. 4
1.1 Data storage devices .............................................................................................................. 4
1.1.1 What's the difference between NAS and SAN? ..................................................................... 4
1.2 Types of storage devices and systems........................................................................................... 5
1.3 Forms of data storage ................................................................................................................... 5
1.4 Data storage security .................................................................................................................... 6
2.Basics of distributed file systems ......................................................................................................... 7
2.1Components of DFS........................................................................................................................ 7
2.2Design Principles of Distributed File System .................................................................................. 7
3.Basics of batch processing ................................................................................................................. 10
3.1Key Concepts: ............................................................................................................................... 10
3.2Benefits: ....................................................................................................................................... 10
3.3Use Cases: .................................................................................................................................... 10
4.Introduc on to basic SQL queries ...................................................................................................... 11
4.1What is SQL? ................................................................................................................................ 11
4.2Components of a SQL System ...................................................................................................... 11
4.3 SQL Data Types ............................................................................................................................ 11
4.4SQL Queries .................................................................................................................................. 11
4.4.1Craete Query ......................................................................................................................... 12
4.4.2 Insert Query ......................................................................................................................... 13
4.4.3Select Query .......................................................................................................................... 13
4.4.4Primary key ........................................................................................................................... 13
4.4.5 FOREIGN KEY ........................................................................................................................ 13
4.4.6 SQL CHECK ............................................................................................................................ 14
4.4.7Where clause ........................................................................................................................ 15
4.4.8Sql Update ............................................................................................................................. 15
4.4.9Delete query.......................................................................................................................... 15
4.4.10Sum, Average, mul ply, count in sql ................................................................................... 16
5.Introduc on to DBMS ........................................................................................................................ 18
5.1Types of DBMS ............................................................................................................................. 18
5.2Advantages of DBMS .................................................................................................................... 19
5.3 Disadvantages of DBMS .............................................................................................................. 19
5.4Difference between File System and DBMS ................................................................................. 19
5.5 ACID Proper es in DBMS ............................................................................................................ 20
5.6Data Integrity ............................................................................................................................... 21
5.7 Data Consistency ......................................................................................................................... 22
5.8Example of DBMS in Various Fields .............................................................................................. 22
5.8.1 Banking System .................................................................................................................... 22
5.8.2 College ERP System (Enterprise Resource Planning)............................................................ 22
5.8.3 Social Media Pla orms......................................................................................................... 23
5.9 DATABASE MODELS ..................................................................................................................... 23
5.9.1 Rela onal Model .................................................................................................................. 23
5.9.2 ER (En ty-Rela onship) Model ............................................................................................ 23
5.9.3 Rela onal Vs ER model ......................................................................................................... 24
Assignment............................................................................................................................................ 26
1.Introduc on to data storage concepts
Data storage refers to magne c, op cal or mechanical media that record and preserve digital
informa on for ongoing or future opera ons. There are two types of digital informa on: input and
output data. Users provide the input data, and computers provide the output data. However, a
computer's CPU can’t compute anything or produce output data without the user's input. Today,
organiza ons and users require data storage to meet high-level computa onal needs for big data
analy cs, ar ficial intelligence (AI), machine learning (ML) and the Internet of Things (IoT). The other
side of requiring vast data storage is protec ng against data loss due to disaster, failure or fraud. So, to
avoid data loss, organiza ons can also employ data storage as a backup and restore solu on.
Direct area storage, also known as direct-a ached storage (DAS), is as the name implies. This storage
is o en in the immediate area and directly connected to the compu ng machine accessing it. O en,
it's the only machine connected to it. DAS can also provide decent local backup services, but sharing
is limited. DAS devices include diske es, op cal discs—compact discs (CDs) and digital video discs
(DVDs)—hard disk drives (HDD), flash drives and solid-state drives (SSD).
Network-based storage allows mul ple computers to access it through a network, making it be er for
data sharing and collabora on. Its off-site storage capability is also be er suited for backups and data
protec on. Two standard network-based storage setups are network-a ached storage (NAS) and
storage area network (SAN).
NAS SAN
File storage, or file-based storage, is a hierarchical storage methodology used to organize and store
data. In other words, data is stored in files, which are organized in folders, which are organized under
a hierarchy of directories and subdirectories.
Block storage
Block storage, some mes called block-level storage, is a technology for storing data in blocks. The
blocks are then stored as separate pieces, each with a unique iden fier. Developers favor block storage
for compu ng situa ons that require fast, efficient and reliable data transfer.
Object storage
Object storage, o en called object-based storage, is a data storage architecture for handling large
amounts of unstructured data. This data doesn't conform to—or can't be organized easily into—a
tradi onal rela onal database with rows and columns. Examples include email, videos, photos, web
pages, audio files, sensor data and other media and web content (textual or nontextual). Other use
cases include building cloud-na ve applica ons or transforming legacy applica ons into next-
genera on cloud applica ons by using cloud-based object storage as a persistent data store.
Data breaches are costly and present an ongoing for enterprise businesses. According to the IBM Cost
of a Data Breach Report 2023, the global average data breach cost in that year was USD 4.45 million,
a 15% increase over three years. The report also revealed that the average savings for organiza ons
that use security AI and automa on extensively is USD 1.76 million when compared to organiza ons
that don't.
Enterprises deploy data security measures to enhance visibility into data storage. Storage security
hardware and so ware features include special permissions, encryp on, data masking and redac on
of sensi ve files. The latest security storage so ware solu ons also help to automate repor ng to
streamline audits and adhere to regulatory requirements.
Moreover, cyber resilience—an organiza on's ability to prevent, withstand and recover from
cybersecurity incidents—has become an integral part of data storage security. Cyber resilience takes
data security to a new level by combining business con nuity disaster recovery (BCDR), informa on
systems security and organiza onal resilience to help organiza ons ward off threats and safeguard
their data.
Today, industries that need to preserve records and maintain data integrity (for example, healthcare,
government) can opt for immutable storage, which protects stored data by preven ng any changes or
altera ons for a set or indefinite amount of me. These file systems allow stored data to be accessed
repeatedly once created, but not modified and can help protect data from tampering, cybera acks
and ransomware.
2.Basics of distributed file systems
A Distributed File System (DFS) is a file system that is distributed on mul ple file servers or mul ple
loca ons. It allows programs to access or store isolated files as they do with the local ones, allowing
programmers to access files from any network or computer. In this ar cle, we will discuss everything
about Distributed File System.
2.1Components of DFS
In the case of failure and heavy load, these components together improve data availability by allowing
the sharing of data in different loca ons to be logically grouped under one folder, which is known as
the “DFS root”. It is not necessary to use both the two components of DFS together, it is possible to
use the namespace component without using the file replica on component and it is perfectly possible
to use the file replica on component without using the namespace component between servers.
3.1Key Concepts:
Grouping of Tasks: Batch processing involves grouping mul ple tasks or transac ons together
and processing them as a single unit, rather than individually.
Automated Execu on: Once the batch is ini ated, it runs autonomously without user
interven on un l it completes or encounters an error.
Efficiency for Large Data: This approach is well-suited for processing large datasets or tasks
that require significant compu ng power and me, such as data analysis or ETL (Extract,
Transform, Load) opera ons.
Scheduled Execu on: Batch processing is o en scheduled to run during off-peak hours or at
specific intervals to op mize resource u liza on.
3.2Benefits:
Cost-Effec ve: By processing mul ple tasks at once, batch processing can reduce the overall
processing me and cost.
Data Consistency: Processing in batches ensures data consistency and reduces the risk of
errors.
Scalability: Batch processing systems can be designed to handle large volumes of data and
scale to meet growing needs.
3.3Use Cases:
Payroll Processing: Calcula ng and distribu ng employee salaries at the end of a pay period.
End-of-Month Reconcilia on: Processing and reconciling financial transac ons at the end of
each month.
Data Analy cs: Performing complex analy cal computa ons on large datasets.
ETL Opera ons: Extrac ng, transforming, and loading data from mul ple sources.
4.4SQL Queries
SQL commands are the fundamental building blocks for communica ng with a database management
system (DBMS). It is used to interact with the database with some opera ons. It is also used to
perform specific tasks, func ons, and queries of data. SQL can perform various tasks like crea ng a
table, adding data to tables, dropping the table, modifying the table, set permission for users.
Example:
CREATE TABLE Persons (
PersonID int,
LastName varchar(255),
FirstName varchar(255),
Address varchar(255),
City varchar(255)
);
4.4.2 Insert Query
To insert data into the table, we follow this query,
4.4.3Select Query
To display the data, we follow this query
4.4.4Primary key
In SQL, a primary key is a column or a set of columns that uniquely iden fies each row in a table. It
ensures that no two rows have the same primary key value, and it cannot contain NULL values. A table
can have only one primary key.
Example:
Query:
CREATE TABLE Orders (
OrderID int NOT NULL,
OrderNumber int NOT NULL,
PersonID int,
PRIMARY KEY (OrderID),
FOREIGN KEY (PersonID) REFERENCES Persons(PersonID)
);
CHECK (Age >= 18): Ensures that any value entered into the Age column must be 18 or older. If a
value below 18 is inserted, the database will reject it.
CREATE TABLE Persons (
ID int NOT NULL, LastName varchar(255) NOT NULL,
FirstName varchar(255), Age int,
City varchar(255),
CONSTRAINT CHK_Person CHECK (Age>=18 AND City='Sandnes')
);
Explana on:
ID int NOT NULL: Each person must have a unique, non-null ID.
This named constraint CHK_Person enforces a rule:
The Age must be at least 18,
AND the City must be exactly 'Sandnes'.
4.4.7Where clause
The WHERE clause in SQL is used to filter records based on a specified condi on, extrac ng only those
that meet the criteria. It's a fundamental part of SQL queries, enabling you to retrieve specific subsets
of data from a database table. The WHERE clause can be used in SELECT, UPDATE,
and DELETE statements.
SELECT column_name(s)
FROM table_name
WHERE condi on;
Example:
SELECT * FROM Customers WHERE City = 'Aralumallige';
4.4.8Sql Update
The UPDATE statement is used to modify the exis ng records in a table.
UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condi on;
Example:
UPDATE Customers
SET ContactName = 'Alfred Schmidt', City= 'Frankfurt'
WHERE CustomerID = 1;
4.4.9Delete query
The SQL DELETE query is used to remove records from a table. Here’s the basic syntax:
1. Data Modeling: Tools to create and modify data models, defining the structure and
rela onships within the database.
2. Data Storage and Retrieval: Efficient mechanisms for storing data and execu ng queries to
retrieve it quickly.
3. Concurrency Control: Ensures mul ple users can access the database simultaneously
without conflicts.
4. Data Integrity and Security: Enforces rules to maintain accurate and secure data, including
access controls and encryp on.
5. Backup and Recovery: Protects data with regular backups and enables recovery in case of
system failures.
5.1Types of DBMS
There are several types of Database Management Systems (DBMS), each tailored to different data
structures, scalability requirements, and applica on needs. The most common types are as follows:
RDBMS organizes data into tables (rela ons) composed of rows and columns. It uses primary keys to
uniquely iden fy rows and foreign keys to establish rela onships between tables. Queries are wri en
in SQL (Structured Query Language), which allows for efficient data manipula on and retrieval.
2. NoSQL DBMS
NoSQL systems are designed to handle large-scale data and provide high performance for scenarios
where rela onal models might be restric ve. They store data in various non-rela onal formats, such
as key-value pairs, documents, graphs, or columns. These flexible data models enable rapid scaling and
are well-suited for unstructured or semi-structured data.
OODBMS integrates object-oriented programming concepts into the database environment, allowing
data to be stored as objects. This approach supports complex data types and rela onships, making it
ideal for applica ons requiring advanced data modeling and real-world simula ons.
2. Data integrity: A DBMS provides mechanisms for enforcing data integrity constraints, such as
constraints on the values of data and access controls that restrict who can access the data.
4. Data security: A DBMS provides tools for managing the security of the data, such as controlling
access to the data and encryp ng sensi ve data.
5. Backup and recovery: A DBMS provides mechanisms for backing up and recovering the data
in the event of a system failure.
6. Data sharing: A DBMS allows mul ple users to access and share the same data, which can be
useful in a collabora ve work environment.
2. Performance overhead: The use of a DBMS can add overhead to the performance of an
applica on, especially in cases where high levels of concurrency are required.
3. Scalability: The use of a DBMS can limit the scalability of an applica on, since it requires the
use of locking and other synchroniza on mechanisms to ensure data consistency.
4. Cost: The cost of purchasing, maintaining and upgrading a DBMS can be high, especially for
large or complex systems.
5. Limited Use Cases: Not all use cases are suitable for a DBMS, some solu ons don’t need high
reliability, consistency or security and may be be er served by other types of data storage.
1. Data Storage Stores data in files, manually Stores data in a structured way using
managed tables and schemas
2. Data Redundancy High – same data may be stored in Low – normaliza on helps reduce
mul ple files redundancy
3. Data Consistency Difficult to maintain across files Easier due to central control and
integrity constraints
4. Data Security Limited – file-level protec on only High – user roles, access control,
authen ca on
5. Data Access Manual and sequen al (using file Query-based (using SQL), faster and
handlers) flexible
6. Data Integrity Hard to enforce rules Easy with constraints (e.g., primary
key, foreign key)
9. Scalability Not scalable for large data volumes Designed to handle large-scale
applica ons
10. Examples Text files, Excel files MySQL, Oracle, PostgreSQL, MongoDB
Property Meaning
C – Consistency Ensures that the database is always in a valid state before and a er the transac on.
Transac ons do not interfere with each other. They execute as if they were
I – Isola on
independent.
Once a transac on is commi ed, the changes are permanent — even if there's a
D – Durability
crash.
Step Opera on
Isola on If two users are transferring money at the same me, they won't affect each other.
Durability Once the transfer is successful, it remains recorded even a er power failure.
5.6Data Integrity
• Defini on: Data integrity ensures that the data is accurate, complete, and valid according to
rules and constraints.
• Example:
• Alice's email address is stored as [email protected].
• The system requires that the email field must not be empty and must follow a proper
email format.
• If someone tries to insert aliceexample.com (missing @), the system rejects it due to
a data integrity constraint.
Why DBMS?
• Allows mul ple users (bank branches) to access the same central database.
Why DBMS?
• Ensures data consistency — e.g., a student's marks match with their subjects
Why DBMS?
Symbol Meaning
Rectangle En ty
Ellipse A ribute
Line Connec on
Example:
Module-4:
Module Content: Overview of data warehousing concepts, Basics of data warehousing solu ons,
Introduc on to basic SQL queries for data retrieval (SELECT statements), Basics of data analysis and
repor ng.
2. Enhanced Analy cs: Transac onal databases are not op mized for analy cal purposes. A data
warehouse is built specifically for data analysis, enabling businesses to perform complex queries and
gain insights from historical data.
3. Centralized Data Storage: A data warehouse acts as a central repository for all organiza onal data,
helping businesses to integrate data from mul ple sources and have a unified view of their opera ons
for be er decision-making.
4. Trend Analysis: By storing historical data, a data warehouse allows businesses to analyze trends over
me, enabling them to make strategic decisions based on past performance and predict future
outcomes.
5. Support for Business Intelligence: Data warehouses support business intelligence tools and
repor ng systems, providing decision-makers with easy access to cri cal informa on, which enhances
opera onal efficiency and supports data-driven strategies.
1.2Components of Data Warehouse
The main components of a data warehouse include:
Data Sources: These are the various opera onal systems, databases, and external data feeds
that provide raw data to be stored in the warehouse.
ETL (Extract, Transform, Load) Process: The ETL process is responsible for extrac ng data from
different sources, transforming it into a suitable format, and loading it into the data
warehouse.
Data Warehouse Database: This is the central repository where cleaned and transformed data
is stored. It is typically organized in a mul dimensional format for efficient querying and
repor ng.
Metadata: Metadata describes the structure, source, and usage of data within the warehouse,
making it easier for users and systems to understand and work with the data.
Data Marts: These are smaller, more focused data repositories derived from the data
warehouse, designed to meet the needs of specific business departments or func ons.
OLAP (Online Analy cal Processing) Tools: OLAP tools allow users to analyse data in mul ple
dimensions, providing deeper insights and suppor ng complex analy cal queries.
End-User Access Tools: These are repor ng and analysis tools, such as dashboards or Business
Intelligence (BI) tools, that enable business users to query the data warehouse and generate
reports.
Centralized Data Repository: Data warehousing provides a centralized repository for all
enterprise data from various sources, such as transac onal databases, opera onal systems,
and external sources. This enables organiza ons to have a comprehensive view of their data,
which can help in making informed business decisions.
Data Integra on: Data warehousing integrates data from different sources into a single,
unified view, which can help in elimina ng data silos and reducing data inconsistencies.
Historical Data Storage: Data warehousing stores historical data, which enables organiza ons
to analyze data trends over me. This can help in iden fying pa erns and anomalies in the
data, which can be used to improve business performance.
Query and Analysis: Data warehousing provides powerful query and analysis capabili es that
enable users to explore and analyze data in different ways. This can help in iden fying pa erns
and trends, and can also help in making informed business decisions.
Data Transforma on: Data warehousing includes a process of data transforma on, which
involves cleaning, filtering, and forma ng data from various sources to make it consistent and
usable. This can help in improving data quality and reducing data inconsistencies.
Data Mining: Data warehousing provides data mining capabili es, which enable organiza ons
to discover hidden pa erns and rela onships in their data. This can help in iden fying new
opportuni es, predic ng future trends, and mi ga ng risks.
Data Security: Data warehousing provides robust data security features, such as access
controls, data encryp on, and data backups, which ensure that the data is secure and
protected from unauthorized access.
1. Enterprise Data Warehouse (EDW): A centralized warehouse that stores data from across the
organiza on for analysis and repor ng.
2. Opera onal Data Store (ODS): Stores real- me opera onal data used for day-to-day
opera ons, not for deep analy cs.
3. Data Mart: A subset of a data warehouse, focusing on a specific business area or department.
4. Cloud Data Warehouse: A data warehouse hosted in the cloud, offering scalability and
flexibility.
5. Big Data Warehouse: Designed to store vast amounts of unstructured and structured data for
big data analysis.
6. Virtual Data Warehouse: Provides access to data from mul ple sources without physically
storing it.
7. Hybrid Data Warehouse: Combines on-premises and cloud-based storage to offer flexibility.
8. Real- me Data Warehouse: Designed to handle real- me data streaming and analysis for
immediate insights.
Social Media Websites: The social networking websites like Facebook, Twi er, Linkedin, etc.
are based on analyzing large data sets. These sites gather data related to members, groups,
loca ons, etc., and store it in a single central repository. Being a large amount of data, Data
Warehouse is needed for implemen ng the same.
Banking: Most of the banks these days use warehouses to see the spending pa erns of
account/cardholders. They use this to provide them with special offers, deals, etc.
Government: Government uses a data warehouse to store and analyze tax payments which
are used to detect tax the s.
Business Intelligence: Provides strong opera onal insights through business intelligence.
Data Quality: Guarantees data quality and consistency for trustworthy repor ng.
Scalability: Capable of managing massive data volumes and expanding to meet changing
requirements.
Effec ve Queries: Fast and effec ve data retrieval is made possible by an op mized structure.
Cost reduc ons: Data warehousing can result in cost savings over me by reducing data
management procedures and increasing overall efficiency, even when there are setup costs
ini ally.
Faster Queries: The data warehouse is designed to handle large queries that’s why it runs
queries faster than the database..
Historical Insight: The warehouse stores all your historical data which contains details about
the business so that one can analyze it at any me and extract insights from it.
Complexity: Data warehousing can be complex, and businesses may need to hire specialized
personnel to manage the system.
Time-consuming: Building a data warehouse can take a significant amount of me, requiring
businesses to be pa ent and commi ed to the process.
Data integra on challenges: Data from different sources can be challenging to integrate,
requiring significant effort to ensure consistency and accuracy.
Data security: Data warehousing can pose data security risks, and businesses must take
measures to protect sensi ve data from unauthorized access or breaches.
2.Basics of data warehousing solu ons
Data warehousing is a system that consolidates data from various sources into a central repository for
efficient querying and analysis, suppor ng business intelligence ac vi es like repor ng and data
mining. It involves processes like data integra on, transforma on, and loading (ETL) to prepare data
for analysis.
All student data (admissions, results, fees, a endance) is stored in servers located within the
university campus.
Feature Descrip on
Cost High upfront cost (servers, licenses), but no recurring cloud bills.
Benefits:
Data Privacy: Sensi ve data (e.g., student grades, ID cards, salaries) stays local.
Drawbacks:
On-Premise:
• Oracle,
• Microso SQL Server,
• Teradata.
We access it via the internet, and the cloud provider handles infrastructure, storage, security, and
scalability.
Feature Descrip on
Hosted Remotely Data is stored in data centers of cloud vendors (AWS, Azure, GCP).
Subscrip on Pricing Pay for what you use (no large upfront cost).
Benefits:
Benefit Descrip on
Scalable Handles growing student data and performance needs automa cally.
Drawbacks:
Limita on Descrip on
Data Security Concerns Data is stored off-campus—may raise privacy ques ons.
Cloud-Based:
Amazon Redshi
Google BigQuery
Snowflake
Performance High with tuning Very fast for ad-hoc High, auto-scaling Good for hybrid
queries workloads
Ease of Use Moderate Very easy, SQL- Very easy, Easy (integrates
(needs tuning) based intui ve with MS tools)
Integra on AWS Glue, S3, Dataflow, Cloud Snowpipe, BI Power BI, Data
Tools QuickSight Storage, Looker tools Factory
Before diving into the specifics, let’s visualize how each join type operates:
INNER JOIN: Returns only the rows where there is a match in both tables.
LEFT JOIN (LEFT OUTER JOIN): Returns all rows from the le table, and the matched rows from
the right table. If there’s no match, NULL values are returned for columns from the right table.
RIGHT JOIN (RIGHT OUTER JOIN): Returns all rows from the right table, and the matched rows
from the le table. If there’s no match, NULL values are returned for columns from the le
table.
FULL JOIN (FULL OUTER JOIN): Returns all rows when there is a match in one of the tables. If
there’s no match, NULL values are returned for columns from the table without a match.
Syntax
Note: We can also write JOIN instead of INNER JOIN. JOIN is same as INNER JOIN.
Student Table
StudentCourse Table
Query:
Syntax:
SELECT table1.column1,table1.column2,table2.column1,….
FROM table1
LEFT JOIN table2
ON table1.matching_column = table2.matching_column;
In this example, the LEFT JOIN retrieves all rows from the Student table and the matching rows from
the StudentCourse table based on the ROLL_NO column.
SELECT Student.NAME,StudentCourse.COURSE_ID
FROM Student
LEFT JOIN StudentCourse
ON StudentCourse.ROLL_NO = Student.ROLL_NO;
3.4SQL RIGHT JOIN
RIGHT JOIN returns all the rows of the table on the right side of the join and matching rows for the
table on the le side of the join. It is very similar to LEFT JOIN for the rows for which there is no
matching row on the le side, the result-set will contain null. RIGHT JOIN is also known as RIGHT OUTER
JOIN.
SELECT table1.column1,table1.column2,table2.column1,....
FROM table1
RIGHT JOIN table2
ON table1.matching_column = table2.matching_column;
In this example, the RIGHT JOIN retrieves all rows from the StudentCourse table and the matching
rows from the Student table based on the ROLL_NO column.
SELECT Student.NAME,StudentCourse.COURSE_ID
FROM Student
RIGHT JOIN StudentCourse
ON StudentCourse.ROLL_NO = Student.ROLL_NO;
3.4SQL FULL JOIN
FULL JOIN creates the result-set by combining results of both LEFT JOIN and RIGHT JOIN. The result-
set will contain all the rows from both tables. For the rows for which there is no matching, the result-
set will contain NULL values.
SELECT table1.column1,table1.column2,table2.column1,....
FROM table1
FULL JOIN table2
ON table1.matching_column = table2.matching_column;
This example demonstrates the use of a FULL JOIN, which combines the results of both LEFT
JOIN and RIGHT JOIN. The query retrieves all rows from the Student and StudentCourse tables. If a
record in one table does not have a matching record in the other table, the result set will include that
record with NULL values for the missing fields.
SELECT Student.NAME,StudentCourse.COURSE_ID
FROM Student
FULL JOIN StudentCourse
ON StudentCourse.ROLL_NO = Student.ROLL_NO;
NAME COURSE_ID
HARSH 1
PRATIK 2
RIYANKA 2
DEEP 3
SAPTARHI 1
DHANRAJ NULL
ROHIT NULL
NIRAJ NULL
NULL 4
NULL 5
NULL 4
4.Basics of data analysis and repor ng
4.1Data Analysis
Data Analy cs is used to get conclusions by processing the raw data. It is helpful in various businesses
as it helps the company to make decisions based on the conclusions from the data. Basically, data
analy cs helps to convert a Large number of figures in the form of data into Plain English i.e.,
conclusions which are further helpful in making in-depth decisions. Data analysis can be broadly
categorized into four primary types: descrip ve, diagnos c, predic ve, and prescrip ve. Each type
serves a unique purpose in understanding data and making informed decisions.
Beyond produc on op miza on, data analy cs is u lized in diverse sectors. Gaming firms u lize it to
design reward systems that engage players effec vely, while content providers leverage analy cs to
op mize content placement and presenta on, ul mately driving user engagement.
4. Diagnos c analy cs
4.1.1Predic ve Analy cs
Predic ve analy cs turn the data into valuable, ac onable informa on. predic ve analy cs uses data
to determine the probable outcome of an event or a likelihood of a situa on occurring. Predic ve
analy cs holds a variety of sta s cal techniques from modeling, machine learning , data mining ,
and game theory that analyze current and historical facts to make predic ons about a future
event. Techniques that are used for predic ve analy cs are:
Linear Regression
Data Mining
Predic ve modeling
Transac on profiling
Predic ve analysis answers the ques on, “What might happen in the future?”
1. Define a Problem
• Defining the problem means clearly expressing the challenge that the organiza on
aims to focus using data analysis.
• Once you define a problem statement it is important to acquire and organize data
properly.
• Acquiring data for predic ve analy cs means collec ng and preparing relevant
informa on and data from various sources like databases, data warehouses, external
data providers, APIs, logs, surveys, and more that can be used to build and train
predic ve models.
3. Pre-process Data:
• Raw data collected from different sources is rarely in an ideal state for analysis. So,
before developing a predic ve models, data need to be pre-processed properly.
• Pre-processing involves cleaning the data to remove any kind of anomalies, handling
missing data points and addressing outliers that could be caused by errors or input or
transforming the data , which can be used for further analysis.
• Pre-processing ensures that data is of high quality and now the data is ready for model
development.
• Now techniques like machine learning algorithms, regression models , decisions trees,
neural networks are much among the common techniques for this.
• These models are trained on the prepared data to iden fy correla ons and pa erns
that can be used for making predic ons.
• A er building the predic ve model, valida on is the cri cal steps to assess the
accuracy and reliability of predic ons.
• Data scien sts rigorously evaluate the model's performance against known outcomes
or test datasets.
• If required, modifica ons are implemented to improve the accuracy of the model.
• This can be done through applica ons, websites or data dashboards, making the
insights easily accessible to decision makers or stakeholders.
4.1.2Descrip ve Analy cs
Descrip ve analy cs looks at data and analyze past event for insight as to how to approach future
events. It looks at past performance and understands the performance by mining historical data to
understand the cause of success or failure in the past. Almost all management repor ng such as sales,
marke ng, opera ons, and finance uses this type of analysis.
The descrip ve model quan fies rela onships in data in a way that is o en used to classify customers
or prospects into groups. Unlike a predic ve model that focuses on predic ng the behavior of a single
customer, Descrip ve analy cs iden fies many different rela onships between customer and product.
let's have a closer look on the working of Descrip ve Analy cs. A type of descrip ve analy cs involves
analyzing and simplifying historical data to provide insights into previous events, trends, and pa erns.
It is much closer to repor ng than to what most people think of as analy cs.
1. Data Collec on: Collec ng useful informa on is the ini al stage in the descrip ve analy cs
process. By using mul ple resources such as databases, spreadsheets, and other data
repositories. All of these provide this data. Since they directly affect how accurate the
descrip ve analy cs is, the accuracy and standard of the data are extremely important.
2. Cleaning the Data and Preprocessing: The obtained data usually needs to be cleaned and
preprocessed before analysis can start. This includes conver ng data into a uniform structure,
standardizing formats, and handling missing or incorrect values. Clean and well-preprocessed
data ensures that the subsequent analy cs is reliable.
3. Data analysis: It provides an understanding of the structure and features of the dataset. Here
EDA (exploratory data analysis) methods helps to find the pa erns, trends, and possible
outliers in the data. These methods include making histograms, sca er plots, and summary
sta s cs.
4. Compila on and Summary: The goal of descrip ve analy cs is to offer an overview of the data
at a high level. To get important metrics and sta s cs, such as mean, median, mode, range,
and standard devia on, this frequently requires combining the data.
5. Visualiza on: In descrip ve analy cs, visualiza ons are extremely useful tools. It helps us to
communicate complex informa on with a variety of charts, graphs, and other visual
representa ons are employed. Data pa erns and trends can be highlighted with the use of
visualiza on, which also makes it easier to convey insights to a wide range of audiences.
6. Fic on Crea on: Descrip ve analy cs can include the crea on of descrip ons that offer a
logical and contextualized explana on of the data, in addi on to visuals. When communica ng
findings to those in the audience who might not be familiar with the complexi es of the data,
this can be especially helpful.
7. Interpreta on: To obtain significant knowledge, analysts interpret the outcomes of descrip ve
analy cs. This involves knowing the effects of the trends and pa erns seen in the data. While
interpreta on provides the founda on for more in-depth analyses that inves gate "why" and
"what might happen in the future," descrip ve analy cs concentrates on the "what
happened" topic.
8. Tes ng Ac vely: The process of descrip ve analy cs is not one- me. Organiza ons con nually
repeat the descrip ve analy cs when new data becomes available in order to keep informed
about the latest developments and pa erns. This way, people making decisions get the newest
informa on.
Presents data clearly: Descrip ve analy cs simplifies complex data, making it easy to
understand through reports and visualiza ons like charts and graphs.
Convenient to Realize: Data that has been summarized and graphically represented is easier
to clarify and evaluate for a larger audience.
Iden fies Relevant Data Points: It offers straigh orward metrics that give an accurate
es ma on of important data points.
Simple and cost-effec ve: Descrip ve analy cs is simple to use and just requires basic
arithme c knowledge for execu on.
Efficient with tools: With the aid of tools like Python or MS Excel, which make things fast and
easy.
Inability of Cause Analysis: The main goal of descrip ve analy cs is to explain historical
events. It doesn't explore the root causes or reasons for the pa erns that are seen.
Analysis Simplicity: The reach of descrip ve analy cs is restricted to basic analyses that look
at the rela onships between a small number of variables.
Doesn't Explain Why: History offers lessons for future genera ons, by offering facts, but
causes and predic ons are not provided to the readers.
Inappropriate for Making Decisions in Real Time: Normally, descrip ve analy cs involves
ge ng summary informa on at intervals intervals and this might not be the best op on for
decision- making when the me ma er. In many situa ons, fast responsiveness is vital,
therefore, some mes only relying on the descrip ve analy cs might drag you behind.
Lack of ability to handle unstructured data: Structured and well-organized datasets are be er
suited for descrip ve analy cs. while analyzing semi-structured or unstructured data, such as
text, photos, or mul media, it could make challenging to offer insigh ul analysis.
4.1.3Diagnos c Analy cs
Diagnos c Analy cs plays an important role in today’s world of data in helping businesses to
understand not just what happened but also why it happened. Think of it like solving a mystery and
asking ques ons like “Why did sales fall?”, “Why are customers leaving?” or “What caused this system
breakdown?” By understanding data, businesses can iden fy main reasons behind these issues and
can take ac on to resolve them.
Purpose of Diagnos c Analy cs
1. Iden fy Root Causes: It helps businesses understand data in a be er way, also to iden fy the
key factors of these specific outcomes which leads to clear, ac onable insights.
2. Solve Problems: By iden fying root causes, the companies can choose targeted solu ons to
resolve issues and improve performance.
3. Inform Future Decisions: Understanding past events and their causes helps businesses to
make data-driven decisions and develop smarter strategies.
1. Iden fy the Anomaly: Detect irregulari es in data by using sources such as website logs,
customer feedback and financial records. This helps in iden fying issues that requires further
inves ga on.
2. Data Collec on: Collect data from various sources which include transac on records, surveys,
system logs or other sources that provide important informa on to understand the situa on
be er.
3. Data Explora on: Explore the collected data to iden fy trends, pa erns and correla ons.
Techniques like sta s cal analysis and data visualiza on help us to find insights that explain
the anomaly.
4. Pa ern Iden fica on: Using data analysis methods like machine learning and correla on
analysis helps to detect recurring pa erns or trends. This step helps in merging the anomalies
to poten al causes.
5. Root Cause Analysis: Check the iden fied pa erns to find the main cause of the issue. This
step helps to answers ques ons like whether the cause is due to opera onal issues, external
factors or system issues.
6. Tes ng and Confirma on: Check the hypothesis using various tests or simula ons. For
example tes ng whether a new website feature caused a decline in user engagement or if it
was due to a marke ng change.
2. Improved Problem-Solving: By iden fying root causes, businesses can follow targeted
solu ons that helps in solving problems.
4. Enhanced Decision-Making: With data driven insights, businesses can make more informed
and strategic decisions. This helps in minimizing risks and ensures that all the ac ons align with
long-term goals.
5. Risk Reduc on: Early detec on of issues helps businesses to avoid risks before they increases.
By taking mely measures, companies can prevent disrup ons and avoid costly mistakes.
6. Customer Sa sfac on: It help businesses to understand customer needs and preferences
which helps in crea ng personalized experiences, improving sa sfac on and stronger
customer loyalty.
4.1.4Prescrip ve Analy cs
Prescrip ve Analy cs is the area of Business Analy cs dedicated to searching out the best solu on for
day-to-day occurring problems. It is directly related to the other two comparable processes,
i.e. Descrip ve and Predic ve Analy cs. Prescrip ve Analy cs can be defined as a type of data
analy cs that uses algorithms and analysis of raw data to achieve be er and more effec ve decisions
for a long and short span of me. It suggests strategy over possible scenarios, accumulated sta s cs,
and past/present databases collected through the consumer community.
Step 1 Data Collec on: Gather data for a customer’s loca ons, their requirement, company
warehouses, and transporta on
Step 2 Mathema cal Modeling: We will create mathema cal models that will handle supply chain
data like customer loca on, me, warehouse loca on, and routes, we will also finalize an op miza on
func on that will minimize company cost and delivery me
Step 3 Op miza on: We will use an op miza on approach like linear programming or differen al
calculus to solve mathema cal models and find op mal loca ons.
Step 4 Scenario Analysis: We will perform a scenario analysis for our assump ons variables about the
models.
Step 5 Decision Support: Based on our data modeling and business knowledge that we got from the
raw data we will create dashboards and visualiza on graphs that will stakeholders in taking decisions.
Step 5 Implementa on: The Final and most important part a er doing all the five steps is to implement
it with changes that maximizes the company’s revenues
An accurate and Comprehensive form of data aggrega on and analysis also reduces human
error and bias.
Removing immediate uncertain es helps in the preven on of fraud, limits risk, increases
efficiency, and creates logical customers.
Understand what
Forecast what might Recommend ac ons to
happened in the
happen in the future. achieve desired outcomes.
Purpose past.
Historical
Future predic on and Op mal decision-making and
understanding and
risk assessment. performance improvement.
Objec ve trend analysis.
An cipa ng future
Maximizing outcomes and
Historical insights for scenarios for
efficiency through informed
strategy refinement. proac ve decision-
ac ons.
Impact making.
1. For each data source, any updates are exported periodically into a staging area in Azure Data
Lake Storage.
2. Azure Data Factory incrementally loads the data from Azure Data Lake Storage into staging
tables in Azure Synapse Analy cs. The data is cleansed and transformed during this process.
PolyBase can parallelize the process for large datasets.
3. A er loading a new batch of data into the warehouse, a previously created Azure Analysis
Services tabular model is refreshed. This seman c model simplifies the analysis of business
data and rela onships.
4. Business analysts use Microso Power BI to analyze warehoused data via the Analysis Services
seman c model.
Components
Oracle on-premises
Azure Cosmos DB
Data is loaded from these different data sources using several Azure components:
Azure Data Lake Storage is used to stage source data before it's loaded into Azure Synapse.
Data Factory orchestrates the transforma on of staged data into a common structure in Azure
Synapse. Data Factory uses PolyBase when loading data into Azure Synapse to maximize
throughput.
Azure Synapse is a distributed system for storing and analyzing large datasets. Its use of
massive parallel processing (MPP) makes it suitable for running high-performance analy cs.
Azure Synapse can use PolyBase to rapidly load data from Azure Data Lake Storage.
Analysis Services provides a seman c model for your data. It can also increase system
performance when analyzing your data.
Power BI is a suite of business analy cs tools to analyze data and share insights. Power BI can
query a seman c model stored in Analysis Services, or it can query Azure Synapse directly.
Microso Entra ID authen cates users who connect to the Analysis Services server through
Power BI. Data Factory can also use Microso Entra ID to authen cate to Azure Synapse via a
service principal or Managed iden ty for Azure resources.
4.2Repor ng
Reports on data analysis are essen al for communica ng data-driven insights to decision-makers,
stakeholders, and other per nent par es. These reports provide an organized format for providing
conclusions, analyses, and sugges ons derived from data set analysis.
These reports are crucial in various fields such as business, science, healthcare, finance, and
government, where data-driven decision-making is essen al. It combines quan ta ve and qualita ve
data to evaluate past performance, understand current trends, and make informed recommenda ons
for the future. Think of it as a translator, taking the language of numbers and transforming it into a
clear and concise story that guides decision-making.
2. Performance Evalua on: Data analysis reports are used by organiza ons to assess how well
procedures, goods, or services are working. Through the examina on of per nent metrics
and key performance indicators (KPIs), enterprises may pinpoint opportuni es for
improvement and maximize produc vity.
3. Risk management: Within a company, data analysis reports may be used to detect possible
dangers, difficul es, or opportuni es. Businesses may reduce risks and take advantage of new
possibili es by examining past data and predic ng future pa erns.
Crea ng a well-structured outline is like drawing a roadmap for your report. It acts as a guide, to
organize your thoughts and content logically. Begin by iden fying the key sec ons of report, such as
introduc on, methodology, findings, analysis, conclusions, and recommenda ons. Within each
sec on, break down the specific points or subtopics you want to address. This step-by-step approach
not only streamlines the wri ng process but also ensures that you cover all essen al elements of your
analysis. Moreover, an outline helps you maintain focus and prevents you from veering off track,
ensuring that your report remains coherent and easy to follow for your audience.
In a data analysis report, it's crucial to priori ze the most relevant Key Performance Indicators (KPIs)
to avoid overwhelming your audience with unnecessary informa on. Start by iden fying the KPIs that
directly impact your business objec ves and overall performance. These could include metrics like
revenue growth, customer reten on rates, conversion rates, or website traffic. By focusing on these
key metrics, the audience can track report with ac onable insights that drive strategic decision-
making. Addi onally, consider contextualizing these KPIs within your industry or market landscape to
provide a comprehensive understanding of your performance rela ve to compe tors or benchmarks.
Data visualiza on plays a pivotal role in conveying complex informa on in a clear and engaging
manner. When selec ng visualiza on tools, consider the nature of the data and the story you want to
tell. For instance, if you're illustra ng historical trends, melines or line graphs can effec vely
showcase pa erns over me. On the other hand, if you're comparing categorical data, pie charts or
bar graphs might be more suitable. The key is to choose visualiza on methods that accurately
represent your findings and facilitate comprehension for your audience. Addi onally, pay a en on to
design principles such as color contrast, labeling, and scale to ensure that your visuals are both
informa ve and visually appealing.
Transforming your data into a compelling narra ve is essen al for engaging your audience and
highligh ng key insights. Instead of presen ng raw data, strive to tell a story that contextualizes the
numbers and unveils their significance.
Start by iden fying specific events or trends in data and explore the underlying reasons behind them.
For example, if you no ce a sudden spike in sales, inves gate the marke ng campaign or external
factors that may have contributed to this increase. By weaving these insights into a cohesive narra ve,
you can guide your audience through your analysis and make your findings more memorable and
impac ul. Remember to keep your language clear and concise, avoiding jargon or technical terms that
may confuse your audience.
Establishing a clear informa on hierarchy is essen al for ensuring that your report is easy to navigate
and understand. Start by outlining the main points or sec ons of your report and consider the logical
flow of informa on. Typically, it's best to start with broader, more general informa on and gradually
delve into specifics as needed. This approach helps orient your audience and provides them with a
framework for understanding the rest of the report.
To create an effec ve summary, dis ll the main points of the report into a few succinct paragraphs.
Focus on highligh ng the most significant insights and outcomes, avoiding unnecessary details or
technical language. Consider the needs of your audience and tailor the summary to address their
interests and priori es. By providing a clear and concise summary upfront, you set the stage for the
rest of the report and help busy readers grasp the essence of your analysis quickly.
Effec ve communica on of data analysis findings goes beyond simply repor ng the numbers; it
involves providing ac onable recommenda ons that drive decision-making and facilitate
improvements. When offering recommenda ons, remain objec ve and avoid assigning blame for any
nega ve outcomes. Instead, focus on iden fying solu ons and sugges ng prac cal steps for
addressing challenges or leveraging opportuni es.
Consider the implica ons of your findings for the broader business strategy and provide specific
guidance on how to implement changes or ini a ves. Moreover, priori ze recommenda ons that are
realis c, achievable, and aligned with the organiza on's goals and resources. By offering ac onable
recommenda ons, you demonstrate the value of your analysis and empower stakeholders to take
proac ve steps towards improvement.
The presenta on format of the report is as crucial as its content, as it directly impacts the effec veness
of your communica on and engagement with your audience. Interac ve dashboards offer a dynamic
and visually appealing way to present data, allowing users to explore and interact with the informa on
in real- me.