Exploratory Data Analytics and Data
Visualization
UNIT I
VII SEMESTER
DS-433T
1 CSE Dept. , BVCOE, New Delhi Subject: EDADV, Instructor: Dr. Srishti Vashishtha
What is Exploratory Data Analysis
(EDA)?
Exploratory Data Analysis (EDA) is a crucial initial step in data
science projects.
It involves analyzing and visualizing data to understand its key
characteristics, uncover patterns, and identify relationships
between variables. It refers to the method of studying and
exploring record sets to apprehend their predominant traits,
discover patterns, locate outliers, and identify relationships
between variables.
EDA is normally carried out as a preliminary step before
undertaking extra formal statistical analyses or modeling.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
2 Instructor: Dr. Srishti Vashishtha
Key aspects of EDA include:
• Distribution of Data: Examining the distribution of data points
to understand their range, central tendencies (mean, median),
and dispersion (variance, standard deviation).
• Graphical Representations: Utilizing charts such as histograms,
box plots, scatter plots, and bar charts to visualize relationships
within the data and distributions of variables.
• Outlier Detection: Identifying unusual values that deviate from
other data points. Outliers can influence statistical analyses and
might indicate data entry errors or unique cases.
• Correlation Analysis: Checking the relationships between
variables to understand how they might affect each other. This
includes computing correlation coefficients and creating
correlation matrices.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
3 Instructor: Dr. Srishti Vashishtha
• Handling Missing Values: Detecting and deciding how to
address missing data points, whether by imputation or removal,
depending on their impact and the amount of missing data.
• Summary Statistics: Calculating key statistics that provide
insight into data trends and nuances.
• Testing Assumptions: Many statistical tests and models
assume the data meet certain conditions (like normality or
homoscedasticity). EDA helps verify these assumptions.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
4 Instructor: Dr. Srishti Vashishtha
Why Exploratory Data Analysis is
Important?
Exploratory Data Analysis (EDA) is important for several reasons,
especially in the context of data science and statistical modeling. Here
are some of the key reasons why EDA is a critical step in the data
analysis process:
1. Understanding Data Structures: EDA helps in getting familiar with the
dataset, understanding the number of features, the type of data in each
feature, and the distribution of data points. This understanding is
crucial for selecting appropriate analysis or prediction techniques.
2. Identifying Patterns and Relationships: Through visualizations and
statistical summaries, EDA can reveal hidden patterns and intrinsic
relationships between variables. These insights can guide further
analysis and enable more effective feature engineering and model
building.
3. Detecting Anomalies and Outliers: EDA is essential for identifying
errors or unusual data points that may adversely affect the results of
your analysis. Detecting these early can prevent costly mistakes in
predictive modeling and analysis.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
5 Instructor: Dr. Srishti Vashishtha
4. Testing Assumptions: Many statistical models assume that data
follow a certain distribution or that variables are independent. EDA
involves checking these assumptions. If the assumptions do not
hold, the conclusions drawn from the model could be invalid.
5. Informing Feature Selection and Engineering: Insights gained
from EDA can inform which features are most relevant to include in
a model and how to transform them (scaling, encoding) to improve
model performance.
6. Optimizing Model Design: By understanding the data’s
characteristics, analysts can choose appropriate modeling
techniques, decide on the complexity of the model, and better tune
model parameters.
7. Facilitating Data Cleaning: EDA helps in spotting missing values
and errors in the data, which are critical to address before further
analysis to improve data quality and integrity.
8. Enhancing Communication: Visual and statistical summaries from
EDA can make it easier to communicate findings and convince
others of the validity of your conclusions, particularly when
explaining data-driven insights to stakeholders without technical
backgrounds.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
6 Instructor: Dr. Srishti Vashishtha
Introduction to Data Analytics
Data analytics is a multidisciplinary field that employs a wide
range of analysis techniques, including math, statistics, and
computer science, to draw insights from data sets. Data
analytics is a broad term that includes everything from simply
analyzing data to theorizing ways of collecting data and
creating the frameworks needed to store it.
It is the process of examining data sets to find trends and draw
conclusions about the information they contain. Increasingly,
data analytics is done with the help of specialized systems and
software. Data analytics technologies and techniques are
widely used in commercial industries to enable organizations to
make more informed business decisions.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
7 Instructor: Dr. Srishti Vashishtha
Different Sources of Data
In the process of big data analysis, “Data collection” is the initial
step before starting to analyze the patterns or useful information
in data.
The data which is to be analyzed must be collected from different
valid sources.
The main goal of data collection is to collect information-rich data.
Data is then further divided mainly into two types known as:
1. Primary Data
2. Secondary Data
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
8 Instructor: Dr. Srishti Vashishtha
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
9 Instructor: Dr. Srishti Vashishtha
1.Primary data:
The data which is Raw, original, and extracted directly from
the official sources is known as primary data. This type of
data is collected directly by performing techniques such as
questionnaires, interviews, and surveys. The data collected
must be according to the demand and requirements of the
target audience on which analysis is performed otherwise it
would be a burden in the data processing. Few methods of
collecting primary data:
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
10 Instructor: Dr. Srishti Vashishtha
A. Interview
The data collected during this process is through interviewing the
target audience by a person called interviewer and the person
who answers the interview is known as the interviewee. Some
basic business or product related questions are asked and noted
down in the form of notes, audio, or video and this data is stored
for processing. These can be both structured and unstructured like
personal interviews or formal interviews through telephone, face
to face, email, etc.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
11 Instructor: Dr. Srishti Vashishtha
B. Survey method:
The survey method is the process of research where a list of relevant
questions are asked and answers are noted down in the form of text,
audio, or video. The survey method can be obtained in both online and
offline mode like through website forms and email. Then that survey
answers are stored for analyzing data. Examples are online surveys or
surveys through social media polls.
C. Observation method:
The observation method is a method of data collection in which the
researcher keenly observes the behavior and practices of the target
audience using some data collecting tool and stores the observed data in
the form of text, audio, video, or any raw formats. In this method, the data
is collected directly by posting a few questions on the participants. For
example, observing a group of customers and their behavior towards the
products. The data obtained will be sent for processing.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
12 Instructor: Dr. Srishti Vashishtha
D. Experimental method:
The experimental method is the process of collecting data through performing
experiments, research, and investigation. The most frequently used experiment
methods are CRD, RBD, LSD, FD.
• CRD- Completely Randomized design is a simple experimental design used
in data analytics which is based on randomization and replication. It is mostly
used for comparing the experiments.
• RBD- Randomized Block Design is an experimental design in which the
experiment is divided into small units called blocks. Random experiments are
performed on each of the blocks and results are drawn using a technique
known as analysis of variance (ANOVA). RBD was originated from the
agriculture sector.
• LSD – Latin Square Design is an experimental design that is similar to CRD
and RBD blocks but contains rows and columns. It is an arrangement of NxN
squares with an equal amount of rows and columns which contain letters
that occurs only once in a row. Hence the differences can be easily found
with fewer errors in the experiment. Sudoku puzzle is an example of a Latin
square design.
• FD- Factorial design is an experimental design where each experiment has
two factors each with possible values and on performing trail other
combinational factors are derived.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
13 Instructor: Dr. Srishti Vashishtha
2. Secondary data:
Secondary data is the data which has already been collected
and reused again for some valid purpose. This type of data
is previously recorded from primary data and it has two
types of sources named internal source and external source.
A. Internal source:
These types of data can easily be found within the
organization such as market record, a sales record,
transactions, customer data, accounting resources, etc. The
cost and time consumption is less in obtaining internal
sources.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
14 Instructor: Dr. Srishti Vashishtha
B. External source:
The data which can’t be found at internal organizations
and can be gained through external third party
resources is external source data. The cost and time
consumption is more because this contains a huge
amount of data. Examples of external sources are
Government publications, news publications, Registrar
General of India, planning commission, international
labor bureau, syndicate services, and other non-
governmental publications.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
15 Instructor: Dr. Srishti Vashishtha
C. Other sources:
• Sensors data: With the advancement of IoT devices, the
sensors of these devices collect data which can be used
for sensor data analytics to track the performance and
usage of products.
• Satellites data: Satellites collect a lot of images and data
in terabytes on daily basis through surveillance cameras
which can be used to collect useful information.
• Web traffic: Due to fast and cheap internet facilities many
formats of data which is uploaded by users on different
platforms can be predicted and collected with their
permission for data analysis. The search engines also
provide their data through keywords and queries searched
mostly.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
16 Instructor: Dr. Srishti Vashishtha
Classification of Data
Big Data includes huge volume, high velocity, and
extensible variety of data.There are 3 types: Structured data,
Semi-structured data, and Unstructured data.
1. Structured data – Structured data is data whose
elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a
database. It concerns all data which can be stored in
database SQL in a table with rows and columns. They have
relational keys and can easily be mapped into pre-designed
fields. Today, those data are most processed in the
development and simplest way to manage
information. Example: Relational data.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
17 Instructor: Dr. Srishti Vashishtha
2. Semi-Structured data – Semi-structured data is
information that does not reside in a relational
database but that has some organizational properties
that make it easier to analyze. With some processes,
you can store them in the relation database (it could be
very hard for some kind of semi-structured data), but
Semi-structured exist to ease space. Example: XML
data.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
18 Instructor: Dr. Srishti Vashishtha
3. Unstructured data – Unstructured data is a data
which is not organized in a predefined manner or does
not have a predefined data model, thus it is not a good
fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for
storing and managing, it is increasingly prevalent in IT
systems and is used by organizations in a variety of
business intelligence and analytics
applications. Example: Word, PDF, Text, Media logs.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
19 Instructor: Dr. Srishti Vashishtha
Difference between types of data
Properties Structured data Semi-structured data Unstructured data
It is based on
It is based on Relational It is based on character
Technology XML/RDF(Resource
database table and binary data
Description Framework).
Matured transaction and No transaction
Transaction is adapted
Transaction management various concurrency management and no
from DBMS not matured
techniques concurrency
Versioning over Versioning over tuples or
Version management Versioned as a whole
tuples,row,tables graph is possible
It is more flexible than
It is more flexible and
It is schema dependent structured data but less
Flexibility there is absence of
and less flexible flexible than unstructured
schema
data
It is very difficult to scale It’s scaling is simpler than
Scalability It is more scalable.
DB schema structured data
New technology, not very
Robustness Very robust —
spread
Structured query allow Queries over anonymous Only textual queries are
Query performance
complex joining nodes are possible possible
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
20 Instructor: Dr. Srishti Vashishtha
Big Data Platform
What is Big Data?
Big data is a term used to describe data of great variety, huge
volumes, and even more velocity. Apart from the significant
volume, big data is also complex such that none of the
conventional data management tools can effectively store or
process it. The data can be structured or unstructured.
Examples of big data include Mobile phone details, Social media content,
Health records, Transactional data Web searches, Financial documents
and Weather information.
Big data can be generated by users (emails, images, transactional data,
etc.), or machines (IoT, ML algorithms, etc.). And depending on the owner,
the data can be made commercially available to the public through API or
FTP. In some instances, it may require a subscription for you to be granted
access to it.
21 CSE Dept. , BVCOE, New Delhi Subject: EDADV,
Instructor: Dr. Srishti Vashishtha
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
22 Instructor: Dr. Srishti Vashishtha
Characteristics of Big Data
Any good big data platform should have the following important
features:
• Ability to accommodate new applications and tools
depending on the evolving business needs
• Support several data formats
• Ability to accommodate large volumes of streaming or at-
rest data
• Have a wide variety of conversion tools to transform data to
different preferred formats
• Capacity to accommodate data at any speed
• Provide the tools for scouring the data through massive
data sets
• The ability for quick deployment
• Have the tools for data analysis and reporting requirements
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
23 Instructor: Dr. Srishti Vashishtha
Big Data Platform
The constant stream of information from various sources is
becoming more intense, especially with the advance in technology.
And this is where big data platforms come in to store and analyze
the ever-increasing mass of information.
A big data platform is an integrated computing solution that
combines numerous software systems, tools, and hardware for
big data management. It is a one-stop architecture that solves all
the data needs of a business regardless of the volume and size of
the data at hand. Due to their efficiency in data management,
enterprises are increasingly adopting big data platforms to gather
tons of data and convert them into structured, actionable business
insights.
Currently, the marketplace is flooded with numerous Open source
and commercially available big data platforms. They boast different
features and capabilities for use in a big data environment.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
24 Instructor: Dr. Srishti Vashishtha
Big Data Platforms are a complete system that helps
organisations work with large and complex datasets.
It gives them the tools and technology to turn raw
data into valuable information. This, in turn, helps
them make decisions based on data and come up
with innovative solutions.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
25 Instructor: Dr. Srishti Vashishtha
Importance of Big Data Platforms
Big Data has transformed the way businesses operate, and it has become
a valuable resource that, when harnessed effectively, can drive
innovation, enhance decision-making, and create competitive advantages.
This is where Big Data Platforms play a pivotal role.
a) Data-driven decision-making: In the past, decisions relied on intuition,
but today, data-backed decisions are crucial. Big Data Platforms enable
real-time collection, processing, and analysis of vast datasets,
empowering businesses to make informed decisions, identify trends, and
predict outcomes more accurately.
b) Improved customer experiences: Understanding customer behaviour is
vital for personalised experiences. Big Data Platforms gather and analyse
data from various touchpoints, like websites and social media. This allows
companies to tailor products, services, and marketing to individual needs,
boosting customer satisfaction and loyalty.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
26 Instructor: Dr. Srishti Vashishtha
Importance of Big Data Platforms
c) Enhanced operational efficiency: Big Data Platforms
streamline operations by optimising processes and reducing
waste. For instance, in manufacturing, real-time data
analysis identifies bottlenecks and maintenance needs, saving
costs. In logistics, it optimises routes and reduces fuel
consumption, improving overall efficiency.
d) Innovation and product development: Big Data Platforms
drive innovation by revealing market trends and consumer
behaviour. Analysing large datasets helps companies identify gaps
and develop products that meet demand, driving revenue and
maintaining a competitive edge.
e) Fraud detection and security: In an era of cyber threats, Big
Data Platforms swiftly detect and mitigate risks by analysing real-
time patterns and anomalies, bolstering security with robust
access controls and encryption.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
27 Instructor: Dr. Srishti Vashishtha
Importance of Big Data Platforms
f) Healthcare advancements: Big Data Platforms revolutionise
healthcare by analysing patient data and genomic information,
leading to advances in disease detection, drug development, and
personalised medicine.
g) Competitive advantage: Firms leveraging Big Data Platforms
adapt swiftly to market changes, capitalise on opportunities, and
deliver superior products and services, gaining a competitive edge.
h) Scientific and research advancements: Beyond business, Big Data
Platforms accelerate scientific research by analysing vast datasets,
facilitating breakthroughs in fields like climate science and genomics.
i) Government and social impact: Public organisations utilise data
analytics from Big Data Platforms to enhance services, allocate
resources optimally, and make informed decisions, improving citizens'
quality of life.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
28 Instructor: Dr. Srishti Vashishtha
Popular Big Data Platforms
Big Data Platforms are like the superheroes of the digital world, capable
of handling massive amounts of data and turning it into valuable
information. Here, we'll introduce you to a list of Big Data Platforms:
a) Apache Hadoop: Apache Hadoop is a platform that's excellent at
storing and processing large volumes of data. It's like a robust storage
and data processing system that companies use to handle and manage
massive datasets.
b) Apache Spark: Apache Spark is known for its speed and efficiency
in analysing data. It's like a powerful tool that helps organisations quickly
make sense of their data and extract valuable insights from it.
c) Apache Flink: Apache Flink is another data processing
platform, similar to Spark, that specializes in real-time data analysis. It's
used for tasks where speed and low latency are critical, like monitoring
online activities or financial transactions.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
29 Instructor: Dr. Srishti Vashishtha
d) Amazon Web Services (AWS) Big Data services: AWS offers
a suite of Big Data services that run in the cloud. These services
make it easier for companies to store, process, and analyse data
without the need for extensive infrastructure management.
e) Google Cloud Platform (GCP) Big Data services: Similar
to AWS, Google Cloud Platform provides a range of Big Data
services in the cloud. These services help organisations leverage
Google's computing power and data analytics capabilities.
f) Microsoft Azure Big Data services: Microsoft Azure offers
various Big Data services, including data storage, processing, and
analytics tools. These services are designed to help businesses
work with their data efficiently and effectively.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
30 Instructor: Dr. Srishti Vashishtha
Benefits of using Big Data Platforms
There are various benefits of using Big Data Platforms, which are
discussed below:
a) Better decision-making: Big Data Platforms help organisations
make smarter decisions by providing insights from vast datasets,
ensuring choices are based on facts rather than guesswork.
b) Cost efficiency: These platforms streamline data storage and
processing, reducing infrastructure costs and making Data
Management more affordable.
c) Real-time insights: Big Data Platforms enable real-time Data
Analysis, allowing companies to respond quickly to changing
situations and seize opportunities as they arise.
d) Data Integration: They help integrate data from various
sources, creating a unified view of information and facilitating
comprehensive analysis.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
31 Instructor: Dr. Srishti Vashishtha
e) Enhanced decision-making: With better insights, organisations can
tailor their products, services, and strategies to meet customer needs
more effectively, increasing satisfaction and loyalty.
f) Scalability: Big Data Platforms can expand effortlessly to
accommodate growing data volumes, ensuring they remain effective as
organisations evolve.
g) Competitive advantage: Those who harness Big Data gain an edge by
staying ahead of the competition and providing superior products and
services.
h) Innovation: These platforms spark innovation by revealing trends,
gaps, and opportunities, driving the development of new products and
services.
i) Security: Big Data Platforms offer robust security features to protect
sensitive data, mitigating risks in an increasingly complex cybersecurity
landscape.
j) Efficiency across industries: They improve operations across sectors,
from manufacturing to healthcare, increasing efficiency and reducing
waste.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
32 Instructor: Dr. Srishti Vashishtha
Need of data Analytics
1. The use of data analytics in product development is a reliable
understanding of future requirements. The company will
understand the current market situation of the product. They can
use the techniques to develop new products as per market
requirements. Sometimes companies want to hire fresh talents to
join their data analytics team. You have to learn all the skills and
tools required in data analytics. It can be helpful to understand the
company’s requirements to develop new products. You have to
focus on the competitive advantages to predict future trends.
2. Data Analytics targets the main audience of the business by
identifying the trends and patterns from the data sets. Thus, it can
improve the businesses to grow and optimise its performance.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
33 Instructor: Dr. Srishti Vashishtha
3. Data analytics will help to identify opportunities to solve
problems. It might be good to eliminate non-required data. The
company will always find ways to maximize its profits. It might be
good for the companies to identify the main area to rectify their
mistakes. The data analyst will help them to analyze the situation
and provides solutions. There are several types of tools that help
to remove errors and show the best results. The company will
check those areas and try to improve them. It will be helpful to
customize their needs to maximize profits.
4. Data analysis also helps in the marketing and advertising of the
business to make it popular and thus more customers will know
about the business.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
34 Instructor: Dr. Srishti Vashishtha
5. By doing data analysis it shows the areas where business needs
more resources, products and money and where the right amount
of interaction with the customer is not happening in the business.
Thus by identifying the problems then working on those problems
to grow in the business.
6. The valuable information which is taken out from the raw data
can bring advantage to the organisation by examining present
situations and predicting future outcomes.
7. From data Analytics the business can get better by targeting
the right audience, disposable outcomes and audience spending
habits which helps the business to set prices according to the
interest and budget of customers.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
35 Instructor: Dr. Srishti Vashishtha
Data Analytic Process
Steps for Data Analysis Process
1. Define the Problem or Research Question
2. Collect Data
3. Data Cleaning
4. Analyzing the Data
5. Data Visualization
6. Presenting Data
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
36 Instructor: Dr. Srishti Vashishtha
Data Analytic Process
1. Define the Problem or Research Question
In the first step of process the data analyst is given a
problem/business task. The analyst has to understand the
task and the stakeholder’s expectations for the solution. A
stakeholder is a person that has invested their money and
resources to a project. The analyst must be able to ask
different questions in order to find the right solution to their
problem. The analyst has to find the root cause of the
problem in order to fully understand the problem. The
analyst must make sure that he/she doesn’t have any
distractions while analyzing the problem. Communicate
effectively with the stakeholders and other colleagues to
completely understand what the underlying problem is.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
37 Instructor: Dr. Srishti Vashishtha
Data Analytic Process
2. Collect Data
The second step is to Prepare or Collect the Data. This step includes
collecting data and storing it for further analysis. The analyst has to collect
the data based on the task given from multiple sources. The data has to be
collected from various sources, internal or external sources.
Internal data is the data available in the organization that you work for
while external data is the data available in sources other than your
organization. The data that is collected by an individual from their own
resources is called first-party data. The data that is collected and sold is
called second-party data. Data that is collected from outside sources is
called third-party data.
The common sources from where the data is collected are Interviews,
Surveys, Feedback, Questionnaires. The collected data can be stored in a
spreadsheet or SQL database.
The best tools to store the data are MS Excel or Google Sheets in the case
of Spreadsheets and there are so many databases like Oracle, Microsoft to
store the data.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
38 Instructor: Dr. Srishti Vashishtha
3. Data Cleaning
The third step is Clean and Process Data. After the data is
collected from multiple sources, it is time to clean the data. Clean
data means data that is free from misspellings, redundancies, and
irrelevance.
Clean data largely depends on data integrity. There might be
duplicate data or the data might not be in a format, therefore the
unnecessary data is removed and cleaned.
There are different functions provided by SQL and Excel to clean
the data. This is one of the most important steps in Data Analysis
as clean and formatted data helps in finding trends and solutions.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
39 Instructor: Dr. Srishti Vashishtha
4. Analyzing the Data
The fourth step is to Analyze. The cleaned data is used for
analyzing and identifying trends. It also performs calculations and
combines data for better results. The tools used for performing
calculations are Excel or SQL. These tools provide in-built
functions to perform calculations or sample code is written in SQL
to perform calculations.
Using Excel, we can create pivot tables and perform calculations
while SQL creates temporary tables to perform calculations.
Programming languages are another way of solving problems.
They make it much easier to solve problems by providing
packages. The most widely used programming languages for data
analysis are R and Python.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
40 Instructor: Dr. Srishti Vashishtha
5. Data Visualization
Nothing is more compelling than a visualization. The data now
transformed has to be made into a visual (chart, graph). The reason for
making data visualizations is that there might be people, mostly
stakeholders that are non-technical.
Visualizations are made for a simple understanding of complex data.
Tableau and Looker are the two popular tools used for compelling data
visualizations. Tableau is a simple drag and drop tool that helps in
creating compelling visualizations. Looker is a data viz tool that directly
connects to the database and creates visualizations.
R and Python have some packages that provide beautiful data
visualizations. R has a package named ggplot which has a variety of
data visualizations. A presentation is given based on the data findings.
Sharing the insights with the team members and stakeholders will help
in making better decisions. It helps in making more informed decisions
and it leads to better outcomes.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
41 Instructor: Dr. Srishti Vashishtha
6. Presenting the Data
Presenting the data involves transforming raw information into a
format that is easily comprehensible and meaningful for various
stakeholders. This process encompasses the creation of visual
representations, such as charts, graphs, and tables, to effectively
communicate patterns, trends, and insights gleaned from the data
analysis.
The goal is to facilitate a clear understanding of complex information,
making it accessible to both technical and non-technical audiences.
Effective data presentation involves thoughtful selection of
visualization techniques based on the nature of the data and the
specific message intended. It goes beyond mere display to storytelling,
where the presenter interprets the findings, emphasizes key points, and
guides the audience through the narrative that the data unfolds.
Whether through reports, presentations, or interactive dashboards, the
art of presenting data involves balancing simplicity with depth,
ensuring that the audience can easily grasp the significance of the
information presented and use it for informed decision-making.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
42 Instructor: Dr. Srishti Vashishtha
Reporting Vs Analytics
Reporting is the process of gathering and presenting
data in a structured format such as graphs and tables.
Organizing information in predefined KPIs and metrics
makes it easier for you to understand what is
happening.
Analytics is the process of analyzing your data to
identify patterns and gain insights. Using techniques
such as predictive and prescriptive analytics helps you
understand why things are happening and what to do
next.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
43 Instructor: Dr. Srishti Vashishtha
Analytics Reporting
Analytics is the method of Reporting is an action that includes
examining and analyzing all the needed information and data
summarized data to make business and is put together in an organized
decisions. way.
Identifying business events,
Questioning the data,
gathering the required information,
understanding it, investigating it,
organizing, summarizing, and
and presenting it to the end users
presenting existing data are all part
are all part of analytics.
of reporting.
The purpose of reporting is to
The purpose of analytics is to draw
organize the data into meaningful
conclusions based on data.
information.
Reporting is provided to the
Analytics is used by data analysts,
appropriate business leaders to
scientists, and business people to
perform effectively and efficiently
make effective decisions.
within a firm.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
44 Instructor: Dr. Srishti Vashishtha
Modern Data Analytic Tools
Apache Hadoop :-
• Apache Hadoop, a big data analytics tool which is a Java
based free software framework.
• It helps in effective storage of huge amount of data in a
storage place known as a cluster.
• It runs in parallel on a cluster and also has ability to process
huge data across all nodes in it.
• There is a storage system in Hadoop popularly known as
the Hadoop Distributed File System (HDFS), which helps to
splits the large volume of data and distribute across many
nodes present in a cluster.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
45 Instructor: Dr. Srishti Vashishtha
• KNIME :-
• KNIME analytics platform is one of the leading open
solutions for data-driven innovation.
• This tool helps in discovering the potential and hidden
in a huge volume of data, it also performs mine for
fresh insights, or predicts the new futures.
• OpenRefine :-
• OneRefine tool is one of the efficient tools to work on
the messy and large volume of data.
• It includes cleansing data, transforming that data from
one format another.
• It helps to explore large data sets easily.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
46 Instructor: Dr. Srishti Vashishtha
• Orange :-
• Orange is famous open-source data visualization and helps
in data analysis for beginner and as well to the expert.
• This tool provides interactive workflows with a large
toolbox option to create the same which helps in analysis
and visualizing of data.
• RapidMiner :-
• RapidMiner tool operates using visual programming and
also it is much capable of manipulating, analyzing and
modeling the data.
• RapidMiner tools make data science teams easier and
productive by using an open-source platform for all their
jobs like machine learning, data preparation, and model
deployment.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
47 Instructor: Dr. Srishti Vashishtha
• R-programming :-
• R is a free open source software programming
language and a software environment for
statistical computing and graphics.
• It is used by data miners for developing statistical
software and data analysis.
• It has become a highly popular tool for big data in
recent years.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
48 Instructor: Dr. Srishti Vashishtha
• Datawrapper :-
• It is an online data visualization tool for making interactive
charts.
• It uses data file in a csv, pdf or excel format.
• Datawrapper generate visualization in the form of bar, line,
map etc. It can be embedded into any other website as well.
• Tableau :-
• Tableau is another popular big data tool. It is simple and
very intuitive to use.
• It communicates the insights of the data through data
visualization.
• Through Tableau, an analyst can check a hypothesis and
explore the data before starting to work on it extensively.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
49 Instructor: Dr. Srishti Vashishtha
Data Analytics Applications
Healthcare
Data analytics is revolutionizing the healthcare
industry by enabling better patient care, disease
prevention, and resource optimization. For
example, hospitals can analyze patient data to
identify high-risk individuals and provide
personalized treatment plans. Data analytics can
also help detect disease outbreaks, monitor the
effectiveness of treatments, and improve
healthcare operations.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
50 Instructor: Dr. Srishti Vashishtha
Finance
In the financial sector, data analytics plays a crucial
role in fraud detection, risk assessment, and
investment strategies. Banks and financial
institutions analyze large volumes of data to
identify suspicious transactions, predict
creditworthiness, and optimize investment
portfolios. Data analytics also enables
personalized financial advice and the development
of creative financial products and services.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
51 Instructor: Dr. Srishti Vashishtha
E-commerce
E-commerce platforms utilize data analytics to
understand customer behavior, personalize
shopping experiences, and optimize marketing
campaigns. By analyzing customer preferences,
purchase history, and browsing patterns, e-
commerce companies can offer personalized
product recommendations, target specific customer
segments, and improve customer satisfaction and
retention.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
52 Instructor: Dr. Srishti Vashishtha
Cybersecurity
Data analytics plays a vital role in cybersecurity by
detecting and preventing cyber threats and attacks.
Security systems analyze network traffic, user
behavior, and system logs to identify anomalies
and potential security breaches. By leveraging data
analytics, organizations can proactively strengthen
their security measures, detect and respond to
threats in real-time, and safeguard sensitive
information.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
53 Instructor: Dr. Srishti Vashishtha
Supply Chain Management
Data analytics improves supply chain management
by optimizing inventory levels, reducing costs, and
enhancing overall operational efficiency.
Organizations can identify bottlenecks, forecast
demand, and improve logistics and distribution
processes by analyzing supply chain data. Data
analytics also enables better supplier management
and enhances transparency throughout the supply
chain.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
54 Instructor: Dr. Srishti Vashishtha
Banking
Banks use data analytics to gain insights into
customer behavior, manage risks, and personalize
financial services. Banks can tailor their offerings,
identify potential fraud, and assess
creditworthiness by analyzing transaction data,
customer demographics, and credit histories. Data
analytics also helps banks detect money
laundering activities and improve regulatory
compliance.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
55 Instructor: Dr. Srishti Vashishtha
Logistics
In the logistics industry, data analytics plays a crucial
role in optimizing transportation routes, managing fleet
operations, and improving overall supply chain
efficiency. Logistics companies can minimize costs,
reduce delivery times, and enhance customer
satisfaction by analyzing data on routes, delivery times,
and vehicle performance. Data analytics also enables
better demand forecasting and inventory management.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
56 Instructor: Dr. Srishti Vashishtha
Retail
Data analytics transforms the retail industry by
providing insights into customer preferences,
optimizing pricing strategies, and improving
inventory management. Retailers analyze sales
data, customer feedback, and market trends to
identify popular products, personalize offers, and
forecast demand. Data analytics also helps
retailers enhance their marketing efforts, improve
customer loyalty, and optimize store layouts.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
57 Instructor: Dr. Srishti Vashishtha
Manufacturing
Data analytics is revolutionizing the manufacturing
sector by enabling predictive maintenance,
optimizing production processes, and improving
product quality. Manufacturers can predict
equipment failures, minimize downtime, and
ensure efficient operations by analyzing sensor
data, machine performance, and historical
maintenance records. Data analytics also enables
real-time monitoring of production lines, leading to
higher productivity and cost savings.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
58 Instructor: Dr. Srishti Vashishtha
Internet Searching
Data analytics powers internet search engines,
enabling users to find relevant information quickly
and accurately. Search engines analyze vast
amounts of data, including web pages, user
queries, and click-through rates, to deliver the
most relevant search results. Data analytics
algorithms continuously learn and adapt to user
behavior, providing increasingly accurate and
personalized search results.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
59 Instructor: Dr. Srishti Vashishtha
Risk Management
Data analytics plays a crucial role in risk
management across various industries, including
insurance, finance, and project management.
Organizations can assess risks, develop mitigation
strategies, and make informed decisions by
analyzing historical data, market trends, and
external factors. Data analytics helps organizations
identify potential risks, quantify their impact, and
implement risk mitigation measures.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
60 Instructor: Dr. Srishti Vashishtha
Data Analytics Lifecycle: Key roles for
successful analytic projects
1. Business User: Someone who understands the domain area
and usually benefits from the results. This person can consult
and advise the project team on the context of the project, the
value of the results, and how the outputs will be
operationalized. Usually a business analyst, line manager, or
deep subject matter expert in the project domain fulfills this
role.
2. Project Sponsor: Responsible for the genesis of the project.
Provides the impetus and requirements for the project and
defines the core business problem. Generally provides the
funding and gauges the degree of value from the final outputs
of the working team. This person sets the priorities for the
project and clarifies the desired outputs.
3. Project Manager: Ensures that key milestones and objectives
are met on time and at the expected quality.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
61 Instructor: Dr. Srishti Vashishtha
4. Business Intelligence Analyst: Provides business domain
expertise based on a deep understanding of the data, key
performance indicators (KPIs), key metrics, and business
intelligence from a reporting perspective. Business Intelligence
Analysts generally create dashboards and reports and have
knowledge of the data feeds and sources.
5. Database Administrator (DBA): Provisions and configures the
database environment to support the analytics needs of the
working team. These responsibilities may include providing
access to key databases or tables and ensuring the
appropriate security levels are in place related to the data
repositories.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
62 Instructor: Dr. Srishti Vashishtha
6. Data Engineer: Leverages deep technical skills to assist
with tuning SQL queries for data management and data
extraction, and provides support for data ingestion into
the analytic sandbox. Whereas the DBA sets up and
configures the databases to be used, the data engineer
executes the actual data extractions and performs
substantial data manipulation to facilitate the analytics.
The data engineer works closely with the data scientist
to help shape data in the right ways for analyses.
7. Data Scientist: Provides subject matter expertise for
analytical techniques, data modeling, and applying valid
analytical techniques to given business problems.
Ensures overall analytics objectives are met. Designs
and executes analytical methods and approaches with
the data available to the project.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
63 Instructor: Dr. Srishti Vashishtha
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
64 Instructor: Dr. Srishti Vashishtha
Data Analytics Lifecycle
Data Analytics Lifecycle is designed specifically for Big
Data problems and data science projects. The lifecycle
has six phases, and project work can occur in several
phases at once. For most phases in the lifecycle, the
movement can be either forward or backward.
Phase 1: Discovery
Phase 2: Data Preparation
Phase 3: Model Planning
Phase 4: Communication results
Phase 5: Operationalize
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
65 Instructor: Dr. Srishti Vashishtha
Data Analytics Lifecycle
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
66 Instructor: Dr. Srishti Vashishtha
Phase 1—Discovery: In Phase 1, the team learns the business
domain, including relevant history such as whether the
organization or business unit has attempted similar projects in the
past from which they can learn. The team assesses the resources
available to support the project in terms of people, technology,
time, and data. Important activities in this phase include framing
the business problem as an analytics challenge that can be
addressed in subsequent phases and formulating initial
hypotheses (IHs) to test and begin learning the data.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
67 Instructor: Dr. Srishti Vashishtha
Phase 2—Data preparation: Phase 2 requires the presence of an
analytic sandbox, in which the team can work with data and
perform analytics for the duration of the project. The team needs
to execute extract, load, and transform (ELT) or extract, transform
and load (ETL) to get data into the sandbox. The ELT and ETL are
sometimes abbreviated as ETLT. Data should be transformed in
the ETLT process so the team can work with it and analyze it. In
this phase, the team also needs to familiarize itself with the data
thoroughly and take steps to condition the data.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
68 Instructor: Dr. Srishti Vashishtha
Phase 3—Model planning: Phase 3 is model planning, where the
team determines the methods, techniques, and workflow it
intends to follow for the subsequent model building phase. The
team explores the data to learn about the relationships between
variables and subsequently selects key variables and the most
suitable models.
Phase 4—Model building: In Phase 4, the team develops datasets
for testing, training, and production purposes. In addition, in this
phase the team builds and executes models based on the work
done in the model planning phase. The team also considers
whether its existing tools will suffice for running the models, or if
it will need a more robust environment for executing models and
workflows (for example, fast hardware and parallel processing, if
applicable).
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
69 Instructor: Dr. Srishti Vashishtha
Phase 5—Communicate results: In Phase 5, the team, in
collaboration with major stakeholders, determines if the results of
the project are a success or a failure based on the criteria
developed in Phase 1. The team should identify key findings,
quantify the business value, and develop a narrative to summarize
and convey findings to stakeholders.
Phase 6—Operationalize: In Phase 6, the team delivers final
reports, briefings, code, and technical documents. In addition, the
team may run a pilot project to implement the models in a
production environment.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
70 Instructor: Dr. Srishti Vashishtha
Phase I Discovery
In this phase, the data science team must learn and investigate the
problem, develop context and understanding, and learn about the
data sources needed and available for the project.
1. Learning the Business Domain
2. Resources
3. Framing the Problem
4. Identifying Key Stakeholders
5. Interviewing the Analytics Sponsor
6. Developing Initial Hypotheses
7. Identifying Potential Data Sources
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
71 Instructor: Dr. Srishti Vashishtha
Phase II Data Preparation
1. Preparing the Analytic Sandbox
2. Performing ETLT
3. Learning about the data
4. Data Conditioning
5. Survey and Visualize
6. Common Tools for the Data Preparation Phase
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
72 Instructor: Dr. Srishti Vashishtha
Phase III Model Planning
The data science team identifies candidate models to
apply to the data for clustering, classifying, or finding
relationships in the data depending on the goal of the
project.
1. Data Exploration and Variable Selection
2. Model Selection
3. Common Tools for the Model Planning Phase
R, SQL Analysis Services, SAS/ ACCESS
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
73 Instructor: Dr. Srishti Vashishtha
Phase IV Model Building
The data science team needs to develop datasets for
training, testing, and production purposes.
1. Common Tools for the Model Building Phase
Commercial tools: SAS Enterprise Miner, SPSS Modeler,
Matlab, Apline Miner, STATISTICA
Free or Open Source tools: R, Octave, WEKA, Python,
SQL
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
74 Instructor: Dr. Srishti Vashishtha
Phase V Communicate Results
After executing the model, the team needs to compare
the outcomes of the modeling to the criteria
established for success and failure. In Phase 5 the team
considers how best to articulate the findings and
outcomes to the various team members and
stakeholders, taking into account caveats, assumptions,
and any limitations of the results.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
75 Instructor: Dr. Srishti Vashishtha
Phase VI Operationalize
In the final phase, the team communicates the benefits
of the project more broadly and sets up a pilot project
to deploy the work in a controlled way before
broadening the work to a full enterprise or ecosystem
of users.
CSE Dept. , BVCOE, New Delhi Subject: EDADV,
76 Instructor: Dr. Srishti Vashishtha