0% found this document useful (0 votes)

22 views50 pages

Mod 3

Uploaded by

vaishbhat19

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views50 pages

Mod 3

Uploaded by

vaishbhat19

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

EXPLORATORY DATA ANALYSIS

Exploratory Data Analysis or EDA is a statistical

approach or technique for analyzing data sets to
summarize their important and main characteristics
generally by using some visual aids.
The EDA approach can be used to gather knowledge
about the following aspects of data.
• Main characteristics or features of the data.
• The variables and their relationships.
• Finding out the important variables that can be used in
our problem.
we are going to perform EDA under two broad
classifications:

• Descriptive Statistics, which includes mean,

median, mode, inter-quartile range, and so on.
• Graphical Methods, which includes
histogram, density estimation, box plots, and
so on.
Exploring a new dataset

 The read.csv() function is used to import CSV file

into R. The header = TRUE tells R that header is
included in the data that we are going to

import.mydata < -read.csv("C:/Users/Deepanshu/Documents/Book1.csv”,

header=TRUE)
What is Anomalies?

 Anomalies, also known as outliers, are data points

that significantly deviate from the normal behavior or
expected patterns within a dataset.
 They can be caused by various factors such as errors
in data collection, system glitches, fraudulent
activities, or genuine but rare occurrences.
ANOMALIES IN NUMERICAL DATA

 Anomaly detection is a critical aspect of data

analysis, allowing us to identify unusual patterns,
outliers, or abnormalities within datasets.
 It plays a pivotal role across various domains such as
finance, cybersecurity, healthcare, and more.
This process is crucial for decision-making, risk
management, and maintaining the integrity of
datasets.

They can manifest in various forms and fields.

1. Financial Transactions
In financial data, anomalies might include fraudulent activities like:
 Unusually large transactions compared to typical spending patterns for an
individual.
 Transactions occurring at odd hours or from atypical geographic locations.
 in spending behavior.

2. Network Security
In cybersecurity, anomalies could be:
 Unusual spikes in network traffic that differ significantly from regular
patterns.
 Unexpected login attempts from unrecognized IP addresses.
 Unusual file access or transfer patterns that deviate from typical user
behavior.
3. Healthcare and Medical Data
In medical data, anomalies might include:
 Outliers in patient vital signs that deviate significantly from the norm.
 Irregularities in medical imaging (like X-rays, MRIs) indicating potential health issues.
 Unexpected patterns in patient records, such as sudden, significant changes in
medication or treatment adherence.
4. Manufacturing and IoT
In manufacturing or IoT (Internet of Things), anomalies could be:
 Abnormal sensor readings in machinery indicating potential faults or malfunctions.
 Sudden temperature, pressure, or vibration changes in equipment beyond usual operating
ranges.
 Deviations in product quality or output that fall outside standard tolerances.

5. Climate and Environmental Data

In environmental datasets, anomalies might be:
 Unusual weather patterns or extreme weather events that deviate from historical records.
 Unexpected changes in air quality measurements indicating potential pollution events.
 Abnormal fluctuations in ocean temperatures or ice melting rates.
Detecting anomalies in numerical data is a critical
step in data preprocessing, and R provides several
techniques to help identify these outliers.
Here are a few popular methods for identifying anomalies
in numerical data in R:

1. Using Z-Scores
Calculate the Z-scores for each value in a numeric column. A value
with a Z-score above a certain threshold (typically ±3) is
considered an anomaly.
 Z-scores are a statistical measurement that show how far a data
point is from the mean of a set of data. They are expressed in terms of
standard deviations, and can be used to determine if a value is above or
below the mean
 Positive Z-score: A positive Z-score indicates that the value is above the
mean. For example, a Z-score of 1.0 means the value is one standard
deviation above the mean
 Negative Z-score: A negative Z-score indicates that the value is below
the mean. For example, a Z-score of -1.0 means the value is one standard
deviation below the mean
 Z-score of 0: A Z-score of 0 indicates that the value is equal to the mean.
2. USING THE INTERQUARTILE RANGE (IQR)
METHOD

The IQR method considers values outside the

range [Q1−1.5×IQR,Q3+1.5×IQR][Q1−1.5×IQR,Q3+1.5×IQR] as outliers.
3. USING DENSITY-BASED METHODS (E.G.,
DBSCAN)
Density-based spatial clustering of applications with
noise (DBSCAN) can identify clusters and outliers based
on density.
4. USING THE FORECAST PACKAGE FOR TIME-
SERIES DATA
If working with time-series data,
the forecast package’s tsoutliers() function can identify
anomalies based on model residuals.
5. USING ISOLATION FORESTS
Isolation forests is a machine learning approach for anomaly
detection. The isofor package in R can be used for this.
VISUALIZING RELATIONS BETWEEN VARIABLES
 To visualize relationships between variables in R, here are several
common and effective types of plots for exploring relationships,
depending on the data type and structure:
1. Scatter Plot
• Best for visualizing the relationship between two continuous
variables.
• Use ggplot2 for enhanced customization.
2. Line Plot
• Ideal for showing trends in time-series data or other sequential
relationships.
 3. Heatmap
• Useful for visualizing the correlation matrix of
multiple variables.
• This requires converting correlations to a matrix
format.
 4. Pair Plot (Scatterplot Matrix)
• Good for exploring relationships between multiple
pairs of variables.
• GGally’s ggpairs function is helpful for this.
 5. Box Plot
• For visualizing the relationship between a categorical
variable and a continuous variable.
ASSUMPTIONS OF LINEAR REGRESSION

 Linear regression is the simplest machine learning

algorithm of predictive analysis.

 It is widely used for predicting a continuous target

variable based on one or more predictor
variables. While linear regression is powerful and
interpretable, its validity relies heavily on certain
assumptions about the data.
KEY ASSUMPTIONS OF LINEAR
REGRESSION

1. Linearity
 Assumption: The relationship between the
independent and dependent variables is linear.
 The first and foremost assumption of linear regression is
that the relationship between the predictor(s) and the
response variable is linear.
 This means that a change in the independent
variable results in a proportional change in the
dependent variable. This can be visually assessed
using scatter plots or residual plots.
 2. Homoscedasticity of Residuals in Linear
Regression
 Homoscedasticity is one of the key assumptions of
linear regression, which asserts that the residuals (the
differences between observed and predicted
values) should have a constant variance across all
levels of the independent variable(s).
 In simpler terms, it means that the spread of the
errors should be relatively uniform, regardless
of the value of the predictor.
3. Multivariate Normality – Normal Distribution

 Multivariate normality is a key assumption for linear

regression models when making statistical inferences.

 Specifically, it means that the residuals (the differences

between observed and predicted values) should follow
a normal distribution when considering multiple
predictors together.

 This assumption ensures that hypothesis tests, confidence

intervals, and p-values are valid.
VALIDATING LINEAR
ASSUMPTION
 Validating the linear assumption is a critical step in
linear regression, ensuring that the relationship
between the independent (predictor) and
dependent (response) variables is indeed linear.
 This validation step is essential, as non-linear
relationships can lead to inaccurate predictions and
interpretations. Here are some methods commonly used
to validate the linearity assumption:
 1. Visual Inspection with Scatter Plots
• Scatter Plot of Dependent vs. Independent Variables:
Plot the dependent variable (Y) against each independent
variable (X) individually. A linear pattern (i.e., points roughly
following a straight line) suggests a linear relationship.
• Residuals vs. Fitted Values Plot: After fitting the linear
model, plot the residuals (differences between observed and
predicted values) against the fitted values. If the residuals are
randomly scattered around zero with no discernible pattern,
this suggests linearity. Curvature or other patterns may
indicate non-linearity.
2. Correlation Coefficients
• Calculate the correlation coefficient between each
independent variable and the dependent variable.
While a high correlation does not guarantee linearity, it
often indicates a linear relationship.
 3. Using Polynomial and Interaction Terms
• Add polynomial terms (e.g., X2X2) or interaction terms to
the model and check if they are significant. If polynomial
terms significantly improve the model, it may indicate that
a non-linear relationship exists.
• Likelihood Ratio Test: Compare models with and without
these terms to see if including them significantly improves
the fit.
 4. Statistical Tests
• Lack-of-Fit Test: This test checks whether a linear model fits the data
well or if a more complex model is needed. If the test is significant, it
suggests that the linear model may not be adequate.
 5. Box-Cox Transformation
• The Box-Cox transformation can help identify if a transformation of the
dependent variable might make the relationship linear. By examining
the best λ (lambda) value, one can determine if the data needs a
transformation like a square root, log, or reciprocal.
 6. Partial Residual (Component-Plus-Residual) Plots
• These plots are useful when dealing with multiple predictors. They
show the relationship between the response and each predictor
variable after accounting for other predictors in the model. Non-linear
patterns in partial residual plots indicate a need to reconsider the
linearity assumption.
1.Missing Values:
1. Definition: Missing values refer to absent data points in a
dataset, which may occur due to errors in data collection,
processing, or respondents not providing information.
2. Handling: Various methods exist for dealing with missing
values, including:
1. Deletion: Remove rows or columns with missing values (useful when
missing values are few).
2. Imputation: Fill missing values using strategies like mean, median, mode,
or more sophisticated methods like regression or machine learning models.
3. Indicator Variables: Add a binary variable to indicate if data was missing,
preserving information.
3. Consideration: It’s crucial to assess why data is missing (e.g.,
missing completely at random or missing not at random) since
this impacts how we handle it.
Covariation:
1. Definition: Covariation indicates how two variables vary together.
Positive covariation means that as one variable increases, the other
tends to increase too, while negative covariation indicates that as one
variable rises, the other tends to decrease.
2. Measures: The common measures
include Covariance and Correlation.
1. Covariance is a measure of how much two random variables vary together.
It can be positive, negative, or zero.
2. Correlation standardizes covariance to make it dimensionless, typically
ranging between -1 and +1, which makes it easier to interpret.
3. Visualization: Scatter plots or heatmaps help visualize covariation
patterns between variables.
Patterns:
1.Definition: Patterns are recurring structures or
relationships observed within data.
2.Types: Patterns may include linear, non-
linear, seasonal, or cyclic patterns.
3.Analysis: Finding patterns is central to
understanding trends, anomalies, and relationships
within datasets, and can be achieved through
visualizations (e.g., line plots, histograms) or by
using machine learning algorithms.
4.Example: In time series data, a recurring peak
might signify a seasonal pattern.
Models:
1. Definition: Models are representations or mathematical
approximations of relationships in data, used to understand,
predict, or simulate phenomena.
2. Types: Models can be statistical (like linear regression)
or machine learning models (like decision trees, neural
networks).
3. Purpose: Models help to make predictions, explain
variability, and uncover insights from data.
4. Modeling Process: Involves selecting a model based on
data characteristics, training the model on a subset of the
data, and validating its performance on unseen data.
GGLOT2 CALLS.
2. DETECTING ANOMALIES IN NUMERICAL DATA

# Detect anomalies using IQR

numerical_data <- data$your_numeric_column
Q1 <- quantile(numerical_data, 0.25)
Q3 <- quantile(numerical_data, 0.75)
IQR <- Q3 - Q1
outliers <- numerical_data[numerical_data < (Q1 - 1.5 * IQR) |
numerical_data > (Q3 + 1.5 * IQR)]
print(outliers)

Business Statistics Report Sample
100% (2)
Business Statistics Report Sample
12 pages
Florian Heiss - Using R For Introductory Econometrics-Florian Heiss (2020)
No ratings yet
Florian Heiss - Using R For Introductory Econometrics-Florian Heiss (2020)
379 pages
EFA in R
No ratings yet
EFA in R
32 pages
Lab File AD PDF
No ratings yet
Lab File AD PDF
25 pages
7 OLS Assumptions
No ratings yet
7 OLS Assumptions
37 pages
DSR 2879
No ratings yet
DSR 2879
25 pages
Regression Analysis Project Report
No ratings yet
Regression Analysis Project Report
26 pages
Notes
No ratings yet
Notes
6 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
66 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
67 pages
R Code Regression PCA Guide
No ratings yet
R Code Regression PCA Guide
5 pages
Course Notes18
No ratings yet
Course Notes18
113 pages
Ismaykim1 PDF
No ratings yet
Ismaykim1 PDF
522 pages
BQL Record PDF
No ratings yet
BQL Record PDF
65 pages
Advanced Statistical Modelling Notes
No ratings yet
Advanced Statistical Modelling Notes
233 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
TOBo ML
No ratings yet
TOBo ML
120 pages
R Corregr
No ratings yet
R Corregr
147 pages
Linear Regression Datascience Basit PDF
No ratings yet
Linear Regression Datascience Basit PDF
19 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Multivariant Data.
No ratings yet
Multivariant Data.
36 pages
Data Preparation and Cleaning Guide
No ratings yet
Data Preparation and Cleaning Guide
28 pages
Saurabh
No ratings yet
Saurabh
22 pages
Inferential Statistics & Regression
No ratings yet
Inferential Statistics & Regression
22 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Cheat Sheet
No ratings yet
Cheat Sheet
3 pages
Da Thoery
No ratings yet
Da Thoery
24 pages
Module - 4 (R Training) - Basic Stats & Modeling
No ratings yet
Module - 4 (R Training) - Basic Stats & Modeling
15 pages
Data Science with R: Key Concepts
No ratings yet
Data Science with R: Key Concepts
12 pages
ADS Ut2
No ratings yet
ADS Ut2
23 pages
R for MBA Students
No ratings yet
R for MBA Students
10 pages
DAUP Exam Notes - 2in1
No ratings yet
DAUP Exam Notes - 2in1
35 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
TOBo ML
No ratings yet
TOBo ML
135 pages
Course Content
No ratings yet
Course Content
28 pages
R Commands
No ratings yet
R Commands
18 pages
Time Series EDA for Data Analysts
No ratings yet
Time Series EDA for Data Analysts
20 pages
Data Science Interview Preparation (30 Days of Interview Preparation)
No ratings yet
Data Science Interview Preparation (30 Days of Interview Preparation)
45 pages
Data Science Interview Preparation (30 Days of Interview Preparation)
No ratings yet
Data Science Interview Preparation (30 Days of Interview Preparation)
18 pages
STATISTICS
No ratings yet
STATISTICS
6 pages
Chapter 2
No ratings yet
Chapter 2
47 pages
Heiss F. Using R For Introductory Econometrics 2ed 2020
No ratings yet
Heiss F. Using R For Introductory Econometrics 2ed 2020
379 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
ASM Using R 2 Marks Answer Keys
100% (1)
ASM Using R 2 Marks Answer Keys
10 pages
Using R For Introductory Econometrics
No ratings yet
Using R For Introductory Econometrics
378 pages
Linear Regression Firm Basit PDF
No ratings yet
Linear Regression Firm Basit PDF
21 pages
Assignment 2 - Factor Hair
No ratings yet
Assignment 2 - Factor Hair
39 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Why's and Wherefore's
No ratings yet
Why's and Wherefore's
15 pages
Da SMNR
No ratings yet
Da SMNR
32 pages
Module 5-6
No ratings yet
Module 5-6
12 pages
2.3 Assumptions of Linear Regression
No ratings yet
2.3 Assumptions of Linear Regression
16 pages
Lab 5
No ratings yet
Lab 5
6 pages
LinearRegressionUsing R
No ratings yet
LinearRegressionUsing R
91 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
R Topicscovered
No ratings yet
R Topicscovered
22 pages
Statistical Methods For Data Science
100% (2)
Statistical Methods For Data Science
406 pages
R Programming Lab Assignments
No ratings yet
R Programming Lab Assignments
40 pages
BCA Exam: Statistical Techniques
No ratings yet
BCA Exam: Statistical Techniques
7 pages
Why Video Games Are Not Bad Persuasive Essay
100% (1)
Why Video Games Are Not Bad Persuasive Essay
3 pages
Business School Homework Analysis
No ratings yet
Business School Homework Analysis
18 pages
Statistics and Probability Lesson1
No ratings yet
Statistics and Probability Lesson1
9 pages
Cooley and Mead's Interactionism Theory
No ratings yet
Cooley and Mead's Interactionism Theory
11 pages
Syed, 2015 EA Handbook
No ratings yet
Syed, 2015 EA Handbook
23 pages
Needs of Research
No ratings yet
Needs of Research
16 pages
Per g28 Pub 2083 Touchstone AssessmentQPHTMLMode1 2083O2581 2083O2581S6D41893 17438447037492610 MP07209197 2083O2581S6D41893E1.html#
No ratings yet
Per g28 Pub 2083 Touchstone AssessmentQPHTMLMode1 2083O2581 2083O2581S6D41893 17438447037492610 MP07209197 2083O2581S6D41893E1.html#
39 pages
Topic Proposal: Group-2
No ratings yet
Topic Proposal: Group-2
15 pages
Principles of Gas Chromatography 2
No ratings yet
Principles of Gas Chromatography 2
12 pages
Term Paper of Goup-F (Applied Research Methodology)
No ratings yet
Term Paper of Goup-F (Applied Research Methodology)
39 pages
Healthcare Systems in Zambia
No ratings yet
Healthcare Systems in Zambia
3 pages
Business Process Discovery
No ratings yet
Business Process Discovery
7 pages
MSC Chemistry Error Analysis Answer
No ratings yet
MSC Chemistry Error Analysis Answer
3 pages
Scenario Analysis 2 PDF
100% (1)
Scenario Analysis 2 PDF
133 pages
European Teacher Education Trends
No ratings yet
European Teacher Education Trends
6 pages
Correlationanalysis
No ratings yet
Correlationanalysis
49 pages
Steps in Research Process
No ratings yet
Steps in Research Process
15 pages
Applicability of Using GIN Method, by Considering
No ratings yet
Applicability of Using GIN Method, by Considering
18 pages
Experimental Design in Quantitative Studies
No ratings yet
Experimental Design in Quantitative Studies
19 pages
EAL-G12: Traceability of Measuring and Test Equipment To National Standards
No ratings yet
EAL-G12: Traceability of Measuring and Test Equipment To National Standards
16 pages
Francis Diebold - Econometrics Slides
No ratings yet
Francis Diebold - Econometrics Slides
281 pages
Cole Et Al 2025 Practical Problems Estimating and Reporting Power When Hypotheses Are Embedded in Complex Statistical
No ratings yet
Cole Et Al 2025 Practical Problems Estimating and Reporting Power When Hypotheses Are Embedded in Complex Statistical
17 pages
Essentials of Statistics For Business and Economics 8th Edition Anderson Solutions Manual Download
100% (6)
Essentials of Statistics For Business and Economics 8th Edition Anderson Solutions Manual Download
50 pages
Pharmaceutical Statistics Practical and Clinical Applications, Fifth Edition - 5th Edition Open Access Download
100% (14)
Pharmaceutical Statistics Practical and Clinical Applications, Fifth Edition - 5th Edition Open Access Download
17 pages
Static Electricity for Grades 3-4
No ratings yet
Static Electricity for Grades 3-4
2 pages
(Simu 2024) Midterm Review - TA Review
No ratings yet
(Simu 2024) Midterm Review - TA Review
30 pages
Syllabus For College and Advanced Algebra
100% (1)
Syllabus For College and Advanced Algebra
6 pages
PDF 2
No ratings yet
PDF 2
21 pages

Mod 3

Uploaded by

Mod 3

Uploaded by

EXPLORATORY DATA ANALYSIS

Exploratory Data Analysis or EDA is a statistical

• Descriptive Statistics, which includes mean,

 The read.csv() function is used to import CSV file

import.mydata < -read.csv("C:/Users/Deepanshu/Documents/Book1.csv”,

 Anomalies, also known as outliers, are data points

 Anomaly detection is a critical aspect of data

They can manifest in various forms and fields.

5. Climate and Environmental Data

The IQR method considers values outside the

 Linear regression is the simplest machine learning

 It is widely used for predicting a continuous target

 Multivariate normality is a key assumption for linear

 Specifically, it means that the residuals (the differences

 This assumption ensures that hypothesis tests, confidence

# Detect anomalies using IQR

You might also like