Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
22 views50 pages

Mod 3

Uploaded by

vaishbhat19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views50 pages

Mod 3

Uploaded by

vaishbhat19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

EXPLORATORY DATA ANALYSIS

Exploratory Data Analysis or EDA is a statistical


approach or technique for analyzing data sets to
summarize their important and main characteristics
generally by using some visual aids.
The EDA approach can be used to gather knowledge
about the following aspects of data.
• Main characteristics or features of the data.
• The variables and their relationships.
• Finding out the important variables that can be used in
our problem.
we are going to perform EDA under two broad
classifications:

• Descriptive Statistics, which includes mean,


median, mode, inter-quartile range, and so on.
• Graphical Methods, which includes
histogram, density estimation, box plots, and
so on.
Exploring a new dataset

 The read.csv() function is used to import CSV file


into R. The header = TRUE tells R that header is
included in the data that we are going to

import.mydata < -read.csv("C:/Users/Deepanshu/Documents/Book1.csv”,


header=TRUE)
What is Anomalies?

 Anomalies, also known as outliers, are data points


that significantly deviate from the normal behavior or
expected patterns within a dataset.
 They can be caused by various factors such as errors
in data collection, system glitches, fraudulent
activities, or genuine but rare occurrences.
ANOMALIES IN NUMERICAL DATA

 Anomaly detection is a critical aspect of data


analysis, allowing us to identify unusual patterns,
outliers, or abnormalities within datasets.
 It plays a pivotal role across various domains such as
finance, cybersecurity, healthcare, and more.
This process is crucial for decision-making, risk
management, and maintaining the integrity of
datasets.

They can manifest in various forms and fields.


1. Financial Transactions
In financial data, anomalies might include fraudulent activities like:
 Unusually large transactions compared to typical spending patterns for an
individual.
 Transactions occurring at odd hours or from atypical geographic locations.
 in spending behavior.

2. Network Security
In cybersecurity, anomalies could be:
 Unusual spikes in network traffic that differ significantly from regular
patterns.
 Unexpected login attempts from unrecognized IP addresses.
 Unusual file access or transfer patterns that deviate from typical user
behavior.
3. Healthcare and Medical Data
In medical data, anomalies might include:
 Outliers in patient vital signs that deviate significantly from the norm.
 Irregularities in medical imaging (like X-rays, MRIs) indicating potential health issues.
 Unexpected patterns in patient records, such as sudden, significant changes in
medication or treatment adherence.
4. Manufacturing and IoT
In manufacturing or IoT (Internet of Things), anomalies could be:
 Abnormal sensor readings in machinery indicating potential faults or malfunctions.
 Sudden temperature, pressure, or vibration changes in equipment beyond usual operating
ranges.
 Deviations in product quality or output that fall outside standard tolerances.

5. Climate and Environmental Data


In environmental datasets, anomalies might be:
 Unusual weather patterns or extreme weather events that deviate from historical records.
 Unexpected changes in air quality measurements indicating potential pollution events.
 Abnormal fluctuations in ocean temperatures or ice melting rates.
Detecting anomalies in numerical data is a critical
step in data preprocessing, and R provides several
techniques to help identify these outliers.
Here are a few popular methods for identifying anomalies
in numerical data in R:

1. Using Z-Scores
Calculate the Z-scores for each value in a numeric column. A value
with a Z-score above a certain threshold (typically ±3) is
considered an anomaly.
 Z-scores are a statistical measurement that show how far a data
point is from the mean of a set of data. They are expressed in terms of
standard deviations, and can be used to determine if a value is above or
below the mean
 Positive Z-score: A positive Z-score indicates that the value is above the
mean. For example, a Z-score of 1.0 means the value is one standard
deviation above the mean
 Negative Z-score: A negative Z-score indicates that the value is below
the mean. For example, a Z-score of -1.0 means the value is one standard
deviation below the mean
 Z-score of 0: A Z-score of 0 indicates that the value is equal to the mean.
2. USING THE INTERQUARTILE RANGE (IQR)
METHOD

The IQR method considers values outside the


range [Q1−1.5×IQR,Q3+1.5×IQR][Q1−1.5×IQR,Q3+1.5×IQR] as outliers.
3. USING DENSITY-BASED METHODS (E.G.,
DBSCAN)
Density-based spatial clustering of applications with
noise (DBSCAN) can identify clusters and outliers based
on density.
4. USING THE FORECAST PACKAGE FOR TIME-
SERIES DATA
If working with time-series data,
the forecast package’s tsoutliers() function can identify
anomalies based on model residuals.
5. USING ISOLATION FORESTS
Isolation forests is a machine learning approach for anomaly
detection. The isofor package in R can be used for this.
VISUALIZING RELATIONS BETWEEN VARIABLES
 To visualize relationships between variables in R, here are several
common and effective types of plots for exploring relationships,
depending on the data type and structure:
1. Scatter Plot
• Best for visualizing the relationship between two continuous
variables.
• Use ggplot2 for enhanced customization.
2. Line Plot
• Ideal for showing trends in time-series data or other sequential
relationships.
 3. Heatmap
• Useful for visualizing the correlation matrix of
multiple variables.
• This requires converting correlations to a matrix
format.
 4. Pair Plot (Scatterplot Matrix)
• Good for exploring relationships between multiple
pairs of variables.
• GGally’s ggpairs function is helpful for this.
 5. Box Plot
• For visualizing the relationship between a categorical
variable and a continuous variable.
ASSUMPTIONS OF LINEAR REGRESSION

 Linear regression is the simplest machine learning


algorithm of predictive analysis.

 It is widely used for predicting a continuous target


variable based on one or more predictor
variables. While linear regression is powerful and
interpretable, its validity relies heavily on certain
assumptions about the data.
KEY ASSUMPTIONS OF LINEAR
REGRESSION

1. Linearity
 Assumption: The relationship between the
independent and dependent variables is linear.
 The first and foremost assumption of linear regression is
that the relationship between the predictor(s) and the
response variable is linear.
 This means that a change in the independent
variable results in a proportional change in the
dependent variable. This can be visually assessed
using scatter plots or residual plots.
 2. Homoscedasticity of Residuals in Linear
Regression
 Homoscedasticity is one of the key assumptions of
linear regression, which asserts that the residuals (the
differences between observed and predicted
values) should have a constant variance across all
levels of the independent variable(s).
 In simpler terms, it means that the spread of the
errors should be relatively uniform, regardless
of the value of the predictor.
3. Multivariate Normality – Normal Distribution

 Multivariate normality is a key assumption for linear


regression models when making statistical inferences.

 Specifically, it means that the residuals (the differences


between observed and predicted values) should follow
a normal distribution when considering multiple
predictors together.

 This assumption ensures that hypothesis tests, confidence


intervals, and p-values are valid.
VALIDATING LINEAR
ASSUMPTION
 Validating the linear assumption is a critical step in
linear regression, ensuring that the relationship
between the independent (predictor) and
dependent (response) variables is indeed linear.
 This validation step is essential, as non-linear
relationships can lead to inaccurate predictions and
interpretations. Here are some methods commonly used
to validate the linearity assumption:
 1. Visual Inspection with Scatter Plots
• Scatter Plot of Dependent vs. Independent Variables:
Plot the dependent variable (Y) against each independent
variable (X) individually. A linear pattern (i.e., points roughly
following a straight line) suggests a linear relationship.
• Residuals vs. Fitted Values Plot: After fitting the linear
model, plot the residuals (differences between observed and
predicted values) against the fitted values. If the residuals are
randomly scattered around zero with no discernible pattern,
this suggests linearity. Curvature or other patterns may
indicate non-linearity.
2. Correlation Coefficients
• Calculate the correlation coefficient between each
independent variable and the dependent variable.
While a high correlation does not guarantee linearity, it
often indicates a linear relationship.
 3. Using Polynomial and Interaction Terms
• Add polynomial terms (e.g., X2X2) or interaction terms to
the model and check if they are significant. If polynomial
terms significantly improve the model, it may indicate that
a non-linear relationship exists.
• Likelihood Ratio Test: Compare models with and without
these terms to see if including them significantly improves
the fit.
 4. Statistical Tests
• Lack-of-Fit Test: This test checks whether a linear model fits the data
well or if a more complex model is needed. If the test is significant, it
suggests that the linear model may not be adequate.
 5. Box-Cox Transformation
• The Box-Cox transformation can help identify if a transformation of the
dependent variable might make the relationship linear. By examining
the best λ (lambda) value, one can determine if the data needs a
transformation like a square root, log, or reciprocal.
 6. Partial Residual (Component-Plus-Residual) Plots
• These plots are useful when dealing with multiple predictors. They
show the relationship between the response and each predictor
variable after accounting for other predictors in the model. Non-linear
patterns in partial residual plots indicate a need to reconsider the
linearity assumption.
1.Missing Values:
1. Definition: Missing values refer to absent data points in a
dataset, which may occur due to errors in data collection,
processing, or respondents not providing information.
2. Handling: Various methods exist for dealing with missing
values, including:
1. Deletion: Remove rows or columns with missing values (useful when
missing values are few).
2. Imputation: Fill missing values using strategies like mean, median, mode,
or more sophisticated methods like regression or machine learning models.
3. Indicator Variables: Add a binary variable to indicate if data was missing,
preserving information.
3. Consideration: It’s crucial to assess why data is missing (e.g.,
missing completely at random or missing not at random) since
this impacts how we handle it.
Covariation:
1. Definition: Covariation indicates how two variables vary together.
Positive covariation means that as one variable increases, the other
tends to increase too, while negative covariation indicates that as one
variable rises, the other tends to decrease.
2. Measures: The common measures
include Covariance and Correlation.
1. Covariance is a measure of how much two random variables vary together.
It can be positive, negative, or zero.
2. Correlation standardizes covariance to make it dimensionless, typically
ranging between -1 and +1, which makes it easier to interpret.
3. Visualization: Scatter plots or heatmaps help visualize covariation
patterns between variables.
Patterns:
1.Definition: Patterns are recurring structures or
relationships observed within data.
2.Types: Patterns may include linear, non-
linear, seasonal, or cyclic patterns.
3.Analysis: Finding patterns is central to
understanding trends, anomalies, and relationships
within datasets, and can be achieved through
visualizations (e.g., line plots, histograms) or by
using machine learning algorithms.
4.Example: In time series data, a recurring peak
might signify a seasonal pattern.
Models:
1. Definition: Models are representations or mathematical
approximations of relationships in data, used to understand,
predict, or simulate phenomena.
2. Types: Models can be statistical (like linear regression)
or machine learning models (like decision trees, neural
networks).
3. Purpose: Models help to make predictions, explain
variability, and uncover insights from data.
4. Modeling Process: Involves selecting a model based on
data characteristics, training the model on a subset of the
data, and validating its performance on unseen data.
GGLOT2 CALLS.
2. DETECTING ANOMALIES IN NUMERICAL DATA

# Detect anomalies using IQR


numerical_data <- data$your_numeric_column
Q1 <- quantile(numerical_data, 0.25)
Q3 <- quantile(numerical_data, 0.75)
IQR <- Q3 - Q1
outliers <- numerical_data[numerical_data < (Q1 - 1.5 * IQR) |
numerical_data > (Q3 + 1.5 * IQR)]
print(outliers)

You might also like