Descriptive statistics, hypothesis
testing, and basic regression
analysis
Pre-Processing Data
• The data is
• Clean, accurate and suitable
• Data pre-processing involves tasks such as cleaning the data to handle
missing values and outliers, transforming variables to meet
assumptions, reducing dimensionality, integrating data from multiple
sources, and normalizing or standardizing data.
How to deal with missing data
• Check with the data collection source
• Drop the missing value
• Drop the variable
• Drop the data entry
• Replace the missing value
• Replace it with an average
• Replace it with 0
• Replace it based on other function
• Leave it as missing data
How to check missing values in R
• Use is.na()
How to check missing values in R
• Use is.na()
Drop rows
• Use data[-c(n1, n2, …), ] to drop specific rows
Drop columns
• Use data[ , -c(n1, n2, …)] to drop specific columns
How to drop missing values in R
• Use drop_na()
Don’t forget
How to replace missing values in R
• Use replace_na()
Data Formatting
• Data is usually collected from different places and stored in different
formats
• Bringing data into a common standard of expression allows you to
make meaningful comparisons.
Reformat an entire column
Reformat an entire column
• Separate FlightDate in the airline dataset
Incorrect data types
• Sometimes the wrong data type is assigned to a feature
Correcting data types
• To identify data types
• To convert data types
Correcting data types
• Convert data type to integer in columns year, month, day
Data normalization
Data normalization
• Not normalized • Normalized
• Age and income are in different
range • Similar value range
• Hard to compare • Similar influence on
• İncome will influence the result analytical model
more
Methods of normalizing of data
Simple feature scaling in R
Min-max in R
Z-score in R
Turning Categorical Values to a Numeric
Variable in R
• Problem: Most statistical models cannot take in objects or strings as
input
Categorical -> numerical
• Solution:
• Add dummy variables for each unique category
• Assign 0 or 1 in each category
Categorical -> numerical
Analysis of Variance (ANOVA)
• Question: “How do different
categories of the reporting
airline feature (as a categorical
variable) impact flight delays?”
ANOVA
• Analysis of Varience (ANOVA)
• Why perform an ANOVA test?
• Finding correlation between different groups of a categorical variable
• Null hypothesis (H0): The mean of the reporting airline is the same for
all groups.
• What dou you obtain from ANOVA?
• F-test score: calculates the ratio of variance between the group’s mean over
the variation within each of the sample groups.
• P-value: confidence degree
ANOVA F-test
• A small f-test score implies a poor correlation between variables
categories and the target variable
ANOVA F-test
• A large f-test score implies a strong correlation between variables
categories and the target variable
ANOVA
• Anova between AA and AS
ANOVA
• Anova between AA and PA
ANOVA in R
• ANOVA between AS and AA
• aa_as_subset = data_flight %>%
select(ArrDelay, Reporting_Airline) %>%
filter(Reporting_Airline=="AA" | Reporting_Airline=="AS")
ad_aov=aov(ArrDelay ~ Reporting_Airline, data = aa_as_subset)
summary(ad_aov)
Example
• ANOVA between AA and PA
• aa_as_subset = data_flight %>%
select(ArrDelay, Reporting_Airline) %>%
filter(Reporting_Airline=="AA" | Reporting_Airline=="AS")
ad_aov=aov(ArrDelay ~ Reporting_Airline, data = aa_as_subset)
summary(ad_aov)
EXAMPLE
• Download gapminder data
• install.packages("gapminder")
• Research question: Is the life expectancy in three continents different
• Hypothesis testing: H0: Mean life expectancy is the same (if p<0.05)
H1: Mean life expectancy is not the same
• Observation: Difference in mean is observed in the sample data but is
this statistically significant?
Correlation
• What is correlation?
• Mesures to what extent different variables are interdependent
• Example
• Lung canser –> Smoking
• Rain -> Umbrella
• Correlation does not imply causation
Correlation
• Positive correlation
• Both variables move in the same direction
• Negative correlation
• Two variables move in the opposite direction
• Correlation coefficient
Possitive linear relationship
• Correlation between
ArrDelayMinutes and
DepDelayMinutes
• ggplot(data_flight,
aes(DepDelayMinutes,ArrDelayMi
nutes)) +
geom_point() +
geom_smooth(method="lm")
Possitive linear relationship
• Weak correlation between
ArrDelayMinutes and
WeatherDelay
ggplot(data_flight,
aes(WeatherDelay,ArrDelayMinu
tes)) +
geom_point() +
geom_smooth(method="lm")
Negative linear relationship
• Weak correlation
between CarrierDelay
and LateAircraftDelay
• ggplot(data_flight,
aes(CarrierDelay,LateAirc
raftDelay)) +
geom_point()+
geom_smooth(method="
lm")
Pearson correlation
• Measures the strength of the correlation between two features
• Correlation coefficient
• P-value
• Correlation coefficient
• Close to +1: large positive relationship
• Close to -1: large negative relationship
• Close to 0: no relationship
• P-value
• P-value < 0.001 Strong certainty in the result
• P-value < 0.05 Moderate certainty in the result
• P-value < 0.1 Weak certainty in the result
• P-value > 0.1 No certainty in the result
Pearson Correlation
Source: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#/media/File:Correlation_examples2.svg
Example
Example
Model development
• “Can we predict the arrival delay of a flight?”
• Model development
• A model can be thought of as a mathematical equation used to
predict a value given one or more other values
• Relating one or more independent variables to dependent variables
Model development
• Usually, the more relevant data you have the more accurate your
model is.
Model Development
• Consider the following situation
• You have two flights at the same time
• The flights are from different airlines
Regression Models
SLR
• The predictor (independent) variable : x
• The target (dependent) variable : y
• b0= the intercept
• b1=the slope
SLR: prediction
• ArrDelayMinutes = 17.35 + 0.7523*DepDelayMinutes
• 17.35 + 0.7523*(20)
• 17.35 + 15.046
• 32.396
SLR: Fit
SLR
Fitting a SLR
Fitting a SLR
Fitting a SLR
Estimated Linear Model
Prediction
Prediction
Multiple Linear Regression
MLR
MLR
MLR
MLR prediction