Business Analytics with Excel
Data Analysis Using Statistics
Learning Objectives
By the end of this lesson, you will be able to:
Create a moving average chart
Perform ANOVA to compare means of different groups
Identify relationships between variables using covariance and
correlation
Calculate regression for the given data
Create normal distribution for the given data
A Day in the Life of Business Analyst
As a business analyst of an organization:
You are required to do forecasting and planning for sales data
Along with the prediction models, you need to co-relate existing data and test any
hypothesis.
This lesson will help you understand the usage of statistics for data analytics and
predictions.
Introduction to Statistical Analysis
Statistical Analysis
It involves the collection, examination, summarization, manipulation, and interpretation of
quantitative data to discover underlying causes, patterns, relationships, and trends.
Need for Statistical Analysis
It reveals the overall pattern and behaviour of the data.
It is useful when you have a set of data and want to see a summary of that data set.
Statistical Analysis: Example
ABC LLC is a financial analytics and research organization that needs to determine how stock
prices are fluctuating in various emerging economies.
Statistical Analysis: Example
The firm can use the moving average tool based on
the historical records and stock market data.
This tool forecasts the price trends for any
number of days.
It predicts the trends for the upcoming month by
creating a moving average chart.
Statistical Analysis: Tools
Moving Average ANOVA Correlation Normal Distribution
Hypothesis Testing Covariance Regression
Statistical Analysis in Excel
Excel is widely used to understand statistical concepts and perform calculations.
Provides data and
parameters for each tool
Uses appropriate statistical macro
functions
Calculates and displays results in
an output table
Generates charts
Data Analysis on Command
Data analysis tools are available under the Data Analysis command under Data tab.
Analysis ToolPak add-in needs to be loaded if the Data Analysis command is not available.
Moving Average: Introduction
Moving Average
It evaluates data points by creating a series of averages of different subsets of the
complete dataset.
16,000.0
14,000.0
12,000.0
Axis Title 10,000.0
8,000.0
Actual
Forecast
6,000.0
4,000.0
2,000.0
-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Axis Title
A moving average is used to smooth out irregularities and easily recognize trends.
Moving Average
It is mainly used to forecast long-term trends in the data.
Moving Average can be calculated for any period of time.
Assisted Practice: Create Moving Average Chart
Problem statement:
Demonstrate how to create a Moving Average chart in Excel.
Assisted Practice Guidelines
Steps to follow:
Step 1: Open the Excel file
Step 2: Moving average
Hypothesis Testing: Introduction
Hypothesis Testing
It is used to determine whether there is enough evidence in a data sample to infer that a certain
condition is true for the entire population.
Hypothesis Testing
To understand the characteristics of general population:
Take a random sample. Analyze the properties of Test whether the identified
the sample. conclusions represent the
population correctly or not.
Hypothesis Testing
A hypothesis about a Sample statistics are used to
population parameter is assess the likelihood that the
generated. hypothesis is true.
Hypothesis Testing
It is formulated in terms of two hypotheses:
Null Hypothesis, which is
Alternate Hypothesis,
referred to as H0, is assumed
which is referred to as H1,
to be true unless there is
is assumed to be true
strong evidence to the
when the null hypothesis
contrary.
is false.
Hypothesis Testing
The Hypothesis Test (t–test) is used to test the null hypothesis (H0), which assumes that the mean
or average of two populations is equal.
Assisted Practice: How to use Hypothesis Testing
Problem statement:
Demonstrate how to use Hypothesis Testing to determine Null Hypothesis for two variables.
Assisted Practice Guidelines
Steps to follow:
Step 1: Open the Excel file
Step 2: Hypothesis testing
ANOVA
ANOVA
It is a statistical method that The logic behind this analysis
stands for analysis of is to identify variance in the
variance. population.
ANOVA is a collection of statistical methods
used to compare the means of different
groups.
T-Test
The t-test helps ANOVA helps test
analyze variance the Null Hypothesis
between two of two or more
groups only. groups.
Assisted Practice: How to use ANOVA
Problem statement:
Demonstrate how to ANOVA to determine Null Hypothesis for two or more variables.
Assisted Practice Guidelines
Steps to follow:
Step 1: Open the Excel file
Step 2: ANOVA testing
Covariance
Covariance: Introduction
Covariance determines the relationship between two random variables and
how they change together.
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10
Covariance: Types
Let us suppose that X and Y are two random variables.
60
Positive Covariance
50
40
30
If variable X increases as Y increases or X
20
decreases as Y decreases, then covariance is
10 positive.
0
1 2 3 4 5 6 7 8 9 10
Y 48 52 4 8 40 4 16 40 32 40
X 12 13 1 2 10 1 4 10 8 10
Covariance: Types
Negative Covariance
90
80
70
60
50
If variable X decreases as Y increases or X increases
40
as Y decreases, then covariance is negative.
30
20
10
0
1 2 3 4 5 6 7 8 9 10
X 15 16 4 5 13 4 7 13 11 13
Y $38 $20 $85 $82 $46 $85 $70 $46 $65 $46
Assisted Practice: How to use Covariance
Problem statement:
Demonstrate how to use Covariance in Excel.
Assisted Practice Guidelines
Steps to follow:
Step 1: Open the Excel file
Step 2: Use Covariance
Correlation
Correlation: Introduction
Correlation is a statistical measure that indicates the extent
to which two or more variables fluctuate together.
Correlation Coefficient
The correlation coefficient tells us how strongly two variables are related to
each other and it has a value between -1 and +1.
A correlation coefficient with value +1 indicates
a perfect positive correlation.
Correlation Coefficient
In Excel, CORREL function is used to calculate correlation.
A correlation coefficient with value A correlation coefficient with value
-1 indicates a perfect negative 0 indicates no correlation.
correlation.
Assisted Practice: How to use Correlation
Problem statement:
Demonstrate how to use Correlation in Excel.
Assisted Practice Guidelines
Steps to follow:
Step 1: Open the Excel file
Step 2: Use Covariance
Regression
Regression: Introduction
Regression is a statistical method for determining the strength of a relationship between one
dependent variable and a set of independent variables that change over time.
Assisted Practice: How to use Regression
Problem statement:
Demonstrate how to use Regression to determine relationships between variables.
Assisted Practice Guidelines
Steps to follow:
Step 1: Open the Excel file
Step 2: Use Regression
Multiple Linear Regression
Simple Linear Regression
Simple Linear Regression (SLR) tries to find a linear representation between two variables x and y.
y = function(x)
Simple Linear Regression
A linear relation of the temperature and number of ice creams sold can be observed using a
scatter plot.
Multiple Linear Regression
Multiple Linear Regression (MLR) tries to find the relationship between multiple independent x’s and
a single independent y.
Source: https://medium.com/analytics-vidhya/new-aspects-to-consider-while-moving-from-simple-linear-regression-to-multiple-linear-regression-dad06b3449ff
Multiple Linear Regression
The approach is to build a fitting line in n-dimensional space to:
• Explain the effects of the independent variables on the y variable.
• Predict y value given in a new set of x variables.
Multiple Linear Regression
The data is fit into the following equation:
Where:
•Y: dependent or resultant variable
•x1,x2,x3,…,xi: independent variables
y=β0 + β1x1+ β2x2 +… + βixi + e
•β0: constant term in the equation
• βi: slope coefficients to each independent
variable
Multiple Linear Regression
A multiple linear regression model can be built using Excel with at least 30 data points.
The mathematical equation with the coefficients is derived instantly and used to predict new
values.
Multiple Linear Regression
Consider the boston_housing.csv as the input data to build our model.
boston_housing.csv
Multiple Linear Regression
The data set contains 13 independent variables which define the dependent variable MEDV.
MEDV is the median value of a house in Boston according to
the data provided.
Multiple Linear Regression
A model built using this data can be used to predict the median value of a new house with the
attributes of the house.
Multiple Linear Regression
The meaning of each attribute is given in the Column description tab.
Create a Linear Regression Model
Choose the complete data after checking for any junk data
Click on Data Analysis in Data Tab.
If this does not appear, click on File -> Options -> Excel Add-ins and Go
Create a Linear Regression Model
Click on Analysis ToolPak to enable Data Analysis within Data
Create a Linear Regression Model
Choose Regression from the Data Analysis dialog box
Create a Linear Regression Model
• Under Regression, choose rows and columns for
the X range and column for the Y range
• Set Labels to present and the Confidence Level to
95%.
Create a Linear Regression Model
The results appear in a new worksheet, showing the regression data for the chosen data set.
Linear Regression Model
R-squared is a measure to indicate how much of
the variance of y is explained by all x’s. Closer to 1.0,
better the model fit.
The intercept coefficient is β0 in the multiple
regression equation.
Other coefficients are βi in the multiple
regression equation.
Linear Regression Model
Standard error is a deviation from actual and
the line of best fit line values.
P-value gives the significance of the feature
on the dependent variable.
Linear Regression Model
From the results it is understood that:
• The most and least important features determine
the median price of the house.
• The value of y can be determined by using the
equation with a new set of x values.
Logistic Regression
Logistic Regression
It is an algorithm for classification problems.
Though the name has the word regression, it is not a regression algorithm.
Logistic Regression
We have seen the following equation in linear regression:
y=β0 + β1x1+ β2x2 +… + βixi + e
This equation cannot be used because:
• The value of y is not in In odds value
• The dependent variable y represents classes
• y is no more a continuous variable unlike regression
• log(ODDS) instead can help to arrive at a similar equation
Logistic Regression
Linear regression equation can be reused for logistic regression.
• By converting the y value in the classification problem to an ‘In odds’ value of the event
• ln(odds(E)) = β0 + β1x1+ β2x2 +… + βixi + e
Odds of Event
Odds of event (E) is defined as the probability of E happening divided by the probability of E not
happening.
odds(E) = P(E)/1-P(E)
• The result of odds(E) is then converted to categorical values.
• Example: If y<= 0.5, then it is negative, or else it is positive.
Sigmoid Equation
If we solve for P(E) using the two odds equations, we get:
• P(E) = 1/1+e-(β0 + β1x1+ β2x2 +… + βixi + e)
• The equation in this form is called the sigmoid equation.
• Example: If you take a numeric value of Y, it converts it into
a probability value between 0 and 1.
Logistic Regression in Excel
To perform logistic regression in Excel, multiple regression equation is used which is created by using Data
Analysis add-ins.
• It forms the equation of P(E), and
• Segregates the target values based on P(E)
Logistic Regression in Excel
When a new data is given to the model, the P(E) is calculated, and the target value is derived.
Steps to Derive Target Value
These are the steps to derive target values.
Step 1: Data items are encoded to numeric values
Steps to Derive Target Value
Step 2: The target values are encoded to numeric values
Steps to Derive Target Value
Step 3: Use add-ins of Data Analysis, to calculate the intercept and coefficients
Steps to Derive Target Value
Step 4: The linear regression equation arrives for each data row. This equation can be called y.
Steps to Derive Target Value
Step 5: P(E) is calculated as 1/(1+e-y)
Steps to Derive Target Value
Step 6: A rule is applied on P(E) to get the target values
Normal Distribution
Normal Distribution: Introduction
All normal distributions are symmetric and have bell-shaped curves
with a single peak.
Create Normal Distribution
Normal distribution helps find the probability distribution for various variables such as rainfall,
height, weight, manufacturing error, weight error, and test scores.
The standard
The mean,
deviation, which Normal
where the peak
indicates the Distribution
of the density
spread of the Curve
occurs
bell curve
Normal Distribution: Empirical Rule
All normal density curves satisfy the Empirical Rule or (68-95-99.7% Rule) in Statistics.
68% of the observations 95% of the observations 99.7% of the observations
fall within 1 standard fall within 2 standard fall within 3 standard
deviation of the mean, i.e. deviations of the mean, i.e. deviations of the mean, i.e.
between Mean – Standard between Mean – between Mean –
Deviation and Mean + 2*Standard Deviation and 3*Standard Deviation and
Standard Deviation. Mean + 2*Standard Mean + 3*Standard
Deviation. Deviation.
Assisted Practice: Create Normal Distribution graph
Problem statement:
Demonstrate how to create a Normal Distribution graph in Excel.
Assisted Practice Guidelines
Steps to follow:
Step 1: Open the Excel file
Step 2: Create Normal Distribution
Key Takeaways
A Moving Average evaluates data points by creating a series of
averages of different subsets of the complete dataset.
The Hypothesis Testing is used to test the null hypothesis.
ANOVA is a collection of statistical methods used to compare the
means of different groups.
Covariance determines the relationship between two random
variables— how they change together.
Key Takeaways
Correlation is a statistical measure that indicates the extent to which
two or more variables fluctuate together
Regression is a statistical measure that determines the strength of
the relationship between one dependent variable and a series of
other changing variables.
All Normal Distributions are symmetric and have bell-shaped
curves with a single peak.
Knowledge Check
Knowledge
Check Which of the following statistical methods is used to analyze variance between
1 more than two groups?
A. Hypothesis Testing
B. Histogram
C. ANOVA
D. Covariance
Knowledge
Check Which of the following statistical methods is used to analyze variance between
1 more than two groups?
A. Hypothesis Testing
B. Histogram
C. ANOVA
D. Covariance
The correct answer is C
ANOVA is used to analyze variance between more than two groups.
Knowledge
Check What conclusion will you derive for the Null Hypothesis if “F > F crit” in ANOVA
2 testing?
A. The Null Hypothesis is not rejected
B. The Null Hypothesis is rejected
C. There is no relationship with Hypothesis Testing
D. None of the above is correct
Knowledge
Check What conclusion will you derive for the Null Hypothesis if “F > F crit” in ANOVA
2 testing?
A. The Null Hypothesis is not rejected
B. The Null Hypothesis is rejected
C. There is no relationship with Hypothesis Testing
D. None of the above is correct
The correct answer is B
In ANOVA testing if “F > F crit,” then the Null Hypothesis is rejected.
Knowledge
Check
The Null Hypothesis means that the mean/average of two populations is equal.
3
A. True
B. False
Knowledge
Check
The Null Hypothesis means that the mean/average of two populations is equal.
3
A. True
B. False
The correct answer is A
The Null Hypothesis(H0) means that the mean/average of two populations is equal.
Knowledge
Check
Which of the following is indicated if the Correlation Coefficient value is +1?
4
A. Perfect Positive Correlation
B. Zero Correlation
C. Perfect Negative Correlation
D. No Correlation
Knowledge
Check
Which of the following is indicated if the Correlation Coefficient value is +1?
4
A. Perfect Positive Correlation
B. Zero Correlation
C. Perfect Negative Correlation
D. No Correlation
The correct answer is A
The Correlation Coefficient value of +1 indicates Perfect Positive Correlation.
Knowledge
Check Which statistical measure determines the strength between a dependent variable
5 and an independent variable?
A. Histogram
B. Hypothesis Testing
C. Moving Average
D. Regression
Knowledge
Check Which statistical measure determines the strength between a dependent variable
5 and an independent variable?
A. Histogram
B. Hypothesis Testing
C. Moving Average
D. Regression
The correct answer is D
Regression determines the strength between a dependent variable and an independent variable.
Knowledge
Check What are the mandatory fields required while creating a Normal Distribution
6 curve?
A. Mean and Standard Deviation
B. Mean and Maximum value
C. Maximum and Minimum value
D. Standard Deviation and Minimum Value
Knowledge
Check What are the mandatory fields required while creating a Normal Distribution
6 curve?
A. Mean and Standard Deviation
B. Mean and Maximum value
C. Maximum and Minimum value
D. Standard Deviation and Minimum Value
The correct answer is A
To create Normal Distribution curve, we need to specify two quantities: the mean, where the peak of the density
occurs, and the standard deviation, which indicates the spread of the bell curve.