Section02 ESA
Section02 ESA
Why Tabular and Graphical? Because it is hard to see patterns and trends when you are looking at
Raw Data! Our Goal is to create useful information from the raw data so that we can see patterns
and trends. This will help in Decision Making.
Data
Categorical or
Categorical Quantitative Quantitative
Quantitative
Frequency distribution = Tabular summary that shows a unique list of nonoverlapping categories with the counts (frequency) for each category
The goal of a frequency distribution is to show the distribution of counts across counting categories
For categorical data the categories are a unique list of items from the column of data
For quantitative data you usually have to create lower and upper limits for each class or counting category. The counting categories must be collectively exhaustive (enough
categories so nothing is left out) and mutually exclusive (no item can fit into more than one category). Details below.
Five types of columns in a frequency distribution table:
Frequency = count for each category/class
Relative frequency = (frequency of class) / (number of observations in data set). Used to build a probability distribution based on past data (ch 4).
% Frequency = relative frequencies with percent number format applied. We will not use method of multiplying by 100.
Cumulative Frequency = for each counting category in a grouped PivotTable, the count is made for "less than" the upper limit of class. The last class will be equal to the count of
all items in the data set.
% Frequency Distribution = cumulative relative frequencies with percent number format applied. We will not use method of multiplying by 100.
Creating classes (counting categories) for quantitative variables in a frequency distribution and histogram:
The goal is to reveal the natural distribution or shape or variation of the data. This is the "art side of statistics". It takes practice to get the hang of it.
Determine the number of nonoverlapping classes. Goal is to have enough to show natural shape of data. One general guideline is: 2^k > n, where n = count and k = number of
Step 1 classes.
Step 2 Determine the width of each class with something like: approx. width = (max-min)/(number of classes). Trial and error is usually required.
Determine the class limits, which are the lower and upper limits used in an AND Logical test to count how many values occur between the lower and upper limit. The key is to
not create classes that would double count. Once you create the class limits, you list the counting categories from smallest to biggest before you calculate the counts and make a
Step 3 histogram chart. Trial and error is usually required.
If you have a discrete variable (or a continuous variable that is shown as a whole number) it is just a matter of getting the lower and upper limit, like: 0-9, 10-19...
If you have a continuous variable and you choose to use the upper limit from the previous class as the lower limit for the current class, be sure to include the equal sign on only
one side, either the lower or upper, but not both. Create classes like: 0 <= Sales < 20, 20 <= Sales < 40... or 0 up to 20, 20 up to 40...
When you create a set of classes, you are creating a type of category for your continuous quantitative variable
Charts are more easily interpreted if the class width is the same for all classes.
Sometimes if there are a few large values or small values, it may be efficient to create an open ended class
Column and Bar Excel charts = graphical display that compares relative differences across categories/classes.
Column chart: Height of column conveys number
Bar chart: Length of bar conveys number.
The difference between the two charts is that the bar chart, as compared to the column chart, can more forcefully emphasize differences across categories and can
accommodate longer category names.
In some statistics and math textbooks, authors will refer to column charts as bar charts. However, in Excel, column charts use vertical rectangles and bar charts use horizonal
rectangles.
Both charts are good for displaying frequency, relative frequency or % frequency
For categorical data: 1) Columns do not touch (to indicate "gap" between categories) and 2) Order of categories conveys no information.
For discrete quantitative data: 1) Columns do not touch (to indicate "gap" between categories) and 2) Order of categories is smallest to biggest help to show the distribution or
variation or pattern in data.
For continuous quantitative data: 1) Columns must touch to indicate that there is no gap between counting categories. 2) Order of categories is smallest to biggest help to show
the distribution or variation or pattern in data.
Specific types of column/bar charts:
Pareto chart = quality control categorical data plotted in a column chart with columns sorted by frequencies left to right from biggest to smallest
Used in quality control to show highest to lowest frequency of problems from left to right. Often a cumulative line is added to chart.
Histogram chart created from a column/bar chart = continuous quantitative data plotted in a column/bar chart using counting categories with a lower and upper limit, where
counting categories are sorted left to right from smallest to biggest and there is no gap between columns to indicate that no data can occur between the successive lower and
upper limits. This chart is used to visualize the frequencies from a frequency distribution for a continuous variable. Do not use built-in Excel Histogram Chart: it assumes a
normal distribution and does not allow you to provide the lower limit for the first class.
Correct graphical display for revealing the distribution or variation or pattern in how frequencies occur in the data set. This chart shows the shape of the data or the skew in
the data.
Histogram Notes:
Column or bar charts where columns are touching to indicate that the variable is continuous
Columns touch to indicate that no numbers can fit between classes. "No numbers can fit between columns - no gaps"
Height of columns convey count
Order of classes is important to help reveal shape of data, or distribution of data
Skew of Histograms:
What does the distribution of histogram columns look like?
Skew left or negative means a few short histogram columns are on the low end (pull mean down)
Skew right or positive means a few short histogram columns are on the high end (pull mean up)
No skew means the distribution is bell shaped or nearly bell shaped (mean = median = mode)
Clustered column chart (Excel name) = graphical display for a crosstabulation. Emphasis is on comparing the categories listed in the legend. Also known as: "Side-by-side bar chart"
Stacked column (Excel name) = graphical display for a crosstabulation. Emphasis is on comparing the categories listed in the horizontal axis. . Also known as: "Stacked bar chart"
Scatter Chart for X-Y Data = graphical display to show the relationship between two categorical variables.
Horizontal Axis = Independent Variable = x. Vertical Axis = Dependent Variable = f(x) = y
To plot point: 1) Move along x axis, then 2) move along y axis, record point.
To use the Excel X-Y Scatter chart, the source table of data should have X values on the left of the y values, and have field names at top of each column
Use X-Y Scatter Plot Chart, not Line Chart (common mistake).
Type of relationship:
A direct or positive relationship indicates that as the x value increases, the y value tends to increase.
An indirect or negative relationship indicates that as the x value increases, the y value decrease.
No relationship indicates that as x increases, it is hard to predict where the y value will be.
File, Options, on right-side of Excel Options dialog box select Data, click Edit Default Layout button to change the settings to match your goals. I set mine to: 1) Report Layout =
Show in Tabular Form (will show Field Names in Report rather than generic "Row Labels"), 2) PivotTable Options = uncheck the check box for "Autofit Column widths on update".
Date Variable:
Number (Quantitative)
Discrete Variable (Sometimes Grouped)
But sometimes treated as Continuous, as with a Time Series Line Chart
Column Charts:
Frequency by Product Frequency by Product No. Frequency by Units Histogram: Count Sales ($) Between
6 6 7 Lower (Included) and Upper (Not
5 5 Included) Limits
6
4 5
3 3 3
2
1
Aspen Carlota Quad 100 200 300 1 6 12 0 up to 50 50 up to 100 100 up to 150 150 up to 200
Product Product No. Units Sales ($) Category
1. Categorical variable as 2. Categorical number variable as 3. Discrete number variable as 4. Continuous quantitative variable as
condition for calculation = condition for calculation = condition for calculation = condition for calculation =
Gaps between columns/bars Gaps between columns/bars Gaps between columns/bars NO Gaps between columns/bars
Bar Charts:
Frequency by Product Frequency by Product No. Frequency by Units Histogram: Count Sales by Sales Category
with Lower (Included) and Upper (Not
Included) Limits
Aspen 5 1 7
100 5
0 up to 50 6
Carlota 3 6 4 50 up to 100 5
200 3
100 up to 150 1
Quad 6 300 6 12 3
150 up to 200 2
10
Frequency
15
8
9 8
6 10
4
5 3
2 1 1 1 1 1 0 0
0 0
0 up to 15 15 up to 30 30 up to 45 45 up to 60 60 up to 75 75 up to 90 90 up to 105 2 up to 3 3 up to 4 4 up to 5 5 up to 6 6 up to 7 7 up to 8 8 up to 9
Classes for Test Scores Height in feet
House Prices ($) often Skewed Right (positive) - a few large values Tests Scores can be bi-modal
will tend to pull the Average (Mean) up 14 13 13
18 16 12
16
14 10
12
Frequency
12 7
Frequency
8
10
6 5
8
6 4 4 4
4 3 2
2 1 1 2 1
0
0 0
$0 up to $150,000 up $300,000 up $450,000 up $600,000 up $750,000 up $900,000 up 0 up to 15 15 up to 30 30 up to 45 45 up to 60 60 up to 75 75 up to 90 90 up to
$150,000 to $300,000 to $450,000 to $600,000 to $750,000 to $900,000 to 105
$1,050,000
Classes for Test Scores
80
2.5
60 2
y = 2.2498x + 54.535
40 R² = 0.4966 1.5
1
20
0.5
0
0
0 5 10 15 20 25 30
0 2 4 6 8 10 12
Hours Studied X Absences In Class = X
No Apparent Relationship y = 0.1011x + 71.307 Field: # Ads on Radio During Week = X and Field: Car
R² = 0.0007
$350
Sales = Y appear highly correlated.
$300 $4.00
Millions
$3.50
Amount Spent
$250
$3.00
Car Sales = Y
$200 $2.50 v
$150 $2.00
$100 $1.50
$1.00
$50 y = 116670x - 47369
$0.50
$0 R² = 0.8012
$0.00
0 10 20 30 40 50 60 70 0 5 10 15 20 25 30
Customer Age # Ads on Radio During Week = X