BSTAT HANDOUTS - DESCRIPTIVE ONLY Handouts 3
BSTAT HANDOUTS - DESCRIPTIVE ONLY Handouts 3
LA SALLE
Yu An Log College of Business and Accountancy
HANDOUTS 3
Recall: Statistics involves a body of techniques and procedures dealing with the collection,
organization, analysis, interpretation, and presentation of information that can be stated
numerically.
Summarizing data involves using statistical tools and procedures appropriate for answering a research
problem or objective.
Measure – a numerical representation of a particular characteristic (variable of the study) of the group
being studied
Parameter – A measure calculated from the population; usually represented by letters of the Greek
alphabet
Statistic – A measure calculated from the sample; usually represented by letters of the English alphabet
Remark: Since “sex” is a qualitative variable and the codes 0 and 1 represent nominal data, then
it is not appropriate to consider them as numbers with values, so it is not correct to
apply arithmetic operations such as addition and division to get the “average sex” since
it will not make any sense for a qualitative variable; Rather, use proportion (or
percentage) of males (or females) in the group
Say, “Two out of 10 students are male,” or “twenty percent of the students are males”
Quantitative data are usually summarized in terms of the center and spread of the distribution.
The center of the distribution can be identified using an appropriate measure of central
tendency or location.
LEONARES, S. R. 1
MEASURES OF CENTRAL TENDENCY OR LOCATION (AVERAGES)
(ARITHMETIC) MEAN
computed by summing all the data values in the sample or population and dividing the sum by
the number of observations (usually referred to as “average”)
Most important measure representing the center of the distribution if the distribution is
symmetric
data must be at least interval
Most stable measure of location, especially for large data sets
When n is small, the mean is very sensitive to extreme values
Differentiate between the population and sample means by their symbols:
Population Mean:
x i
, where x i is the ith score or observation, and N is the number
N
of observations in the population (the parameter is ,
the Greek letter “mu”)
Sample Mean: x
x i
, where x i is the ith score or observation, and n is the number of
n
observations in the sample (the statistic is 𝑥̅ , and is
read as “x-bar”)
Why differentiate between and 𝑥̅ : if the research procedure is a population study, then
a populations symbol (parameter) must be used; if it is a sample study, then a sample
symbol (statistic) must be used. This will be a very important distinction in inferential
statistics.
That is why it is important to determine at the beginning of the research process if you
will be doing a population of sample study, since it will have a bearing in the use of
notations/symbols for parameters or statistics.
Example 1: During a particular summer month, the eight salespeople in an appliance store sold the
following number of central air-conditioning units: 8, 11, 5, 14, 8, 11, 16, 11. Considering this month as
the statistical population of interest, the mean number of units sold is
x i
84
10.5 central a / c units
N 8
Why ? Because the problem stated that the month should be considered as a statistical population of
interest.
LEONARES, S. R. 2
WEIGHTED MEAN
w or X w
wX
w
Operationally, each value in the group (X) is multiplied by the appropriate weight factor (w), and
the products are then summed and divided by the sum of the weights.
Example 2: In a multiproduct company, the profit margins for the company’s four product lines during
the past fiscal year were: line A, 4.2percent; line B, 5.5 percent; line C, 7.4 percent; and line D, 10.1
percent.
x 27.2 6.80%
N 4
However, unless the four products are equal in sales, this unweighted average is incorrect. Assuming the
sales totals in the following table which are not all equal, the weighted mean correctly describes the
overall average.
303,300,000
w 5.22%
58,000,000
Remark: The weighted mean is used in computing for final grades when the number of units of
the subjects are not equal. Each grade is multiplied by the number of units of the
subject, and the sum of the (grades x no. of units) is divided by the total number of
units taken.
LEONARES, S. R. 3
MEDIAN
Population Median: ~
(read “mu-tilde”)
Sample Median: ~
x (read “x-tilde”)
Example 3: The eight salespeople described in Example 1 sold the following number of central air-
conditioning units, in ascending order: 5, 8, 8, 11, 11, 11, 14, 16. Find the median.
~ 11 11 11
central a/c units
2
Since the number of data values is even (N = 8), then the value of the median is the mean of the two
middle values, which are the fourth and fifth values in the ordered group. Both these values equal “11”
in this case, so adding the two 11’s and dividing by 2 gives the median which is equal to 11. Note that
there is an equal number of data points below and above the median (5, 8, 8, 11 are below; 11, 11, 14,
16 are above).
Example 4: The reaction times for a random sample of 9 subjects to a stimulant were recorded as 2.5,
3.6, 3.1, 4.3, 2.9, 2.3, 2.6, 4.1, and 3.4 seconds. Calculate the median.
First form the array: 2.3, 2.5, 2.6, 2.9, 3.1, 3.4, 3.6, 4.1, 4.3
Since there are 9 data values (odd), then there will only be one middle value.
𝑥̃ = 3.1 seconds
NOTE: When the problem does not specifically indicate whether the group involved is a sample
or population, treat the data set as a sample.
LEONARES, S. R. 4
Recall Example 1:
During a particular summer month, the eight salespeople in an appliance store sold the following number
of central air-conditioning units: 8, 11, 5, 14, 8, 11, 16, 11. Considering this month as the statistical
population of interest,
a. the mean number of units sold is
x i
84
10.5 central a / c units
N 8
~ 11 11 11
central a/c units
2
Dot plot: The mean and median are relatively close to each other.
5 6 7 8 9 10 11 12 13 14 15 16
The mean and the median values would be considered to be good representatives of the
data set since they are located in the center of the distribution (where the points are).
Then the last point of the dot plot would be very far from the rest of the points (extremely high
value) – it can also be called an outlier.
Then:
x i
228
28.5 central a / c units
N 8
~ 11 11 11
central a/c units
2
The resulting value of the mean is not found at the center of where the points are
(28.5 is far from the majority of the points), while the median remains the same.
The value of the mean is affected if there are extreme values in the distribution, hence, it
cannot be used to represent the distribution if the shape is skewed. That is why, one
condition for its use as a representative value is that the shape must be symmetric.
On the other hand, the median has not changed, because only the middle value (if n is
odd) or the mean of the two middle values (if n is even) is used; the extreme value is not
used in determining the median. Therefore, the median is a better representative value if
the shape of the distribution is skewed.
LEONARES, S. R. 5
MODE
Value in the data set which has the highest frequency (occurs most often)
Can be applied to any measurement level
May not exist (the data set may not have a mode if all the values occur with the same frequency)
May not be unique, if it exists (a data set may have more than one value which have the same
highest fequency
Related to the concept of a peak or peaks in the frequency distribution
Unimodal – one peak
Bimodal – two peaks, etc.
Population Mode: Mo
Sample Mode: mo
Example 5: The eight salespeople described in Example 1 sold the following number of central air-
conditioning units: 8, 11, 5, 14, 8, 11, 16, and 11. Find the mode.
Example 6: The reaction times for a random sample of 9 subjects to a stimulant were recorded as 2.5,
3.6, 3.1, 4.3, 2.9, 2.3, 2.6, 4.1, and 3.4 seconds. Find the mode.
Since all values occur only once (they have the same frequency), then this distribution has
no mode or we say that the mode does not exist.
This different from saying that the mode is 0 (why?)
Note that the shape of the distribution is important in choosing the most appropriate measure of central
tendency (and in other measures and tests as well). Hence, to determine the shape and there is no graph
to base it on, comparing the mean and median values will determine the shape:
Notes: 1. Since the mode does not always exist, it is just the mean and the median that are compared.
2. A positively skewed distribution indicates that the values mostly cluster on the lower half of
the distribution but there are few extremely high values. When the mean is computed, these
high values influence the value of the computed mean and pull its value away from the center
towards where the extremely high values are. On the other hand, the median is not affected
by extreme values, so it stays closer to where most of the values are. That is why, for a
positively skewed distribution, the median is a better representative value than the mean.
3. A negatively skewed distribution as majority of the data clustering on the upper half of the
distribution but there are few extremely low values. For the same reason as in the positively
skewed distribution, the mean is pulled towards where the few extremely lower values are.
The median is the better representative value compared to the mean.
LEONARES, S. R. 6
READ: https://www.khanacademy.org/math/ap-statistics/quantitative-data-ap/describing-
comparing-distributions/v/classifying-distributions
EXERCISES: Show complete solutions. For each item, identify the following needed information:
a. determine whether the data set constitutes a population or sample.
b. identify the variable of the problem (label this as X)
1. The following are scores of 50 high school students in a 150-item achievement test in Mathematics.
2. According to a survey, the average person spends 45 minutes a day listening to recorded music. The
following data were obtained for the number of minutes spent listening to recorded music for a
sample of 30 individuals.
88.3 4.3 4.6 7.0 9.2
0.0 99.2 34.9 81.7 0.0
85.4 0.0 17.5 45.0 53.3
29.1 28.8 0.0 98.9 64.5
4.4 67.9 94.2 7.6 56.6
52.9 145.6 70.4 65.1 63.6
LEONARES, S. R. 7
a. Compute the mean.
Do these data appear to be consistent with the average reported by the newspaper? Explain
your answer.
3. During a 30-day period, the daily number of cars rented of a car rental company are as follows:
7 10 6 7 9 4 7 9 9 8
5 5 7 8 4 6 9 7 12 7
9 10 4 7 5 9 8 9 5 7
b. If the break-even point for the company is 8 cars per day, is the company doing well? Explain.
4. Find the preferred measure of central location for the sample whose observations 18, 10, 11, 98, 22,
15, 11, 25, and 17 represent the number of automobiles sold during this past month by 9 different
automobile agencies. Justify your choice.
5. For a sample of 15 students at an elementary-school snack bar, the following sales amounts arranged
in ascending order of magnitude are observed: Php10, 10, 25, 25, 27, 30, 33, 35, 40, 43, 45, 45, 50, 55,
60.
a. Determine the mean, median, and mode for these sales amounts.
b. How would you describe the distribution from the standpoint of skewness?
6. The following table shows the percentage of defective items in an assembly department. Determine
the overall percentage defective of all items assembled during the sampled week.
7. The average IQ of 10 students in a mathematics course is 114. If 9 of the students have IQs of 101,
125, 118, 128, 106, 115, 99, 118, and 109, what must be the other IQ?
8. What is the average for a student who received grades of 85, 76, and 82 on 3 tests and a 79 on the
final examination in a certain course if the final examination counts three times as much as each of
the 3 tests?
LEONARES, S. R. 8
INTRODUCTION TO VARIABILITY
Consider the following two sets (male and female) of number of bottles of soft drink consumed in a week:
A 3 4 5 6 8 9 10 12 15
B 3 7 7 7 8 8 8 9 15
n x ~
x
A
Describe the two sets with respect to the two measures: _______________________________________
_____________________________________________________________________________________
Remarks:
The measures of central location do not give an adequate description of a given distribution if the
purpose is to differentiate between the two using measures (the two sets have the same mean
and median)
n x ~
x
A 9 8 bots 8 bots
B 9 8 bots 8 bots
The two measures do not describe how the observations spread out from the average
Consider the dot plot of the two sets (Set B above the line; set A below):
3 4 5 6 7 8 9 10 11 12 13 14 15
The dot plot shows that the points of B are more closely clustered about the center, while the
points of A are scattered, yet they have the same mean and median 43
Therefore, there is a need to use a measure that will differentiate between the two distributions
in terms of how they are scattered/dispersed
LEONARES, S. R. 9
MEASURES OF VARIATION
RANGE
difference in value between the highest (maximum) and the lowest (minimum) observation
can be computed very quickly
but not very useful because it considers only the extremes
does not take into consideration the bulk of the observations.
The range is used when:
1. the data are too scant or too scattered to justify the computation of a more precise measure
of variability.
2. a knowledge of extreme scores or a total spread is all that is wanted.
RB = 15 – 3 = 12.0 points
this example shows an instance wherein range values are not able to differentiate between set
A and set B, although the dot plots present different “stories”
there is a need to have a measure that will be able to truly distinguish between the two sets
STANDARD DEVIATION
most important and most commonly used measure of variation, together with the mean as a
measure of central tendency
a measure of variability that is based on the difference between the value of each observation (xi)
and the mean
difference between each xi and the mean is called a deviation about the mean
x
2
x
2
x
s
2 i
Definitional formula for the sample standard deviation:
n 1
LEONARES, S. R. 10
N x i2 ( x i ) 2
Raw score formula for the population standard deviation:
N2
n x i2 ( x i ) 2
Raw score formula for the sample standard deviation: s
n( n 1)
Remark: It would be good for you to have a scientific calculator with an SD mode so that you will just
have to learn how to key in the data. Your calculator will generate the values of the measures that you
would like to solve. Since different models work differently, search a You tube tutorial for the particular
calculator model that you have.
Example:
Given the following sample data set (xi) where X : score in a quiz ( n = 10):
Xi Xi2
32 pts (32 pts)(32pts) = 1,024 pts2
71 pts (71 pts)(71 pts) = 5,041 pts2
64 pts (64 pts)(64 pts) = 4,096 pts2
50 pts (50 pts)(50 pts) = 2,500 pts2
48 pts (48 pts)(48 pts) = 2,304 pts2
63 pts (63 pts)(63 pts) = 3,969 pts2
38 pts (38 pts)(38 pts) = 1,444 pts2
41 pts (41 pts)(41 pts) = 1,681 pts2
47 pts (47 pts)(47 pts) = 2,209 pts2
52 pts (52 pts)(52 pts) = 2,704 pts2
Sum of the
column x i 506 xi 26,972 pts2
2
10( 26,972 pts 2 ) (506 pts) 2 269,720 pts 2 256,036 pts 2 13,684 pts 2
s
2
152.04 pts 2
10(9) 90 90
since it makes no sense to have a measure in terms of squared units of the original unit
of measurement (e.g., pts2), the unit has to be reverted back to the original unit
(pts) which can be done by extracting the square root of the value of the variance
LEONARES, S. R. 11
Example: Solve for the standard deviations of the two sets of data on page 1.
A B
2
x x x x2
1 3 9 3 9
2 4 16 7 49
3 5 25 7 49
4 6 36 7 49
5 8 64 8 64
6 9 81 8 64
7 10 100 8 64
8 12 144 9 81
9 15 225 15 225
x i 72 x 2
i 700 x i 72 x 2
i 654
NOTE: If it helps you by creating a table, you may do so, otherwise just presenting the solution in terms
of summations (like below) without the table will suffice.
Set A: x = 72
x2 = 700
Set B: x = 72
x2 = 654
VARIANCE
square of the standard deviation:
population variance: 2
sample variance: s2
of little use in descriptive statistics because its calculated value is expressed in square units of
measurement
WATCH:
Statistics Fundamentals: The Mean, Variance and Standard Deviation.
https://www.youtube.com/watch?v=SzZ6GpcfoQY
LEONARES, S. R. 12
APPLICATIONS OF THE STANDARD DEVIATION
A. COEFFICIENT OF VARIATION
Note: terms used interchangeably: more uniform, more homogeneous, more compact, less dispersed,
less scattered, less variable, less heterogeneous, less varied
Remark:
In the investing world, the coefficient of variation allows you to determine how much volatility (risk) you
are assuming in comparison to the amount of return you can expect from your investment. In simple
language, the lower the ratio of standard deviation to mean return, the better your risk-return tradeoff.
Example: Consider two investment proposals, A and B, with the following data:
herefore, because the coefficient of variation is a relative measure of risk, B is considered more risky
than A. Although B has a greater mean ($250) than A ($230), be is considered a more risky investment
since B is more volatile than A, meaning, your earning with B can vary from $41.43 to $458.57, while for
A it is from $122.93 to $337.07 (the greater the CV, the more variable the data of the group).
Example: The weights of 10 boxes of a certain brand of cereal have a mean content of 278.0 grams with
a standard deviation of 9.6 grams. If these boxes were purchased at 10 different stores and the mean
price per box is PhP64.50 with a standard deviation of PhP4.50, can you conclude that the weights are
relatively more homogeneous than the prices?
9.6 𝑔𝑟𝑎𝑚𝑠
CVw = x 100% = 3.5%
278.0 𝑔𝑟𝑎𝑚𝑠
𝑃ℎ𝑃4.50
CVp = x 100% = 6.98%
𝑃ℎ𝑃64.50
Yes, the weights are relatively more homogenous than the prices, because the CV for the
weights is less than the CV for the prices.
LEONARES, S. R. 13
B. STANDARD SCORE
Example: Ruben got a final grade of 85 in both English and Physics. The mean final grades of his class in
these two courses are 80 in English and 75 in Physics with standard deviations of 12 and 10, respectively.
In which subject was his academic performance better in relation to his class?
Subject Ruben’s final grade (x) Class Mean Class Std. Dev.
English 85 80 12
Physics 85 75 10
85−80 85−75
ZE = = 0.40 ZP = = 1.00
12 10
Example: Different typing skills are required for secretaries depending on whether one is working in a law
office, an accounting firm, or for a research mathematical group at a major university. In order to evaluate
candidates for these positions, an employment agency administers three distinct standardized typing
samples. A time penalty has been incorporated into the scoring of each sample based on the number of
typing errors. The mean and standard deviation for each test, together with the score achieved by a recent
applicant, are given in the following table. Determine which office this particular applicant should be
assigned.
Sample Applicant’s Mean Standard
score (xi) ( ) deviation (s)
Law 141 sec 180 sec 30 sec
Accounting 7 min 10 min 2 min
Scientific 33 min 26 min 5 min
LEONARES, S. R. 14
7 𝑚𝑖𝑛−10 𝑚𝑖𝑛
ZA = = - 1.50
2 𝑚𝑖𝑛
33 𝑚𝑖𝑛 − 26 𝑚𝑖𝑛
ZS = = 1.40
5 𝑚𝑖𝑛
Since a secretary is supposed to type speedily and accurately, a lower z-score is desired. This
particular applicant should be assigned to an accounting firm.
there is no need to convert to the same units since the numerator units will cancel with the
denominator units. z should have no unit of measurement
3(mean median)
Formula: SK
std deviation
D. EMPIRICAL RULE
When the data are believed to approximate a bell-shaped distribution, the empirical rule can be
used to determine the percentage of data values that must be within a specified number of
standard deviations of the mean, that is,
o Approximately 68% of the data values will be within 1 standard deviation of the mean
( ± 1) = ( - 1 , + 1).
o Approximately 95% of the data values will be within 2 standard deviations of the mean
( ± 2) = ( - 2 , + 2).
o Approximately 99.7% of the data values will be within 3 standard deviations of the mean
( ± 3) = ( - 3 , + 3).
LEONARES, S. R. 15
LEONARES, S. R. 16
± 1 : 16.00 ± 0.25 (16.00 - 0.25, 16.00 + 0.25)
(15.75, 16.25)
68% of the liquid detergent cartons have filling weights between
15.75 oz and 16.25 oz
EXERCISES
1. A goal of management is to help their company earn as much as possible relative to the capital
invested. One measure of success is return on equity – the ratio of net income to stockholder’s
equity. Shown here are return on equity percentages for 25 companies. Find the range, variance,
and standard deviation.
9.0 19.6 22.9 41.6 11.4
15.8 52.7 17.3 12.3 5.1
17.3 31.1 9.6 8.6 11.2
12.8 12.2 14.5 9.2 16.6
5.0 30.3 14.7 19.2 6.2
2. During a 30-day period, the daily number of cars rented of a car rental company are as follows:
7 10 6 7 9 4 7 9 9 8
5 5 7 8 4 6 9 7 12 7
9 10 4 7 5 9 8 9 5 7
Find the range, variance, and standard deviation.
3. A manufacturing firm regularly places orders with two different suppliers, A and B. The following
data are the number of days required to fill orders for these suppliers.
Supplier A: 11 10 9 10 11 11 10 11 10 10
Supplier B: 8 10 13 7 10 11 10 7 15 12
Determine which supplier provides the more consistent and reliable delivery times. Use the
range and standard deviation. Since you are comparing the two, why just use the standard
deviation and not compute for the coefficient of variation?
LEONARES, S. R. 17
4. A production department uses a sampling procedure to test the quality of newly produced items.
The department employs the following decision rule at an inspection station: If a sample of 14
items has a variance of more than .005, the production line must be shut down for repairs.
Suppose the following data have been collected:
3.43 3.45 3.43 3.48 3.52 3.50 3.39
3.48 3.41 3.38 3.49 3.45 3.51 3.50
Should the production line be shut down? Why or why not?
5. Two friends want to take a summer holiday before going to college in the autumn. They are looking
for somewhere with plenty of clubs where they can party all night. Unfortunately they have left it
rather late to book and there are only two resorts, Medlena and Bistry, available within their
budget. When they ask about the ages of the holiday-makers at these resorts their travel agent
says the only thing he can tell them is that that the mean age of people going to Medlena is 19
whereas the mean age of visitors to Bistry is 22. Just as they are about to book holidays in Medlena
because it seems to attract the sort of young crowd they want to be with the travel agent says.
‘I’ve got some more figures, the standard deviation of the ages of visitors to Medlena is 8 and the
standard deviation of the ages of visitors to Bistry is 2’. Should they change their minds on the
basis of this new information, and if so, why?
6. Many national academic achievement and aptitude tests, such as the SAT, report standardized
test scores with the mean for the normative group used to establish scoring standards converted
to 500 with a standard deviation of 100. Suppose that the distribution of scores for such a test is
known to be approximately normally distributed. Determine the approximate percentage of
reported scores that would be
a. between 400 and 600
b. between 500 and 700
c. greater than 700
d. less than 200
Hint: Draw the bell-shaped curve and replace the values of and on the horizontal axis:
7. A SAT test taker (refer to #6) got a score of 625. What is his standard score?
8. The same student (in #7) got the same score (625) in a different test, the mean of which is 450
and standard deviation 150. In which test did this student fare better?
LEONARES, S. R. 18