Descriptive Stats
Descriptive Stats
1 INTRODUCTION
For this course, you will need a scientific calculator, that is, one that has
statistical functions and, more preferably, one having the linear regression (LR)
mode. The most cost-effective calculator for the purpose of this section is the
CASIO FX-82 TL. It will save you a tremendous amount of time – once
statistical data entered, statistics like the number of observations, mean, standard
deviation, correlation and regression coefficients can be readily obtained by just
pressing buttons. Obviously, computer software like SPSS or SAS are much more
powerful but the calculator can help you to determine basic statistics very quickly
‘on the spot’.
The summation notation is used to summarise a series, that is, the sum of
the terms of a sequence. It is denoted by Greek capital letter sigma, ∑ , as
opposed to small letter sigma, σ , which, in Statistics, stands for standard
deviation.
b
∑
r =a
f (r )
where r is known as the index, a and b are the lower and upper limits of
summation respectively and f (r) is known as the general term. r, just like a
counter, starts at a and increases by steps of 1 until it reaches b. Each term of the
series is obtained by substituting successive values of r in the general term. The
following example illustrates the mechanism.
1
Example
∑
6
Here, the index (counter) is k. It can be observed that k takes on an initial value of
2 (the lower limit) and increases by steps of 1 until it reaches the upper limit 6.
Every value that k assumes is substituted in the general term (2k + 1) in order to
generate a term of the series. Obviously, the terms are added up since Sigma
stands for summation.
Such expressions are found in the formulae for arithmetic mean and
standard deviation. In this module, students are simply required to recognise the
summation notation and understand its meaning so that they can at least use
relevant statistical functions on calculators.
2 DISTRIBUTIONS
2
2 7 8 11 15
16 18 19 19 19
23 23 24 26 27
29 33 40 44 47
49 51 54 63 68
Age Frequency
19 14
20 23
21 134
22 149
23 71
24 8
Total 399
3
The main drawback in grouping of data is that the identity (value) of each
observation is lost so that important descriptive statistics like the mean and
standard deviation can only be estimated and not exactly calculated. For example,
if the age group ‘21–25’ has frequency 5 (Fig. 2.2.1), nothing can be said about
the values of these 5 observations. Besides, a lot of new quantities have to be
calculated in order to satisfy statistical calculations and analyses as will be
explained in the following sections.
Note The ‘gap’ between any pair of successive cells in a frequency distribution
is equal to the degree of accuracy to which the original observations were
recorded.
In the above example, it is easy to deduce that age was recorded to the
nearest unit since the ‘gap’ between the cells ‘21–25’ and ‘26–30’ is 1. The real
limits of these 2 will now be ‘20.5–25.5’ and ‘25.5–30.5’. Note that the following
relationships hold:
4
2.2.2 Mid-class values (MCV)
The mid-class value, MCV, of a cell is defined as its midpoint, that is, the
average of its limits or real limits. Thus, the MCV of the cell ‘21–25’ is 23. The
MCV of a cell is the representative of that cell in the sense that, since the values
of all the observations in the cell are unknown individually, it is assumed that they
are all equal to the MCV. This assumption is not fortuitous and neither is it
unjustified. It has the logical implication that if observations are unknown, the
best way of estimating statistics more accurately would be to assume that, at least,
they are uniformly distributed within the cell (which could be untrue, of course!).
Mathematically, the sum of the observations would be equal to the number of
observations multiplied by the MCV (think about it!). The importance of the mid-
class value can thus never be underestimated, especially for the calculation of the
crucial statistics like the mean and standard deviation.
The cell width is simply the length of the cell, that is, the difference
between its lower and upper real limits.
Note Do not make the mistake of subtracting the lower limit from the upper
limit since this will not give the exact cell width.
This can be easily verified by taking the cell ’21–25’. Its cell width is 5
(21, 22, 23, 24 and 25), which is obtained by subtracting 20.5 from 25.5. We
therefore use the following formula:
3 DESCRIPTION OF A DISTRIBUTION
1. Location
2. Dispersion
3. Skewness
4. Kurtosis
5
3.1 LOCATION (LOCALITY OR CENTRAL TENDENCY)
The arithmetic mean is the most common form of average. For a given set
of data, it is defined as the sum of the values of all the observations divided by the
total number of observations. The mean is denoted by x for a sample and by μ
for a population. Its formula, however, differs for ungrouped and grouped data.
Ungrouped data
x=
∑x
μ=
∑X
n N
Grouped data
x=
∑ fx μ=
∑ fX
∑f N
n = sample size
N = population size
f = frequency of classes
Merits
1. It is widely understood.
2. Its calculation involves all observations.
3. It is suited to further statistical analysis.
Limitations
6
3.1.2 Geometric mean
The geometric mean of n items is defined as the nth root of their combined
product. The general formula which is used to calculate the geometric mean is as
follows:
Example
97 − 84
= 0.155 = p1
84
116 − 97
= 0.196 = p 2
97
129 − 117
= 0.112 = p3
117
7
Merit
Limitation
The harmonic mean of n items is the number of items divided by the sum of the
reciprocal of each individual item.
The general formula for calculating the harmonic mean is given as:
n
Harmonic mean =
1 1 1 1
+ + + ... +
x1 x 2 x3 xn
Example
An organisation owns three lorries. Over a distance of 100 miles, one does
14 miles per gallon, one 18 miles per gallon and one 20 miles per gallon.
3
Harmonic mean = = 16.95
1
14 + 181 + 1
20
Merit
8
3.1.4 Weighted mean
∑ wi xi
x weighted =
∑ wi
Example
In a certain institution, the year marks for modules are based upon a first-
term test, a second-term test and a final exam at the end of the year. Given the
number of topics to be covered for each assessment, they have a relative
importance in the ratio 2:3:5. If a student obtained 74 marks in the first test, 63 in
the second test and 55 in the final exams, what is his year mark?
3.1.5 Median
In the case of grouped data, the determination of the value of the median is
slightly more complicated since the identity of individual observations is
unknown. We proceed as follows:
9
The formula for calculating the median is given by
⎛ n +1 − CF ⎞
Median = LCB + ⎜⎜ 2 ⎟⎟ c
⎝ f ⎠
Merits
1. It is rigidly defined.
2. It is easily understood and, in some cases, it can even be located by
inspection.
3. It is not at all affected by extreme values.
Limitations
Quartiles are calculated in the same way as the median, that is using the
same formula except, obviously, for the rank. (Formula to be explained in detail.)
10
3.1.6 Mode
The mode is the observation which occurs the most or with the highest
frequency. Sometimes, it is denoted by x̂ . For ungrouped data, it may easily be
detected by inspection. If there is more than one observation with the same
highest frequency, then we either say that there is no mode or that the distribution
is multimodal.
For grouped data, we can only estimate the mode – the class with the
highest frequency is known as the modal class. Since we would prefer a single
value for the mode (instead of an entire class), a rough approximation is the mid-
class value of the modal class. However, there are two ways of estimating the
mode quite accurately. Both should theoretically lead to the same result, the first
one being numerical and the second, graphical.
⎛ f ⎞
Mode = LCB + ⎜⎜ 1
⎟⎟ c
+
⎝ 1 f2 ⎠
f
where f 1 is the difference between the frequencies of the modal class and that of
the class preceding it and f 2 is the difference between the frequencies of the
modal class and that of the class following it.
Modal
class
Frequency
density
O Values
11
Merits
Limitations
3.2 DISPERSION
3.2.1 Range
The range is simply the difference between the values of the maximum
and minimum observations. It can only measure the extent to which the
distribution spreads over the x-axis.
Merit
Limitations
12
3.2.2 Variance
Ungrouped data
s 2
=
∑ (x − x)2
σ 2
=
∑ (X − μ)2
n N
Grouped data
s2 =
∑ f (x − x)2 σ2 =
∑ f (X − μ)2
∑f N
Note The formula for variance can be simplified using the laws of summation
so that calculations may become shorter and less complicated.
s 2
=
∑ x2
−x 2
s 2
=
∑ fx 2
− x2
n ∑f
13
3.2.4 Mean deviation
∑ f x−x
Mean deviation =
n
Note Obviously, the frequency f falls off when there are no classes in the
distribution, that is, only individual values.
Merits
Limitations
1. The distance from the mean does not reveal whether the observation is less
than or greater than the mean.
2. It is not suitable for further statistical analysis.
Q3 − Q1
Quartile deviation =
2
The quartile deviation shows the average distance between a quartile and
the median. The smaller the quartile deviation, the less dispersed is the
distribution. Just like the range, the quartile deviation can be misleading. If the
majority of the data is towards the lower end of the range, then the third quartile
will be considerably further above the median than the first quartile is below it. In
such a case, when the two distances from the median are averaged, the difference
is disguised. Then, it would be better to quote the actual values of the two
quartiles rather than the quartile deviation.
14
3.2.6 Coefficient of variation
s
Coefficient of variation = × 100
x
Example
A B
Mean 120 125
Standard deviation 50 51
Coefficient of variation 41.7 40.8
Table 3.2.6
Q3 − Q1
Quartile coefficient of dispersion =
Q3 + Q1
15
3.2.8 Coefficient of mean deviation
∑ f x−x
Coefficient of mean deviation =
nx
Again, the frequency f falls off if there are no classes in the distribution.
3.3 SKEWNESS
16
3.3.1 Pearson’s coefficient of skewness
This is the most accurate measure of dispersion since its formula contains
two of the most reliable statistics, the mean and standard deviation. The formula
is given as
3 ( x − Q2 )
α=
s
Note The validity of the formula can be verified by looking at the positions of
the mean and median in Fig. 3.3.
Q1 + Q3 − 2Q2
α=
Q3 − Q1
3.4 KURTOSIS
If we start with the normal distribution and move scores from both centre
and tails towards the shoulders, the curve becomes flatter and is said to be
platykurtic. If, on the other hand, we move scores from the shoulders to the centre
and tails, the curve becomes more peaked with thicker tails. In that case, it is said
to be leptokurtic. Fig. 3.4 shows the degree of peakedness for three types of
distributions.
17
Platykurtic Mesokurtic Leptokurtic
β=
∑ (x − x)4
or β =
∑ f (x − x)
4
ns 4 ns 4
4 EXAMPLES
The full descriptive statistics have been calculated and given in Tables
4.4, 4.5 and 4.6.
18
4.1 Example 1 (ungrouped raw data)
2 7 8 11 15
16 18 19 19 19
23 23 24 26 27
29 33 40 44 47
49 51 54 63 68
Table 4.1
Table 4.3
19
Table 4.4 Descriptive statistics for Example 4.1
x=
∑ fx = 735 = 29.4
Mean
∑f 25
1
Rank of median = (25 + 1) = 13
Median 2
Median = 24
1
Rank of first quartile = (25 + 1) = 6.5
4
Lower Quartile
(15 + 16)
Q1 = = 15.5
2
3
Rank of third quartile = (25 + 1) = 19.5
4
Upper Quartile
(44 + 47)
Q3 = = 67.5
2
Maximum Maximum observation = 68
Range Range = 68 – 2 = 66
∑ f x−x 368.8
Mean deviation MD = = = 14.752
n 25
∑ x − x 2 = 29351 − (29.4) 2 = 17.598
2
Standard deviation s2 =
n 25
Q3 − Q1 67.5 − 15.5
Quartile coefficient of dispersion Quart. coeff. of dis. = = = 0.313
Q3 + Q1 67.5 + 15.5
∑ f x−x 14.752
Coefficient of mean deviation Coeff. of MD = = = 0.50
nx 29.4
3 ( x − Q2 ) (3)(29.4 − 24)
Pearson’s coefficient of skewness α= = = 0.92
s 17.598
β=
∑ ( x − x ) 4 = 4226007.248 = 1.763
Coefficient of kurtosis 4 4
ns (25)(17.598)
20
Table 4.5 Descriptive statistics for Example 4.2
x=
∑ fx = 8643 = 21.66
Mean
∑f 399
1
Rank of median = (399 + 1) = 200
Median 2
Median = 22
1
Rank of first quartile = (399 + 1) = 100
Lower Quartile 4
Q1 = 21
3
Rank of third quartile = (399 + 1) = 300
Upper Quartile 4
Q3 = 22
Range Range = 24 – 19 = 5
∑ f x−x 328.38
Mean deviation MD = = = 0.823
n 399
Standard deviation s2 =
∑ fx 2 − x 2 = 187631 − (21.66) 2 = 1.013
∑f 399
Q3 − Q1 22 − 21
Quartile coefficient of dispersion Quart. coeff. of dis. = = = 0.023
Q3 + Q1 22 + 21
∑ f x−x 0.823
Coefficient of mean deviation Coeff. of MD = = = 0.038
nx 21.66
3 ( x − Q2 ) (3)(21.66 − 22)
Pearson’s coefficient of skewness α= = = –1.007
s 1.013
β=
∑ f (x − x)4 = 468.7743
= 1.116
Coefficient of kurtosis 4
ns (399)(1.103) 4
21
Table 4.6 Descriptive statistics for Example 4.3
x=
∑ fx = 5669 = 39.64.
Mean
∑f 143
1
Rank of median = (143 + 1) = 72
2
Median
⎛ 72 − 40 ⎞
Median = 35.5 + ⎜ ⎟(5) = 39.60
⎝ 39 ⎠
Modal class: 36 – 40
Mode ⎛ 16 ⎞
Mode = 35.5 + ⎜ ⎟ (5) = 38.98
⎝ 16 + 7 ⎠
1
Rank of first quartile = (143 + 1) = 36
4
Lower Quartile
⎛ 36 − 17 ⎞
Q1 = 30.5 + ⎜ ⎟(5) = 34.63
⎝ 23 ⎠
3
Rank of median = (143 + 1) = 108
4
Upper Quartile
⎛ 108 − 79 ⎞
Q3 = 40.5 + ⎜ ⎟(5) = 45.03
⎝ 32 ⎠
Maximum Maximum observation = 60
Range Range = 60 – 21 = 39
∑ f x−x 879.6
Mean deviation MD = = = 6.151
n 143
Standard deviation s2 =
∑ fx 2 − x 2 = 232977 − (39.64) 2 = 7.590
∑f 143
Q3 − Q1 45.03 − 34.63
Quartile coefficient of dispersion Quart. coeff. of dis. = = = 0.13
Q3 + Q1 45.03 + 34.63
∑ f x−x 6.151
Coefficient of mean deviation Coeff. of MD = = = 0.155
nx 39.64
3 ( x − Q2 ) (3)(39.64 − 39.60)
Pearson’s coefficient of skewness α= = = 0.016
s 7.590
β=
∑ f (x − x)4 = 468.7743
= 1.116
Coefficient of kurtosis 4
ns (399)(1.103) 4
22