Stat 101 Notes
Stat 101 Notes
Chapter # 01 Statistics
Variable:-
Any observed characteristics which can vary/change from individual to individual.
Constant:-
Any observed characteristics which cannot vary/change from individual to individual.
Quantitative Variable:-
A variable which can assume numerical value is called quantitative variable. For
example, age, height, weight, marks, etc.
Qualitative Variable:-
A variable which can assume non-numerical value is called qualitative variable. For
example, eye color, gender, etc.
Discrete Variable:-
Any quantitative variable which can assumes only some specific, finite or countable
values within a given range.
Continuous Variable:-
Any quantitative variable which can assumes every possible values within a given
interval.
Primary data / Raw Data / Ungroup Data:-
The data which has not undergone any statistical treatment is called primary data. 1st
hand collected data is called primary data. It is also called raw data and ungroup data.
Secondary data / Group Data:-
The data which has undergone any statistical treatment is called secondary data. 2nd
hand collected data is called secondary data. It is also called group data.
Sources of Primary Data:-
(i) Direct Personal Observation
(ii) Indirect Oral Observation
(iii) Estimates through Correspondents
(iv) Investigation through Schedules/Questionnaires
Sources of Secondary Data:-
(i) Government Organization
(ii) Semi- Government Organization
(iii) Published Sources
(iv) Unpublished Sources
(v) Internet
Classification:-
The process of arranging observations into different classes or categories to some
common characteristics is called classification.
Tabulation:-
The process of making tables or arranging data into rows and columns is called
tabulation.
Steps in Constructing Table
(i) Title:-
It is the heading at top of the table.
(ii) Head Note / Prefatory Note:-
It appears after the title of the table. It is used for further description about the
title.
(iii) Column Captions and Box Head:-
The headings of columns are called column captions. The part of column
caption is called box head.
(iv) Row Captions and Stub:-
The headings of rows are called row captions. The part of row caption is
called stub.
(v) Body:-
The entries in different cells of columns and rows in a table is called body of
the table.
(vi) Source Note:-
Source notes are given at the end of the table.
(vii) Foot Note:-
It is given at bottom of the table.
Frequency:-
The number of values falling in a particular class is called frequency of that class.
Frequency Distribution:-
Frequency distribution is a compact form of the data in a table which displays all
categories of observations according to their magnitudes and frequencies.
Cumulative Frequency Distribution:-
It is a table that displays class intervals and the corresponding cumulative frequencies.
Relative Frequency Distribution:-
It is a table that displays class intervals and the corresponding relative frequencies.
Class Limits:-
Class limit is also a technical term used to express non-overlapping classes like: 20-
29, 30-39, 40-49 and so on.
Class Boundaries:-
Class boundary is a technical term used to express overlapping classes like: 20-30, 30-
40, 40-50 and so on.
Mid-Point / Class Mark:-
The mid way of given class limits of a class or class boundaries of a class are called
mid-points or class marks.
Class size / Interval:-
Class size or class interval means gap between upper and lower class boundaries of
the frequency distribution.
Simple Bar Diagram:-
It is used to get an impression of the distribution of a discrete or categorical data set.
Multiple Bar Chart:-
It is an extension of the simple bar diagram and is used to represent two or more
related sets of data in the form of groups of simple bars..
Sub-Divided Bar Chart / Component Bar Chart:-
There are certain situations where the simple bar diagram represents the totals and it is
possible to divide it further into different segments.
Pie Chart:-
Component Part
It is a division of a circular region into different
= sectors. Q × 360o
Total Part
Histogram:-
The graph of frequency distribution is called Histogram. It is a useful graphic
representation of data to get a visual impression about its distribution.
Historigram:-
The graph of time series is called Historigram.
Frequency Polygon:-
It is a closed geometric figure used to display a frequency distribution graphically.
Frequency Curve:-
The frequency polygon smoothed is called frequency curve, which is useful to have a
visual impression about the data.
Types of Averages:
(1) Arithmetic Mean
(2) Geometric Mean
(3) Harmonic Mean
(4) Median
(5) Mode
Arithmetic Mean:-
The Arithmetic mean is obtained by dividing the sum of the values by their numbers
is called Arithmetic Mean (A.M). It is denoted by x . It is computed by following formula.
x=
∑x Direct Method x=
∑ fx
n ∑f
x= A +
∑D x= A+
∑ fD
n ∑f
D= x − A Indirect/Shortcut Method
A = Assume value or
provisional mean
x =+
A
∑u × h x=
A+
∑ fu × h
n Step-Deviation/Coding ∑f
x− A Method
u=
h
h = Class Interval
∑ f ( x − x ) < ∑ f ( x − a)
2 2
(For group data)
Merits:-
(i) It is simple to understand and easy to calculate.
(ii) It is clearly defined by mathematically formula.
(iii) It is based on all the values.
(iv) It is capable of further algebraic treatment.
(v) It is stable average in repeating sampling treatment.
Demerits:-
(i) It is greatly affected by extreme values.
(ii) It cannot be accurately computed for open-end classes without assuming open
ends.
(iii) It cannot be located graphically.
(iv) It is not suitable average in highly skewed distribution.
Geometric Mean:-
The nth root of the product of all observations is called Geometric Mean (G.M). It is
computed by following formula
(i) If there are k sets each with observations n1.n2 .n3 .........nK and G1.G2 .G3 .........GK as
their Geometric Means. Then the combined Geometric Mean of the total
k
∑ ni log Gi
observations is given by G.M combined = Anti log i =1 k .
ni
∑ i =1
(ii) If there are two sets each consisting of n positive observations x11.x12 .x13 .........x1k
with Geometric Mean G1 and x21.x22 .x23 .........x2 k with Geometric Mean G2 then the
G
Geometric Mean G of the ratio of other Geometric Means. G = 1
G2
Merits:-
(i) It is rigorously defined by a mathematically formula.
(ii) It is based on all observed values.
(iii) It is amenable to mathematical treatment in certain cases.
(iv) It gives equal weightage to all the observations.
(v) It is not much affected by sampling variability / extreme values.
(vi) It is an appropriate type of average to be used in case rates of changes or ratios are
to be averaged.
Demerits:-
(i) It is neither easy to calculate nor to understand.
(ii) It becomes zero if any of the observations is zero.
(iii) In case of negative values, it cannot be computed at all.
Harmonic Mean:-
The reciprocal of the mean of reciprocals of the observations is called Harmonic
Mean. It is computed by following formula.
(i) If there are k sets each with observations n1.n2 .n3 .........nK and H1.H 2 .H 3 .........H K as
their harmonic Means. Then the combined Harmonic Mean of the total
k
∑ ni
observations is given by H .M combined = ki =1 .
ni
∑
i =1 H i
Sanan Fazal Lecturer in Statistics
M.Phil Statistics
+92-313-6212440 University of Gujrat 3
STAT-101 Introduction to Statistics BS Programs
Chapter # 03 Measures of Central Tendency or Averages
Merits:-
(i) It is rigorously defined by a mathematically formula.
(ii) It is based on all the observations in the data.
(iii) It is amenable to mathematical treatment.
(iv) It is not much affected by sampling variability.
(v) It is an appropriate type for averaging rates and ratios.
Demerits:-
(i) It is not readily understood.
(ii) It cannot be calculated, if any one observation is zero.
(iii) It gives less weight to large values and more weight to small values.
Weighted Mean:-
Arithmetic mean is used when all the observations are given equal importance but
there are certain situations in which the different observations get different weights. In this
Properties of Median:-
Quartiles:-
The three values which divide the distribution into four equal parts, are called
quartiles. These values are denoted by Q1 , Q2 and Q3 respectively. Q1 is called the first or
lower quartile and Q3 is called the third or upper quartile. Q2 is called the second quartile.
Q 2 is equal to median.
Q1 = The value of
( n + 1) th observations Q1 =
h n
l + −c
4 f 4
2 ( n + 1) h 2n
Q 2 = The value of th observations Q2 =l + −c
4 f 4
Q 2 =Median Q 2 =Median
3 ( n + 1) h 3n
Q3 = The value of th observations Q3 =
l + −c
4 f 4
Deciles:-
The nine values which divide the distribution into ten parts, are called Deciles and are
denoted by D1 , D2 , D3 ,........., D9 . D5 is equal to median.
D1 = The value of
( n + 1) th observations D1 =
h n
l + −c
10 f 10
2 ( n + 1) h 2n
D 2 = The value of th observations D2 =l + −c
10 f 10
. .
. .
Sanan Fazal Lecturer in Statistics
M.Phil Statistics
+92-313-6212440 University of Gujrat 5
STAT-101 Introduction to Statistics BS Programs
Chapter # 03 Measures of Central Tendency or Averages
5 ( n + 1) h 5n
D5 = The value of th observations D5 =l + −c
10 f 10
D5 =Median D5 =Median
. .
. .
.
.
9 ( n + 1)
D9 = The value of th observations h 9n
10 D9 =
l + −c
f 10
Percentiles:-
The ninety-nine values dividing the data into one hundred equal parts, are called
Percentiles and are denoted by P1 , P2 , P3 ,........., P99 . Where P25 is equal to Q1 , P50 is equal to
median and P75 is equal to Q3 .
P1 = The value of
( n + 1) th observations P1 =
h n
l+
−c
100 f 100
2 ( n + 1) h 2n
P2 = The value of th observations P2 = l+ −c
100 f 100
. .
. .
.
.
25 ( n + 1)
P25 = The value of th observations h 25n
100 P25 =l+ −c
f 100
.
. .
. .
.
50 ( n + 1) h 50n
P50 = The value of th observations P50 =l+ −c
100 f 100
. .
. .
. .
75 ( n + 1) h 75n
P75 = The value of th observations P75 =l+ −c
100 f 100
. .
. .
. .
99 ( n + 1) h 99n
P99 = The value of th observations P99 =l+ −c
100 f 100
Mode = l +
( f m − f1 ) h
( f m − f1 ) + ( f m − f 2 )
Most frequent value or Most repeated value Where
l = Lower class boundary of model group
f m = Maximum frequency
h = Class Interval
Properties of Mode:-
Merits:-
(i) It is simply defined and easily calculated.
(ii) In many cases, it is extremely easy to locate the mode.
(iii) It is not affected by abnormal large or small observations.
(iv) It can be determined for both the qualitative and the quantitative data.
Demerits:-
(i) It is not rigorously defined.
(ii) It is often indeterminate and indefinite.
(iii) It is not based on all the observations made.
(iv) It is not capable of lending itself to further statistical treatment.
(v) When the distribution consists of a small number of values the mode may not
exist.
Its relative measure known as the co-efficient of range and the co-efficient of
xm − x0
dispersion is defined by the following relation: Co-efficient of range =
xm + x0
The half difference between upper quartile ( Q3 ) and lower quartile ( Q1 ) is called
quartile deviation or semi-interquartile range. It is denoted by Q.D.
Q3 − Q1
Q.D =
2
Merits:-
i) It is simple to understand and easy to calculate.
ii) It is not affected by extreme values.
Demerits:-
i) It is not based on all the values.
ii) Q.D. will be same value for all the distributions having the same quartiles.
iii) It gives no information about the position of observations lying outside the two
quartiles.
iv) It is not amenable to mathematical treatment.
Mean Deviation:-
The mean of absolute deviations of observations from mean, median and mode is
mean deviation.
OR
The mean of the deviation from central values (mean, median or mode) without
considering the algebraic sign is called mean deviation. It is denoted by M.D. It is
calculated as follows.
M.D( x )
Co-efficient of M.D( x ) =
x
M.D( xˆ )
Co-efficient of M.D( xˆ ) =
xˆ
Merits:-
i) It is easy to calculate.
ii) It is based on all the values.
iii) It gives more information than the range or the quartile deviation.
Demerits:-
i) It is affected by the extreme values.
ii) It is not readily capable of mathematical development.
iii) It does not take into account the negative signs of the deviations from some
average.
For population by σ2
For sample by S 2
It is computed by the following formula.
S2 = Direct Method 2
S =
n ∑f
2 2 2
∑ X 2 − ∑ X ∑ fX − ∑ fX
S2
= n Shortcut Method =S2
n ∑f ∑f
2 2 2
∑ D2 − ∑ D ∑ fD − ∑ fD
S2
= n Deviation Method =S2
n ∑f ∑f
∑ u 2 ∑ u 2 fu 2 fu 2
S 2 h2 S 2 h2
∑ ∑
= − Coding Method = −
n n ∑f ∑ f
Standard Deviation:-
The square root of the mean of squared deviations from mean is called standard
deviation and is denoted
For population by σ
For sample by S
S= Direct Method S=
n ∑f
2 2 2
=S
∑ X 2 − ∑ X Shortcut Method =S
∑ fX − ∑ fX
n
n ∑f ∑f
2 2 2
=S
∑ D2 − ∑ D Deviation Method =S
∑ fD − ∑ fD
n
n ∑f ∑f
2 2 2
=S h
∑ u2 − ∑ u Coding Method =S h
∑ fu − ∑ fu
n
n ∑f ∑f
Sc2 = , i = 1, 2, 3,...., k
∑ ni
n1 X1 + n2 X 2 + ....... + nk X k
Where, X c =
n1 + n2 + ....... + nk
Moments:-
The moments about mean are the mean of deviations from the mean after raising them
to integer powers. The rth population moment about the mean is denoted by µr is defined as:
∑ f (X −X)
r
N
∑( X −X)
r i i
i
Where r = 1, 2,3,........ µr = i =1
µr = i =1 N
N ∑f i =1
i
The rth sampled moment about the mean is denoted by mr is defined as:
∑ f (x − x)
r
n
∑(x − x )
r i i
i
Where r = 1, 2,3,........ mr = i =1
n
mr = i =1
n ∑f i =1
i
m1 =
∑(x − x )
i
m1 =
∑ fi ( xi − x ) m1 = 0 (Always)
n ∑ fi
∑ f (x − x)
2
∑(x − x )
2
m2 = i
m2 = i i
m2 = Variance (Always)
n ∑f i
∑ f (x − x)
3
∑(x − x )
3
i i
m3 = i
m3 =
n ∑f i
∑ f (x − x)
4
∑(x − x )
4
i i
m4 = i
m4 =
n ∑f i
=
∑ ( xi − a ) ∑ D=
m1′ = i ∑ fi ( xi − a ) ∑ fi Di
m1′ =
n n ∑ fi ∑ fi
∑fD
2
∑ f ( x − a)
2
∑=
2
∑ ( xi − a )
2
D
m′ = i i i
m2′
i
= = i
∑f ∑f
2
n n i i
∑ f ( x − a) ∑fD
3
∑=
3
∑
3
( x − a)
3
D
=m3′ = i m′ = i
i i i i
∑f ∑f
3
n n i i
∑ f ( x − a) ∑fD
4 4
∑=
4
∑ ( x − a)
4
D
=m4′ = i m′ = i
i i i i
∑f ∑f
4
n n i i
m1′ = h
∑u i
m1′ = h
∑ fiui
n ∑ fi
∑u
2
m2′ = h 2
∑ fu i i
2
m2′ = h 2 i
n ∑f i
∑u
3
m3′ = h3
∑ fu i i
3
m3′ = h 3 i
n ∑f i
∑u
4
m4′ = h 4
∑ fu i i
4
m4′ = h 4 i
n ∑f i
=
∑ ( xi − 0 ) ∑ x=
m1′ = i ∑ fi ( xi − 0 ) ∑ fi xi
m1′ =
n n ∑ fi ∑ fi
∑ f ( x − 0) ∑fx
2
∑ ( x − 0) ∑ x
2 2 2
=m2′ = = i
m′ = i i i i i
∑f ∑f
2
n n i i
∑ f ( x − 0) ∑fx
3
∑ ( x − 0) ∑ x
3 3 3
=m3′ = = i
m′ = i i i i i
∑f ∑f
3
n n i i
∑ f ( x − 0) ∑fx
4
∑ ( x − 0) ∑ x
4 4 4
=m4′ = = i
m′ = i i i i i
∑f ∑f
4
n n i i
If m1′ , m2′ , m3′and m4′ are given and we want to calculate first four moments about mean then
we use the following formulas:
m1 = m1′ − m1′ = 0
( )
2
m2′ − m1′
m2 = Variance
=
( )
3
m3′ 3m2′m1′ + 2 m1′
m3 =−
( ) ( )
2 4
m4′ − 4m3′m1′ + 6m2′ m1′
m4 = − 3 m1′
h2
m2 ( corrected ) m2 ( uncorrected ) −
=
12
m3 ( corrected ) = m3 ( uncorrected )
h2 7 4
m4 ( corrected ) =
m4 ( uncorrected ) − .m2 ( uncorrected ) + h
2 240
Where h = Class Interval
Note:-
These corrections are not applicable to highly skewed distributions and distributions
having unequal class-intervals.
Moment-Ratios:-
There are certain ratios in which both the numerators and denominators are moments.
They are independent of origin and units of measurement, i.e., they are pure members.
Where 1st moment ratio is the square of the third moment expressed in standard units and 2nd
moment ratio is the fourth standardized moment.
Mean − Mode
Sk =
Standard Deviation
ii) Karl Pearson 2nd Co-efficient of Skewness:-
Sometimes mode is ill-defined and is difficult to locate by simple methods
then
3 ( Mean − Median )
Sk = −3 < S k < +3
Standard Deviation
This co-efficient usually lies/varies between -3 (negative skewness) and +3 (positive
skewness) and the sign indicates the direction of the skewness.
iii) Bowley’s Co-efficient of Skewness Based on Quartiles:-
Q3 + Q1 − 2Median
Sk = −1 < S k < +1
Q3 − Q1
This co-efficient lies between -1 & +1. For symmetrical distributions its value is zero.
m32
b1 = 3
m2
If b1 = 0 Distribution is Symmetric
Kurtosis:-
The word kurtosis is used to indicate the length of the tails and peakedness of
symmetrical distributions. Symmetrical distributions may be platykurtic, mesokurtic (normal)
or leptokurtic.
i) The mesokurtic is the usual normal distribution.
ii) The leptokurtic is more peaked and has many values around the mean and in the
tails away from the mean. The leptokurtic distribution may be composite of two
normal distributions with the same mean but different variances.
iii) The platykurtic is bit flat and has more values between the mean and tails. The
platykurtic distribution may be composite of two normal distributions with the
same variance but different means.
Measures of Kurtosis:-
If b2 = 3 Distribution is Mesokurtic
If b2 > 3 Distribution is
Leptokurtic
Q.D
K= 0 < K < 0.50 K = 0.263 for a Normal Distribution
P90 − P10
The device which measures the changes or variations which are occurred in the data
Pn
due to index number. I=
n ×100
Po
An index number is computed for more than one commodity or variable is called
composite index number.
Index number are used as a economic barometers for measuring the prevailing
conditions as well as changes in economic variables like whole sale prices, consumer prices,
production, investment, import, export, business conditions and terms of trade etc.
(i) All index numbers are not suitable for all purpose.
(ii) They are based on sampling and sampling error creep into the calculations.
(iii) Comparison of changes in variable over long period are not reliable.
(iv) The choice of normal period is difficult.
(v) It is not practicable to price all the goods and services.
An index number which measures the changes in the whole sale or relative prices of a
particular commodity or a number of commodities is called price index number.
An index number which measures the changes in the quantity or volume of goods,
produced, consumed exported or imported is called quantity or value index number.
The percentage ratio of the price in current year to the price in the base year is called
P
price relative. It is computed by n ×100
Po
Link Relative:-
The percentage ratio of the price in current year to the price in the preceding year is
P
called link relative. It is computed by n ×100 .
Pn −1
In a fixed base method, one of the time period is chosen as the base and rest of prices
of the various time period are divided by base period price and the results are expressed in
percentage form. These results are also called price relatives.
In a chain base method, the price of preceding year is taken as base then compute the
P
link relatives by the formula n ×100 . Then link relatives converted into a fixed base is
Pn −1
called chain indices.
The chain index for a year is obtained by multiplying average of the link relatives of
that year by chain index of the preceding year and then dividing the resulting product by
hundred.
An index number that measures the changes in the price or quantity of a group of
commodities. When the relative importance of commodities is not taken into account is called
un-weighted index number.
An index number that measures the changes in the price or quantity of a group of
commodities. When the relative importance of commodities has been taken into account is
called weighted index number.
Simple aggregative price index number is the percentage ratio between the sums of
commodity prices in current year and the sum of commodity prices in base year is called
simple aggregative price index number.
Weighted aggregative price index number is the percentage ratio between the sums of
weighted commodity prices in current year and the sum of weighted commodity prices in
base year is called weighted aggregative price index number. There are four different
formulae of weighted aggregative price index number.
Laspeyres’ =Pon
∑p q n o
×100 Qon
=
∑q n po
×100
∑p q o o ∑q o po
Paasche’s =Pon
∑p q n n
×100 Qon
=
∑q n pn
×100
∑p q o n ∑q o pn
=
Marshall-Edgeworth Pon
∑ p q +∑ p q =
n o n n
× 100 Q P
∑ q p +∑ q p n o n n
× 100
∑ p q +∑ p q ∑q p + ∑q p
on on
o o o n o o o n
Where
=Pon
∑p q n n
×100
∑p q o o
The Prices of the selected commodities are to be carefully selected from the
selected markets through enumerates, trade associations, chamber of commerce,
news correspondents and govt. price reporters, etc.
The period with which prices in other period are to be compared is called base
period. It should be normal year. There are two methods for selecting the base
period (i) Fixed Base Method (ii) Chain Base Method
The price relatives are computed by fixed base method and link relatives by
compute chain base method for more than one commodities then these relatives
are averaged to get the index number.
An index that measures the changes in the price of a specific basket of goods and
services between current year and base year is called cost of living index number. The basket
of goods and services contains (i) Food (ii) Clothing (iii) House Rent (iv) Education (v) Misc.
etc.
Pon =
∑ IW
∑W
pn
Where ∑W = ∑ p qo o and =
I
po
×100
(i) The price index numbers are used to measure the changes in the price of
commodities or a group of related commodities.
(ii) They measure the purchasing power of money.
(iii) They are used to measure the changes in the level of industrial production.
(iv) They are used to forecast the further economic barometers, business conditions of
country and to discover seasonal fluctuation and business cycle.
The consumer price index numbers are used to measure the changes in retail price of
specified in formulating the policies.