Introduction To Statistics
Introduction To Statistics
Introduction to Statistics
Statistics is a branch of mathematics that deals with the collection, organization, presentation,
analysis, and interpretation of numerical data. It helps in making informed decisions across various
fields, including economics, business, medicine, and the social and physical sciences.
Definition and Meaning of Statistics
The term "statistics" has two distinct meanings:
Statistics in a Plural Sense
Refers to numerical descriptions of quantitative aspects of objects or events.
Represents numerical facts related to a collection of data.
Statistics in a Singular Sense
Refers to the scientific methods used in collecting, organizing, presenting, analyzing, and
interpreting data.
Aids in making effective decisions based on numerical evidence.
Mr. owino 1
Introduction to business statistics
Includes performing estimations, hypothesis testing, determining relationships between variables,
and making predictions.
Example: Researchers use inferential statistics to analyze survey results and predict future trends.
Characteristics of Statistics
Statistics is a scientific method used to collect, analyze, and interpret numerical data. The key
characteristics of statistics include:
Statistics Means an Aggregate of Facts
Statistics refers to a collection of facts that can be analyzed.
A single fact cannot be statistically analyzed; multiple facts are required.
Example: The weights of 60 students in a class can be analyzed, but the weight of a single student is
not considered statistics.
Statistics Are Affected by Multiple Causes
Statistical facts are influenced by various interacting factors.
The results observed are not due to a single cause but a combination of several factors.
Statistics Are Numerically Expressed
Only numerical data can be analyzed statistically.
Example: A statement like "Price decreases with increasing production" is not statistics unless
expressed numerically.
Statistics Are Enumerated or Estimated with Accuracy
Statistical data should be measured or estimated with an appropriate degree of accuracy.
The level of accuracy required depends on the purpose of the study.
Statistics Are Collected Systematically
Data should be collected using planned and scientific methods.
Improper collection methods may result in misleading conclusions.
Statistics Are Collected for a Pre-Determined Purpose
Statistical data must have a clear objective.
Data collected without a specific purpose may be irrelevant or unusable.
Statistics Are Related to Each Other
Statistical facts should be arranged logically to facilitate comparison and analysis.
Mr. owino 2
Introduction to business statistics
Only related data, when properly organized, can provide meaningful insights.
Functions of Statistics
Statistics serves various functions in data analysis and decision-making. The six major functions
include:
i. Presentation of Facts in a Precise and Definite Form
Statistics ensures clarity and avoids ambiguity in data representation.
ii. Simplification of Large Data Sets
Large amounts of data are condensed and organized for better understanding.
Mr. owino 3
Introduction to business statistics
1. Qualitative Variables
A qualitative variable (also known as a categorical variable) represents attributes or characteristics
that cannot be expressed numerically. These variables are descriptive in nature and categorize data
into distinct groups.
Examples:
Sex: Male or Female
State of Birth: Nairobi, Mombasa, Kisumu, etc.
Cause of Death: Heart disease, Cancer, Accident, etc.
Religion: Catholic, Protestant, Muslim, Hindu, etc.
Key Characteristics:
Qualitative data is often summarized using frequency counts or proportions.
It is typically displayed using bar charts, pie charts, and frequency tables.
2. Quantitative Variables
A quantitative variable is numerical and represents measurable quantities. These values can be
ordered, ranked, and used in mathematical computations.
Quantitative variables are further divided into:
a. Discrete Variables
A discrete variable assumes specific, countable values, usually whole numbers. It cannot take
fractional or decimal values.
Examples:
Number of children in a family (e.g., 1, 2, 3, etc.)
Number of rooms in a house
Number of patients in a hospital
b. Continuous Variables
A continuous variable can take any value within a given range. These values are obtained by
measurement and may include fractions or decimals.
Examples:
Height of individuals (e.g., 170.5 cm, 180.2 cm)
Weight of a person (e.g., 65.8 kg, 72.4 kg)
Age (e.g., 25.5 years, 30.2 years)
Body temperature (e.g., 36.5°C, 37.2°C)
Mr. owino 4
Introduction to business statistics
Key Characteristics:
Continuous data can be represented using histograms, line graphs, and scatter plots.
Measurement precision depends on the measuring instrument used.
STATISTICAL METHODS
Introduction to Statistical Methods
Statistical methods are essential for analyzing numerical data, making informed decisions, and
understanding various phenomena. The following procedures are commonly adopted:
Data Collection – Gather relevant information.
Data Organization – Arrange data systematically.
Data Presentation – Use diagrams, charts, and tables.
Data Analysis – Compute averages, measure disparities.
Data Interpretation – Understand patterns and trends.
Policy Decision – Utilize findings for improvement.
I. Collection of Data
Statistics is centered around numerical data analysis. The first step in statistical methodology is data
collection, which can be categorized into:
a. Primary Data
Primary data is collected firsthand by an investigator for a specific purpose, making it original and
reliable. It is obtained through:
Census: Examines every item in the population, ensuring completeness and representativeness but
requiring significant resources.
Sample: A subset of the population that is cost-effective and time-efficient but may lack universal
acceptability.
b. Secondary Data
Secondary data refers to information collected by other researchers or agencies. Examples include
government publications such as financial statistics and economic reports.
Advantages of Secondary Data:
Saves time, manpower, and resources.
Potential Issues with Secondary Data:
May be outdated.
Data collection and analysis methods may be unknown.
Mr. owino 5
Introduction to business statistics
Risk of bias due to different collection purposes.
Errors in transcription or publication.
Considerations Before Using Secondary Data:
Purpose and credibility of the publishing institution.
Accuracy and relevance to the research problem.
Completeness and consistency across sources.
Homogeneity of data conditions.
Sampling Methods
Sampling allows researchers to study a portion of a population rather than the entire group. There
are two primary categories:
i. Probabilistic Methods
Simple Random Sampling – Every sample has an equal chance of selection.
Stratified Random Sampling – The population is divided into strata (e.g., age, gender) with
random sampling from each.
Cluster Sampling – Groups or clusters (e.g., city blocks) are randomly selected, and all
elements within them are studied.
ii. Non-Probabilistic Methods
Judgment Sampling – The researcher selects items based on expertise.
Convenience Sampling – Items are chosen based on ease of access.
Quota Sampling – Ensures the sample contains specific characteristics (e.g., political
polling).
Population vs. Sample
Population: The entire collection of items being studied.
Sample: A subset selected to represent the population.
Example:
To study high-school teachers' salaries, the population consists of all teachers' salaries, whereas a
sample could be 100 randomly chosen salaries.
In a political poll, the population includes all eligible voters, while a sample consists of 1,000
randomly selected individuals.
Misuse of Statistics
Statistical data itself is neutral, but misuse can occur due to:
Mr. owino 6
Introduction to business statistics
Using data for unintended purposes.
Bias in data collection.
Careless analysis leading to misleading conclusions.
NB: To ensure reliability, data should be collected, analyzed, and interpreted systematically,
keeping in mind the integrity of statistical principles.
Application of Statistics in Business
Quality Control
In every industry, quality control plays a crucial role in ensuring that products meet the required
standards. Quality control departments are responsible for maintaining these standards, ensuring that
goods comply with customer expectations. In Kenya, the Kenya Bureau of Standards (KeBS) is a
national institution that inspects and certifies products on behalf of the government to guarantee
quality assurance.
To achieve this, KeBS and other regulatory bodies have developed quality control charts. These
charts serve as monitoring tools that help determine whether manufactured products meet the desired
standards. If a product deviates from the acceptable quality levels, corrective actions are taken to
maintain consistency in production and customer satisfaction.
Economic Order Quantity (EOQ) and Inventory Management
Statistics plays a significant role in inventory management, particularly in determining the Economic
Order Quantity (EOQ). EOQ is the optimal quantity of stock that should be ordered to minimize
costs while ensuring that customer demand is met efficiently.
Ordering large quantities of stock may lead to excessive storage costs and capital being tied up in
inventory, which could have been used for other business investments. Conversely, ordering small
quantities reduces storage costs but may lead to stockouts, resulting in unsatisfied customers and lost
sales.
By applying statistical analysis, businesses can calculate the EOQ that balances storage costs and
customer demand, ensuring cost-effectiveness and operational efficiency.
Forecasting in Business
Forecasting is an essential statistical application that enables business managers to predict future
trends and outcomes. Statistical methods, such as regression analysis, help in establishing
relationships between dependent and independent variables. This allows businesses to develop
predictive models that guide decision-making processes.
Mr. owino 7
Introduction to business statistics
For example, sales forecasting helps businesses anticipate demand, manage inventory efficiently,
and optimize production schedules. Similarly, financial forecasting enables organizations to plan
budgets, allocate resources, and set strategic goals based on historical data and market trends.
Human Resource Management
Statistics is also valuable in human resource management for workforce planning and organizational
improvement. By conducting employee surveys and analyzing workforce data, businesses can
identify areas where management needs to improve.
For instance, statistical analysis of employee resignation trends can help identify reasons for high
turnover rates. If data shows that employees are leaving due to workplace dissatisfaction,
management can take corrective measures such as improving working conditions, enhancing
employee engagement, and offering better incentives.
Additionally, statistical tools can be used to measure employee performance, training effectiveness,
and productivity levels, aiding in informed decision-making for human resource development.
TOPIC TWO: STATISTICAL MEASUREMENTS
MEASURES OF CENTRAL TENDENCY
These are statistical values which tend to occur at the centre of any well-ordered set of data.
Whenever these measures occur, they do not indicate the centre of that data. These measures are as
follows:
The arithmetic mean
The mode
The median
The geometric mean
The harmonic mean
1. The arithmetic mean
This is commonly known as average or mean it is obtained by first of all summing up the values
given and by dividing the total value by the total no. of observations.
i.e., mean =
∑X
n
Where x = no. of values
∑ = summation
n = no of observations
Mr. owino 8
Introduction to business statistics
Example
The mean of 60, 80, 90, 120
60+80+ 90+120
=
4
350
=
4
=87.5
The arithmetic mean is very useful because it represents the values of most observations in the
population.
The mean therefore describes the population quite well in terms of the magnitudes attained by most
of the members of the population
Computation of the mean from grouped Data i.e. in classes.
When calculating the mean of grouped data, we use the following formula:
n
∑f .X
x= i=1
∑f
Where: f = Frequency
X = Midpoint
Example:
Compute the mean amount of expenditure on food and other household items incurred by the Bahati
Estate families last month.
Expenditure (Shs '000s) No. of Families (f) Class Midpoint (x) fx
5 – 10 8 7.5 60
10 – 15 9 12.5 112.5
15 – 20 15 17.5 262.5
20 – 25 11 22.5 247.5
25 – 30 6 27.5 165
30 – 35 3 32.5 97.5
35 – 40 2 37.5 75
∑f = 54 ∑fx =
Mr. owino 9
Introduction to business statistics
Expenditure (Shs '000s) No. of Families (f) Class Midpoint (x) fx
1020
∑f .X
x= i=1
∑f
1020
x=
54
x=18.889
Sh 18,889
This means that on average each of the Bahati Estate families spent about Shs 18889 on food and
other household items last month.
Characteristics of the Mean
The mean, also known as the arithmetic average, is one of the most commonly used measures of
central tendency. It provides a representative value for a given set of data. Below are the key
characteristics of the mean:
Applicability to Interval and Ratio Data
The mean can only be calculated for data measured on an interval or ratio scale. These scales
provide meaningful numerical values that allow for proper computation of the average.
Inclusion of All Values in Computation
When computing the mean, all values in the dataset are considered. This is particularly true for both
the arithmetic mean and the weighted mean. However, for grouped data, the mean is only an
estimate because individual data values are not directly used; instead, class midpoints are utilized.
1
Mr. owino
0
Introduction to business statistics
The mean is highly sensitive to extreme values (outliers). Unusually large or small values can
significantly impact the mean, making it higher or lower than the typical data points.
Sum of Deviations from the Mean
One of the unique mathematical properties of the arithmetic mean is that the sum of the deviations of
all data points from the mean is always zero. This characteristic is fundamental in various statistical
applications and calculations.
Challenges with Open-Ended Distributions
It may not always be possible to compute the mean for open-ended distributions. An open-ended
distribution is one where the lowest class has no defined lower limit or the highest class lacks an
upper limit. Since these distributions do not provide a definitive class midpoint, accurate
computation of the mean becomes difficult.
The following statistical terms are commonly used in statistical calculations. They must therefore be
clearly understood.
Class limits
These are numerical values which limits uq extended of a given class i.e. all the observations in a
given class are expected to fall within the interval which is bounded by the class limits e.g. 5 & 10
are class limits as in the table of the example above.
Class boundaries
These are statistical boundaries, which separate one class from the other. They are usually
determined by adding the lower-class limit to the next upper-class limit and dividing by 2 e.g., in the
above table the class boundary between 10 and 10 is 10 which is =
10+10
2
Class Mid points
This are very important values which mark the center of a given class. They are obtained by adding
together the two limits of a given class and dividing the result by 2.
Class interval/width
This is the difference between an upper-class boundary and lower-class boundary. The value usually
measures the length of a given class.
1
Mr. owino
1
Introduction to business statistics
2. The Mode
The mode is one of the key measures of central tendency in statistics. It is defined as the value in a
frequency distribution that appears most frequently. In cases where no single value appears most
frequently, we refer to the class with the highest frequency as the modal class.
Importance of the Mode in Business
The mode is widely used in business to make data-driven decisions. Firms often rely on this measure
to determine the most popular or frequently demanded products. For example, businesses may use
the mode to:
Identify the most frequently purchased footwear or clothing items.
Determine the most popular construction materials such as beams, wires, and iron sheets.
Analyze customer preferences for better inventory management.
Determining the Mode
Ungrouped Data
For ungrouped data, the mode is determined by:
Arranging the given values in ascending or descending order.
Identifying the value that appears most frequently.
Grouped Data
When dealing with grouped data, the mode can be determined using the modal class (the class with
the highest frequency). The mode can be calculated using the following formula:
f 1−f 0
Mode=L+ ⌊ ⌋ ×c
2 f 1−f 0−f 2
Where:
L = Lower boundary of the modal class
F1 = Frequency of the modal class
F0 = Frequency of the class before the modal class
F2 = Frequency of the class after the modal class
c = Class width
Example
IQ Range Frequency (f) Upper Class Bound Cumulative Frequency (CF)
1
Mr. owino
2
Introduction to business statistics
1 – 20 6 20 6
21 – 40 18 40 24
41 – 60 32 (fo) 60 56
61 – 80 48 (f1) 80 104
81 – 100 27 (f2) 100 131
101 – 120 13 120 144
121 – 140 2 140 146
f 1−f 0
Mode=L+ ⌊ ⌋ ×c
2 f 1−f 0−f 2
48−32
Mode=60.5+ ⌊ ⌋ ×20
2(48)−32−27
Mode=69.14
3. The Median
The median is a statistical value that represents the middle value in a given set of data that has been
arranged in ascending order.
It is an important measure of central tendency because it divides the data into two equal halves,
ensuring that the number of observations below and above it is equal.
Example
Consider the following data set:
14, 17, 9, 8, 20, 32, 18, 14.5, 13
Step 1: Arrange the data in ascending order
8, 9, 13, 14, 14.5, 17, 18, 20, 32
Step 2: Identify the middle value
Since the dataset consists of 9 values, the middle value is the 5th observation in the ordered set:
Median = 14.5
Importance of the Median
The median divides the data into two equal halves.
It is not affected by extreme values (outliers), making it a reliable measure of central
tendency for skewed data.
1
Mr. owino
3
Introduction to business statistics
It provides a more accurate representation of the center of a dataset compared to the mean
when data is asymmetrical.
Determining the Median for Grouped Data
When data is grouped into classes, the median can be calculated using the median formula:
n+1
−Cfbm
2
Median=L+ ⌊ ⌋ ×C
Cfmc
Where:
L = Lower boundary of the median class
n = Total number of observations
Cfbm = Cumulative frequency before the median class
Cfmc = Frequency of the median class
C = Class width
Example
Referring to the table below, determine the median
IQ Range No. of Residents (f) UCB Cumulative Frequency
0 – 20 6 20 6
20 – 40 18 40 24
40 – 60 32 60 56
60 – 80 48 80 104
80 – 100 27 100 131
100 – 120 13 120 144
120 – 140 2 140 146
n+1
−Cfbm
2
Median=L+ ⌊ ⌋ ×C
Cfmc
73.5−56
Median=60+ ⌊ ⌋ ×20
48
Median=60+ 7.29
1
Mr. owino
4
Introduction to business statistics
Median=67.9
4. Geometric mean
This is a measure of central tendency normally used to measure industrial growth rates.
It is defined as the nth root of the product of ‘n’ observations or values i.e.
GM = √n X 1 X 2 x X 3... X n
Example
In 1995 five firms registered the following economic growth rates; 26%. 32% 41% 18% and 36%.
Required
Calculate the GM for the above values.
GM = √n X 1× X 2× X 3 × X 4 × X 5
GM = √5 26 ×32 × 41× 18× 36
GM = √5 22,104,576
GM = 29.44
5. Harmonic mean
This is a measure of central tendency which is used to determine the average growth rates for natural
economies. It is defined as the reciprocal of the average of the reciprocals of all the values given by
HM.
1
HM = 1 1 1 1 1
( + + +, , , , , , ,+ )
n x1 x 2 x3 xn
Example
The economic growth rates of five countries were given as 20%, 15%, 25%, 18% and 5%
Calculate the harmonic mean
1
HM = 1 1 1 1 1 1
( + + + + )
n x1 x 2 x3 x 4 x 5
1
HM = 1
¿¿
5
HM = 10.86 %
1
Mr. owino
5
Introduction to business statistics
Merits and demerits of the measures of central tendency
1. The arithmetic mean (a.m)
Merits
It utilizes all the observations given
It is a very useful statistic in terms of applications. It has several applications in business
management e.g. hypothesis testing, quality control e.t.c.
It is the best representative of a given set of data if such data was obtained from a normal
population
The a.m. can be determined accurately using mathematical formulas
Demerits of the a.m.
If the data is not drawn from a ‘normal’ population, then the a.m. may give a wrong
impression about the population
In some situations, the a.m. may give unrealistic values especially when dealing with discrete
variables e.g. when working out the average no. of children in a no. of families.
It may be found that the average is 4.4 which is unrealistic in human beings
2. The mode
Merits
It can be determined from incomplete data provided the observations with the highest
frequency are already known
The mode has several applications in business
The mode can be easily defined
It can be determined easily from a graph
Demerits
If the data is quite large and ungrouped, determination of the mode can be quite cumbersome
Use of the formula to calculate the mode is unfamiliar to most business people
The mode may sometimes be non-existent or there may be two modes for a given set of data.
In such a case therefore, a single mode may not exist
3. The median
Merits
It shows the centre of a given set of data
1
Mr. owino
6
Introduction to business statistics
Knowledge of the determination of the median may be extended to determine the quartiles
The median can easily be defined
It can be obtained easily from the cumulative frequency curve
It can be used in determining the degrees of skew ness (see later)
Demerits
In some situations where the no. of observations is even, the value of the median obtained is
usually imaginary
The computation of the median using the formulas is not well understood by most
businessmen
In business environment the median has got very few applications
4. The geometric mean
Merits
It makes use of all the values given (except when x = 0 or negative)
It is the best measure for industrial growth rates
Demerits
The determination of the GM by using logarithms is not familiar process to all those
expected to use it e.g managers
If the data contains zeros or –ve values, the GM ceases to exist
The harmonic means and weighted mean
Merits – same as the arithmetic mean
Demerits – same as the arithmetic mean
1
Mr. owino
7
Introduction to business statistics
MEASURES OF DISPERSION
Measures of dispersion are essential statistical tools that help determine how data values are spread
around the mean. They provide insights into whether data points are clustered close to the mean or
widely scattered.
When data points are closely dispersed around the mean, the measure of dispersion is small,
indicating that the mean is a reliable representation of the dataset.
When data points are widely spread, the measure of dispersion is large, suggesting that the mean
does not adequately represent the dataset.
The most frequently used measures of dispersion include:
Range
Absolute Mean Deviation
Standard Deviation
Semi-Interquartile and Quartile Deviation
10th and 90th Percentile Range
Variance
1. The Range
The range is the simplest measure of dispersion and is defined as the difference between the highest
and lowest values in a dataset:
Range = Maximum Value – Minimum Value
Characteristics of the Range:
i. It only considers two values from the dataset, making it less efficient in describing data
dispersion.
ii. A smaller range indicates that data points are closer to the mean, implying lower dispersion.
iii. A larger range suggests that data points are more widely spread, indicating higher dispersion.
Limitations of the Range:
i. It does not account for the distribution of other values within the dataset.
ii. Two datasets can have the same range but differ significantly in their overall dispersion.
iii. Due to its limitations, the range is not commonly used in business management for decision-
making.
2. The standard deviation
1
Mr. owino
8
Introduction to business statistics
This is one of the most accurate measures of dispersion.
It has the following advantages;
i. It utilizes all the values given
ii. It makes use of both negative and positive values if they occur
iii. The standard deviation reflects an accurate impression of how much the sample data
varies from the mean. This is because its suitability can also be tested using other
statistical methods
Example
A sample comprises of the following observations; 14, 18, 17, 16, 25, 31
Determine the standard deviation of this sample
Observation.
14 -6.1 37.21
18 -2.1 4.41
17 -3.1 9.61
16 -4.1 16.81
25 4.9 24.01
31 10.9 118.81
Total 121 210.56
∴ standard deviation,
= 5.93
Alternative method
x X2
1
Mr. owino
9
Introduction to business statistics
14 196
18 324
17 289
16 256
25 625
31 961
Total 121 2651
= 5.93
Example 2
The following table shows the part-time rate per hour of a given no. of laborers in the month of June
1997.
Calculate the standard deviation from the above table showing how the hourly payment was varying
from the respective mean
2
Mr. owino
0
Introduction to business statistics
∴ standard deviation,
=
= 96.29
In business statistical work we usually encounter a set of grouped data. In order to determine the
standard deviation from such data, we use any of the three following methods
i. The long method
ii. The shorter method
iii. The coded method
The above methods are used in the following examples
Example 3.1
The quality controller in a given firm had an accurate record of all the iron bars produced in may
1997. The following data shows those records
i. Using long method
Bar lengths No. of bars(f) Class mid- fx fx2
(cm) point (x)
201 – 250 25 225.5 5637.5 1271256.25
251 – 300 36 275.5 9918 2732409
301 – 350 49 325.5 15949.5 5191562.25
351 – 400 80 375.5 30040 11280020
401 – 450 51 425.5 21700.5 9233562.75
2
Mr. owino
1
Introduction to business statistics
451 – 500 42 475.5 19971 9496210.50
501 - 550 30 525.5 15765 8284507.50
313 118981.50 47489526
∴ standard deviation, σ =
=
= 84.99 cm
2
Mr. owino
2
Introduction to business statistics
∴ Standard deviation, σ =
=
= 84.99 cm
iii. Using coded method
2
Mr. owino
3
Introduction to business statistics
= 50 × 1.6997
= 84.99
3. The Variance
Example 1
2
Mr. owino
4
Introduction to business statistics
The following table shows the levels of retirement benefits given to a group of workers in a given
establishment.
Required
= 29.5 + 6.63
= £36.13
2
Mr. owino
5
Introduction to business statistics
The upper quartile (Q3) lies on position
= 287.25
= 12.475
= £12,475
ii. The top 10% is equivalent to the lower 90% of the retirees
The position corresponding to the lower 90%
= 0.9 x 383
= 344.7
∴ the benefits (value) corresponding to the minimum value for top 10%
= 69.5 + x 10
= 72.925
= £ 72925
iii. The lower 40% corresponds to position
2
Mr. owino
6
Introduction to business statistics
40
= 100 (382 + 1)
= 153.20
∴ retirement benefits corresponding to its position
= 39.5 + x 10
= 39.5 + 4.88
= 44.38
= £ 44380
5. The 10th – 90th percentile range
This is a measure of dispersion which uses percentile. A percentile is a value which separates one
division from the other when a given data is divided into 100 equal divisions.
This measure of dispersion is very important when calculating the co-efficient of skewness (see
later)
Example
Using the above data for retirees calculate the 10th - 90th percentile. The tenth percentile 10th
percentile lies on position
= 19.5 + 7.66
= 27.16
The 90th percentile lies on position
= 344.7
2
Mr. owino
7
Introduction to business statistics
∴ the value corresponding to the 90th percentile
= 69.5 + x 10
= 69.5 + 3.425
= 72.925
∴ the required value of the 10th – 90th percentile = 72.925 – 27.16 = 45.765
Definition:
A relative measure of dispersion is a statistical value which may be used to compare variations in 2
or more samples.
The measures of dispersion are usually expressed as decimals or percentages and usually they do not
have any other units
Example
The average distance covered by vehicles in a motor rally may be given as 2000 km with a standard
deviation of 5 km.
In another competition set of vehicles covered 3000 km with a standard deviation of 10 kms
NB: The 2 standard deviations given above are referred to as absolute measures of dispersion. These
are actual deviations of the measurements from their respective mean
However, these are not very useful when comparing dispersions among samples.
Therefore, the following measures of dispersion are usually employed in order to assess the degree
of dispersion.
i. Coefficient of mean deviation
2
Mr. owino
8
Introduction to business statistics
=
Example (see information above)
First group of cars: mean = 2000 kms
Standard deviation = 5 kms
∴ C.O.V = 5 x 100
2000
= 0.25%
Second group of cars: mean = 3000 kms
Standard deviation = 10kms
∴ C.O.V = 10 x 100
3000
= 0.33%
Conclusion
Since the coefficient of variation is greater in the 2 nd group, than in the first group we may conclude
that the distances covered in the 1st group are much closer to the mean that in the 2nd group.
Example 2
In a given farm located in the UK the average salary of the employees is £ 3500 with a standard
deviation of £150
The same firm has a local branch in Kenya in which the average salaries are Kshs 8500 with a
standard deviation of Kshs.800
2
Mr. owino
9
Introduction to business statistics
Determine the coefficient of variation in the 2 firms and briefly comment on the degree of dispersion
of the salaries in the 2 firms.
First firm in the UK
C.O.V = 150 x 100
3500
= 4.29%
Second firm in Kenya
C.O.V = 800 x 100
8500
= 9.4%
Conclusively, since 4.29% < 9.4% then the salaries offered by the firm in UK are much closer to the
mean given them in the case to the local branch in Kenya
COMBINED MEAN AND STANDARD DEVIATION
Sometimes we may need to combine 2 or more samples say A and B. It is therefore essential to
know the new mean and the new standard deviation of the combination of the samples.
Combined mean
Let m be the combined mean
Let x1 be the mean of first sample
Let x2 be the mean of the second sample
Let n1 be the size of the 1st sample
Let n2 be the size of the 2nd sample
Let s1 be the standard deviation of the 1st sample
Let s2 be the standard deviation of the 2nd sample
Example
A sample of 40 electric batteries gives a mean life span of 600 hrs with a standard deviation of 20
hours.
3
Mr. owino
0
Introduction to business statistics
Another sample of 50 electric batteries gives a mean lifespan of 520 hours with a standard deviation
of 30 hours.
If these two samples were combined and used in a given project simultaneously, determine the
combined new mean for the larger sample and hence determine the combined or pulled standard
deviation.
Size x s
40(n1) 600 hrs(x1) 20hrs (s1)
50 (n1) 520 hrs (x2) 30 hrs (s2)
3
Mr. owino
1
Introduction to business statistics
CORRELATION
SCATTER GRAPHS
- A scatter graph is a graph which comprises of points which have been plotted but are not
joined by line segments
- The pattern of the points will definitely reveal the types of relationship existing between
variables
- The following sketch graphs will greatly assist in the interpretation of scatter graphs.
3
Mr. owino
2
Introduction to business statistics
Independent variable
NB: For the above pattern, it is referred to as perfect because the points may easily be represented by
a single line graph e.g. when measuring relationship between volumes of sales and profits in a
company, the more the company sales the higher the profits.
Perfect negative correlation
y x
Quantity sold x
X
x
x
x
x
x
x
10 20 Price X
3
Mr. owino
3
Introduction to business statistics
This example considers volume of sale in relation to the price, the cheaper the goods the bigger the
sale.
3
Mr. owino
4
Introduction to business statistics
price
No correlation
y
600 x x x x x
x x x
400 x x x x x
x x x x
200 x x x x x
x x x x
0
10 20 30 40 50 x
h) Spurious Correlations
- in some rare situations when plotting the data for x and y we may have a group showing
either positive correlation or –ve correlation but when you analyze the data for x and y in
normal life there may be no convincing evidence that there is such a relationship. This
implies therefore that the relationship only exists in theory and hence it is referred to as
spurious or non sense e.g. when high passrates of student show high relation with increased
accidents.
Correlation coefficient
- These are numerical measures of the correlations existing between the dependent and the
independent variables
3
Mr. owino
5
Introduction to business statistics
- These are better measures of correlation than scatter graphs (diagrams)
- The range for correlation coefficients lies between +ve 1 and –ve 1. A correlation coefficient
of +1 implies that there is perfect positive correlation. A value of –ve shows that there is
perfect negative correlation. A value of 0 implies no correlation at all
- The following chart will be found useful in interpreting correlation coefficients
There are usually two types of correlation coefficients normally used namely; -
Product Moment Coefficient (r)
It gives an indication of the strength of the linear relationship between two variables.
r=
note that this formula can be rearranged to have different outlooks but the resultant is always the
same.
3
Mr. owino
6
Introduction to business statistics
Example
The following data was observed and it is required to establish if there exists a relationship between
the two.
X 15 24 25 30 35 40 45 65 70 75
Y 60 45 50 35 42 46 28 20 22 15
Solution
Compute the product moment coefficient of correlation (r)
X Y X2 Y2 XY
15 60 225 3,600 900
24 45 576 2,025 1,080
25 50 625 2,500 1,250
30 35 900 1,225 1,050
35 42 1,225 1,764 1,470
40 46 1,600 2,116 1,840
45 28 2,025 784 1,260
65 20 4,225 400 1,300
70 22 4,900 484 1,540
75 15 5,625 225 1,125
r=
r=
=
The correlation coefficient thus indicates a strong negative linear association between the two
variables.
3
Mr. owino
7
Introduction to business statistics
Interpretation of r – Problems in interpreting r values
NOTE:
A high value of r (+0.9 or – 0.9) only shows a strong association between the two variables but
doesn’t imply that there is a causal relationship i.e. change in one variable causes change in the
other it is possible to find two variables which produce a high calculated r yet they don’t have a
causal relationship. This is known as spurious or nonsense correlation e.g. high pass rates in QT
in Kenya and increased inflation in Asian countries.
Also note that a low correlation coefficient doesn’t imply lack of relation between variables but
lack of linear relationship between the variables i.e. there could exist a curvilinear relation.
A further problem in interpretation arises from the fact that the r value here measures the
relationship between a single independent variable and dependent variable, where as a particular
variable may be dependent on several independent variables (e.g. crop yield may be dependent
on fertilizer used, soil exhaustion, soil acidity level, season of the year, type of seed etc.) in
which case multiple correlation should be used instead.
The Rank Correlation Coefficient (R)
Also known as the spearman rank correlation coefficient, its purpose is to establish whether there is
any form of association between two variables where the variables arranged in a ranked form.
R=1-
Where d = difference between the pairs of ranked values.
n = numbers of pairs of rankings
Example
A group of 8 accountancy students are tested in Quantitative Techniques and Law II. Their rankings
in the two tests were.
Student Q. T. ranking Law II ranking d d2
A 2 3 -1 1
B 7 6 1 1
C 6 4 2 4
D 1 2 -1 1
3
Mr. owino
8
Introduction to business statistics
E 4 5 -1 1
F 3 1 2 4
G 5 8 -3 9
H 8 7 1 1
R=1-
= 0.74
Thus, we conclude that there is a reasonable agreement between student’s performances in the two
types of tests.
NOTE: in this example, if we are given the actual marks then we find r. R varies between +1
and -1.
Tied Rankings
A slight adjustment to the formula is made if some students tie and have the same ranking the
adjustment is
R=1-
Example
Assume that in our previous example student E & F achieved equal marks in Q. T. and were given
joint 3rd place.
Solution
Student Q. T. ranking Law II ranking d d2
A 2 3 -1 1
B 7 6 1 1
C 6 4 2 4
3
Mr. owino
9
Introduction to business statistics
D 1 2 -1 1
E 3½ 5 -1 ½ 2¼
F 3½ 1 2½ 6¼
G 5 8 -3 9
H 8 7 1 12
R = 1- = 1-
= 0.68
NOTE: It is conventional to show the shared rankings as above, i.e. E, & F take up the 3 rd and
4th rank which are shared between the two as 3½ each.
ii. Coefficient of Determination
This refers to the ratio of the explained variation to the total variation and is used to measure the
strength of the linear relationship. The stronger the linear relationship the closer the ratio will be to
one.
Coefficient determination = Explained variation
Total variation
Example (Rank Correlation Coefficient)
In a beauty competition 2 assessors were asked to rank the 10 contestants using the professional
assessment skills. The results obtained were given as shown in the table below
4
Mr. owino
0
Introduction to business statistics
G 4 8
H 5 2
J 10 9
K 9 10
REQUIRED
Calculate the rank correlation coefficient and hence comment briefly on the value obtained
d d2
A 6 5 1 1
B 1 3 -2 4
C 3 4 -1 1
D 7 6 1 1
E 8 7 1 1
F 2 1 1 1
G 4 8 -4 16
H 5 2 3 9
J 10 9 +1 1
K 9 10 -1 1
Σd2 = 36
∴ The rank correlation coefficient R
R=1-
=1-
=1-
= 1 – 0.22
= 0.78
4
Mr. owino
1
Introduction to business statistics
Comment: since the correlation is 0.78 it implies that there is high positive correlation between the
ranks awarded to the contestants. 0.78 > 0 and 0.78 > 0.5
Example
Contestant 1st 2nd d d2
assessor assessor
A 1 2 -1 1
B 5 (5.5) 3 2.5 6.25
C 3 4 -1 1
D 2 1 1 1
E 4 5 -1 1
F 5 (5.5) 6.5 -1 1
G 7 6.5 -0.5 0.25
H 8 8 0 0
Σd2 = 11.25
Required: Complete the rank correlation coefficient
∴R= 1- =1-
=1–
= 1 – 0.13
= 0.87
This implies high positive correlation
Example (Rank Correlation Coefficient)
Sometimes numerical data which refers to the quantifiable variables may be given after which a rank
correlation coefficient may be worked out.
Is such a situation, the rank correlation coefficient will be determined after the given variables have
been converted into ranks. See the following example;
4
Mr. owino
2
Introduction to business statistics
Candidates Math r Accounts r d d2
P 92 1 67 5 -4 16
Q 82 3 88 1 2 4
R 60 5(5.5) 58 7(7.5) -2 4
S 87 2 80 2 0 0
T 72 4 69 4 0 0
U 60 5(5.5) 77 3 -2.50 6.25
V 52 8 58 7(7.5) 0.5 0.25
W 50 9 60 6 3 9
X 47 10 32 10 0 0
Y 59 7 54 9 -2 4
Σd2 = 43.5
∴ Rank correlation r = 1-
=1- =1–
= 0.74 (High positive correlation between mathematics
marks and accounts)
Example
(Product moment correlation)
The following data was obtained during a social survey conducted in a given urban area regarding
the annual income of given families and the corresponding expenditures.
4
Mr. owino
3
Introduction to business statistics
C 520 510 265200 270400 260100
D 610 500 305000 372100 250000
E 400 360 144000 160000 129600
F 320 290 92800 102400 84100
G 280 250 70000 78400 62500
H 410 380 155800 168100 144400
J 380 240 91200 144400 57600
K 300 270 81000 90000 72900
Total 4020 3550 1504400 1706600 1342900
Required
Calculate the product moment correlation coefficient briefly comment on the value obtained
The produce moment correlation
r=
Workings:
= = 402
r=
= 0.89
Comment: The value obtained 0.89 suggests that the correlation between annual income and annual
expenditure is high and positive. This implies that the more one earns the more one spends.
4.2 REGRESSION
- This is a concept, which refers to the changes which occur in the dependent variable as a
result of changes occurring on the independent variable.
- Knowledge of regression is particularly very useful in business statistics where it is
necessary to consider the corresponding changes on dependent variables whenever
independent variables change
4
Mr. owino
4
Introduction to business statistics
- It should be noted that most business activities involve a dependent variable and either one or
more independent variable. Therefore, knowledge of regression will enable a business
statistician to predict or estimate the expenditure value of a dependent variable when given
an independent variable e.g. consider the above example for annual incomes and annual
expenditures. Using the regression techniques one can be able to determine the estimated
expenditure of a given family if the annual income is known and vice versa
- The general equation used in simple regression analysis is as follows
y = a + bx
Where y = Dependent variable
a= Intercept of y axis (constant)
b = Slope on the y axis
x = Independent variable
i. The determination of the regression equation such as given above is normally done
by using a technique known as “the method of least squares.
Regression equation of y on x i.e. y = a + bx
The following sets of equations normally known as normal equation are used to determine the
equation of the above regression line when given a set of data.
Σy = an + bΣx
Σxy = aΣx + bΣx2
4
Mr. owino
5
Introduction to business statistics
Where Σy = Sum of y values
Σxy = sum of the product of x and y
Σx = sum of x values
Σx2= sum of the squares of the x values
a = The intercept on the y axis
b = Slope gradient line of y on x
NB: The above regression line is normally used in one way only i.e. it is used to estimate the y
values when the x values are given.
Regression line of x on y i.e. x = a + by
The fact that regression lines can only be used in one way leads to what is known as a regression
paradox
This means that the regression lines are not ordinary mathematical line graphs which may be used to
estimate the x and y simultaneously
Therefore, one has to be careful when using regression lines as it becomes necessary to develop an
equation for x and y before doing the estimation.
The following example will illustrate how regression lines are used
Example
An investment company advertised the sale of pieces of land at different prices. The following table
shows the pieces of land their acreage and costs
Piece of (x)Acreage (y) Cost £ 000 xy x2
land Hectares
A 2.3 230 529 5.29
B 1.7 150 255 2.89
C 4.2 450 1890 17.64
D 3.3 310 1023 10.89
E 5.2 550 2860 27.04
F 6.0 590 3540 36
G 7.3 740 5402 53.29
H 8.4 850 7140 70.56
4
Mr. owino
6
Introduction to business statistics
J 5.6 530 2969 31.36
Σx =44.0 Σy = 4400 Σxy= 25607 Σx2 = 254.96
Required
Determine the regression equations of
i. y on x and hence estimate the cost of a piece of land with 4.5 hectares
ii. Estimate the expected average if the piece of land costs £ 900,000
Σy = an + bΣxy
Σxy = a∑x + bΣx2
By substituting of the appropriate values in the above equations we have
4400 = 9a + 44b ……... (i)
25607 = 44a + 254.96b ……...(ii)
By multiplying equation …. (i) by 44 and equation …… (ii) by 9 we have
193600 = 396a + 1936b ……... (iii)
230463 = 396a + 2294.64b ……...(iv)
By subtraction of equation …. (iii) from equation …… (iv) we have
36863 = 358.64b
102.78 = b
by substituting for b in ……... (i)
4400 = 9a + 44(102.78)
4400 – 4522.32 = 9a
–122.32 = 9a
-13.59 = a
Therefore, the equation of the regression line of y on x is
Y = 13.59 + 102.78x
When the acreage (hectares) is 4.5 then the cost
(y) = -13.59 + (102.78 x 4.5)
= 448.92
= £ 448, 920
Note that
4
Mr. owino
7
Introduction to business statistics
Where the regression equation is given by
y= a + bx
Where a is the intercept on the y axis and
b is the slope of the line or regression coefficient
n is the sample size
then,
intercept a =
Slope b =
Example
The calculations for our sample size n = 10 are given below. The linear regression model is
y = a + bx
Table
Distance x Time y mins xy x2 y2
miles
3.5 16 56.0 12.25 256
2.4 13 31.0 5.76 169
4.9 19 93.1 24.01 361
4.2 18 75.6 17.64 324
3.0 12 36.0 9.0 144
1.3 11 14.3 1.69 121
1.0 8 8.0 1.0 64
3.0 14 42.0 9.0 196
1.5 9 13.5 2.25 81
4.1 16 65.6 16.81 256
Σx = 28.9 Σy = 136 Σxy = Σx2 = 99.41 Σy2= 1972
435.3
4
Mr. owino
8
Introduction to business statistics
The Slope b =
= 2.66
4
Mr. owino
9
Introduction to business statistics
The linear model for multiple linear regression if of the type; (which is the line of best fit).
y = α + b1x1 +b2x2 +………… + bnxn
We assume that errors or residuals are negligible.
In order to choose between the models, we examine the values of the multiple correlation coefficient
r and the standard deviation of the residuals α.
A model which describes well the relationship between y and x’s has multiple correlation coefficient
r close to ±1 and the value of α which is small.
Example
Odino chemicals limited are aware that its power costs are semi variable cost and over the last six
months these costs have shown the following relationship with a standard measure of output.
b=
= = 0.342
a = (Σy – bΣx)
r=
=
= 0.96
5
Mr. owino
1
Introduction to business statistics
This shows a strong correlation between power cost and output. The multiple correlation when both
output and time are considered at the same time is 0.976.
We observe that there has been very little increase in r which means that inclusion of time variable
does not improve the correlation significantly
The value for time variable is only 0.60 which is insignificant as compared with a t value of 2.64 for
the output variable
In fact, if we work out correlation between output and time, there will be a high correlation. Hence
there is no necessity of taking both the variables. Inclusion of time does improve the correlation
coefficient but by a very small amount.
If we use the linear regression analysis and attempt to find the linear relationship between output and
time i.e.
Month Output
1 12
2 18
3 19
4 20
5 24
6 30
The value of b and a will turn out to be 3.11 and 9.6 i.e. relationship will be of the form
Output = 9.6 + 3.11 × month
For this equation forecast for 7th month will be
Output = 9.6 + 3.11 × 7
= 9.6 + 21.77
= 31.37 units
Using the equation, Power costs = 2.29 + 0.34 × output
= 2.29 + 0.34 × 31.37
= 2.29 + 10.67
= 12.96 i.e. £ 12,960
Non-Linear Relationships
5
Mr. owino
2
Introduction to business statistics
If the scatter diagram and the correlation coefficient do not indicate linear relationship, then the
relationship may be non – linear
Two such relationships are of peculiar interest
Both of these can be reduced to linear model. Simple or multiple linear regression methods are then
used to determine the values of the coefficients
i. Exponential model
5
Mr. owino
3
Introduction to business statistics
SKEWNESS
This is a concept which is commonly used in statistical decision making. It refers to the degree in
which a given frequency curve is deviating away from the normal distribution
There are 2 types of skew ness namely
i. Positive skew ness
ii. Negative skew ness
1. Positive Skewness
This is the tendency of a given frequency curve leaning towards the left. In a positively skewed
distribution, the long tail extended to the right.
In this distribution one should note the following
i. The mean is usually bigger than the mode and median
ii. The median always occurs between the mode and mean
iii. There are more observations below the mean than above the mean
This frequency distribution as represented in the skewed distribution curve is characteristic of the
age distributions in the developing countries
frequency Positively skewed frequency
frequency curve Negatively skewed
frequency curve
Normal distribution
Mode
Median
Mean
Long tail
Mean
Median
Mode
2. Negative Skewness
This is an asymmetrical curve in which the long tail extends to the left
NB: This frequency curve for the age distribution is characteristic of the age distribution in
developed countries
i. The mode is usually bigger than the mean and median
ii. The median usually occurs in between the mean and mode
5
Mr. owino
4
Introduction to business statistics
iii. The no. of observations above the mean are usually more than those below the mean (see the
shaded region)
MEASURES OF SKEWNESS
These are numerical values which assist in evaluating the degree of deviation of a frequency
distribution from the normal distribution.
Following are the commonly used measures of skew ness.
1. Coefficient Skewness
=
2. Coefficient of skewness
=
NB: These 2 coefficients above are also known as Pearsonian measures of skewness.
3. Quartile Coefficient of skewness
=
Where Q1 = 1st quartile
Q2 = 2nd quartile
Q3 = 3rd quartile
NB: The Pearsonian coefficients of skewness usually range between –ve 3 and +ve 3. These are
extreme value i.e. +ve 3 and –ve 3 which therefore indicate that a given frequency is negatively
skewed and the amount of skewness is quite high.
Similarly, if the coefficient of skewness is +ve it can be concluded that the amount of skew ness of
deviation from the normal distribution is quite high and also the degree of frequency distribution is
positively skewed.
Example
The following information was obtained from an NGO which was giving small loans to some small-
scale business enterprises in 1996. the loans are in the form of thousands of Kshs.
Loans Units Midpoints(x) x-a=d d/c= u fu Fu2 UCB cf
(f)
5
Mr. owino
5
Introduction to business statistics
46 – 50 32 48 -15 -3 -96 288 50.5 32
51 – 55 62 53 -10 -2 -124 248 55.5 94
56 – 60 97 58 -5 -1 -97 97 60.5 191
61 –65 120 63 (A) 0 0 0 0 0 0
66 –70 92 68 5 +1 92 92 70.5 403
71 –75 83 73 10 +2 166 332 75.5 486
76 – 80 52 78 15 +3 156 468 80.5 538
81 – 85 40 83 20 +4 160 640 85.5 57.8
86 – 90 21 88 25 +5 105 525 90.5 599
91 – 95 11 93 30 +6 66 396 95.5 610
Total 610 428 3086
Required
Using the Pearsonian measure of skew ness, calculate the coefficients of skew ness and hence
comment briefly on the nature of the distribution of the loans.
= 63 +
= 66.51
=5 ×
= 10.68
5
Mr. owino
6
Introduction to business statistics
= = 305.5
= 60.5 + ×5
= 60.5 + ×5
Median = 65.27
Therefore, the Pearsonian coefficient
=
= 0.348
Comment
The coefficient of skewness obtained suggests that the frequency distribution of the loans given was
positively skewed
This is because the coefficient itself is positive. But the skewness is not very high implying the
degree of deviation of the frequency distribution from the normal distribution is small
INDEX NUMBERS
An index number measures how much available changes overtime. An index number is an attempt
to summarize a whole mass of data into one figure. The single figure shows how one year differs
from another year.
It is a statistical devise used to measure the change in the level of prices, wages output and other
variables at given times, relative to their level at an earlier time which is taken as the base for
comparison purposes
We calculate the index number by finding the ratio of the current value to a
base year. The following are the classifications of index numbers:
(i) Price Indices: This type of indices is the most frequently used. Price indices consider
prices of a commodity or a group of commodities and compare changes of prices
5
Mr. owino
7
Introduction to business statistics
from one period to another period and also compare the difference in price from one
place to another. For example, the familiar Consumer Price Index measuring overall
price changes of consumer commodities and services is used to define the cost of
living.
(ii) Quantity Indices: The major focus of consideration and comparison in these indices
are the quantities either of a single commodity or a group of commodities. For
example, the focus may be to understand the changes in the quantity of paddy
production in India over different time periods. For this purpose, a single
commodity’s quantity index will have to be constructed. Alternatively, the focus
may be to understand the changes in food grain production in India, in this case all
commodities which are categorized under food grains will be considered while
constructing the quantity index.
(iii) Value Indices: Value indices actually measure the combined effects of price and
quantity changes. For many situations either a price index or quantity index may not
be enough for the purpose of a comparison. For example, an index may be needed to
compare cost of living for a specific group of persons in a city or a region. Here
comparsion of expenditure of a typical family of the group is more relevant. Since
this involves comparing expenditure, it is the value index which will have to be
constructed. These indices are useful in production decisions, because it avoids the
effects of inflation.
The following are the popular index number formulas:
5
Mr. owino
8
Introduction to business statistics
∑ Q n X 100
∑ Q0
Where pn is the price of a commodity in the current year (the year for which the price index to be
calculated) Where po is the price of the same commodity in the base year (the year for comparison
purposes)
For comparison purposes if two series have different base years, it is difficult to compare them
directly. In such cases, it is necessary to change the base year of one of the series (or both) so that
both have the same base. It is also necessary to keep the index relevant to current conditions hence
the need to change the base from time to time.
Example;
Year 1985 1986 1987 1988 1989 1990 1991 1992
Price index 100 104 108 109 112 120 125 140
112
1986 104 104
× 100 = 92.9
112
1987 108 108
× 100 = 96.4
112
5
Mr. owino
9
Introduction to business statistics
112
1989 (new base year) 112 112
× 100 = 100
112
1990 120 120
× 100 = 107.1
112
1991 125 125
× 100 = 111.6
112
1992 140 140
× 100 = 125.0
112
When changing the base year, it is advisable to update the weights used in the base year.
6
Mr. owino
0
Introduction to business statistics
Chain Based Index Numbers
A chain-based index is one where the index is calculated every year using the previous year as the
base year. This type of index measures rate of change from year to year.
This method is suitable where weights are changing rapidly and items are constantly being brought
into the index and unwanted items taken out. It can be a price or quantity index
100 100
1987 108 108 108
× 100 = 103.8 × 100 = 108
104 100
1988 109 109 109
× 100 = 100.9 × 100 = 109
108 100
1989 112 112 112
× 100 = 102.8 × 100 = 112
109 100
1990 120 120 120
× 100 = 107.1 × 100 = 120
112 100
1991 125 125 125
× 100 = 104.2 × 100 = 125
120 100
1992 140 140 140
× 100 = 112 × 100 = 140
120 100
6
Mr. owino
1
Introduction to business statistics
AGGREGATE PRICE INDEX NUMBERS AND QUANTITY INDEX NUMBERS
6
Mr. owino
2
Introduction to business statistics
Determine:
Example
Let us observe the following data of 1995 and 2000, and also required computation for
construction of
Solution:
6
Mr. owino
3
Introduction to business statistics
This shows that prices for the group (sample commodities) have increased by 18.94% in 2000 as
compared to those prevailing in 1995. The quantity index according to Laspeyre’s formula is
computed as shown below:
6
Mr. owino
4
Introduction to business statistics
This shows a 19.78% increase in aggregate quantity consumption for this group in 2000 as compared
to 1995.
Thus, according to the Paasche’s Index the price index reveals an increase of 20.18% in prices in
2000 as against 1995.
It shows a 21.03% increase in quantity consumption for this group in 2000 as compared to 1995.
1. Obtain an understanding of the underlying forces and structure that produced the observed
data.
2. Fit a model and proceed to forecasting, monitoring or even feedback and feedforward
control.
Time Series Analysis is used for many applications such as:
1. Economic Forecasting
2. Sales Forecasting
3. Budgetary Analysis
4. Stock Market Analysis
5. Yield Projections
6
Mr. owino
5
Introduction to business statistics
6. Process and Quality Control
7. Inventory Studies
8. Workload Projections
9. Utility Studies
10. Census Analysis
Moving Average and Smoothing Techniques
In the Inherent collection of data taken over time, there is some form of random variation. For this
reason, it is necessary to ‘smoothen’ data collected over time. Smoothing data removes random
variation and shows trends and cyclic components.
1. Averaging Methods
2. Exponential Smoothing Methods
1. Moving Average
Periodical data e.g. monthly sales may have random fluctuation every month despite a general
trend being evident. Moving average helps in smoothing away these random changes.
A moving average is the forecast for a period that takes the average of the previous periods.
Example:
The table below represents company sales, calculate 3 and 6 monthly moving averages, for the data
Months Sales
January 1200
February 1280
March 1310
April 1270
May 1190
June 1290
July 1410
August 1360
6
Mr. owino
6
Introduction to business statistics
September 1430
October 1280
November 1410
December 1390
Solution.
These are calculated as follows
Jan+ Feb+ March
April ’ s forecast =
3
1200+1280+1310
¿
3
Feb+ Mar + April
May ’ s forecast =
3
And so on…
Similarly for 6 monthly moving average
1200+1280+1310+1270+1190 +1290
6
And so on…
3 months moving 6 months moving
average average
April 1263
May 1287
June 1257
July 1250 1257
August 1297 1292
September 1353 1305
October 1400 1325
November 1357 1327
December 1373 1363
6
Mr. owino
7
Introduction to business statistics
Note:
When plotting moving average on graphs the points are plotted as the midpoint of the period of the
average,
e.g. in our example the forecast for April (1263) is plotted on mid Feb.
1) The more the number of periods in the moving average, the greater the smoothing
effect.
2) Different moving averages produce different forecasts.
3) The more the randomness of data with underlying trend being constant then the
more the periods should be involved in the moving averages.
Limitations of moving averages.
1) Equal weighing with disregard to how more recent data is more relevant.
2) Moving average ignores data outside the period of the average thus it doesn’t
fully utilise available data.
3) Where there is an underlying seasonal variation, forecasting with unadjusted moving
average can be misleading.
6
Mr. owino
8
Introduction to business statistics
Exponential smoothing
Whereas in Single Moving Averages the past observations are weighted equally, Exponential
Smoothing assigns exponentially decreasing weights as the observation get older. In other words,
recent observations are given relatively more weight in forecasting than the older observations.
This method involves automatic weighing of past data with weights that decrease exponentially with
time.
Example: Consider the following set of data consisting of 12 observations taken over time. The
α (the smoothing constant) is 0.1.
Time Yt
1 71
2 70 70.9
3 69 70.71
4 68 70.439
5 64 69.7951
6 65 69.31559
7 72 69.58403
8 78 70.42563
9 75 70.88307
10 75 71.29476
11 75 71.66528
12 70 71.49875
Note:
The value α lies between 0 and 1.
The higher the α value, the more the forecast is sensitive to the current status.
6
Mr. owino
9
Introduction to business statistics
Characteristics of exponential smoothing
More weight is given to the most recent data.
All past data are incorporated unlike in moving averages.
Less data is needed to be stored unlike in periodic moving averages.
d. Random residual variation (R) – These are non-recurring random variations e.g. war, fire,
coup e.t.c.
For accurate forecasts these aspects are qualified separately (i.e. T, C, S and R) from data. This is
known as time decomposition or time series analysis
Additive Model
This model is best suited where the component factors are independent e.g. where the seasonal
variation is unaffected by trend.
Multiplicative Model:
7
Mr. owino
0
Introduction to business statistics
Time series value = T × S× C × R
This model is best applied where characteristics interact e.g. where high trends increase seasonal
variations. Multiplicative model is more commonly used in practice.
Of the four elements of time series the most important are trend and seasonal variation. The
following illustration shows how the trend (T) and seasonal variation (S) are separated out from a
time series and how the calculated T and S values are used to prepare forecast. The process of
separating out the trend and seasonal variation is known as deseasonalising the data.
There are two approaches to this process: one is based on regression through the actual data points
and the other calculates the regression line through moving average trend points. The method using
the actual data is demonstrated first followed by the moving average method.
1. Time series analysis: trend and seasonal variation using regression on the data
The following data will be used to illustrate how the trend and seasonal variation are calculated.
Example 1
7
Mr. owino
1
Introduction to business statistics
Step 4
These then are the average variations expected from the trend for each of the quarters; for example,
on average the first quarter of each year will be 56% of the value of the trend. Because the
variations have been averaged, the amounts over 100% (Q3 in this example). This can be checked
by adding the average and verifying that they total 400% thus:
On occasions, roundings in the calculations will make slight adjustments necessary to the average
variations.
Step 5
Prepare final forecasts based on the trend line estimates from “trend estimates and percentages
variation table” (i.e. 30.58, 32.42, etc) and the averaged seasonal variations from the table above.
(i.e. 56%, 90%, 170% and 84%)
X Y Seasonally
(quarters) (sales) adjusted
forecast
Year 1 2 32 29.18
3 62 58.24
4 29 30.32
6 21 21.2
Year 2 5 42 4
23
35.8
0
25.3
7 75 70.75
7
Year 3 9 39
42.4
3 Mr. owino
7
2
11 77 83.27
Introduction to business statistics
10
27 29.4
14 13 9
Year 4 39
49.0
5
7
Mr. owino
3
Introduction to business statistics
Once the formulae above have been calculated, they can be used to forecast (extrapolate) future
sales. If it is required to estimate the sales for the next year (i.e. Quarters 17, 18, 19 and 20 in our
series) this is done as follows:
= 60.02
= 33.61
19 = 108.29
20 = 55.05
Notes:
a) Time series decomposition is not an adaptive forecasting system like moving averages
and exponential smoothing.
b) Forecasts produced by such an analysis should always be treated with caution.
Changing conditions and changing seasonal factors make long term forecasting a
difficult task.
c) The above illustration has been an example of a multiplicative model. This is the
seasonal variations were expressed in percentage or proportionate terms. Similar steps
would have been necessary if the additive model had been used except that the
variations from the trend would have been the absolute values. For example, the first
7
Mr. owino
4
Introduction to business statistics
And so on.
The absolute variations would have been averaged in the normal way to find the
average absolute variation, whether + or -, and these values would have been used to
make the final seasonally adjusted forecasts.
When the correlation coefficient is low the method of calculating the regression line
through the actual data points should not be used. This is because the regression line is
too sensitive to changes in the data values.
In such circumstances, calculating a regression line through the moving average trend
points is more robust and stable.
Example 1 is reworked below using this method and, because there are many similarities
to the earlier method, only the key stages are shown.
7
Mr. owino
5
Introduction to business statistics
20 32 62
= 38 which is entered opposite period 2
32 62 29
= 41, and so on
The regression line y = a + bx of the moving average values is calculated in the normal manner and
results in the following:
7
Mr. owino
6
Introduction to business statistics
y = 33.06 + 1.32x
The percentage variations are averaged as previously shown, resulting in the following values:
Q1 Q2 Q3 Q4
Average seasonal variation % 54 89 170 86
The trend line and the average seasonal variations are then used in a similar manner to that
previously described.
For example, to extrapolate future sales for the next year (i.e. quarters 17, 18, 19 and 20) is as
follows:
Quarter 17
Quarter 18 = 50.57
19 = 98.84
20 = 51.13
Forecast errors
Differences between actual results and predictions may arise from many reasons. They may arise
from random influences, normal sampling errors, choice of the wrong forecasting system or alpha
value or simply that the future conditions turn out to be radically different from the past. Whatever
7
Mr. owino
7
Introduction to business statistics
the cause(s) management wish to know the extent of the forecast errors and various methods exist
to calculate these errors.
A commonly used technique, appropriate to time series, is to calculate the mean squared error of
the deviations between forecast and actual values then choose the forecasting system and/or
parameters which gives the lowest value of mean squared errors, i.e. akin to the ‘least squares’
method of establishing a regression line.
Moving averages, exponential smoothing and decomposition methods tend to be used for short to
medium term forecasting. Longer term forecasting is usually less detailed and is normally
concerned with forecasting the main trends on a year to year basis. Any of the techniques of
regression analysis described in the preceding chapters could be used depending on the
assumptions about linearity or non- linearity, the number of independent variables and so on. The
least squares regression approach is often used for trend forecasting.
Data have been kept of sales over the last seven years
Year 1 2 3 4 5 6 7
Sales (in ‘000 units 14 17 15 23 18 22 27
It is required to forecast the sales for the 8th
year
Solution
136 = 7a + 28b
b = 1.86
a = 12
7
Mr. owino
9
Introduction to business statistics
This refers to the sampling technique in which each and every item of the population is given an
equal chance of being included in the sample. Since selection of items in the sample depends entirely
on chance, this method is also called chance selection or representative sampling.
It is assumed that if the sample is chosen at random and if the size of the sample is sufficiently large,
it will represent all groups in the population
Random sampling is of 2 types; sampling with replacement and sampling without replacement
Sampling is said to be with replacement when from a finite population a sampling unit is drawn
observed and then returned to the population before another unit is drawn. The population in this
case remains the same and a sampling unit might be selected more than once
If on the other hand a sampling unit is chosen and not retuned to the population after it has been
observed the sampling is said to be without replacement.
Random samples may be selected by the help of lottery method or table of random numbers (such as
tippet’s table of random numbers, fischer and Yates numbers or Kendall and Babington Smith
numbers.)
Stratified sampling
In this case the population is divided into groups in such a way that units within each group are as
similar as possible in a process called stratification. The groups are called strata. Simple random
samples from each of the strata are collected and combined into a simple. This technique of
collecting a sample from a population is called stratified sampling. Stratification may be by age,
occupation income group e.t.c.
Systematic Sampling
This sampling is a part of simple random sampling in ascending or descending orders. In systematic
sampling a sample is drawn according to some predetermined object. Suppose a population consists
of 1000 units, then every tenth, 20th or 50th item is selected. This method is very easy and
economical. It also saves a lot of time
Multistage sampling
8
Mr. owino
0
Introduction to business statistics
This is similar to stratified sampling except division is done on geographical/location basis, e.g. a
country can be divided into provinces and then survey is done in 4 towns in each province. This
helps to cut traveling costs for a surveyor.
Cluster Sampling
This is where a few geographical regions e.g. a location, town or village are selected at random and
say every single household or shop in that area is interviewed. This again cuts on costs.
Judgment Sampling
Here the interviewer selects whom to interview believing that their view is more fundamental since
they might be directly affected e.g. to find out effects of public transport one may chose to interview
only people who don’t own cars and travel frequently to work.
Types of distribution
Population distribution
It refers to the distribution of the individual values of population. Its mean is denoted by ‘µ’
Sample distribution
It is the distribution of the individual values of a single sample. Its mean is generally written as “ x ”.
it is not usually the same as µ
8
Mr. owino
1
Introduction to business statistics
The series of sample means X 1 , X 2 , X 3 …….. is normally distributed or nearly so (according to the
central limit theorem). It can be described by its mean and its standard deviation. This standard
deviation is known as the standard error.
s
S x=
Standard error of the mean = √n
Note: this formula is satisfactory for larger samples and a large population i.e. n > 30 and n > 5% of
N.
- The word ‘error’ is in place of ‘deviation’ to emphasize that variation among sample means is
due to sampling errors.
- The smaller the standard error the greator the precision of the sample value.
6.3 STATISTICAL INFERENCE
It is the process of drawing conclusions about attributes of a population based upon information
contained in a sample (taken from the population).
It is divided into estimation of parameters and testing of hypothesis. Symbols for statistic of
population parameters are as follows.
Sample Population
Statistic Parameter
Arithmetic mean x µ
Standard deviation s σ
Number of items n N
Statistical estimation
8
Mr. owino
2
Introduction to business statistics
(i) Unbiased: where the expected value of the statistic is equal to the population
parameter e.g. if the expected mean of a sample is equal to the population mean
(ii) Consistency: where an estimator yields values more closely approaching the
population parameter as the sample increases
(iii) Efficiency: where the estimator has smaller variance on repeated sampling.
(iv) Sufficiency: where an estimator uses all the information available in the data
concerning a parameter
Confidence Interval
The interval estimate or a ‘confidence interval’ consists of a range (an upper confidence limit and
lower confidence limit) within which we are confident that a population parameter lies and we assign
a probability that this interval contains the true population value
The confidence limits are the outer limits to a confidence interval. Confidence interval is the interval
between the confidence limits. The higher the confidence level the greater the confidence interval.
For example
A normal distribution has the following characteristic
i. Sample mean ± 1.960 σ includes 95% of the population
ii. Sample mean ± 2.575 σ includes 99% of the population
1. Large Samples
8
Mr. owino
3
Introduction to business statistics
These are samples that contain a sample size greater than 30(i.e. n>30)
(a) Estimation of population mean
Here we assume that if we take a large sample from a population then the mean of the population is
very close to the mean of the sample
Steps to follow to estimate the population mean includes
i. Take a random sample of n items where (n>30)
Note that sample size is alredy n > 30 whereas s and x are given thus step i), ii) and iv) are provided.
Here: X = 6200 kgs
8
Mr. owino
4
Introduction to business statistics
s 200
S x = √ n = √ 64 = 25
Population mean µ = ±
= Sample mean
8
Mr. owino
5
Introduction to business statistics
S = 9; v = n – 1 = 12 – 1 = 11;
µ = x’ ±
At 95% confidence level
µ = 50 ± 2.262
= 50 ± 5.72 grams
Therefore we can state with 95% confidence that the population mean is between 44.28 and 55.72
grams
At 99% confidence level
µ = 50 ± 3.25
= 50 ± 8.07 grams
Therefore we can state with 99% confidence that the population mean is between 41.93 and 58.07
grams
Note: To use the t distribution tables it is important to find the degrees of freedom (v = n – 1). In the
example above v = 12 – 1 = 11
From the tables we find that at 95% confidence level against 11 and under 0.05, the value of t =
2.201
6.4 HYPOTHESIS TESTING
Definition
- A hypothesis is a claim or an opinion about an item or issue. Therefore it has to be tested
statistically in order to establish whether it is correct or not correct
- Whenever testing an hypothesis, one must fully understand the 2 basic hypothesis to be tested
namely
i. The null hypothesis (H0)
ii. The alternative hypothesis(H1)
8
Mr. owino
6
Introduction to business statistics
8
Mr. owino
7
Introduction to business statistics
A level of significance is a probability value which is used when conducting tests of hypothesis. A
level of significance is basically the probability of one making an incorrect decision after the
statistical testing has been done. Usually such probability used are very small e.g. 1% or 5%
0.5000 0.4900
0
Critical value
0.45
5% = 0.05
Critical region
0
Crititical value = -1.65
NB: If the standardized value of the mean is less than –1.65 we reject the null hypothesis (H 0) and
accept the alternative Hypothesis (H1) but if the standardized value of the mean is more than –1.65
we accept the null hypothesis and reject the alternative hypothesis
The above sketch graph and level of significance are applicable when the sample mean is < (i.e. less
than the population mean)
The following is used when sample mean > population mean
Acceptance region
8
Mr. owino
8
Introduction to business statistics
5% = 0.05
NB: If the sample mean standardized value < 1.65, we accept the null hypothesis but reject the
alternative. If the sample mean value > 1.65 we reject the null hypothesis and accept the alternative
hypothesis
The above sketch is normally used when the sample mean given is greater than the population mean
Reject null hyp (accept alt hyp) Reject null hyp (accept alt hyp)
8
Mr. owino
9
Introduction to business statistics
15cm 17 ½ cm
NB: Alternative hypothesis is usually rejected if the standardized value of the sample mean lies
beyond the tolerance limits (15cm and 17 ½ cm).
ONE TAILED TEST
This is a test where the alternative hypothesis (H 1:) is only concerned with one of the tails of the
distribution e.g. to test a business complaint if the complaint is above the measurements of item
being shorter than is required.
E.g. a manufacturer of a given brand of bread may state that the average weight of the bread is 500
gms but if a consumer takes a sample and weighs each of the pieces of bread and happens to have a
mean of 450 gms he will definitely complain about the bread which is underweight. The statistical
analysis to be done will concentrate on the left tail of the normal distribution in which one will have
to establish whether 450 gms being less than 500g is statistically significant. Such a test therefore is
referred to as one tailed test.
9
Mr. owino
0
Introduction to business statistics
left
On the other hand the test may compuliate on the right hand tail of the normal distribution when this
happens the major complaint is likely to do with oversize items bought. Therefore the test is known
as one tailed as the focus is on one end of the normal distribution.
Number of standard errors
Two tailed One tailed
test test
5% level of 1.96 1.65
significance
1% level of 2.58 2.33
significance
9
Mr. owino
1
Introduction to business statistics
Test a sample mean ( ) against a population mean (µ) (where samples size n > 30 and
population variance σ2 is known) and sample proportion, P(where sample size np >5 and nq >5
since in this case the normal distribution can be used to approximate the binomial distribution
2. t test
Tests a sample mean ( ) against a population mean and especially where the population
variance is unknown and n < 30.
3. Variance ratio test or f test
It is used to compare population variances and it is used with samples of any size drawn from
normal populations.
4. Chi squared test
It can be used to test the association between attributes or the goodness of fit of an observed
frequency distribution to a standard distribution
Example 1
A certain NGO carried out a survey in a certain community in order to establish the average at which
the girls are married. The results of the survey indicated that the marriage age for the girls is 19 years
In order to establish the validity of the mean marital age, a sample of 50 women was interviewed and
the average age indicated that they got married at the age of 16 years. However the different ages at
which they were married differed with the standard deviation of 2.1years
The sample data indicates that the marital age is less 19 years. Is this conclusion true or not ?
Required
9
Mr. owino
2
Introduction to business statistics
Conduct a statistical test to either support the above conclusion drawn from the sample statistics i.e.
the marriage age is less than 19 years, use a level of significance of 5%
Solution
1. Null hypothesis
H0: μ (mean marital age) = 19 years
Alternative hypothesis H1: μ (mean marital age) < 19 years
2. The level of significance is 5%
Acceptance region
Rejection region
- 1.65 0
Z = where =
9
Mr. owino
3
Introduction to business statistics
The standard value Z must fall within the acceptance region for us to accept the null
hypothesis. Thus it must be > - 1.65 otherwise we accept the alternative hypothesis.
Z = = - 10.1
6. Since –10.1 < -1.65, we reject the null hypothesis but accept the alternative hypothesis at 5%
level of significance i.e. the marriage age in this community is significantly lower than 19
years
Example 2
A foreign company which manufactures electric bulbs has assured its customers that the lifespan of
the bulbs is 28 month with a standard deviation of 4months
Recently the company embarked on a quality improvement research for their product. After the
research using new technology, a sample of 70 bulbs was tested and they gave a mean lifespan of
30.2 months
Does this justify the research undertaken? Use 1% level of significance to conduct a statistical test in
order to establish the truth about the above question.
Testing procedure
1. Null hypothesis H0: µ = 28
Alternative hypothesis H1: µ > 28
2. The level of significance is 1% (one tailed test)
3. The test statistics is the sample mean age, x’ = 30.2
4. The critical value of the one tailed test at 5% level of significance is + 2.33
0.4900
1% = 0.01
9
Mr. owino
4
Introduction to business statistics
2.33
Z = = = 4.6
6. Since 4.6 > 2.33, we reject the null hypothesis but accept the alternative hypothesis at 1%
level of significance i.e. the new sample mean life span is statistically significant higher than
the population mean
Therefore the research undertaken was worth while or justified
Example 3
A construction firm has placed an order that they require a consignment of wires which have a mean
length of 10.5 meters with a standard deviation of 1.7 m
The company which produces the wires delivered 90 wires, which had a mean length of 9.2 m., The
construction company rejected the consignment on the grounds that they were different from the
order placed.
Required
Conduct a statistical test to indicate whether you support or not support the action taken by the
construction company at 5% level of significance.
Solution
Null hypothesis µ = 10.5 m
Alternative hypothesis µ ≠ 10.5 m
Level of significance be 5%
9
Mr. owino
5
Introduction to business statistics
- 1.96 +1.96
Z = = = - 7.25
Since 7.25 < 1.96, reject the null hypothesis but accept the alternative hypothesis at 5% level of
significance i.e. the sample mean is statistically different from the consignment ordered by the
construction company. Therefore support the action taken by the construction company
T DISTRIBUTION (STUDENT’S T DISTRIBUTION) TESTS OF HYPOTHESIS (TEST
FOR SMALL SAMPLES N < 30)
For small samples n < 30, the method used in hypothesis testing is exactly similar to the one for large
samples exept that t values are used from t distribution at a given degree of freedom v, instead of z
score, the standard error Se statistic used is also different.
Note that v = n – 1 for a single sample and n1 + n2 – 2 where two sample are involved.
a) Test of hypothesis about the population mean
When the population standard deviation (S) is known then the t statistic is defined as
t = where
Follows the students t distribution with (n-1) d.f. where
= Sample mean
μ = Hypothesis population mean
n = sample size
and S is the standard deviation of the sample calculated by the formula
9
Mr. owino
6
Introduction to business statistics
S= for n < 30
If the calculated value of t exceeds the table value of t at a specified level of significance, the null
hypothesis is rejected.
Example
Ten oil tins are taken at random from an automatic filling machine. The mean weight of the tins is
15.8 kg and the standard deviation is 0.5kg. Does the sample mean differ significantly from the
intended weight of 16kgs. Use 5% level of significance.
Solution
t =
=
= -1.25
The table value for t for 9 d.f. at 5% level of significance is 2.26. the computed value of t is smaller
than the table value of t. therefore, difference is insignificant and the null hypothesis is accepted.
CHI SQUARE HYPOTHESIS TESTS (NON-PARAMETRIC TEST) (X2)
They include amongst others
i. Test for goodness of fit
ii. Test for independence of attributes
iii. Test of homogeneity
iv. Test for population variance
The Chi square test (χ2) is used when comparing an actual (observed) distribution with a
hypothesized, or explained distribution.
9
Mr. owino
7
Introduction to business statistics
Is the result consistent with the hypothesis that male and female births are equally probable at 5%
level of significance?
Solution
If the distribution of gender is equally probable then the distribution conforms to a binomial
distribution with probability P(X) = ½.
Therefore
H0 = the observed number of boys conforms to a binomial distribution with P = ½
H1 = The observations do not conform to a binomial distribution.
On the assumption that male and female births are equally probable the probability of a male birth is
P = ½ . The expected number of families can be calculated by the use of binomial distribution. The
probability of male births in a family of 5 is given by
P(x) = 5cX Px q5-x (for x = 0, 1, 2, 3, 4, 5,)
= 5cX ( ½ )5 (Since P = q = ½ )
9
Mr. owino
8
Introduction to business statistics
To get the expected frequencies, multiply P(x) by the total number N = 320. The calculations are
shown below in the tables
Arranging observed and expected frequencies in the following table and calculating x2
O E (O – E) 2 (O – E) 2 /E
14 10 16 1.60
56 50 16 0.72
110 100 100 1.00
88 100 144 1.44
40 50 100 2.00
12 10 4 0.40
Σ(0 – E) 2 /E = 7.16
χ2 =
= 7.16
9
Mr. owino
9
Introduction to business statistics
E=
Where R = a row total, C = a column total and n = sample size
3. Based upon the observed values and corresponding expected frequencies the χ 2
statistic is obtained using the formular
χ2 =
4. The characteristic of this distribution are defined by the number of degrees of freedom
(d.f.) which is given by
d.f. = (r-1) (c-1),
Where r is the number of rows and c is number of columns corresponding to a chosen
level of significance, the critical value is found from the chi squared table
5. The calculated value of χ2 is compared with the tabulated value χ2 for (r-1) (c-1)
degrees of freedom at a certain level of significance. If the computed value of χ 2 is
greater than the tabulated value, the null hypothesis of independence is rejected.
Otherwise, we accept it.
Example
1
Mr. owino 0
0
Introduction to business statistics
In a sample of 200 people where a particular devise was selected, 100 were given a drug and the
others were not given any drug. The results are as follows
Drug No drug Total
Cured 65 55 120
Not cured 35 45 80
Total 100 100 200
Test whether the drug will be effective or not, at 5% level of significance.
Solution
Let us take the null hypothesis that the drug is not effective in curing the disease.
Applying the χ2 test
The expected cell frequencies are computed as follows
E11 = = = 60
E12 = = = 60
E21 = = = 40
E22 = = = 40
The table of expected frequencies is as follows
60 60 120
40 40 80
100 100 200
O E (O – E) 2 (O – E) 2 /E
65 60 25 0.417
1
Mr. owino 0
1
Introduction to business statistics
55 60 25 0.625
35 40 25 0.417
45 40 25 0.625
Σ(O – E) 2 /E = 2.084
Arranging the observed frequencies with their corresponding frequencies in the following table we
get
χ2 =
= 2.084
The calculated value of χ2 is less than the table value. The hypothesis is accepted. Hence the drug is
not effective in curing the disease.
Test of homogeneity
It is concerned with the proposition that several populations are homogenous with respect to some
characteristic of interest e.g. one may be interested in knowing if raw material available from several
retailers are homogenous. A random sample is drawn from each of the population and the number in
each of sample falling into each category is determined. The sample data is displayed in a
contingency table
The analytical procedure is the same as that discussed for the test of independence
Example
A rndom sample of 400 persons was selected from each of three age groups and each person was
asked to specify which types of TV programs be preferred. The results are shown in the following
table
Type of program
Age group A B C Total
1
Mr. owino 0
2
Introduction to business statistics
χ2 =
The table value of χ2 for 4d.f. at 5% level of significance is 9.488
The calculated value of χ2 is greater than the table value. We reject the hypothesis and conclude that
the populations are not homogenous with respect to the type of TV programs preferred, thus the
different age groups vary in choice of TV programs.
1
Mr. owino 0
3
Introduction to business statistics
t= where
at n – 1 d.f
level of significance
(b) Chi-square test
X2 =
Where O = observed frequency
E= = expected frequency
(c) F – test (variance test)
F=
here the bigger value between the standard deviations makes the numerator.
1
Mr. owino 0
4
Introduction to business statistics
1
Mr. owino 0
5
Introduction to business statistics