Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
22 views106 pages

Introduction To Statistics

The document provides a comprehensive introduction to statistics, defining it as a mathematical branch focused on data collection, analysis, and interpretation. It outlines the two main branches of statistics—descriptive and inferential—and discusses key characteristics, functions, limitations, and types of variables. Additionally, it covers statistical methods, data collection techniques, sampling methods, and applications of statistics in business, including quality control, inventory management, and forecasting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views106 pages

Introduction To Statistics

The document provides a comprehensive introduction to statistics, defining it as a mathematical branch focused on data collection, analysis, and interpretation. It outlines the two main branches of statistics—descriptive and inferential—and discusses key characteristics, functions, limitations, and types of variables. Additionally, it covers statistical methods, data collection techniques, sampling methods, and applications of statistics in business, including quality control, inventory management, and forecasting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 106

Introduction to business statistics

Introduction to Statistics
Statistics is a branch of mathematics that deals with the collection, organization, presentation,
analysis, and interpretation of numerical data. It helps in making informed decisions across various
fields, including economics, business, medicine, and the social and physical sciences.
Definition and Meaning of Statistics
The term "statistics" has two distinct meanings:
Statistics in a Plural Sense
Refers to numerical descriptions of quantitative aspects of objects or events.
Represents numerical facts related to a collection of data.
Statistics in a Singular Sense
Refers to the scientific methods used in collecting, organizing, presenting, analyzing, and
interpreting data.
Aids in making effective decisions based on numerical evidence.

Statistics can be understood as either statistical data or statistical methods:


Statistical Data refers to numerical descriptions of facts or events.
Statistical Methods involve techniques and procedures used in handling numerical data.
Note: Not all numerical data qualifies as statistics. To be considered statistics, the data must exhibit
essential characteristics such as accuracy, reliability, and relevance.
Branches of Statistics
Statistics is broadly classified into two branches:
 Descriptive Statistics
Involves the collection, organization, summarization, and presentation of data.
Data is presented in various forms such as tables, graphs, diagrams, or numerical summaries.
The purpose is to display and communicate information for better understanding and decision-
making.
Example: Businesses use descriptive statistics when presenting annual reports and financial
statements.
 Inferential (Inductive) Statistics
Focuses on drawing conclusions from sample data and making generalizations about a population.

Mr. owino 1
Introduction to business statistics
Includes performing estimations, hypothesis testing, determining relationships between variables,
and making predictions.
Example: Researchers use inferential statistics to analyze survey results and predict future trends.

Characteristics of Statistics
Statistics is a scientific method used to collect, analyze, and interpret numerical data. The key
characteristics of statistics include:
 Statistics Means an Aggregate of Facts
Statistics refers to a collection of facts that can be analyzed.
A single fact cannot be statistically analyzed; multiple facts are required.
Example: The weights of 60 students in a class can be analyzed, but the weight of a single student is
not considered statistics.
 Statistics Are Affected by Multiple Causes
Statistical facts are influenced by various interacting factors.
The results observed are not due to a single cause but a combination of several factors.
 Statistics Are Numerically Expressed
Only numerical data can be analyzed statistically.
Example: A statement like "Price decreases with increasing production" is not statistics unless
expressed numerically.
 Statistics Are Enumerated or Estimated with Accuracy
Statistical data should be measured or estimated with an appropriate degree of accuracy.
The level of accuracy required depends on the purpose of the study.
 Statistics Are Collected Systematically
Data should be collected using planned and scientific methods.
Improper collection methods may result in misleading conclusions.
 Statistics Are Collected for a Pre-Determined Purpose
Statistical data must have a clear objective.
Data collected without a specific purpose may be irrelevant or unusable.
 Statistics Are Related to Each Other
Statistical facts should be arranged logically to facilitate comparison and analysis.

Mr. owino 2
Introduction to business statistics
Only related data, when properly organized, can provide meaningful insights.
Functions of Statistics
Statistics serves various functions in data analysis and decision-making. The six major functions
include:
i. Presentation of Facts in a Precise and Definite Form
Statistics ensures clarity and avoids ambiguity in data representation.
ii. Simplification of Large Data Sets
Large amounts of data are condensed and organized for better understanding.

iii. Facilitation of Comparison


Statistical tools help in comparing different data sets.
Example: Comparing imports and exports of a country or between different countries.
iv. Formulation and Testing of Hypotheses
Statistics provides methods to analyze data and validate hypotheses.
Example: It helps in analyzing the relationship between economic factors such as production and
price.
v. Aiding in Policy Making and Planning
Governments and businesses use statistical data for planning and decision-making.
Example: Economic policies are developed based on statistical trends in production and demand.
vi. Studying Relationships Between Different Factors
Statistical methods help analyze correlations between variables.
Example: Studying the relationship between production levels and product pricing.
Limitations of Statistics
While statistics is a powerful tool for analysis, it has limitations, including:
a. Statistics applies only to data that can be quantitatively measured.
b. Non-measurable factors (e.g., emotions, opinions) cannot be analyzed using statistical
methods.
c. The accuracy of statistical conclusions depends on the quality of data collection methods.
TYPES OF VARIABLES
In statistics, variables are classified into two main categories: qualitative and quantitative variables.
Understanding these types of variables is essential in data collection, analysis, and interpretation.

Mr. owino 3
Introduction to business statistics
1. Qualitative Variables
A qualitative variable (also known as a categorical variable) represents attributes or characteristics
that cannot be expressed numerically. These variables are descriptive in nature and categorize data
into distinct groups.
Examples:
Sex: Male or Female
State of Birth: Nairobi, Mombasa, Kisumu, etc.
Cause of Death: Heart disease, Cancer, Accident, etc.
Religion: Catholic, Protestant, Muslim, Hindu, etc.
Key Characteristics:
Qualitative data is often summarized using frequency counts or proportions.
It is typically displayed using bar charts, pie charts, and frequency tables.
2. Quantitative Variables
A quantitative variable is numerical and represents measurable quantities. These values can be
ordered, ranked, and used in mathematical computations.
Quantitative variables are further divided into:
a. Discrete Variables
A discrete variable assumes specific, countable values, usually whole numbers. It cannot take
fractional or decimal values.
Examples:
Number of children in a family (e.g., 1, 2, 3, etc.)
Number of rooms in a house
Number of patients in a hospital
b. Continuous Variables
A continuous variable can take any value within a given range. These values are obtained by
measurement and may include fractions or decimals.
Examples:
Height of individuals (e.g., 170.5 cm, 180.2 cm)
Weight of a person (e.g., 65.8 kg, 72.4 kg)
Age (e.g., 25.5 years, 30.2 years)
Body temperature (e.g., 36.5°C, 37.2°C)

Mr. owino 4
Introduction to business statistics
Key Characteristics:
Continuous data can be represented using histograms, line graphs, and scatter plots.
Measurement precision depends on the measuring instrument used.
STATISTICAL METHODS
Introduction to Statistical Methods
Statistical methods are essential for analyzing numerical data, making informed decisions, and
understanding various phenomena. The following procedures are commonly adopted:
 Data Collection – Gather relevant information.
 Data Organization – Arrange data systematically.
 Data Presentation – Use diagrams, charts, and tables.
 Data Analysis – Compute averages, measure disparities.
 Data Interpretation – Understand patterns and trends.
 Policy Decision – Utilize findings for improvement.
I. Collection of Data
Statistics is centered around numerical data analysis. The first step in statistical methodology is data
collection, which can be categorized into:
a. Primary Data
Primary data is collected firsthand by an investigator for a specific purpose, making it original and
reliable. It is obtained through:
Census: Examines every item in the population, ensuring completeness and representativeness but
requiring significant resources.
Sample: A subset of the population that is cost-effective and time-efficient but may lack universal
acceptability.
b. Secondary Data
Secondary data refers to information collected by other researchers or agencies. Examples include
government publications such as financial statistics and economic reports.
Advantages of Secondary Data:
Saves time, manpower, and resources.
Potential Issues with Secondary Data:
 May be outdated.
 Data collection and analysis methods may be unknown.

Mr. owino 5
Introduction to business statistics
 Risk of bias due to different collection purposes.
 Errors in transcription or publication.
Considerations Before Using Secondary Data:
 Purpose and credibility of the publishing institution.
 Accuracy and relevance to the research problem.
 Completeness and consistency across sources.
 Homogeneity of data conditions.
Sampling Methods
Sampling allows researchers to study a portion of a population rather than the entire group. There
are two primary categories:
i. Probabilistic Methods
 Simple Random Sampling – Every sample has an equal chance of selection.
 Stratified Random Sampling – The population is divided into strata (e.g., age, gender) with
random sampling from each.
 Cluster Sampling – Groups or clusters (e.g., city blocks) are randomly selected, and all
elements within them are studied.
ii. Non-Probabilistic Methods
 Judgment Sampling – The researcher selects items based on expertise.
 Convenience Sampling – Items are chosen based on ease of access.
 Quota Sampling – Ensures the sample contains specific characteristics (e.g., political
polling).
Population vs. Sample
Population: The entire collection of items being studied.
Sample: A subset selected to represent the population.
Example:
To study high-school teachers' salaries, the population consists of all teachers' salaries, whereas a
sample could be 100 randomly chosen salaries.
In a political poll, the population includes all eligible voters, while a sample consists of 1,000
randomly selected individuals.
Misuse of Statistics
Statistical data itself is neutral, but misuse can occur due to:

Mr. owino 6
Introduction to business statistics
 Using data for unintended purposes.
 Bias in data collection.
 Careless analysis leading to misleading conclusions.
NB: To ensure reliability, data should be collected, analyzed, and interpreted systematically,
keeping in mind the integrity of statistical principles.
Application of Statistics in Business
 Quality Control
In every industry, quality control plays a crucial role in ensuring that products meet the required
standards. Quality control departments are responsible for maintaining these standards, ensuring that
goods comply with customer expectations. In Kenya, the Kenya Bureau of Standards (KeBS) is a
national institution that inspects and certifies products on behalf of the government to guarantee
quality assurance.
To achieve this, KeBS and other regulatory bodies have developed quality control charts. These
charts serve as monitoring tools that help determine whether manufactured products meet the desired
standards. If a product deviates from the acceptable quality levels, corrective actions are taken to
maintain consistency in production and customer satisfaction.
 Economic Order Quantity (EOQ) and Inventory Management
Statistics plays a significant role in inventory management, particularly in determining the Economic
Order Quantity (EOQ). EOQ is the optimal quantity of stock that should be ordered to minimize
costs while ensuring that customer demand is met efficiently.
Ordering large quantities of stock may lead to excessive storage costs and capital being tied up in
inventory, which could have been used for other business investments. Conversely, ordering small
quantities reduces storage costs but may lead to stockouts, resulting in unsatisfied customers and lost
sales.
By applying statistical analysis, businesses can calculate the EOQ that balances storage costs and
customer demand, ensuring cost-effectiveness and operational efficiency.
 Forecasting in Business
Forecasting is an essential statistical application that enables business managers to predict future
trends and outcomes. Statistical methods, such as regression analysis, help in establishing
relationships between dependent and independent variables. This allows businesses to develop
predictive models that guide decision-making processes.

Mr. owino 7
Introduction to business statistics
For example, sales forecasting helps businesses anticipate demand, manage inventory efficiently,
and optimize production schedules. Similarly, financial forecasting enables organizations to plan
budgets, allocate resources, and set strategic goals based on historical data and market trends.
 Human Resource Management
Statistics is also valuable in human resource management for workforce planning and organizational
improvement. By conducting employee surveys and analyzing workforce data, businesses can
identify areas where management needs to improve.
For instance, statistical analysis of employee resignation trends can help identify reasons for high
turnover rates. If data shows that employees are leaving due to workplace dissatisfaction,
management can take corrective measures such as improving working conditions, enhancing
employee engagement, and offering better incentives.
Additionally, statistical tools can be used to measure employee performance, training effectiveness,
and productivity levels, aiding in informed decision-making for human resource development.
TOPIC TWO: STATISTICAL MEASUREMENTS
MEASURES OF CENTRAL TENDENCY
These are statistical values which tend to occur at the centre of any well-ordered set of data.
Whenever these measures occur, they do not indicate the centre of that data. These measures are as
follows:
 The arithmetic mean
 The mode
 The median
 The geometric mean
 The harmonic mean
1. The arithmetic mean
This is commonly known as average or mean it is obtained by first of all summing up the values
given and by dividing the total value by the total no. of observations.

i.e., mean =
∑X
n
Where x = no. of values
∑ = summation
n = no of observations

Mr. owino 8
Introduction to business statistics
Example
The mean of 60, 80, 90, 120
60+80+ 90+120
=
4
350
=
4
=87.5
The arithmetic mean is very useful because it represents the values of most observations in the
population.
The mean therefore describes the population quite well in terms of the magnitudes attained by most
of the members of the population
Computation of the mean from grouped Data i.e. in classes.
When calculating the mean of grouped data, we use the following formula:
n

∑f .X
x= i=1
∑f
Where: f = Frequency
X = Midpoint
Example:
Compute the mean amount of expenditure on food and other household items incurred by the Bahati
Estate families last month.
Expenditure (Shs '000s) No. of Families (f) Class Midpoint (x) fx
5 – 10 8 7.5 60
10 – 15 9 12.5 112.5
15 – 20 15 17.5 262.5
20 – 25 11 22.5 247.5
25 – 30 6 27.5 165
30 – 35 3 32.5 97.5
35 – 40 2 37.5 75
∑f = 54 ∑fx =

Mr. owino 9
Introduction to business statistics
Expenditure (Shs '000s) No. of Families (f) Class Midpoint (x) fx
1020

∑f .X
x= i=1
∑f
1020
x=
54
x=18.889
Sh 18,889
This means that on average each of the Bahati Estate families spent about Shs 18889 on food and
other household items last month.
Characteristics of the Mean
The mean, also known as the arithmetic average, is one of the most commonly used measures of
central tendency. It provides a representative value for a given set of data. Below are the key
characteristics of the mean:
 Applicability to Interval and Ratio Data
The mean can only be calculated for data measured on an interval or ratio scale. These scales
provide meaningful numerical values that allow for proper computation of the average.
 Inclusion of All Values in Computation
When computing the mean, all values in the dataset are considered. This is particularly true for both
the arithmetic mean and the weighted mean. However, for grouped data, the mean is only an
estimate because individual data values are not directly used; instead, class midpoints are utilized.

 Uniqueness of the Mean


A given dataset has a single, unique mean. This ensures consistency and reliability when
summarizing data.
 Influence of Extreme Values

1
Mr. owino
0
Introduction to business statistics
The mean is highly sensitive to extreme values (outliers). Unusually large or small values can
significantly impact the mean, making it higher or lower than the typical data points.
 Sum of Deviations from the Mean
One of the unique mathematical properties of the arithmetic mean is that the sum of the deviations of
all data points from the mean is always zero. This characteristic is fundamental in various statistical
applications and calculations.
 Challenges with Open-Ended Distributions
It may not always be possible to compute the mean for open-ended distributions. An open-ended
distribution is one where the lowest class has no defined lower limit or the highest class lacks an
upper limit. Since these distributions do not provide a definitive class midpoint, accurate
computation of the mean becomes difficult.
The following statistical terms are commonly used in statistical calculations. They must therefore be
clearly understood.
 Class limits
These are numerical values which limits uq extended of a given class i.e. all the observations in a
given class are expected to fall within the interval which is bounded by the class limits e.g. 5 & 10
are class limits as in the table of the example above.
 Class boundaries
These are statistical boundaries, which separate one class from the other. They are usually
determined by adding the lower-class limit to the next upper-class limit and dividing by 2 e.g., in the
above table the class boundary between 10 and 10 is 10 which is =
10+10
2
 Class Mid points
This are very important values which mark the center of a given class. They are obtained by adding
together the two limits of a given class and dividing the result by 2.
 Class interval/width
This is the difference between an upper-class boundary and lower-class boundary. The value usually
measures the length of a given class.

1
Mr. owino
1
Introduction to business statistics

2. The Mode
The mode is one of the key measures of central tendency in statistics. It is defined as the value in a
frequency distribution that appears most frequently. In cases where no single value appears most
frequently, we refer to the class with the highest frequency as the modal class.
Importance of the Mode in Business
The mode is widely used in business to make data-driven decisions. Firms often rely on this measure
to determine the most popular or frequently demanded products. For example, businesses may use
the mode to:
 Identify the most frequently purchased footwear or clothing items.
 Determine the most popular construction materials such as beams, wires, and iron sheets.
 Analyze customer preferences for better inventory management.
Determining the Mode
 Ungrouped Data
For ungrouped data, the mode is determined by:
 Arranging the given values in ascending or descending order.
 Identifying the value that appears most frequently.
 Grouped Data
When dealing with grouped data, the mode can be determined using the modal class (the class with
the highest frequency). The mode can be calculated using the following formula:
f 1−f 0
Mode=L+ ⌊ ⌋ ×c
2 f 1−f 0−f 2
Where:
L = Lower boundary of the modal class
F1 = Frequency of the modal class
F0 = Frequency of the class before the modal class
F2 = Frequency of the class after the modal class
c = Class width
Example
IQ Range Frequency (f) Upper Class Bound Cumulative Frequency (CF)

1
Mr. owino
2
Introduction to business statistics
1 – 20 6 20 6
21 – 40 18 40 24
41 – 60 32 (fo) 60 56
61 – 80 48 (f1) 80 104
81 – 100 27 (f2) 100 131
101 – 120 13 120 144
121 – 140 2 140 146

f 1−f 0
Mode=L+ ⌊ ⌋ ×c
2 f 1−f 0−f 2
48−32
Mode=60.5+ ⌊ ⌋ ×20
2(48)−32−27
Mode=69.14
3. The Median
The median is a statistical value that represents the middle value in a given set of data that has been
arranged in ascending order.
It is an important measure of central tendency because it divides the data into two equal halves,
ensuring that the number of observations below and above it is equal.
Example
Consider the following data set:
14, 17, 9, 8, 20, 32, 18, 14.5, 13
Step 1: Arrange the data in ascending order
8, 9, 13, 14, 14.5, 17, 18, 20, 32
Step 2: Identify the middle value
Since the dataset consists of 9 values, the middle value is the 5th observation in the ordered set:
Median = 14.5
Importance of the Median
 The median divides the data into two equal halves.
 It is not affected by extreme values (outliers), making it a reliable measure of central
tendency for skewed data.

1
Mr. owino
3
Introduction to business statistics
 It provides a more accurate representation of the center of a dataset compared to the mean
when data is asymmetrical.
Determining the Median for Grouped Data
When data is grouped into classes, the median can be calculated using the median formula:
n+1
−Cfbm
2
Median=L+ ⌊ ⌋ ×C
Cfmc
Where:
L = Lower boundary of the median class
n = Total number of observations
Cfbm = Cumulative frequency before the median class
Cfmc = Frequency of the median class
C = Class width
Example
Referring to the table below, determine the median
IQ Range No. of Residents (f) UCB Cumulative Frequency
0 – 20 6 20 6
20 – 40 18 40 24
40 – 60 32 60 56
60 – 80 48 80 104
80 – 100 27 100 131
100 – 120 13 120 144
120 – 140 2 140 146

n+1
−Cfbm
2
Median=L+ ⌊ ⌋ ×C
Cfmc
73.5−56
Median=60+ ⌊ ⌋ ×20
48
Median=60+ 7.29

1
Mr. owino
4
Introduction to business statistics
Median=67.9
4. Geometric mean
This is a measure of central tendency normally used to measure industrial growth rates.
It is defined as the nth root of the product of ‘n’ observations or values i.e.
GM = √n X 1 X 2 x X 3... X n
Example
In 1995 five firms registered the following economic growth rates; 26%. 32% 41% 18% and 36%.
Required
Calculate the GM for the above values.
GM = √n X 1× X 2× X 3 × X 4 × X 5
GM = √5 26 ×32 × 41× 18× 36
GM = √5 22,104,576
GM = 29.44
5. Harmonic mean
This is a measure of central tendency which is used to determine the average growth rates for natural
economies. It is defined as the reciprocal of the average of the reciprocals of all the values given by
HM.
1
HM = 1 1 1 1 1
( + + +, , , , , , ,+ )
n x1 x 2 x3 xn
Example
The economic growth rates of five countries were given as 20%, 15%, 25%, 18% and 5%
Calculate the harmonic mean
1
HM = 1 1 1 1 1 1
( + + + + )
n x1 x 2 x3 x 4 x 5
1
HM = 1
¿¿
5

HM = 10.86 %

1
Mr. owino
5
Introduction to business statistics
Merits and demerits of the measures of central tendency
1. The arithmetic mean (a.m)
Merits
 It utilizes all the observations given
 It is a very useful statistic in terms of applications. It has several applications in business
management e.g. hypothesis testing, quality control e.t.c.
 It is the best representative of a given set of data if such data was obtained from a normal
population
 The a.m. can be determined accurately using mathematical formulas
Demerits of the a.m.
 If the data is not drawn from a ‘normal’ population, then the a.m. may give a wrong
impression about the population
 In some situations, the a.m. may give unrealistic values especially when dealing with discrete
variables e.g. when working out the average no. of children in a no. of families.
 It may be found that the average is 4.4 which is unrealistic in human beings
2. The mode
Merits
 It can be determined from incomplete data provided the observations with the highest
frequency are already known
 The mode has several applications in business
 The mode can be easily defined
 It can be determined easily from a graph
Demerits
 If the data is quite large and ungrouped, determination of the mode can be quite cumbersome
 Use of the formula to calculate the mode is unfamiliar to most business people
 The mode may sometimes be non-existent or there may be two modes for a given set of data.
In such a case therefore, a single mode may not exist
3. The median
Merits
 It shows the centre of a given set of data

1
Mr. owino
6
Introduction to business statistics
 Knowledge of the determination of the median may be extended to determine the quartiles
 The median can easily be defined
 It can be obtained easily from the cumulative frequency curve
 It can be used in determining the degrees of skew ness (see later)
Demerits
 In some situations where the no. of observations is even, the value of the median obtained is
usually imaginary
 The computation of the median using the formulas is not well understood by most
businessmen
 In business environment the median has got very few applications
4. The geometric mean
Merits
 It makes use of all the values given (except when x = 0 or negative)
 It is the best measure for industrial growth rates
Demerits
 The determination of the GM by using logarithms is not familiar process to all those
expected to use it e.g managers
 If the data contains zeros or –ve values, the GM ceases to exist
The harmonic means and weighted mean
 Merits – same as the arithmetic mean
 Demerits – same as the arithmetic mean

1
Mr. owino
7
Introduction to business statistics
MEASURES OF DISPERSION
Measures of dispersion are essential statistical tools that help determine how data values are spread
around the mean. They provide insights into whether data points are clustered close to the mean or
widely scattered.
When data points are closely dispersed around the mean, the measure of dispersion is small,
indicating that the mean is a reliable representation of the dataset.
When data points are widely spread, the measure of dispersion is large, suggesting that the mean
does not adequately represent the dataset.
The most frequently used measures of dispersion include:
 Range
 Absolute Mean Deviation
 Standard Deviation
 Semi-Interquartile and Quartile Deviation
 10th and 90th Percentile Range
 Variance
1. The Range
The range is the simplest measure of dispersion and is defined as the difference between the highest
and lowest values in a dataset:
Range = Maximum Value – Minimum Value
Characteristics of the Range:
i. It only considers two values from the dataset, making it less efficient in describing data
dispersion.
ii. A smaller range indicates that data points are closer to the mean, implying lower dispersion.
iii. A larger range suggests that data points are more widely spread, indicating higher dispersion.
Limitations of the Range:
i. It does not account for the distribution of other values within the dataset.
ii. Two datasets can have the same range but differ significantly in their overall dispersion.
iii. Due to its limitations, the range is not commonly used in business management for decision-
making.
2. The standard deviation

1
Mr. owino
8
Introduction to business statistics
This is one of the most accurate measures of dispersion.
It has the following advantages;
i. It utilizes all the values given
ii. It makes use of both negative and positive values if they occur
iii. The standard deviation reflects an accurate impression of how much the sample data
varies from the mean. This is because its suitability can also be tested using other
statistical methods
Example

A sample comprises of the following observations; 14, 18, 17, 16, 25, 31
Determine the standard deviation of this sample
Observation.

14 -6.1 37.21
18 -2.1 4.41
17 -3.1 9.61
16 -4.1 16.81
25 4.9 24.01
31 10.9 118.81
Total 121 210.56

∴ standard deviation,
= 5.93
Alternative method
x X2

1
Mr. owino
9
Introduction to business statistics
14 196
18 324
17 289
16 256
25 625
31 961
Total 121 2651

= 5.93
Example 2

The following table shows the part-time rate per hour of a given no. of laborers in the month of June
1997.

Rate per hr (x) No. of labourers fx fx2


Shs (f)
230 7 1610 370300
400 6 2400 960000
350 2 700 245000
450 1 450 202500
200 8 1600 320000
150 11 1650 247500
Total 35 8410 2345300

Calculate the standard deviation from the above table showing how the hourly payment was varying
from the respective mean

2
Mr. owino
0
Introduction to business statistics

∴ standard deviation,

=
= 96.29

Example 3 – Grouped data

In business statistical work we usually encounter a set of grouped data. In order to determine the
standard deviation from such data, we use any of the three following methods
i. The long method
ii. The shorter method
iii. The coded method
The above methods are used in the following examples
Example 3.1

The quality controller in a given firm had an accurate record of all the iron bars produced in may
1997. The following data shows those records
i. Using long method
Bar lengths No. of bars(f) Class mid- fx fx2
(cm) point (x)
201 – 250 25 225.5 5637.5 1271256.25
251 – 300 36 275.5 9918 2732409
301 – 350 49 325.5 15949.5 5191562.25
351 – 400 80 375.5 30040 11280020
401 – 450 51 425.5 21700.5 9233562.75

2
Mr. owino
1
Introduction to business statistics
451 – 500 42 475.5 19971 9496210.50
501 - 550 30 525.5 15765 8284507.50
313 118981.50 47489526

Calculate the standard deviation of the lengths of the bars

∴ standard deviation, σ =

=
= 84.99 cm

ii. Using the shorter method


Bar lengths No. of mid point x-A = d fd Fd2
(cm) bars(f) (x)
201 – 250 25 225.5 -150 -3750 562500
251 – 300 36 275.5 -100 -3600 360000
301 – 350 49 325.5 -50 -2450 122500
351 – 400 80 375.5 (A) 0 0 0
401 – 450 51 425.5 50 2550 127500
451 – 500 42 475.5 100 4200 420000
501 - 550 30 525.5 150 4500 675000
Total 313 1450 2267500
Calculate the standard deviation using the shorter method quagmire

2
Mr. owino
2
Introduction to business statistics

∴ Standard deviation, σ =

=
= 84.99 cm
iii. Using coded method

Bar lengths (f) mid point (x) x-A = d d/c = u fu fu2


(cm)
201 – 250 25 225.5 -150 -3 -75 225
251 – 300 36 275.5 -100 -2 -72 144
301 – 350 49 325.5 -50 -1 -49 49
351 – 400 80 375.5 (A) 0 0 0 0
401 – 450 51 425.5 50 1 51 51
451 – 500 42 475.5 100 2 84 168
501 – 550 30 525.5 150 3 90 270
313 29 907

C = 50 where c is an arbitrary number, try picking a different figure say 45 the


answer should be the same.
Standard deviation using the coded method. This is the most preferable method among the three
methods

2
Mr. owino
3
Introduction to business statistics

= 50 × 1.6997
= 84.99
3. The Variance

Square of the standard deviation is called variance.


4. The semi-interquartile range
This is a measure of dispersion which involves the use of quartile. A quartile is a mark or a value
which lies at the boundary of a division when any given set of data is divided into four equal
divisions
Each of such divisions normally carries 25% of all the observations
The semi interquartile range is a good measure of dispersion because it shows how the rest of the
data are generally spread around the mean
The quartiles normally used are three namely;
i. The lower quartile (first quartile Q1) this usually binds the lower 25% of the data
ii. The median (second quartile Q2)
iii. The upper quartile (third quartile Q3)
The semi-interquartile range,

Example 1

The weights of 15 parcels recorded at the GPO were as follows:


16.2, 17, 20, 25(Q1) 29, 32.2, 35.8, 36.8(Q2) 40, 41, 42, 44(Q3) 49, 52, 55 (in kgs)
Required

Determine the semi-interquartile range for the above data

Example 2 (Grouped Data)

2
Mr. owino
4
Introduction to business statistics
The following table shows the levels of retirement benefits given to a group of workers in a given
establishment.

Retirement benefits £ No of retirees UCB cf


‘000 (f)
20 – 29 50 29.5 50
30 – 39 69 39.5 119
40 – 49 70 49.5 189
50 – 59 90 59.5 279
60 – 69 52 69.5 331
70 – 79 40 79.5 371
80 – 89 11 89.5 382

Required

i. Determine the semi-interquartile range for the above data


ii. Determine the minimum value for the top ten percent. (10%)
iii. Determine the maximum value for the lower 40% of the retirees
Solution

The lower quartile (Q1) lies on position

= 29.5 + 6.63
= £36.13

2
Mr. owino
5
Introduction to business statistics
The upper quartile (Q3) lies on position

= 287.25

∴ the value of Q3 = 59.5 + × 10


= 61.08

The semi-interquartile range =

= 12.475
= £12,475
ii. The top 10% is equivalent to the lower 90% of the retirees
The position corresponding to the lower 90%

= 0.9 x 383

= 344.7
∴ the benefits (value) corresponding to the minimum value for top 10%

= 69.5 + x 10
= 72.925
= £ 72925
iii. The lower 40% corresponds to position

2
Mr. owino
6
Introduction to business statistics
40
= 100 (382 + 1)
= 153.20
∴ retirement benefits corresponding to its position

= 39.5 + x 10
= 39.5 + 4.88
= 44.38
= £ 44380
5. The 10th – 90th percentile range
This is a measure of dispersion which uses percentile. A percentile is a value which separates one
division from the other when a given data is divided into 100 equal divisions.
This measure of dispersion is very important when calculating the co-efficient of skewness (see
later)
Example

Using the above data for retirees calculate the 10th - 90th percentile. The tenth percentile 10th
percentile lies on position

(382 + 1) = 0.1 x 383


= 38.3
∴ the value corresponding to the tenth percentile

= 19.5 + 7.66
= 27.16
The 90th percentile lies on position

= 344.7

2
Mr. owino
7
Introduction to business statistics
∴ the value corresponding to the 90th percentile

= 69.5 + x 10
= 69.5 + 3.425
= 72.925
∴ the required value of the 10th – 90th percentile = 72.925 – 27.16 = 45.765

6. Relative measures of dispersion

Definition:

A relative measure of dispersion is a statistical value which may be used to compare variations in 2
or more samples.
The measures of dispersion are usually expressed as decimals or percentages and usually they do not
have any other units
Example

The average distance covered by vehicles in a motor rally may be given as 2000 km with a standard
deviation of 5 km.
In another competition set of vehicles covered 3000 km with a standard deviation of 10 kms
NB: The 2 standard deviations given above are referred to as absolute measures of dispersion. These
are actual deviations of the measurements from their respective mean
However, these are not very useful when comparing dispersions among samples.
Therefore, the following measures of dispersion are usually employed in order to assess the degree
of dispersion.
i. Coefficient of mean deviation

ii. Coefficient of quartile deviation

2
Mr. owino
8
Introduction to business statistics

Where Q1 = first quartile


Q3 = third quartile
iii. Coefficient of standard deviation

iv. Coefficient of variation

=
Example (see information above)
First group of cars: mean = 2000 kms
Standard deviation = 5 kms

∴ C.O.V = 5 x 100
2000
= 0.25%
Second group of cars: mean = 3000 kms
Standard deviation = 10kms
∴ C.O.V = 10 x 100
3000
= 0.33%
Conclusion

Since the coefficient of variation is greater in the 2 nd group, than in the first group we may conclude
that the distances covered in the 1st group are much closer to the mean that in the 2nd group.
Example 2

In a given farm located in the UK the average salary of the employees is £ 3500 with a standard
deviation of £150
The same firm has a local branch in Kenya in which the average salaries are Kshs 8500 with a
standard deviation of Kshs.800
2
Mr. owino
9
Introduction to business statistics
Determine the coefficient of variation in the 2 firms and briefly comment on the degree of dispersion
of the salaries in the 2 firms.
First firm in the UK
C.O.V = 150 x 100
3500
= 4.29%
Second firm in Kenya
C.O.V = 800 x 100
8500
= 9.4%
Conclusively, since 4.29% < 9.4% then the salaries offered by the firm in UK are much closer to the
mean given them in the case to the local branch in Kenya
COMBINED MEAN AND STANDARD DEVIATION
Sometimes we may need to combine 2 or more samples say A and B. It is therefore essential to
know the new mean and the new standard deviation of the combination of the samples.
Combined mean
Let m be the combined mean
Let x1 be the mean of first sample
Let x2 be the mean of the second sample
Let n1 be the size of the 1st sample
Let n2 be the size of the 2nd sample
Let s1 be the standard deviation of the 1st sample
Let s2 be the standard deviation of the 2nd sample

Example
A sample of 40 electric batteries gives a mean life span of 600 hrs with a standard deviation of 20
hours.

3
Mr. owino
0
Introduction to business statistics
Another sample of 50 electric batteries gives a mean lifespan of 520 hours with a standard deviation
of 30 hours.
If these two samples were combined and used in a given project simultaneously, determine the
combined new mean for the larger sample and hence determine the combined or pulled standard
deviation.
Size x s
40(n1) 600 hrs(x1) 20hrs (s1)
50 (n1) 520 hrs (x2) 30 hrs (s2)

Combined standard deviation

3
Mr. owino
1
Introduction to business statistics
CORRELATION

This is an important statistical concept which refers to interrelationship or association between


variables.
The purpose of studying correlation is for one to be able to establish a relationship, plan and control
the inputs (independent variables) and the output (dependent variables)
In business one may be interested to establish whether there exists a relationship between the
i. Amount of fertilizer applied on a given farm and the resulting harvest
ii. Amount of experience one has and the corresponding performance
iii. Amount of money spent on advertisement and the expected incomes after sale of the
goods/service
There are two methods that measure the degree of correlation between two variables these are
denoted by R and r.

(a) Coefficient of correlation denoted by r, this provides a measure of the strength of


association between two variables one the dependent variable the other the independent
variable r can range between +1 and – 1 for perfect positive correlation and perfect negative
correlation respectively with zero indicating no relation i.e. for perfect positive correlation y
increase linearly with x increment.
(b) Rank correlation coefficient denoted by R is used to measure association between two sets
of ranked or ordered data. R can also vary from +1, perfect positive rank correlation and -1
perfect negative rank correlation where O or any number near zero representing no
correlation.

SCATTER GRAPHS
- A scatter graph is a graph which comprises of points which have been plotted but are not
joined by line segments
- The pattern of the points will definitely reveal the types of relationship existing between
variables
- The following sketch graphs will greatly assist in the interpretation of scatter graphs.

3
Mr. owino
2
Introduction to business statistics

Perfect positive correlation


y
Dependent variable x
x
x
x
x
x
x
x

Independent variable

NB: For the above pattern, it is referred to as perfect because the points may easily be represented by
a single line graph e.g. when measuring relationship between volumes of sales and profits in a
company, the more the company sales the higher the profits.
Perfect negative correlation
y x
Quantity sold x
X
x
x
x
x
x
x
10 20 Price X

3
Mr. owino
3
Introduction to business statistics

This example considers volume of sale in relation to the price, the cheaper the goods the bigger the
sale.

High positive correlation


y
Dependant variable xx
xx
x
x
xx
xx
xx
xx
x
xxx
x
x
independent variable

High negative correlation


y
quantity sold
x
xx
x
xx
x x
x
x
xx
x

3
Mr. owino
4
Introduction to business statistics

price

No correlation
y

600 x x x x x
x x x
400 x x x x x
x x x x
200 x x x x x
x x x x
0
10 20 30 40 50 x
h) Spurious Correlations
- in some rare situations when plotting the data for x and y we may have a group showing
either positive correlation or –ve correlation but when you analyze the data for x and y in
normal life there may be no convincing evidence that there is such a relationship. This
implies therefore that the relationship only exists in theory and hence it is referred to as
spurious or non sense e.g. when high passrates of student show high relation with increased
accidents.
Correlation coefficient
- These are numerical measures of the correlations existing between the dependent and the
independent variables

3
Mr. owino
5
Introduction to business statistics
- These are better measures of correlation than scatter graphs (diagrams)
- The range for correlation coefficients lies between +ve 1 and –ve 1. A correlation coefficient
of +1 implies that there is perfect positive correlation. A value of –ve shows that there is
perfect negative correlation. A value of 0 implies no correlation at all
- The following chart will be found useful in interpreting correlation coefficients

__ 1.0} Perfect +ve correlation


} High positive correlation
__ 0.5 }
} Low positive correlation
__0 }
} Low negative correlation
__-0.5}
} High negative correlation
__-1.0} Perfect –ve correlation

There are usually two types of correlation coefficients normally used namely; -
Product Moment Coefficient (r)
It gives an indication of the strength of the linear relationship between two variables.

r=
note that this formula can be rearranged to have different outlooks but the resultant is always the
same.
3
Mr. owino
6
Introduction to business statistics
Example
The following data was observed and it is required to establish if there exists a relationship between
the two.
X 15 24 25 30 35 40 45 65 70 75
Y 60 45 50 35 42 46 28 20 22 15
Solution
Compute the product moment coefficient of correlation (r)
X Y X2 Y2 XY
15 60 225 3,600 900
24 45 576 2,025 1,080
25 50 625 2,500 1,250
30 35 900 1,225 1,050
35 42 1,225 1,764 1,470
40 46 1,600 2,116 1,840
45 28 2,025 784 1,260
65 20 4,225 400 1,300
70 22 4,900 484 1,540
75 15 5,625 225 1,125

r=

r=

=
The correlation coefficient thus indicates a strong negative linear association between the two
variables.

3
Mr. owino
7
Introduction to business statistics
Interpretation of r – Problems in interpreting r values
NOTE:
 A high value of r (+0.9 or – 0.9) only shows a strong association between the two variables but
doesn’t imply that there is a causal relationship i.e. change in one variable causes change in the
other it is possible to find two variables which produce a high calculated r yet they don’t have a
causal relationship. This is known as spurious or nonsense correlation e.g. high pass rates in QT
in Kenya and increased inflation in Asian countries.
 Also note that a low correlation coefficient doesn’t imply lack of relation between variables but
lack of linear relationship between the variables i.e. there could exist a curvilinear relation.
 A further problem in interpretation arises from the fact that the r value here measures the
relationship between a single independent variable and dependent variable, where as a particular
variable may be dependent on several independent variables (e.g. crop yield may be dependent
on fertilizer used, soil exhaustion, soil acidity level, season of the year, type of seed etc.) in
which case multiple correlation should be used instead.
The Rank Correlation Coefficient (R)
Also known as the spearman rank correlation coefficient, its purpose is to establish whether there is
any form of association between two variables where the variables arranged in a ranked form.

R=1-
Where d = difference between the pairs of ranked values.
n = numbers of pairs of rankings
Example
A group of 8 accountancy students are tested in Quantitative Techniques and Law II. Their rankings
in the two tests were.
Student Q. T. ranking Law II ranking d d2
A 2 3 -1 1
B 7 6 1 1
C 6 4 2 4
D 1 2 -1 1

3
Mr. owino
8
Introduction to business statistics
E 4 5 -1 1
F 3 1 2 4
G 5 8 -3 9
H 8 7 1 1

d = Q. T. ranking – Law II ranking

R=1-
= 0.74
Thus, we conclude that there is a reasonable agreement between student’s performances in the two
types of tests.
NOTE: in this example, if we are given the actual marks then we find r. R varies between +1
and -1.
Tied Rankings
A slight adjustment to the formula is made if some students tie and have the same ranking the
adjustment is

where t = number of tied rankings the adjusted formula becomes

R=1-
Example
Assume that in our previous example student E & F achieved equal marks in Q. T. and were given
joint 3rd place.
Solution
Student Q. T. ranking Law II ranking d d2
A 2 3 -1 1
B 7 6 1 1
C 6 4 2 4

3
Mr. owino
9
Introduction to business statistics
D 1 2 -1 1
E 3½ 5 -1 ½ 2¼
F 3½ 1 2½ 6¼
G 5 8 -3 9
H 8 7 1 12

R = 1- = 1-
= 0.68
NOTE: It is conventional to show the shared rankings as above, i.e. E, & F take up the 3 rd and
4th rank which are shared between the two as 3½ each.
ii. Coefficient of Determination
This refers to the ratio of the explained variation to the total variation and is used to measure the
strength of the linear relationship. The stronger the linear relationship the closer the ratio will be to
one.
Coefficient determination = Explained variation
Total variation
Example (Rank Correlation Coefficient)
In a beauty competition 2 assessors were asked to rank the 10 contestants using the professional
assessment skills. The results obtained were given as shown in the table below

Contestants 1st assessor 2nd assessor


A 6 5
B 1 3
C 3 4
D 7 6
E 8 7
F 2 1

4
Mr. owino
0
Introduction to business statistics
G 4 8
H 5 2
J 10 9
K 9 10

REQUIRED
Calculate the rank correlation coefficient and hence comment briefly on the value obtained
d d2
A 6 5 1 1
B 1 3 -2 4
C 3 4 -1 1
D 7 6 1 1
E 8 7 1 1
F 2 1 1 1
G 4 8 -4 16
H 5 2 3 9
J 10 9 +1 1
K 9 10 -1 1
Σd2 = 36
∴ The rank correlation coefficient R

R=1-

=1-

=1-
= 1 – 0.22
= 0.78

4
Mr. owino
1
Introduction to business statistics
Comment: since the correlation is 0.78 it implies that there is high positive correlation between the
ranks awarded to the contestants. 0.78 > 0 and 0.78 > 0.5

Example
Contestant 1st 2nd d d2
assessor assessor
A 1 2 -1 1
B 5 (5.5) 3 2.5 6.25
C 3 4 -1 1
D 2 1 1 1
E 4 5 -1 1
F 5 (5.5) 6.5 -1 1
G 7 6.5 -0.5 0.25
H 8 8 0 0
Σd2 = 11.25
Required: Complete the rank correlation coefficient

∴R= 1- =1-

=1–
= 1 – 0.13
= 0.87
This implies high positive correlation
Example (Rank Correlation Coefficient)
Sometimes numerical data which refers to the quantifiable variables may be given after which a rank
correlation coefficient may be worked out.
Is such a situation, the rank correlation coefficient will be determined after the given variables have
been converted into ranks. See the following example;

4
Mr. owino
2
Introduction to business statistics
Candidates Math r Accounts r d d2
P 92 1 67 5 -4 16
Q 82 3 88 1 2 4
R 60 5(5.5) 58 7(7.5) -2 4
S 87 2 80 2 0 0
T 72 4 69 4 0 0
U 60 5(5.5) 77 3 -2.50 6.25
V 52 8 58 7(7.5) 0.5 0.25
W 50 9 60 6 3 9
X 47 10 32 10 0 0
Y 59 7 54 9 -2 4
Σd2 = 43.5

∴ Rank correlation r = 1-

=1- =1–
= 0.74 (High positive correlation between mathematics
marks and accounts)
Example
(Product moment correlation)
The following data was obtained during a social survey conducted in a given urban area regarding
the annual income of given families and the corresponding expenditures.

Family (x)Annual (y)Annual xy x2 Y2


income £ 000 expenditure £
000
A 420 360 151200 176400 129600
B 380 390 148200 144400 152100

4
Mr. owino
3
Introduction to business statistics
C 520 510 265200 270400 260100
D 610 500 305000 372100 250000
E 400 360 144000 160000 129600
F 320 290 92800 102400 84100
G 280 250 70000 78400 62500
H 410 380 155800 168100 144400
J 380 240 91200 144400 57600
K 300 270 81000 90000 72900
Total 4020 3550 1504400 1706600 1342900
Required
Calculate the product moment correlation coefficient briefly comment on the value obtained
The produce moment correlation

r=
Workings:

= = 402

r=
= 0.89
Comment: The value obtained 0.89 suggests that the correlation between annual income and annual
expenditure is high and positive. This implies that the more one earns the more one spends.
4.2 REGRESSION
- This is a concept, which refers to the changes which occur in the dependent variable as a
result of changes occurring on the independent variable.
- Knowledge of regression is particularly very useful in business statistics where it is
necessary to consider the corresponding changes on dependent variables whenever
independent variables change

4
Mr. owino
4
Introduction to business statistics
- It should be noted that most business activities involve a dependent variable and either one or
more independent variable. Therefore, knowledge of regression will enable a business
statistician to predict or estimate the expenditure value of a dependent variable when given
an independent variable e.g. consider the above example for annual incomes and annual
expenditures. Using the regression techniques one can be able to determine the estimated
expenditure of a given family if the annual income is known and vice versa
- The general equation used in simple regression analysis is as follows
y = a + bx
Where y = Dependent variable
a= Intercept of y axis (constant)
b = Slope on the y axis
x = Independent variable
i. The determination of the regression equation such as given above is normally done
by using a technique known as “the method of least squares.
Regression equation of y on x i.e. y = a + bx

y x x Line of best fit


x x
x x
x x
x x
x x
x

The following sets of equations normally known as normal equation are used to determine the
equation of the above regression line when given a set of data.
Σy = an + bΣx
Σxy = aΣx + bΣx2

4
Mr. owino
5
Introduction to business statistics
Where Σy = Sum of y values
Σxy = sum of the product of x and y
Σx = sum of x values
Σx2= sum of the squares of the x values
a = The intercept on the y axis
b = Slope gradient line of y on x
NB: The above regression line is normally used in one way only i.e. it is used to estimate the y
values when the x values are given.
Regression line of x on y i.e. x = a + by
The fact that regression lines can only be used in one way leads to what is known as a regression
paradox
This means that the regression lines are not ordinary mathematical line graphs which may be used to
estimate the x and y simultaneously
Therefore, one has to be careful when using regression lines as it becomes necessary to develop an
equation for x and y before doing the estimation.
The following example will illustrate how regression lines are used
Example
An investment company advertised the sale of pieces of land at different prices. The following table
shows the pieces of land their acreage and costs
Piece of (x)Acreage (y) Cost £ 000 xy x2
land Hectares
A 2.3 230 529 5.29
B 1.7 150 255 2.89
C 4.2 450 1890 17.64
D 3.3 310 1023 10.89
E 5.2 550 2860 27.04
F 6.0 590 3540 36
G 7.3 740 5402 53.29
H 8.4 850 7140 70.56

4
Mr. owino
6
Introduction to business statistics
J 5.6 530 2969 31.36
Σx =44.0 Σy = 4400 Σxy= 25607 Σx2 = 254.96

Required
Determine the regression equations of
i. y on x and hence estimate the cost of a piece of land with 4.5 hectares
ii. Estimate the expected average if the piece of land costs £ 900,000
Σy = an + bΣxy
Σxy = a∑x + bΣx2
By substituting of the appropriate values in the above equations we have
4400 = 9a + 44b ……... (i)
25607 = 44a + 254.96b ……...(ii)
By multiplying equation …. (i) by 44 and equation …… (ii) by 9 we have
193600 = 396a + 1936b ……... (iii)
230463 = 396a + 2294.64b ……...(iv)
By subtraction of equation …. (iii) from equation …… (iv) we have
36863 = 358.64b
102.78 = b
by substituting for b in ……... (i)
4400 = 9a + 44(102.78)
4400 – 4522.32 = 9a
–122.32 = 9a
-13.59 = a
Therefore, the equation of the regression line of y on x is
Y = 13.59 + 102.78x
When the acreage (hectares) is 4.5 then the cost
(y) = -13.59 + (102.78 x 4.5)
= 448.92
= £ 448, 920
Note that

4
Mr. owino
7
Introduction to business statistics
Where the regression equation is given by
y= a + bx
Where a is the intercept on the y axis and
b is the slope of the line or regression coefficient
n is the sample size
then,

intercept a =

Slope b =
Example
The calculations for our sample size n = 10 are given below. The linear regression model is
y = a + bx
Table
Distance x Time y mins xy x2 y2
miles
3.5 16 56.0 12.25 256
2.4 13 31.0 5.76 169
4.9 19 93.1 24.01 361
4.2 18 75.6 17.64 324
3.0 12 36.0 9.0 144
1.3 11 14.3 1.69 121
1.0 8 8.0 1.0 64
3.0 14 42.0 9.0 196
1.5 9 13.5 2.25 81
4.1 16 65.6 16.81 256
Σx = 28.9 Σy = 136 Σxy = Σx2 = 99.41 Σy2= 1972
435.3

4
Mr. owino
8
Introduction to business statistics

The Slope b =
= 2.66

and the intercept a =


= 5.91
We now insert these values in the linear model giving
y = 5.91 + 2.66x
or
Delivery time (mins) = 5.91 + 2.66 (delivery distance in miles)
The slope of the regression line is the estimated number of minutes per mile needed for a delivery.
The intercept is the estimated time to prepare for the journey and to deliver the goods, that is the
time needed for each journey other than the actual traveling time.
PREDICTION WITHIN THE RANGE OF SAMPLE DATA
We can use the linear regression model to predict the mean of dependant variable for any given
value of independent variable
For example, if the sample model is given by
Time (min) = 5.91 + 2.66 (distance in miles)
Then if the distance is 4.0 miles, then our estimated mean time is
Ý = 5.91 + 2.66 x 4.0 = 16.6 minutes
Multiple Linear Regression Models
There are situations in which there is more than one factor which influence the dependent variable
Example
Cost of production per week in a large department depends on several factors;
i. Total numbers of hours worked
ii. Raw material used during the week
iii. Total number of items produced during the week
iv. Number of hours spent on repair and maintenance
It is sensible to use all the identified factors to predict department costs
Scatter diagram will not give the relationship between the various factors and total costs

4
Mr. owino
9
Introduction to business statistics
The linear model for multiple linear regression if of the type; (which is the line of best fit).
y = α + b1x1 +b2x2 +………… + bnxn
We assume that errors or residuals are negligible.
In order to choose between the models, we examine the values of the multiple correlation coefficient
r and the standard deviation of the residuals α.
A model which describes well the relationship between y and x’s has multiple correlation coefficient
r close to ±1 and the value of α which is small.
Example
Odino chemicals limited are aware that its power costs are semi variable cost and over the last six
months these costs have shown the following relationship with a standard measure of output.

Month Output (standard units) Total power costs £


000
1 12 6.2
2 18 8.0
3 19 8.6
4 20 10.4
5 24 10.2
6 30 12.4
Required
i. Using the method of least squares, determine an appropriate linear relationship between
total power costs and output
ii. If total power costs are related to both output and time (as measured by the number of the
month) the following least squares regression equation is obtained
Power costs = 4.42 + (0.82) output + (0.10) month
Where the regression coefficients (i.e. 0.82 and 0.10) have t values 2.64 and 0.60
respectively and coefficient of multiple correlation amounts to 0.976
Compare the relative merits of this fitted relationship with one you determine in (a).
Explain (without doing any further analysis) how you might use the data to forecast total
power costs in seven months.
5
Mr. owino
0
Introduction to business statistics
Solution
a)
Output (x) Power costs (y) x2 y2 xy
12 6.2 144 38.44 74.40
18 8.0 324 64.00 144.00
19 8.6 361 73.96 163.40
20 10.4 400 108.16 208.00
24 10.2 576 104.04 244.80
30 12.4 900 153.76 372.00
Σx = 123 Σy = 55.8 Σx2 = 2705 Σy2 = 542.36 Σxy=
1,206.60

b=

= = 0.342

a = (Σy – bΣx)

= (55.8 – 0.342) 123


= 2.29
∴ (Power costs) = 2.29 + 0.342 (output)
b. For linear regression calculated above, the coefficient of correlation r is

r=

=
= 0.96

5
Mr. owino
1
Introduction to business statistics
This shows a strong correlation between power cost and output. The multiple correlation when both
output and time are considered at the same time is 0.976.
We observe that there has been very little increase in r which means that inclusion of time variable
does not improve the correlation significantly
The value for time variable is only 0.60 which is insignificant as compared with a t value of 2.64 for
the output variable
In fact, if we work out correlation between output and time, there will be a high correlation. Hence
there is no necessity of taking both the variables. Inclusion of time does improve the correlation
coefficient but by a very small amount.
If we use the linear regression analysis and attempt to find the linear relationship between output and
time i.e.
Month Output
1 12
2 18
3 19
4 20
5 24
6 30
The value of b and a will turn out to be 3.11 and 9.6 i.e. relationship will be of the form
Output = 9.6 + 3.11 × month
For this equation forecast for 7th month will be
Output = 9.6 + 3.11 × 7
= 9.6 + 21.77
= 31.37 units
Using the equation, Power costs = 2.29 + 0.34 × output
= 2.29 + 0.34 × 31.37
= 2.29 + 10.67
= 12.96 i.e. £ 12,960
Non-Linear Relationships

5
Mr. owino
2
Introduction to business statistics
If the scatter diagram and the correlation coefficient do not indicate linear relationship, then the
relationship may be non – linear
Two such relationships are of peculiar interest

Both of these can be reduced to linear model. Simple or multiple linear regression methods are then
used to determine the values of the coefficients
i. Exponential model

Take log of both sides


log y = log a + log bx
log y = log a + xlog b
Let log y = Y and log a = A and log b = B

Thus, we get Y = A + Bx. This is a linear regression model


ii. Geometric model

using the same technique as above


log y = log a + blog x
Y = A + bX
Where Y = log y
A = log a
X = log x
Using linear regression technique (the method of least squares), it is possible to calculate the value
of a and b

5
Mr. owino
3
Introduction to business statistics
SKEWNESS
This is a concept which is commonly used in statistical decision making. It refers to the degree in
which a given frequency curve is deviating away from the normal distribution
There are 2 types of skew ness namely
i. Positive skew ness
ii. Negative skew ness
1. Positive Skewness
This is the tendency of a given frequency curve leaning towards the left. In a positively skewed
distribution, the long tail extended to the right.
In this distribution one should note the following
i. The mean is usually bigger than the mode and median
ii. The median always occurs between the mode and mean
iii. There are more observations below the mean than above the mean
This frequency distribution as represented in the skewed distribution curve is characteristic of the
age distributions in the developing countries
frequency Positively skewed frequency
frequency curve Negatively skewed
frequency curve

Normal distribution
Mode

Median

Mean

Long tail
Mean

Median

Mode

2. Negative Skewness
This is an asymmetrical curve in which the long tail extends to the left
NB: This frequency curve for the age distribution is characteristic of the age distribution in
developed countries
i. The mode is usually bigger than the mean and median
ii. The median usually occurs in between the mean and mode

5
Mr. owino
4
Introduction to business statistics
iii. The no. of observations above the mean are usually more than those below the mean (see the
shaded region)
MEASURES OF SKEWNESS
These are numerical values which assist in evaluating the degree of deviation of a frequency
distribution from the normal distribution.
Following are the commonly used measures of skew ness.
1. Coefficient Skewness

=
2. Coefficient of skewness

=
NB: These 2 coefficients above are also known as Pearsonian measures of skewness.
3. Quartile Coefficient of skewness

=
Where Q1 = 1st quartile
Q2 = 2nd quartile
Q3 = 3rd quartile
NB: The Pearsonian coefficients of skewness usually range between –ve 3 and +ve 3. These are
extreme value i.e. +ve 3 and –ve 3 which therefore indicate that a given frequency is negatively
skewed and the amount of skewness is quite high.
Similarly, if the coefficient of skewness is +ve it can be concluded that the amount of skew ness of
deviation from the normal distribution is quite high and also the degree of frequency distribution is
positively skewed.
Example
The following information was obtained from an NGO which was giving small loans to some small-
scale business enterprises in 1996. the loans are in the form of thousands of Kshs.
Loans Units Midpoints(x) x-a=d d/c= u fu Fu2 UCB cf
(f)

5
Mr. owino
5
Introduction to business statistics
46 – 50 32 48 -15 -3 -96 288 50.5 32
51 – 55 62 53 -10 -2 -124 248 55.5 94
56 – 60 97 58 -5 -1 -97 97 60.5 191
61 –65 120 63 (A) 0 0 0 0 0 0
66 –70 92 68 5 +1 92 92 70.5 403
71 –75 83 73 10 +2 166 332 75.5 486
76 – 80 52 78 15 +3 156 468 80.5 538
81 – 85 40 83 20 +4 160 640 85.5 57.8
86 – 90 21 88 25 +5 105 525 90.5 599
91 – 95 11 93 30 +6 66 396 95.5 610
Total 610 428 3086
Required
Using the Pearsonian measure of skew ness, calculate the coefficients of skew ness and hence
comment briefly on the nature of the distribution of the loans.

Arithmetic mean = Assumed mean +

= 63 +
= 66.51

It is very important to note that the method of obtaining arithmetic


mean (or any other statistic) by minusing assumed mean (A) from X
and then deviding by c can be abit confusing, if this is the case then
just use the straight forward method of:

The standard deviation =c×

=5 ×
= 10.68

5
Mr. owino
6
Introduction to business statistics

The Position of the median lies m =

= = 305.5

= 60.5 + ×5

= 60.5 + ×5
Median = 65.27
Therefore, the Pearsonian coefficient

=
= 0.348
Comment
The coefficient of skewness obtained suggests that the frequency distribution of the loans given was
positively skewed
This is because the coefficient itself is positive. But the skewness is not very high implying the
degree of deviation of the frequency distribution from the normal distribution is small
INDEX NUMBERS
An index number measures how much available changes overtime. An index number is an attempt
to summarize a whole mass of data into one figure. The single figure shows how one year differs
from another year.

It is a statistical devise used to measure the change in the level of prices, wages output and other
variables at given times, relative to their level at an earlier time which is taken as the base for
comparison purposes

We calculate the index number by finding the ratio of the current value to a
base year. The following are the classifications of index numbers:

(i) Price Indices: This type of indices is the most frequently used. Price indices consider
prices of a commodity or a group of commodities and compare changes of prices

5
Mr. owino
7
Introduction to business statistics
from one period to another period and also compare the difference in price from one
place to another. For example, the familiar Consumer Price Index measuring overall
price changes of consumer commodities and services is used to define the cost of
living.
(ii) Quantity Indices: The major focus of consideration and comparison in these indices
are the quantities either of a single commodity or a group of commodities. For
example, the focus may be to understand the changes in the quantity of paddy
production in India over different time periods. For this purpose, a single
commodity’s quantity index will have to be constructed. Alternatively, the focus
may be to understand the changes in food grain production in India, in this case all
commodities which are categorized under food grains will be considered while
constructing the quantity index.
(iii) Value Indices: Value indices actually measure the combined effects of price and
quantity changes. For many situations either a price index or quantity index may not
be enough for the purpose of a comparison. For example, an index may be needed to
compare cost of living for a specific group of persons in a city or a region. Here
comparsion of expenditure of a typical family of the group is more relevant. Since
this involves comparing expenditure, it is the value index which will have to be
constructed. These indices are useful in production decisions, because it avoids the
effects of inflation.
The following are the popular index number formulas:

(i) Laspeyre’s index


(ii) Paasche index
(iii) Fisher’s index
(iv) Tornqvist index
∑ Pn X 100
A simple price index =
∑ P0

5
Mr. owino
8
Introduction to business statistics

A simple quantity index =

∑ Q n X 100
∑ Q0

Where pn is the price of a commodity in the current year (the year for which the price index to be
calculated) Where po is the price of the same commodity in the base year (the year for comparison
purposes)

Similarly, Qn and Qo are defined in the same way

CHANGING THE BASE OF THE INDEX

For comparison purposes if two series have different base years, it is difficult to compare them
directly. In such cases, it is necessary to change the base year of one of the series (or both) so that
both have the same base. It is also necessary to keep the index relevant to current conditions hence
the need to change the base from time to time.

Example;
Year 1985 1986 1987 1988 1989 1990 1991 1992
Price index 100 104 108 109 112 120 125 140

Suppose we wish to change the base year to 1989

We recalculate each index by expressing it as a percentage of 1989

Previous index Recalculated index


1985 100 100
× 100 = 89.3

112
1986 104 104
× 100 = 92.9

112
1987 108 108
× 100 = 96.4

112

5
Mr. owino
9
Introduction to business statistics

1988 109 109


× 100 = 97.3

112
1989 (new base year) 112 112
× 100 = 100

112
1990 120 120
× 100 = 107.1

112
1991 125 125
× 100 = 111.6

112
1992 140 140
× 100 = 125.0

112

When changing the base year, it is advisable to update the weights used in the base year.

6
Mr. owino
0
Introduction to business statistics
Chain Based Index Numbers

A chain-based index is one where the index is calculated every year using the previous year as the
base year. This type of index measures rate of change from year to year.

This method is suitable where weights are changing rapidly and items are constantly being brought
into the index and unwanted items taken out. It can be a price or quantity index

Previous index Recalculated chain based fixed based index


index
1985 100 100 100(1985 base year
1986 104 104 104
× 100 = 104 × 100 = 104

100 100
1987 108 108 108
× 100 = 103.8 × 100 = 108

104 100
1988 109 109 109
× 100 = 100.9 × 100 = 109

108 100
1989 112 112 112
× 100 = 102.8 × 100 = 112

109 100
1990 120 120 120
× 100 = 107.1 × 100 = 120

112 100
1991 125 125 125
× 100 = 104.2 × 100 = 125

120 100
1992 140 140 140
× 100 = 112 × 100 = 140

120 100

6
Mr. owino
1
Introduction to business statistics
AGGREGATE PRICE INDEX NUMBERS AND QUANTITY INDEX NUMBERS

PRICE INDEX QUANTITY INDEX


LASPEYRE’S INDEX ∑ Pn. Q 0 X 100 ∑ P 0 .Q n X 100
∑ P 0.Q 0 ∑ P 0.Q 0
PAASCHE’S INDEX ∑ Pn. Q n X 100
∑ P n .Q 0 ∑ Pn. Q n X 100
∑ P 0.Q n
Value index =
Example ∑ Pn. Qn X 100
∑ Pn. Q 0
Given below are the price quantity data with price quoted in Ksh per Kg and production in Kgs

6
Mr. owino
2
Introduction to business statistics

2002 2007 Solution


Items Po Qo Pn Qn PnQn P0Q0 PnQn P0Qn
Beef 15 500 20 600 10000 7500 12000 9000
Mutton 18 590 23 640 13570 10620 14720 11520
Chicken 22 450 24 500 10800 9900 12000 11000
Total 55 1540 67 1740 34370 28020 38720 31520

Determine:

(i) Laspeyre’s price and quantity indexes


(ii) Paasche’s price and quantity indexes
The Fisher’s index
The Fisher’s index acts as a compromise between Laspeyre’s index and Paasche’ index. It is
calculated as a geometric mean of the two indexes.

Example

Let us observe the following data of 1995 and 2000, and also required computation for
construction of

(i) Laspeyre’s, (ii) Paasche’s,.

Year 1995 (Base Year 2000 (Current Solution


Commodities Year) Year)
Prices Qty Prices Qty P1Qo PoQo P1Q1 PoQ1
(Po) (Qo) (P1) (Q1)
Wheat 800 6 950 8 5700 4800 7600 6400
Rice 600 3 800 4 2400 1800 3200 2400
Oilseeds 400 5 425 4 2125 2000 1700 1600
Sugar 250 2 300 2 600 500 600 500
Total 2050 16 2475 18 10825 9100 13100 10900

Solution:

6
Mr. owino
3
Introduction to business statistics

This shows that prices for the group (sample commodities) have increased by 18.94% in 2000 as
compared to those prevailing in 1995. The quantity index according to Laspeyre’s formula is
computed as shown below:

6
Mr. owino
4
Introduction to business statistics
This shows a 19.78% increase in aggregate quantity consumption for this group in 2000 as compared
to 1995.

Thus, according to the Paasche’s Index the price index reveals an increase of 20.18% in prices in
2000 as against 1995.

It shows a 21.03% increase in quantity consumption for this group in 2000 as compared to 1995.

TIME SERIES ANALYSIS


This is the mathematical or statistical analysis on past data arranged in a periodic sequence. It
can also be defined as an ordered sequence of values of a variable at equally spaced time
intervals. Time series occur frequently when looking at industrial data Applications:

The usage of time series models can be divided into two:

1. Obtain an understanding of the underlying forces and structure that produced the observed
data.
2. Fit a model and proceed to forecasting, monitoring or even feedback and feedforward
control.
Time Series Analysis is used for many applications such as:

1. Economic Forecasting
2. Sales Forecasting
3. Budgetary Analysis
4. Stock Market Analysis
5. Yield Projections

6
Mr. owino
5
Introduction to business statistics
6. Process and Quality Control
7. Inventory Studies
8. Workload Projections
9. Utility Studies
10. Census Analysis
Moving Average and Smoothing Techniques

In the Inherent collection of data taken over time, there is some form of random variation. For this
reason, it is necessary to ‘smoothen’ data collected over time. Smoothing data removes random
variation and shows trends and cyclic components.

There are two distinct groups of smoothing methods:

1. Averaging Methods
2. Exponential Smoothing Methods
1. Moving Average

Periodical data e.g. monthly sales may have random fluctuation every month despite a general
trend being evident. Moving average helps in smoothing away these random changes.

A moving average is the forecast for a period that takes the average of the previous periods.

Example:
The table below represents company sales, calculate 3 and 6 monthly moving averages, for the data

Months Sales
January 1200
February 1280
March 1310
April 1270
May 1190
June 1290
July 1410
August 1360

6
Mr. owino
6
Introduction to business statistics
September 1430
October 1280
November 1410
December 1390
Solution.
These are calculated as follows
Jan+ Feb+ March
April ’ s forecast =
3
1200+1280+1310
¿
3
Feb+ Mar + April
May ’ s forecast =
3

And so on…
Similarly for 6 monthly moving average

Jan+ Feb+ Mar+ Apr + May + Jun


July ’ s forecast =
6

1200+1280+1310+1270+1190 +1290
6

And so on…
3 months moving 6 months moving
average average
April 1263
May 1287
June 1257
July 1250 1257
August 1297 1292
September 1353 1305
October 1400 1325
November 1357 1327
December 1373 1363

6
Mr. owino
7
Introduction to business statistics
Note:

When plotting moving average on graphs the points are plotted as the midpoint of the period of the
average,

e.g. in our example the forecast for April (1263) is plotted on mid Feb.

Characteristics of moving average

1) The more the number of periods in the moving average, the greater the smoothing
effect.
2) Different moving averages produce different forecasts.

3) The more the randomness of data with underlying trend being constant then the
more the periods should be involved in the moving averages.
Limitations of moving averages.

1) Equal weighing with disregard to how more recent data is more relevant.

2) Moving average ignores data outside the period of the average thus it doesn’t
fully utilise available data.
3) Where there is an underlying seasonal variation, forecasting with unadjusted moving
average can be misleading.

6
Mr. owino
8
Introduction to business statistics
Exponential smoothing

Whereas in Single Moving Averages the past observations are weighted equally, Exponential
Smoothing assigns exponentially decreasing weights as the observation get older. In other words,
recent observations are given relatively more weight in forecasting than the older observations.

This is a weighted moving average technique; it is given by:

New forecast = Old forecast + α (Latest Observation – Old


forecast) Where α = Smoothing constant

This method involves automatic weighing of past data with weights that decrease exponentially with
time.

Example: Consider the following set of data consisting of 12 observations taken over time. The
α (the smoothing constant) is 0.1.

Time Yt
1 71
2 70 70.9
3 69 70.71
4 68 70.439
5 64 69.7951
6 65 69.31559
7 72 69.58403
8 78 70.42563
9 75 70.88307
10 75 71.29476
11 75 71.66528
12 70 71.49875
Note:
 The value α lies between 0 and 1.
 The higher the α value, the more the forecast is sensitive to the current status.

6
Mr. owino
9
Introduction to business statistics
Characteristics of exponential smoothing
 More weight is given to the most recent data.
 All past data are incorporated unlike in moving averages.
 Less data is needed to be stored unlike in periodic moving averages.

Decomposition/Components of Time Series

Time series has the following characteristics.


a. A long-term trend (T) –tendency of the whole series to rise and fall.
b. Seasonal variation (S) – short term periodic fluctuations in values. e.g. in Kenya maize yield
is high in November and low in March or matatus have better business on Friday and very
low on Sundays.
c. Cyclical variation (C) – These are medium term changes caused by factors which apply for a
while then disappear, and come back again in a repetitive cycle. e.g. drought hits Kenya
every 7 years.
Note that cyclic variation has a longer term than seasonal variation e.g. seasonal variation may
occur once every year while cyclic variation occurs once every several years.

d. Random residual variation (R) – These are non-recurring random variations e.g. war, fire,
coup e.t.c.

For accurate forecasts these aspects are qualified separately (i.e. T, C, S and R) from data. This is
known as time decomposition or time series analysis

The separate elements are then combined to produce a forecast.

Time series models:

Additive Model

Time series value = T +S +C +R

Where S, C and R are expressed in absolute value.

This model is best suited where the component factors are independent e.g. where the seasonal
variation is unaffected by trend.

Multiplicative Model:

7
Mr. owino
0
Introduction to business statistics
Time series value = T × S× C × R

Where S, C and are expressed as percentage or proportions.

This model is best applied where characteristics interact e.g. where high trends increase seasonal
variations. Multiplicative model is more commonly used in practice.

Of the four elements of time series the most important are trend and seasonal variation. The
following illustration shows how the trend (T) and seasonal variation (S) are separated out from a
time series and how the calculated T and S values are used to prepare forecast. The process of
separating out the trend and seasonal variation is known as deseasonalising the data.

There are two approaches to this process: one is based on regression through the actual data points
and the other calculates the regression line through moving average trend points. The method using
the actual data is demonstrated first followed by the moving average method.

1. Time series analysis: trend and seasonal variation using regression on the data
The following data will be used to illustrate how the trend and seasonal variation are calculated.

Example 1

Sales of widgets in ‘000s


Quarter 1 Quarter 2 Quarter 3 Quarter
4
Year 1 20 32 62 29
2 21 42 75 31
3 23 39 77 48
4 27 39 92 53

7
Mr. owino
1
Introduction to business statistics

Step 4

Average the percentage variations to find the average seasonal variations.


Q1 Q2 Q3 Q4
% % % %
65 99 181 80
55 106 180 71
51 83 157 94
51 72 163 91
222 360 681 336

These then are the average variations expected from the trend for each of the quarters; for example,
on average the first quarter of each year will be 56% of the value of the trend. Because the
variations have been averaged, the amounts over 100% (Q3 in this example). This can be checked
by adding the average and verifying that they total 400% thus:

56% + 90% + 170% + 84% = 400%.

On occasions, roundings in the calculations will make slight adjustments necessary to the average
variations.

Step 5

Prepare final forecasts based on the trend line estimates from “trend estimates and percentages
variation table” (i.e. 30.58, 32.42, etc) and the averaged seasonal variations from the table above.
(i.e. 56%, 90%, 170% and 84%)

The seasonally adjusted forecast is calculated thus:

Seasonally adjusted forecast = Trend estimate × Seasonal variation%

X Y Seasonally
(quarters) (sales) adjusted
 forecast
Year 1 2 32 29.18
3 62 58.24
4 29 30.32
6 21 21.2
Year 2 5 42 4

 23
35.8
0
25.3
7 75 70.75
7
Year 3 9 39
 42.4
3 Mr. owino
7
2
11 77 83.27
Introduction to business statistics

10

27 29.4
14 13 9
Year 4 39
 49.0
5

7
Mr. owino
3
Introduction to business statistics

Seasonally adjusted forecasts 16 53 48.8


7
The forecasts are compared with the actual data to get some idea of how good extrapolated
forecasts might be. With further analysis they enable us to quantify the residual variations.

Extrapolation using the trend and seasonal factors

Once the formulae above have been calculated, they can be used to forecast (extrapolate) future
sales. If it is required to estimate the sales for the next year (i.e. Quarters 17, 18, 19 and 20 in our
series) this is done as follows:

Quarter 17 Basic trend = 28.74 + 1.84 (17)

= 60.02

Seasonal adjustment for a first quarter = 56%

Adjusted forecast = 60.02 × 56%

= 33.61

A similar process produces the following figures:

Adjusted forecasts Quarter 18 = 55.67

19 = 108.29

20 = 55.05

Notes:

a) Time series decomposition is not an adaptive forecasting system like moving averages
and exponential smoothing.
b) Forecasts produced by such an analysis should always be treated with caution.
Changing conditions and changing seasonal factors make long term forecasting a
difficult task.
c) The above illustration has been an example of a multiplicative model. This is the
seasonal variations were expressed in percentage or proportionate terms. Similar steps
would have been necessary if the additive model had been used except that the
variations from the trend would have been the absolute values. For example, the first

7
Mr. owino
4
Introduction to business statistics

two variations would have been

Q1: 20 – 30.58 = absolute variation = -


10.58 Q2: 32 – 32.42 = absolute variation
= - 0.42

And so on.

The absolute variations would have been averaged in the normal way to find the
average absolute variation, whether + or -, and these values would have been used to
make the final seasonally adjusted forecasts.

2. Trend and seasonal variation using moving averages

When the correlation coefficient is low the method of calculating the regression line
through the actual data points should not be used. This is because the regression line is
too sensitive to changes in the data values.

In such circumstances, calculating a regression line through the moving average trend
points is more robust and stable.

Example 1 is reworked below using this method and, because there are many similarities
to the earlier method, only the key stages are shown.

7
Mr. owino
5
Introduction to business statistics

x y 3 point moving Trend line (2) Actual


%
average (1)
Trend
1 20 34.38 58
2 32 38 35.70 90
3 62 41 37.02 167
4 29 37.3 38.34 76
5 21 30.7 39.66 53
6 42 46 40.98 102
7 75 49.3 42.30 177
8 31 43 43.62 71
9 23 31 44.94 51
10 39 46.3 46.26 84
11 77 54.7 47.58 162
12 48 50.7 48.90 98
13 27 38 50.22 54
14 39 52.7 51.54 76
15 92 61.3 52.86 174
16 53 54.18 98
Trend estimates and percentage variations utilizing moving averages
The first three moving average is calculated as follows

20  32  62
= 38 which is entered opposite period 2

The next calculated:

32  62  29
= 41, and so on

The regression line y = a + bx of the moving average values is calculated in the normal manner and
results in the following:
7
Mr. owino
6
Introduction to business statistics

y = 33.06 + 1.32x

This is used to calculate the trend line:

e.g. For Period 1:y = 33.06 + 1.32(1) =


34.38 For Period 2:y = 33.06 + 1.32
(2) = 35.70

The percentage variations are averaged as previously shown, resulting in the following values:

Q1 Q2 Q3 Q4
Average seasonal variation % 54 89 170 86

The trend line and the average seasonal variations are then used in a similar manner to that
previously described.

For example, to extrapolate future sales for the next year (i.e. quarters 17, 18, 19 and 20) is as
follows:

Quarter 17

Forecast sales = (33.06 + 1.32(17)) × 0.54 = 29.97

A similar process produces the following figures:

Quarter 18 = 50.57

19 = 98.84

20 = 51.13

Forecast errors

Differences between actual results and predictions may arise from many reasons. They may arise
from random influences, normal sampling errors, choice of the wrong forecasting system or alpha
value or simply that the future conditions turn out to be radically different from the past. Whatever

7
Mr. owino
7
Introduction to business statistics

the cause(s) management wish to know the extent of the forecast errors and various methods exist
to calculate these errors.

A commonly used technique, appropriate to time series, is to calculate the mean squared error of
the deviations between forecast and actual values then choose the forecasting system and/or
parameters which gives the lowest value of mean squared errors, i.e. akin to the ‘least squares’
method of establishing a regression line.

Longer- term forecasting

Moving averages, exponential smoothing and decomposition methods tend to be used for short to
medium term forecasting. Longer term forecasting is usually less detailed and is normally
concerned with forecasting the main trends on a year to year basis. Any of the techniques of
regression analysis described in the preceding chapters could be used depending on the
assumptions about linearity or non- linearity, the number of independent variables and so on. The
least squares regression approach is often used for trend forecasting.

Forecasting using least squares


Example 2

Data have been kept of sales over the last seven years

Year 1 2 3 4 5 6 7
Sales (in ‘000 units 14 17 15 23 18 22 27
It is required to forecast the sales for the 8th
year
Solution

Years (x) Sales (y) xy x2


1 14 14 1
2 17 34 4
3 15 45 9
4 23 92 16
5 18 90 25
6 22 132 36
7 27 189 49
x=28 y = 136 xy=596  x2= 140
7
Mr. owino
8
Introduction to business statistics

136 = 7a + 28b

596 = 28a + 140b

 b = 1.86

And substituting in one of the equations we obtain

a = 12

 Regression line = y = 12 + 1.86x

Or, Sales in (‘000s of units) = 12.00 + 1.86 (no of years)


We use this expression for forecasting, for 8th year sales = 12 + 1.86 (8)

=26.88 i.e. 26,888

SAMPLING AND ESTIMATION

6.1 Sampling techniques


6.2 Central limit theorem
6.3 Sampling distribution of statistical parameters
6.4 Test of hypothesis
6.1 METHODS OF SAMPLING
a. Random or probability sampling methods they include:

i. Simple random sampling


ii. Stratified sampling
iii. Systematic sampling
iv. Multi stage sampling
b. Non-random probability sampling methods these consist of:
i. Judgment sampling
ii. Quota sampling
iii. Cluster sampling
Simple Random Sampling

7
Mr. owino
9
Introduction to business statistics

This refers to the sampling technique in which each and every item of the population is given an
equal chance of being included in the sample. Since selection of items in the sample depends entirely
on chance, this method is also called chance selection or representative sampling.

It is assumed that if the sample is chosen at random and if the size of the sample is sufficiently large,
it will represent all groups in the population

Random sampling is of 2 types; sampling with replacement and sampling without replacement

Sampling is said to be with replacement when from a finite population a sampling unit is drawn
observed and then returned to the population before another unit is drawn. The population in this
case remains the same and a sampling unit might be selected more than once

If on the other hand a sampling unit is chosen and not retuned to the population after it has been
observed the sampling is said to be without replacement.

Random samples may be selected by the help of lottery method or table of random numbers (such as
tippet’s table of random numbers, fischer and Yates numbers or Kendall and Babington Smith
numbers.)

Stratified sampling
In this case the population is divided into groups in such a way that units within each group are as
similar as possible in a process called stratification. The groups are called strata. Simple random
samples from each of the strata are collected and combined into a simple. This technique of
collecting a sample from a population is called stratified sampling. Stratification may be by age,
occupation income group e.t.c.

Systematic Sampling
This sampling is a part of simple random sampling in ascending or descending orders. In systematic
sampling a sample is drawn according to some predetermined object. Suppose a population consists
of 1000 units, then every tenth, 20th or 50th item is selected. This method is very easy and
economical. It also saves a lot of time

Multistage sampling

8
Mr. owino
0
Introduction to business statistics

This is similar to stratified sampling except division is done on geographical/location basis, e.g. a
country can be divided into provinces and then survey is done in 4 towns in each province. This
helps to cut traveling costs for a surveyor.

Cluster Sampling
This is where a few geographical regions e.g. a location, town or village are selected at random and
say every single household or shop in that area is interviewed. This again cuts on costs.

Judgment Sampling

Here the interviewer selects whom to interview believing that their view is more fundamental since
they might be directly affected e.g. to find out effects of public transport one may chose to interview
only people who don’t own cars and travel frequently to work.

6.2 THE CENTRAL LIMIT THEOREM


The theory was introduced by De Moivre and according to it; if we select a large number of simple
random samples, say from any population and determine the mean of each sample, the
distribution of these sample means will tend to be described by the normal probability
distribution with a mean µ and variance σ2/n. This is true even if the population itself is not
normal distribution. Or the sampling distribution of sample means approaches to a normal
distribution irrespective of the distribution of population from where the sample is taken and
approximation to the normal distribution becomes increasingly close with increase in sample sizes

Types of distribution
Population distribution
It refers to the distribution of the individual values of population. Its mean is denoted by ‘µ’

Sample distribution
It is the distribution of the individual values of a single sample. Its mean is generally written as “ x ”.
it is not usually the same as µ

8
Mr. owino
1
Introduction to business statistics

Distribution of Sample Means or sampling distribution


A sample of size n is taken from the parent population and mean of the sample is calculated. This is
repeated for a number of samples so that we have a distribution of sample means, which approaches
a normal distribution.

Standard errors of the mean

The series of sample means X 1 , X 2 , X 3 …….. is normally distributed or nearly so (according to the
central limit theorem). It can be described by its mean and its standard deviation. This standard
deviation is known as the standard error.
s
S x=
Standard error of the mean = √n
Note: this formula is satisfactory for larger samples and a large population i.e. n > 30 and n > 5% of
N.
- The word ‘error’ is in place of ‘deviation’ to emphasize that variation among sample means is
due to sampling errors.
- The smaller the standard error the greator the precision of the sample value.
6.3 STATISTICAL INFERENCE

It is the process of drawing conclusions about attributes of a population based upon information
contained in a sample (taken from the population).
It is divided into estimation of parameters and testing of hypothesis. Symbols for statistic of
population parameters are as follows.

Sample Population
Statistic Parameter
Arithmetic mean x µ

Standard deviation s σ
Number of items n N

Statistical estimation

8
Mr. owino
2
Introduction to business statistics

It is the procedure of using statistic to estimate a population parameter


It is divided into point estimation (where an estimate of a population parameter is given by a single
number) and interval estimation (where an estimate of a population is given by a range in which the
parameter may be considered to lie) e.g. a bus meant to take a class of 100 students (population N)
for trip has a limit to the maximum weight of 600kg of which it can carry, the teacher realizes he has
to find out the weight of the class but without enough time to weigh everyone he picks 25 students
selected at random (sample n = 25). These students are weighed and their average weight recorded
as 64kg ( X - mean of a sample) with a standard deviation (s), now using this the teacher intends to
estimate the average weight of the whole class (µ – population mean) by using the statistical

parameters standard deviation (s), and mean of the sample ( x ).


Characteristic of a good estimator

(i) Unbiased: where the expected value of the statistic is equal to the population
parameter e.g. if the expected mean of a sample is equal to the population mean
(ii) Consistency: where an estimator yields values more closely approaching the
population parameter as the sample increases
(iii) Efficiency: where the estimator has smaller variance on repeated sampling.
(iv) Sufficiency: where an estimator uses all the information available in the data
concerning a parameter
Confidence Interval
The interval estimate or a ‘confidence interval’ consists of a range (an upper confidence limit and
lower confidence limit) within which we are confident that a population parameter lies and we assign
a probability that this interval contains the true population value
The confidence limits are the outer limits to a confidence interval. Confidence interval is the interval
between the confidence limits. The higher the confidence level the greater the confidence interval.
For example
A normal distribution has the following characteristic
i. Sample mean ± 1.960 σ includes 95% of the population
ii. Sample mean ± 2.575 σ includes 99% of the population
1. Large Samples

8
Mr. owino
3
Introduction to business statistics

These are samples that contain a sample size greater than 30(i.e. n>30)
(a) Estimation of population mean
Here we assume that if we take a large sample from a population then the mean of the population is
very close to the mean of the sample
Steps to follow to estimate the population mean includes
i. Take a random sample of n items where (n>30)

ii. Compute sample mean ( X ) and standard deviation (S)


iii. Compute the standard error of the mean by using the following formular
s
Sx = √ n

where S x = Standard error of mean


S = standard deviation of the sample
n = sample size
iv. Choose a confidence level e.g. 95% or 99%
v. Estimate the population mean as under

Population mean µ = χ ± (appropriate number) ×S x


‘Appropriate number’ means confidence level e.g. at 95% confidence level is 1.96 this
number is usually denoted by Z and is obtained from the normal tables.
Example
The quality department of a wire manufacturing company periodically selects a sample of wire
specimens in order to test for breaking strength. Past experience has shown that the breaking
strengths of a certain type of wire are normally distributed with standard deviation of 200 kg. A
random sample of 64 specimens gave a mean of 6200 kgs. Find out the population mean at 95%
level of confidence
Solution

Population mean = χ ± 1.96 S x

Note that sample size is alredy n > 30 whereas s and x are given thus step i), ii) and iv) are provided.
Here: X = 6200 kgs

8
Mr. owino
4
Introduction to business statistics

s 200
S x = √ n = √ 64 = 25

Population mean = 6200 ± 1.96(25)


= 6200 ± 49
= 6151 to 6249
At 95% level of confidence, population mean will be in between 6151 and 6249
2. SMALL SAMPLES
(a) Estimation of population mean
If the sample size is small (n<30) the arithmetic mean of small samples are not normally distributed.
In such circumstances, students t distribution must be used to estimate the population mean.
In this case

Population mean µ = ±

= Sample mean

S = standard deviation of samples = for small samples.


n = sample size
v = n – 1 degrees of freedom.
The value of t is obtained from students t distribution tables for the required confidence level
Example
A random sample of 12 items is taken and is found to have a mean weight of 50 grams and a
standard deviation of 9 grams
What is the mean weight of population
a) with 95% confidence
b) with 99% confidence
Solution

8
Mr. owino
5
Introduction to business statistics

S = 9; v = n – 1 = 12 – 1 = 11;

µ = x’ ±
At 95% confidence level

µ = 50 ± 2.262

= 50 ± 5.72 grams

Therefore we can state with 95% confidence that the population mean is between 44.28 and 55.72
grams
At 99% confidence level

µ = 50 ± 3.25
= 50 ± 8.07 grams
Therefore we can state with 99% confidence that the population mean is between 41.93 and 58.07
grams
Note: To use the t distribution tables it is important to find the degrees of freedom (v = n – 1). In the
example above v = 12 – 1 = 11
From the tables we find that at 95% confidence level against 11 and under 0.05, the value of t =
2.201
6.4 HYPOTHESIS TESTING
Definition
- A hypothesis is a claim or an opinion about an item or issue. Therefore it has to be tested
statistically in order to establish whether it is correct or not correct
- Whenever testing an hypothesis, one must fully understand the 2 basic hypothesis to be tested
namely
i. The null hypothesis (H0)
ii. The alternative hypothesis(H1)

8
Mr. owino
6
Introduction to business statistics

The null hypothesis


This is the hypothesis being tested, the belief of a certain characteristic e.g. Kenya Bureau of
Standards (KBS) may walk to a sugar making company with an intention of confirming that the 2kgs
bags of sugar produced are actually 2kgs and not less, they conduct hypothesis testing with the null
hypothesis being: H0 = each bag weighs 2kgs. The testing will set out to confirm this or to refute it.
The alternative hypothesis
While formulating a null hypothesis we also consider the fact that the belief might be found to be
untrue hence we will reject it. We therefore formulate an alternative hypothesis which is a
contradiction to the null hypothesis, thus when we reject the null hypothesis we accept the alternative
hypothesis.
In our example the alternative hypothesis would be
H1 = each bag does not weigh 2kg

Acceptance and rejection regions


All possible values which a test statistic may either assume consistency with the null hypothesis
(acceptance region) or lead to the rejection of the null hypothesis (rejection region or critical region)
The values which separate the rejection region from the acceptance region are called critical values
Type I and type II errors
While testing hypothesis (H0) and deciding to either accept or reject a null hypothesis, there are four
possible occurrences.
a) Acceptance of a true hypothesis (correct decision) – accepting the null hypothesis and it happens
to be the correct decision. Note that statistics does not give absolute information, thus its
conclusion could be wrong only that the probability of it being right are high.
b) Rejection of a false hypothesis (correct decision).
c) Rejection of a true hypothesis – (incorrect decision) – this is called type I error, with probability
= α.
d) Acceptance of a false hypothesis – (incorrect decision) – this is called type II error, with
probability = β.
Levels of significance

8
Mr. owino
7
Introduction to business statistics

A level of significance is a probability value which is used when conducting tests of hypothesis. A
level of significance is basically the probability of one making an incorrect decision after the
statistical testing has been done. Usually such probability used are very small e.g. 1% or 5%

0.5000 0.4900

1% provision for errors

0
Critical value

0.45

5% = 0.05
Critical region

0
Crititical value = -1.65
NB: If the standardized value of the mean is less than –1.65 we reject the null hypothesis (H 0) and
accept the alternative Hypothesis (H1) but if the standardized value of the mean is more than –1.65
we accept the null hypothesis and reject the alternative hypothesis
The above sketch graph and level of significance are applicable when the sample mean is < (i.e. less
than the population mean)
The following is used when sample mean > population mean
Acceptance region

8
Mr. owino
8
Introduction to business statistics

Critical region (rejection region)

5% = 0.05

0 Z = 1.65 (critical value)

NB: If the sample mean standardized value < 1.65, we accept the null hypothesis but reject the
alternative. If the sample mean value > 1.65 we reject the null hypothesis and accept the alternative
hypothesis
The above sketch is normally used when the sample mean given is greater than the population mean

Accept null hyp( reject Alternative hyp)

Reject null hyp (accept alt hyp) Reject null hyp (accept alt hyp)

0.05% = 0.05 0.495 0.495 0.5% = 0.05


-2.58 +2.58
NB: if the standardized value of the sample mean is between –2.58 and +2.58 accept the null
hypothesis but otherwise reject it and therefore accept the alternative hypothesis

8
Mr. owino
9
Introduction to business statistics

TWO TAILED TESTS


A two tailed test is normally used in statistical work(tests of significance) e.g. if a complaint lodged
by the client is about a product not meeting certain specifications i.e. the item will generate a
complaint if its measurements are below the lower tolerance limit or above the upper tolerance
limi

Region of acceptance for


H0

Critical region Critical region

15cm 17 ½ cm

NB: Alternative hypothesis is usually rejected if the standardized value of the sample mean lies
beyond the tolerance limits (15cm and 17 ½ cm).
ONE TAILED TEST
This is a test where the alternative hypothesis (H 1:) is only concerned with one of the tails of the
distribution e.g. to test a business complaint if the complaint is above the measurements of item
being shorter than is required.
E.g. a manufacturer of a given brand of bread may state that the average weight of the bread is 500
gms but if a consumer takes a sample and weighs each of the pieces of bread and happens to have a
mean of 450 gms he will definitely complain about the bread which is underweight. The statistical
analysis to be done will concentrate on the left tail of the normal distribution in which one will have
to establish whether 450 gms being less than 500g is statistically significant. Such a test therefore is
referred to as one tailed test.

9
Mr. owino
0
Introduction to business statistics

left

On the other hand the test may compuliate on the right hand tail of the normal distribution when this
happens the major complaint is likely to do with oversize items bought. Therefore the test is known
as one tailed as the focus is on one end of the normal distribution.
Number of standard errors
Two tailed One tailed
test test
5% level of 1.96 1.65
significance
1% level of 2.58 2.33
significance

HYPOTHESIS TESTING PROCEDURE


Whenever a business complaint comes up there is a recommended procedure for conducting a
statistical test. The purpose of such a test is to establish whether the null hypothesis or alternative
hypothesis is to be accepted.
The following are steps normally adopted
1. Statement of the null and alternative hypothesis
2. Statement of the level of significance to be used.
3. Statement about the test statistic i.e. what is to be tested e.g. the sample mean, sample
proportion, difference between sample means or sample proportions
4. Type of test whether two tailed or one tailed.
5. Statement on critical values using the appropriate level of significance

9
Mr. owino
1
Introduction to business statistics

6. Standardizing the test statistic


7. Conclusion showing whether to accept or reject the null hypothesis
STANDARD HYPOTHESIS TESTS
In principal, we can test the significance of any statistic related to any probability distribution.
However we will be interested in a few standard cases. The sample statistics mean, proportion and
variance, are related to the normal, t, F, and chi squared distributions
Thus
1. Normal test

Test a sample mean ( ) against a population mean (µ) (where samples size n > 30 and
population variance σ2 is known) and sample proportion, P(where sample size np >5 and nq >5
since in this case the normal distribution can be used to approximate the binomial distribution

2. t test

Tests a sample mean ( ) against a population mean and especially where the population
variance is unknown and n < 30.
3. Variance ratio test or f test
It is used to compare population variances and it is used with samples of any size drawn from
normal populations.
4. Chi squared test
It can be used to test the association between attributes or the goodness of fit of an observed
frequency distribution to a standard distribution
Example 1
A certain NGO carried out a survey in a certain community in order to establish the average at which
the girls are married. The results of the survey indicated that the marriage age for the girls is 19 years
In order to establish the validity of the mean marital age, a sample of 50 women was interviewed and
the average age indicated that they got married at the age of 16 years. However the different ages at
which they were married differed with the standard deviation of 2.1years
The sample data indicates that the marital age is less 19 years. Is this conclusion true or not ?
Required

9
Mr. owino
2
Introduction to business statistics

Conduct a statistical test to either support the above conclusion drawn from the sample statistics i.e.
the marriage age is less than 19 years, use a level of significance of 5%
Solution
1. Null hypothesis
H0: μ (mean marital age) = 19 years
Alternative hypothesis H1: μ (mean marital age) < 19 years
2. The level of significance is 5%

3. The test statistics is the sample mean age, = 16 years


4. The critical value of the one tailed test (one tailed because the alternative hypothesis is an
inequality) at 5% level of significance is –1.65

Acceptance region

Rejection region

- 1.65 0

5. The standardizes value of the sample mean is

Z = where =

Where, = Sample mean


µ = Population mean
S = sample standard deviation
n = sample size
z = standard value (as per computation)

9
Mr. owino
3
Introduction to business statistics

The standard value Z must fall within the acceptance region for us to accept the null
hypothesis. Thus it must be > - 1.65 otherwise we accept the alternative hypothesis.

Z = = - 10.1
6. Since –10.1 < -1.65, we reject the null hypothesis but accept the alternative hypothesis at 5%
level of significance i.e. the marriage age in this community is significantly lower than 19
years
Example 2
A foreign company which manufactures electric bulbs has assured its customers that the lifespan of
the bulbs is 28 month with a standard deviation of 4months
Recently the company embarked on a quality improvement research for their product. After the
research using new technology, a sample of 70 bulbs was tested and they gave a mean lifespan of
30.2 months
Does this justify the research undertaken? Use 1% level of significance to conduct a statistical test in
order to establish the truth about the above question.
Testing procedure
1. Null hypothesis H0: µ = 28
Alternative hypothesis H1: µ > 28
2. The level of significance is 1% (one tailed test)
3. The test statistics is the sample mean age, x’ = 30.2
4. The critical value of the one tailed test at 5% level of significance is + 2.33

0.4900

1% = 0.01

9
Mr. owino
4
Introduction to business statistics

2.33

5. The standardized value of the sample mean is

Z = = = 4.6
6. Since 4.6 > 2.33, we reject the null hypothesis but accept the alternative hypothesis at 1%
level of significance i.e. the new sample mean life span is statistically significant higher than
the population mean
Therefore the research undertaken was worth while or justified
Example 3
A construction firm has placed an order that they require a consignment of wires which have a mean
length of 10.5 meters with a standard deviation of 1.7 m
The company which produces the wires delivered 90 wires, which had a mean length of 9.2 m., The
construction company rejected the consignment on the grounds that they were different from the
order placed.
Required
Conduct a statistical test to indicate whether you support or not support the action taken by the
construction company at 5% level of significance.
Solution
Null hypothesis µ = 10.5 m
Alternative hypothesis µ ≠ 10.5 m
Level of significance be 5%

The test statistics is the sample mean = 9.2m


The critical value of the two tailed test at 5% level of significance is ± 1.96 (two tailed test).

9
Mr. owino
5
Introduction to business statistics

- 1.96 +1.96

The standardized value of the test Z =

Z = = = - 7.25
Since 7.25 < 1.96, reject the null hypothesis but accept the alternative hypothesis at 5% level of
significance i.e. the sample mean is statistically different from the consignment ordered by the
construction company. Therefore support the action taken by the construction company
T DISTRIBUTION (STUDENT’S T DISTRIBUTION) TESTS OF HYPOTHESIS (TEST
FOR SMALL SAMPLES N < 30)
For small samples n < 30, the method used in hypothesis testing is exactly similar to the one for large
samples exept that t values are used from t distribution at a given degree of freedom v, instead of z
score, the standard error Se statistic used is also different.
Note that v = n – 1 for a single sample and n1 + n2 – 2 where two sample are involved.
a) Test of hypothesis about the population mean
When the population standard deviation (S) is known then the t statistic is defined as

t = where
Follows the students t distribution with (n-1) d.f. where

= Sample mean
μ = Hypothesis population mean
n = sample size
and S is the standard deviation of the sample calculated by the formula
9
Mr. owino
6
Introduction to business statistics

S= for n < 30
If the calculated value of t exceeds the table value of t at a specified level of significance, the null
hypothesis is rejected.
Example
Ten oil tins are taken at random from an automatic filling machine. The mean weight of the tins is
15.8 kg and the standard deviation is 0.5kg. Does the sample mean differ significantly from the
intended weight of 16kgs. Use 5% level of significance.
Solution

Given that n = 10; = 15.8; S = 0.50; μ = 16; v = 9


H0 : μ = 16
H1 : μ ≠ 16

t =

=
= -1.25
The table value for t for 9 d.f. at 5% level of significance is 2.26. the computed value of t is smaller
than the table value of t. therefore, difference is insignificant and the null hypothesis is accepted.
CHI SQUARE HYPOTHESIS TESTS (NON-PARAMETRIC TEST) (X2)
They include amongst others
i. Test for goodness of fit
ii. Test for independence of attributes
iii. Test of homogeneity
iv. Test for population variance
The Chi square test (χ2) is used when comparing an actual (observed) distribution with a
hypothesized, or explained distribution.
9
Mr. owino
7
Introduction to business statistics

It is given by; χ2 = Where O = Observed frequency


E = Expected frequency
The computed value of χ2 is compared with that of tabulated χ 2 for a given significance level and
degrees of freedom.
i. Test for goodness of fit
These tests are used when we want to determine whether an actual sample distribution matches a
known theoretical distribution
The null hypothesis usually states that the sample is drawn from the theoretical population
distribution and the alternate hypothesis usually states that it is not.
Example
Mr. Nguku carried out a survey of 320 families in Ateka district, each family had 5 children and they
revealed the following distribution
No. of boys 5 4 3 2 1 0
No. of girls 0 1 2 3 4 5
No. of families 14 56 110 88 40 12

Is the result consistent with the hypothesis that male and female births are equally probable at 5%
level of significance?
Solution
If the distribution of gender is equally probable then the distribution conforms to a binomial
distribution with probability P(X) = ½.
Therefore
H0 = the observed number of boys conforms to a binomial distribution with P = ½
H1 = The observations do not conform to a binomial distribution.
On the assumption that male and female births are equally probable the probability of a male birth is
P = ½ . The expected number of families can be calculated by the use of binomial distribution. The
probability of male births in a family of 5 is given by
P(x) = 5cX Px q5-x (for x = 0, 1, 2, 3, 4, 5,)
= 5cX ( ½ )5 (Since P = q = ½ )
9
Mr. owino
8
Introduction to business statistics

To get the expected frequencies, multiply P(x) by the total number N = 320. The calculations are
shown below in the tables

x P(x) Expected frequency =


NP(x)
0
5
c 0 ( ½ )5 = 320 × = 10
1
5
c 1 ( ½ )5 = 320 × = 50
2
5
c 2 ( ½ )5 = 320 × = 100
3
5
c 3 ( ½ )5 = 320 × = 100
4
5
c 4 ( ½ )5 = 320 × = 50
5
5
c 5 ( ½ )5 = 320 × = 10

Arranging observed and expected frequencies in the following table and calculating x2
O E (O – E) 2 (O – E) 2 /E
14 10 16 1.60
56 50 16 0.72
110 100 100 1.00
88 100 144 1.44
40 50 100 2.00
12 10 4 0.40
Σ(0 – E) 2 /E = 7.16

χ2 =
= 7.16

9
Mr. owino
9
Introduction to business statistics

The table of χ2 for V = 6 – 1 = 5 at 5% level of significance is 11.07. The computed value of χ 2 =


7.16 is less than the table value. Therefore, the hypothesis is accepted. Thus, it can be concluded that
male and female births are equally probable.
ii) Test of independence of attributes
This test disclosed whether there is any association or relationship between two or more attributes or
not. The following steps are required to perform the test of hypothesis.
1. The null and alternative hypothesis are set as follows
H0: No association exists between the attributes
H1: an association exists between the attributes
2. Under H0 an expected frequency E corresponding to each cell in the contingency table
is found by using the formula

E=
Where R = a row total, C = a column total and n = sample size
3. Based upon the observed values and corresponding expected frequencies the χ 2
statistic is obtained using the formular

χ2 =
4. The characteristic of this distribution are defined by the number of degrees of freedom
(d.f.) which is given by
d.f. = (r-1) (c-1),
Where r is the number of rows and c is number of columns corresponding to a chosen
level of significance, the critical value is found from the chi squared table
5. The calculated value of χ2 is compared with the tabulated value χ2 for (r-1) (c-1)
degrees of freedom at a certain level of significance. If the computed value of χ 2 is
greater than the tabulated value, the null hypothesis of independence is rejected.
Otherwise, we accept it.
Example

1
Mr. owino 0
0
Introduction to business statistics

In a sample of 200 people where a particular devise was selected, 100 were given a drug and the
others were not given any drug. The results are as follows
Drug No drug Total
Cured 65 55 120
Not cured 35 45 80
Total 100 100 200
Test whether the drug will be effective or not, at 5% level of significance.
Solution
Let us take the null hypothesis that the drug is not effective in curing the disease.
Applying the χ2 test
The expected cell frequencies are computed as follows

E11 = = = 60

E12 = = = 60

E21 = = = 40

E22 = = = 40
The table of expected frequencies is as follows
60 60 120
40 40 80
100 100 200

O E (O – E) 2 (O – E) 2 /E
65 60 25 0.417

1
Mr. owino 0
1
Introduction to business statistics

55 60 25 0.625
35 40 25 0.417
45 40 25 0.625
Σ(O – E) 2 /E = 2.084

Arranging the observed frequencies with their corresponding frequencies in the following table we
get

χ2 =
= 2.084

V= (r –1) (c-1) = (2 – 1) (2 –1) = 1;


χ 2tabulated ( 0. 05) = 3.841

The calculated value of χ2 is less than the table value. The hypothesis is accepted. Hence the drug is
not effective in curing the disease.
Test of homogeneity
It is concerned with the proposition that several populations are homogenous with respect to some
characteristic of interest e.g. one may be interested in knowing if raw material available from several
retailers are homogenous. A random sample is drawn from each of the population and the number in
each of sample falling into each category is determined. The sample data is displayed in a
contingency table
The analytical procedure is the same as that discussed for the test of independence
Example
A rndom sample of 400 persons was selected from each of three age groups and each person was
asked to specify which types of TV programs be preferred. The results are shown in the following
table

Type of program
Age group A B C Total

1
Mr. owino 0
2
Introduction to business statistics

Under 30 120 30 50 200


30 – 44 10 75 15 100
45 and above 10 30 60 100
Total 140 135 125 400
Test the hypothesis that the populations are homogenous with respect to the types of television
program they prefer, at 5% level of significance.
Solution
Let us take hypothesis that the populations are homogenous with respect to different types of
television programs they prefer
Applying χ2 test
O E (O – E) 2 (O – E) 2 /E
120 70.00 2500.00 35.7143
10 35.00 625.00 17.8571
10 35.00 625.00 17.8571
30 67.50 1406.25 20.8333
75 33.75 1701.56 50.4166
30 33.75 14.06 0.4166
50 62.50 156.25 2.500
15 31.25 264.06 8.4499
60 31.25 826.56 26.449
Σ(O – E) 2 /E = 180.4948

χ2 =
The table value of χ2 for 4d.f. at 5% level of significance is 9.488
The calculated value of χ2 is greater than the table value. We reject the hypothesis and conclude that
the populations are not homogenous with respect to the type of TV programs preferred, thus the
different age groups vary in choice of TV programs.

1
Mr. owino 0
3
Introduction to business statistics

SUMMARY OF FORMULAE IN HYPOTHESIS


Testing
(a) Hypothesis testing of mean
For n>30

Z= Where at level of significance.


For n < 30

t= where
at n – 1 d.f
level of significance
(b) Chi-square test

X2 =
Where O = observed frequency

E= = expected frequency
(c) F – test (variance test)

F=
here the bigger value between the standard deviations makes the numerator.

1
Mr. owino 0
4
Introduction to business statistics

1
Mr. owino 0
5
Introduction to business statistics

Prepared by Mr. Owino

You might also like