Statistical Analysis Principles Guide
Statistical Analysis Principles Guide
Introduction………………………………………………………………………………………………….. 2
1
Introduction
As part of its role to provide statistical handbooks on methedologies of all kinds of staistical work,
including surveys and polls, data processing and validation, and quality measurement and control,
Statistics Centre - Abu Dhabi issued this handbook, featuring principles of the descriptive and inferential
statistical analysis, to acquaint users with the most important ways of data analysis and presentation,
statistical standards development, statistical estimates process, statistical assumption testing,
correlations and linear regression between two variables.
According to the statistical theory, statistical analysis methods are various and intertwined, depending
on the number and types of variables and the types of their relations. This handbook examines the
fundamental principles of data descriptive analysis methods. Analysts can also resort to more detailed
statistical resources and theories in case they want to use other methods.
This handbook was issued in four chapters to cover all fundamental principles of the statistical analysis.
The first chapter tackles the presentation methods of various data, including individual or grouped data
over intervals, and methods of building relative and cumulative frequency tables. The second chapter
covers various statistical measures, including the central tendency measures of median, mean, and
mode, and the dispersion measures, such as variance, standard deviation, mean deviation, etc.
The third chapter addresses statistical inference topics of statistical estimates, including point estimates
and confidence intervals estimates. This chapter also elaborates on the development and testing of
simple statistical hypothesis about the population ratio or mean.
The fourth chapter provides explanation on the correlation coefficients, namely the Pearson's correlation
coefficient that measures correlation between two continuous variables, the Spearman's correlation that
measures discrete variables, as well as the partial correlation coefficients that measures different
variables. This chapter also covers the development of simple linear regression model by calculating
regression coefficients estimates, developing the statistical model and testing its accuracy and
efficiency, and establishing predictions for the dependent variable values, assuming that independent
variable values were also given.
2
Chapter I
Presentation of statistical data
1.1. Introduction
Statistics is used to collect, organize, summarize, present and analyze data to draw acceptable results
and make sound decisions based upon this analysis. Moreover, statistics has a descriptive part,
descriptive statistics is the methods used to organize and summarize informaion to be understood.
Accordingly, it is necessary to cover data presentation in this chapter, as a way used to organize and
provide data to the user; to be understand, compare relevant terms, and draw preliminary results easily.
The way data is presented in many fields is crucial, there are methods that encourage the recipient to
interact more. Data may be used in business sector, economics, research, statistics, etc. Therefore,
there are many methods used to present data in the best way, and in a way that serves the purpose
completely.
1.2.1. Tabulation
Tabulation includes many categorization methods, most importantly:
1. Frequency table
The first step to present statistical data is designing a frequency distribution table, which organizes,
summarizes and divides statistical data into two columns and a set of rows; the first column represents
the category of quantitative data, and the second column represents the frequency of the category or
charactaristic, and shows the number of observations of data for each category.
Gender Frequency
Male 23
Female 26
3
Table 2: Frequency distribution of average wages of of employees in an institution
4
2. Relative and percentage frequency distribution tables
The frequency distribution table can be used to form two other types of tables; relative and percentage
frequency distribution, each table is made up of two columns, the relative frequency distribution includes
the relative frequency; which is the frequency of any interval divided by the sum of frequencies, and the
relative frequency sum of all intervals will equal one. The percentage frequency distribution includes
the percentage frequency, which is obtained by multiplying the relative frequency by 100. The sum of
the percentage frequencies equals 100, as shown in Table 4 below, using data stated in Table 2.
Table 5: Exact interval boundarys and medpoints calculated based on the data in Table 2:
5
4. Cumulative frequency table
We often focus on the frequencies that are less than or equal specific data values. Accordingly, the
cumulative frequency distribution is obtained by adding successively the frequencies of all the previous
intervals along with the interval against. This type of distribution is called cumulative frequency. The
cumulative frequency distribution table is made up of two columns, the first is the exact upper-interval
boundarys for each interval, and the secod is the frequencies less the the exact upper-interval
boundary, as shown in Table 6.
Table 6: Cumulative frequency distribution of wages of employees as per data stated in Table 2
ـــــــــــــــــــ 1499.5> 0
1. Dot plot
A method used to present, summarize and represent data using dots, each dot on the vertical axis
represents the frequency of the variable values. This representation tool is used in analysis to identify
the characteristics of statistical data distribution and outliers or gaps in a data set.
Example: the following figure shows a dot plot of the marks of 30 students in science course, showing
that the most frequent mark is 70 and the distribution of other marks as well, starting with 65, the
frequency of marks is getting increased at 70, and decreased after 70 moving towards 90, which is least
frequent.
Figure 1: Dot plot of the marks of students
X
X
X X
X X X
X X X X X
X X X X X X
X X X X X X
X X X X X X
Mark 65 70 75 80 85 90
6
2. Histogram
A method used to present and summarize continuous data and identify the type and characteristics of
the probability distribution of data, and depends on grouping data range, each group is represented by
a column, where the column width represents the interval length, and the height represents the
frequency of data values in this interval.
Example: the following graph shows the histogram of the ages of a population, where the column height
represents the number of individuals in thousands, while the width of the column represents the interval
of the age, e.g., individuals with the ages (30 – 40) are about 22000 in the population.
Example: the following figure presents the stem and leaf plot of the marks of 25 students. The stem
represents the first two digits, and the leaves represent the other third digit, e.g. (the lowest) marks 55,
55, 56 and 59 are represented on the first stem by number (05); the leaves contained the numbers 9,
6, 5 and 5; while 100 is the top mark in the group and represented by three columns, the first two digits
10 are for stem and the last column is represented by the leaf.
Stem Leaves
05 5 5 6 9
06 2 3 5 6
07 2 5 6 8 9 9
08 1 5 7 7 9
09 2 3 5 6
10 0 0
7
4. Box and dot plot
A chart that shows distribution and spread of data, is used to detect outliers or inconsistent data. Box
and dot plot start with determining the first quartile i.e. the data value under which 25% of data is found;
the third quartile i.e. the data value under which 75% of data is found; the minimum value showing the
first side of the plot; and the maximum value showing the other side of the plot.
Mid-quartile range is calculated by subtracting the third from the first quartile and dividing the result by
2.
Accordingly, the following is calculated as follows:
Minimum outliers = third quartile + (mid-quartile range × 1.5)
Maximum outliers = first quartile + (mid-quartile range × 1.5)
Data falling beyond upper and lower boundarys of outliers are called inconsistent data or -sometimes -
outliers.
5. Polygon
Straight polygonal lines that represents the scale of the phenomenon in vertical and horizontal axes, A
polygon can be drawn using Excel, where the horizontal axis represents the interval med points of the
phenomenon under strudy and the vertical axis represents the interval frequency. Below the polygon
for data given in the table (5) above.
Figure 5: Polygon
8
Figure 6: Cumulative frequency curve
7. Column charts
This method makes enabling read data and make comparisons between different values very easily,
which also facilitates making different decisions based on observation. In this method, numbers are
represented by columns whose length is proportional to the relevant value, so that the longest column
represents the number of the higher value and vice versa. The variable to be represented by columns
may include one dimension, represented by a column; this is called column charts. The following is a
column chart showing the population distribution for each age group.
Age
The variable to be represented by columns may include more than one classification; each one is
represented by a separate column within an interval, e.g. the number of products (A, B, C,D) in a
strategy ( 1,2,3,4,5), the following is a column chart, showing the sales for products based on
different strategies.
9
8. Bar charts
It is horizontal lines arrangerd in a specific order, easy to read and less confusing than column charts.
The following is a bar chart, showing the number of students at a school for a year from Grade 1 to 6.
Figure 9: Number of students by grade
9. Pie chart
It is a circle divided into segments or sectors, used to show and place the relative importance of the
population within different groups of the qualitative variable, widely used to represent data, because it
is very easy to read. The following is a pie chart, showing the exports of a country, by country of
destination.
Figure 10: Percentage distribution of exports by country
1.2.3. Images
One of the most common data representation methods, where the user enjoys interaction with the
presented data, and is known for its high ability to enable the user to memorize the represented data
for as long as possible, as people often prefer this method to receive data, which relies mainly on
representing data visually preserving their connotation.
10
Chapter II
Statistical measures
The previous chapter has outlined the methods usd to present and summerize statistical data through
frequency distribution tables or charts; to have some characteristics of the study population. However,
such methods are not sufficient to describe data. Hence, numerical measures shall be provided to
describe these data. This chapter covers two types of statistical measures: measures of central
tendency and measures of dispersion. In this chapter, we will discuss the advantages and limitations of
these measures in detail, which depend on the nature and the purpose of using data.
2.1.1. Mean
A value around which a set of data gather, one of the most important measures of central tendency, the
most widely used in statistics and practice, and usually used in many comparisons between different
phenomena.
Mathematically, the simple mean is calculated by adding together all of the numbers in a data set and
then dividing the sum by the total count of numbers, calculated by the following formula:
Example 1: if the wages of 5 employees in a company are 250, 280, 320, 450 and 370 (USD), the
simple mean is calculated as follows:
No. of students 4 6 5 9
11
Solution: to simplify solving this problem, create the following table:
Interval
Age Frequency (f) fi xi
midpoint (x)
5-6 5.5 4 22
7-8 7.5 6 45
9 - 10 9.5 5 47.5
11 - 12 11.5 9 103.5
Sum 24 218
2.1.2. Median
The median defined as a value among the sorted data set where half count of the data values is lower
than the median and the remaining data are larger than the median value, or in other words, it is the
measure that equally separates the sorted data set in half.
12
Example 3: calculate the median for the following data: 52, 15, 102, 68 and 44.
The data is arranged in an ascending order and ranked as follows:
value 15 44 52 68 102
rank 1 2 3 4 5
Using the data above, we can determine the median rank is (5+1)/2, it is 3. Accordingly, the median
is the observation value with rank 3, it is 52.
Example 4: calculate the median of the following data: 52, 15, 102, 68, 44 and 72
The data is arranged in an ascending order and ranked as follows:
102
value 15 44 52 60 68 72
rank 1 2 3 3.5 4 5 6
Using the data above, we can determine the median rank as (6+1)/2, it is 3.5, between (3 - 4), so the
median is the average of the two-observation ranked with 3 and 4 respectively, it is as follows:
Median = = 60.0
Find the frequency (f1) of interval that preceeding the median interval, the length (L) of the median
interval, and the exact lower boundry of the median interval (A).
The median is calculated as follows:
(n/2 − f1 )
Median = A + L
f2 − f1
Create the cumulative distribution table (as shown in the previous chapter):
0 <=4.5
4 <=6.5
10 <=8.5
15 <=10.5
24 <=12.5
Calculate Median rank as: (n/2) = 12, in the cumulative frequency column this value is between 10 and
15, hence:
A = 8.5, f1 = 10, f2 = 15, L = 15 – 10 = 3
Apply the median formula:
(12 −10)
Med = 8.5 + X 3 = 9.7 year
15− 10
13
2.1.2.3. Charactaristics of the median
• Not affected by outliers and can be found in case of catagorial data that can be sorted.
• The sum of absolute deviations is minimum when taken around the median.
2.1.3. Mode
The most frequent value in a data set, widely used in catagorial data to identify the most common
pattern (level). A data set may have one mode and is, hence, called unimodal data set, or have more
than one mode and is, hence, called multimodal data set. When a data set has no mode, it is called no
mode data set.
In case of grouped data or frequency distribution tables, one may not assume that a specific value is
the most frequent, since values are integrated into various sets. Hence, there are modal interval that
has the highest frequency.
𝐺𝑀 = 𝑛√𝑥1 × 𝑥2 . 𝑥3 … 𝑥𝑛
GM = 3√(1 2 4)= 2
14
2.1.4.1. Charactaristics and Limitations of the geometric mean
• Not affected by outliers.
• Cannot be used with data including negative or zero values.
This example is for ungrouped data, however, the same formula above applies to grouped data, after
calculating the mean for grouped data.
2.2.1. Range
One of the simplestt measures to define and calculate, it provides a quick idea of data dispersion, and
has the symbol (R). Range of a set of data is calculated by the following formulas:
Example 11: calculate the range of 54, 89, 65, 70, 95, 47
R = 95 - 47 = 84
15
Example 12: calculate the range of ages in the table below:
Find interval boundarys and mid points as shown in the previous chapter covering data tables.
Age 15 - 6 25 - 16 35 - 26 45 - 36 55 - 46 65 - 56
Exact interval
15.5 - 5.5 25.5 - 15.5 35.5 - 25.5 45.5 - 35.5 55.5 - 45.5 65.5 - 55.5
boundarys
Frequency (f) 10 16 14 6 9 5
Where Q1 is the value under which 25% of data points are found when they are sorted in increasing
order, and Q3 the value under which 75% of data points are found when sorted in increasing order.
Example 13: calculate the mid-quartile range of 53, 89, 65, 70, 95, 47, 74, 86
Arrange data in an ascending order: 47, 53, 65, 70, 74, 86, 89, 95
Q1 of data above is the 2nd value, as the quartile rank is the product of n values and the quartile
percentage 25%, 50% or 75%, accordingly:
86+53
Q1=X (2) = 53 , Q3= X (6) = 86 , Q= = 69.5
2
16
2.2.1.7. Mid- quartile range for grouped data
The mid-quartile range for such data is calculated using difference method. The first and third quartiles
are calculated by the formulas below previously explained to calculate the median, with a slight
difference by adding n/2 when calculating Q3:
(n/4 − f1 )
Q1 = A1 + L
f2 − f1
(3n/4 − f3 )
Q3 = A 3 + L
f4 − f3
Where A1 is the exact boundary of the interval preceeding Q1 interval, A3 is the exact boundary of the
interval preceeding Q3 interval, L is the interval length = upper interval boundray - lower interval
boundary, f1 is the cumulative frequency preceding Q 1 ranks, f3 is the cumulative frequency preceding
Q3 ranks f2 is the cumulative frequency following Q1 rank and f4 is the cumulative frequency following
Q3 rank
Example 14: find the mid-quartile range (Q) for the ages in example 12.
Create the cumulative frequency distribution as shown below, then calculate Q using the formulas
above.
0 5.5
10 15.5
26 25.5
40 35.5
46 45.5
55 55.5
60 65.5
n 3n
n = 60, = 15 , = 45, L = 10
4 4
(15 −10)
Q1 = 15.5+ 10 = 18.6
26−10
(45 −40)
Q3 = 35.5 + 10 = 43.8
46−40
18.6+43.8
Q= = 31.2
2
17
2.2.2. Mean deviation
The average of absolute deviations of data from the mean, has the symbol of MD.
1
MD = ∑ni=1|xi − x̅|
n
x x - x̅ | x - x̅ |
5 -4 4
9 0 0
7 -2 2
14 5 5
11 2 2
8 -1 1
12 3 3
6 -3 3
72 0 20
Using the data of the table above and the ungrouped mean deviation formula:
MD = 20/9 = 2.22
18
Example 16: Calculate the mean deviation of ages in example 15.
After calculating the mean x̅= 1/n ∑ni=1 fx= 1860/60 = 31
create the following table:
Using the data of the table above and the grouped data mean deviation formula:
MD = 760/60 = 12.7
Where x - x̅ is data deviations from the mean and N is the number of data items.
Standard deviation is the square root of variance and has the symbol of σ. As in case of variance, a
value increase indicates a significant degree of dispersion or fluctuation and divergence of data, and
the contrary is true in case of a value decrease. Using variance, standard deviation of a statistical
sample can be calculated using the following formula, where n is the sample size.
19
2.2.3.1. Standard deviation for ungrouped data
In case of a sample, whose size is n, taken from a population, deviation has the symbol of S2 and is
calculated using the following formula:
Example 17: calculate the standard deviation of ages of 6 primary education students (x): 5, 8, 6, 9, 7,
10.
Create the table below after calculating the mean:
x̅ = 1/n ∑ x = 45/
6 = 7.5
(x) (x - x̅ ) (x - x̅ ) 2
5 -2.5 6.25
8 0.5 0.25
6 -1.5 2.25
7 -0.5 0.25
9 1.5 2.25
10 2.5 6.25
Sum 0 17.5
Using the data of the table above and the ungrouped data variance formula:
20
Example 18: calculate standard deviation for ages in example 10.
Create the table below after calculating the mean:
Using the data of the table above and the grouped data variance formula:
xi − x̅
Zi =
S
Example 19: if a student gets agrade 86 in accounting, where mean is 77, standard deviation is 11,
and he gets agrade 96 in Economics, where mean is 84, standard deviation is 17, in which course did
the student perform better? Standard value of both courses is calculated using the formula above:
The results show that the student did better in accounting than in Economics, although the mark of
accounting is less.
21
2.3. Skewness
The degree of asymmetry or deviation from the symmetry of a distribution. If the data distribution curve
has a longer tail to the right of the central maximum than to the left, the distribution is said to be skewed
to the right or to have positively skewed. If the contrary is true, the distribution is said to be skewed to
the left or to have negatively skewed.
There are many methods to measure skewness of frequency distribution or a data set, such as the
following formulas:
3(x̅ −M)
Sk =
S
∑n ̅ )3
i=1(xi − x
Sk =
S3 (N−1)
This relative measure is negative if skewed to the left, and positive if skewed to the right. Distribution
extends from (-3) if negatively skewed to (+3) if positively skewed and becomes zero when the mean
and median are equal, when distribution is normal.
2.4. Kurtosis
A measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution, and a
symmetric curve centered around the mean. If distribution is heavy tailed (greater than normal
distribution), it is said to be leptokurtic. If distribution is flat, it is said to be platykurtic. If distribution is
semi-heavy tailed (not leptokurtic or platykurtic), it is said to be mesokurtic. Kurtosis is not related to the
distribution mean, there may be many distributions sharing the same mean but differ in terms of
leptokurtic or flat curves.
Since the height of the peak of the normal distribution is approximately 3, distribution is platykurtic when
the kurtosis factor is less than 3, while distribution is leptokurtic when the kurtosis factor is more than
3. The kurtosis factor is calculated using the following formula:
∑n ̅ )4
i=1(xi − x
SK =
S4 (N−1)
22
Chapter III
Statistical estimation and hypotheses testing
The study of the characteristics of any statistical population depends on the nature and method used
to deal with its members. When a census of population is done, characteristics of the population are
studied by identifying statistical indicators of the distribution of the population in the light of the values
and characteristics of its parameters, including the mean, median, average, standard deviation, etc. A
population can be studied by taking a sample from the population members. The characteristics of a
population are studied by conducting a statistical estimation for the parameters from the data of the
selected sample. Each estimator is called statistic. However, the decision to accept and adopt
estimators to study the characteristics of the population from which a sample has been taken is related
to an evaluation of such estimators, because estimation of the statistical indicator based on the sample
is not equal the parameter of the population. Differences between these two indicators refer to the
estimation errors based on the selected sample.
Statistical estimation aims to find the best estimator for the parameters of a population. Testing
statistical hypotheses involve building methods that depend on data under study to decide on a
hypothesis formulated before dealing with the sample data. However, the distinction between estimation
and testing is not reflected by separation of such two processes, they are interrelated, which
necessitates a presentation thereof.
3.1. Estimation
Estimation is associated with a group of statistical problems, dealt with using inference that leads to
accurate perceptions, as much as possible, to study one or more values of the population parameters.
Estimation is either by seeking to obtain a specific point estimate derived from data of a population
sample, trying to make it as close as possible to the real value of the parameter, or by calculating
boundarys within which the real value of the parameter is likely expected to fall. The higher the
probability, the greater the reliability in obtaining the true value of the parameter within confidence
range.
Example 1: If we have Abu Dhabi population, the target is to find the indicator of per capita expenditure
in the Emirate, which requires comprehensive data for all members of the population of Abu Dhabi, who
are to be asked about per capita expenditure, which means conducting a comprehensive survey of all
members, leading to high costs and long time. The result of the indicator, per capita expenditure, will
not be as accurate as required, due to the large size of data collected, the large number of field teams,
and different chances of errors that will ultimately be reflected in the value of the population parameter.
23
considering per capita expenditure, θ̂, as an estimation of per capita expenditure of the emirate's
population, Ө.
The question is how to measure the accuracy of the estimated value of the indicator is. The estimator
includes a specific error percentage, from two main sources; the sampling error and nonsample errors
which cannot be measured but can be minimized by adjusting data collection and processing
procedures.
Sample error may be measured based on standard deviation value of the data of a simple random
sample, including n number of units, by calculating the so-called sampling error:
𝑆 𝑁−𝑛
𝑆𝐸 = √ ,
√𝑛 𝑁
Example 2: If the population parameter in the household average size to be estimated. A sample of
5,000 households is taken and data thereof is obtained. The household average size is calculated
based on the sample data; the estimate value is 6.4.
Standard deviation of the sample data is calculated, accordingly, S = 2.121. Sampling error is calculated
N−n
by dividing standard deviation by the square root of the sample size, where ( ) is ignored, beign too
N
Determining both lower and upper boundaries for confidence interval requires assuming that data are
normally distributed and/or the size of the sample is relatively large.
Accordingly, the confidence interval boundaries are determined by adding and subtracting the value “w
“as:
w=z S
(1−α⁄2)
√n
Where 𝑍1−𝛼⁄2 is standard normal distribution value of the previously set confidence level. For example,
if confidence level is 95%, α = 5%, and the constant z (1 − α⁄2) using normal distribution, is 1.96. If
confidence level is 90%, the value is 1.654. While if confidence level is 85%, the value is 1.28, where
w is the margin error in confidence interval.
Accordingly, confidence interval is calculated by:
[θ̂ − z S , ̂θ + z S ]
(1−α⁄2) (1−α⁄2)
√n √n
It can be said that the value of the population parameter is expected to be within this interval with
confidence level of (1 − α⁄2)%.
24
Example3: based on the example above, if the average size of household is required at a confidence
level of 95%, the bound of error is calculated as follows:
1.96 × 2.121
𝑤= = 0.0588
√5000
Example 4: A food company produces a kind of juice, the weight of the bottle is 125 gm. If the
production control manager takes a random sample of 36 bottles, measures the quantity of
carbohydrates in gm, and finds the average amount of carbohydrates is 12 gm and standard deviation
is 2.4 gm, if the production control department wants to estimate a confidence interval of 95% for the
average amount of carbohydrates in the bottles, and the quantity of carbohydrates is normally
distributed, the margin error is:
Note: If the population data is not subject to normal distribution or is small (less than 30 observations),
to calculate confidence interval, we assume that data is subject to student distribution (t) with n-1
degrees of freedom based on the table values of distribution (t) at a certain confidence level, (t) value
can be obtained from the relevant statistical table and change z by t- value at the predetermined
confidence level (1- α).
in the formula above,
[θ̂ − t S , ̂θ + t S ]
(1−α⁄2) (1−α⁄2)
√n √n
When data differs in importance in the same population, the indicator value cannot be directly calculated
using this data, i.e. by dealing with all data as having the same importance.
For example, if the observation values of Xi of a population with weighte values wi respectivly, the
weighted indicator Ө, whether a mean or percentage, is:
∑ wi xi
θ=
∑ wi
25
For example, if a sample of households in a number of cities is taken to fully identify the average
household expenditure. Not all households have the same weight or importance, and therefore to find
the average expenditure of a household, the average expenditure data of all households in all cities is
not simple average. However, a variable size or weight of the city, where the household is, plays an
influential role in the size of expenditure. Accordingly, in this case the averge is calculated as follows:
∑ wi x̅𝑖
̅=
X
∑ wi
Where:
̅̅̅
𝑥𝑖 is the average observation in a population group (i).
wi is the relative weight or importance of the population group (i).
If the number and average size of households in a number of cities are as in the table below, how can
we calculate the overall weighted household’s average size?
∑ wi xi 52790
̅
X= = = 5.7
∑ wi 9300
The foregoing also applies to the percentage or average indicator. If we have a number of partial
populations and the variable percentage in each part is different, then relative weights or importance of
data will be different, and the estimated percentage as a statistical indicator at the level of the population
as a whole is:
∑ wi pi
p=
∑ wi
Where p is the percentage or ratio at the overall population level, pi is the percentage or ratio of variable
values in part (i) of a population.
26
Example 5: If the percentage of infection with a particular disease among population groups differs by
gender as in the following table, then the total percentage of infection with the disease at the population
level as a whole is calculated as follows:
Percentage of infection
Gender No. of individuals X. W
(p)
In this example, the total percentage of infection with the disease in the population is:
∑ wi pi 102360
p= = = 12.8
∑ wi 8000
The method used to estimate the indicator of the sum τ instead of the average or percentage is based
on the indicators of the sum in different parts of the population (i).
τ = ∑ τi = ∑ Ni x̅i
Example 6: A sample of economic establishments is taken from each city to estimate total revenues at
the population level, noting that the relative importance represented by the number of establishments
in cities is not the same; there shall be relative weights at the level of each city.
Based on the table and using the formula above, total revenues is the sum of the last column in the
table.
τ = ∑τ i = ∑N i x̅ i = 150 X 500 + 90 X 500 + 60 X 850 = 1710
27
3.2. Hypotheses testing
We are (1−∝) confident that the confidence interval will contain the unknown true value of the
population parameter.
It is noted that the confidence interval is calculated based on the data of a random sample, to be used
in statistical inference about the true value of a population parameter. However, in practice, there is
often a previous claim of the value of the unknown parameter, which does not necessarily have to be
related to a specific value and can have a mathematical –relation such as the parameter value is less
than or greater than a specific value. In this case, the aim of the statistical inference is more specific
than in the calculation of a confidence interval, focusing on verifying the credibility thus making a
decision to accept or reject the claim.
Dealing with and judging the credibility of hypotheses is called hypotheses testing. There is a
relationship between calculating a confidence interval and hypotheses testing, as it can be said that
hypotheses testing gives information more used in decision-making than information obtained from
calculating confidence intervals. However, confidence intervals may be relied on in some cases to
decide on whether a hypothesis is valid.
Ha :μ ≠μo Ho :μ=μo
Ha :μ < μo Ho :μ≤μo
Ha :μ<μo Ho :μ≥μo
Where μo is the population parameter based on null hypothesis.
It is noted that null hypothesis is always accompanied by =, accordingly it can be written as follows:
Ho :μ=μo
28
There are 4 possibilities of decisions which may be made on null hypothesis:
Rejecting Ho α )1 – β(
Accepting Ho )1 – α( β
Possibilities:
1. P (reject H0 when H0 is correct)=α (the probability of a type I error).
2. P)accept Ho when Ho is correct( = )1−α(.
3. P(accept Ho when Ho is incorrect) =β.
4. P(reject Ho when Ho is incorrect) = (1−β).
Type I error is controllable and detected by the researcher prior to the test, its probability is called the
significance level (α), most values used are 0.05, 0.01.
As a result, it can be said that in order to set up a null and an alternative hypothesis mathematically,
several conditions shall be met:
• The unknown parameter to be tested shall firstly be identified as it may be a population
average, the difference between two averages, the probability of an event in a population,
the difference between two percentages, a population variance or the ratio of 2 variations.
• The value of the unknown parameter related to the claim to be tested shall be found.
• The relation between the parameter and relevant value shall be determined and in 1 of 3
forms; >, < or =, and the alternative hypothesis representing the claim.
• Null hypothesis shall be formulated, as it includes the parts of the alternative hypothesis with
the alteration of the mathematical relation between the parameter and the relevant value with
changing the sign of the inequality to reflect the corresponding case of the alternative
hypothesis, and thus to represent the opposite of the claim.
Example 1: find input of the claim (the parameter, the relevant value, and the mathematical relation),
then formulate the null and the alternative hypotheses.
• The claim made by the director of an economic department that the average time required
for the maintenance of any machine is less than 12 hours.
• The claim made by a national battery factory that the average battery age produced by the
factor is more than 1.5 years.
• The claim made by a researcher that the percentage of students receiving academic
warnings at Khalifa University is less than 0.30 out of the total number of students.
• The claim made by an investor that the average profit generated by investment at Abu Dhabi
Securities Exchange is not (different than) 0.10.
29
Solution (1):
Parameter: average time needed for the maintenance of a machine (μ)
Relevant value: 12 days
Mathematical relation: less than (>)
Hypotheses:
H0 :μ ≥ 12
Ha :μ < 12
Solution (2):
Parameter: average battery age (μ)
Relevant value: 1.5 years
Mathematical relation: more than (<)
Hypotheses:
H0 :μ≤ 1.5
Ha :μ < 1.5
Solution (3):
Parameter: percentage of student receiving academic warnings (P)
Relevant value: (0.30)
Mathematical relation: Less than (>)
Hypotheses:
H0 : P ≥ 0.3
Ha : P < 0.3
Solution (4):
Parameter: the percentage profit generated by stock investment (P)
Relevant value: 10%
Mathematical relation: not equal (≠)
Hypotheses:
H0 : P = 0.1
Ha : P ≠ 0.1
30
3.2.3. Types of hypothesis tests
There are two types of hypothesis tests determined based upon the alternative hypothsis type as
follows:
➢ Two-tailed test (if Ha: μ ≠ μo), the rejection region is at both ends of the curve.
Acceptance
region
Ho
(1- α)
➢ One-tailed test, all rejection regions α are at the end of the right or left curve:
• If Ha: μ>μo, the rejection region is at the right end of the curve, as shown below:
Acceptance
region
Ho
(1- α)
Ho Rejection region
• If Ha :μ< μo, the rejection region is at the left end of the curve, as shown below:
Acceptance
region
Ho
(1- α)
Ho Rejection region
Ha :μ ≠μo Ho :μ=μo
Ha :μ <μo Ho :μ≤μo
Ha :μ<μo Ho :μ≥μo
31
2. Identification of significance level α, sample distribution, and acceptance and rejection regions:
Sampling distribution, either a standard normal or (t) distribution with (n-1) degrees of freedom. Critical
values, which define acceptance or rejection regions, are extracted from tables as shown in the
following figure:
n > 30 Unknown
n <30 Unknown
32
Chapter IV
Correlation and simple linear regression
4.1. Correlation
Correlation analysis describes the strength of an association between two or more variables; correlation
measures how much the variable values change in a regular manner. Correlation is a quantitative
indicator used to determine the degree of dependence on one or more variables to predict the values
of another variable. It is important to know what correlation analysis can and cannot provide. Correlation
analysis neither provides any information to predict the values of a variable, nor any indication of a
casual relationship between variables. However, the analysis can only determine if the degree of
covariance is significant. Therefore, the relationship between the two phenomena or variables is called
correlation. The correlation may be direct; the two phenomena change in the same direction, so that if
a phenomenon increases, the other tends to increase, and vice versa. Correlation may be inverse; the
two phenomena change in opposite directions, so that if a phenomenon increases, the other tends to
decrease, and vice versa.
It is noted that the value of the correlation coefficient is a relative numerical value between +1 and -1,
this value is not +1 and -1 unless correlation is complete.
33
Negative correlation Strong negative correlation Perfect negative correlation
The following table summarizes correlation types and relationship directions between two variables:
Non-linear correlation 0
Likewise, at the same level, the correlation relationship is inverse if correlation coefficient is negative .
The correlation coefficient between the values of two variables X and Y can be calculated using the
following formula:
2
n ∑ xy − (∑ x)(∑ y)
rxy
√((n ∑ x 2 − (∑ 𝑥)2 )((n ∑ y 2 − (∑ y)2 ))
Where:
∑ xy is the product of the values x and y.
∑ x is the total values variable X.
∑ y is the total values variable Y.
∑ x 2 is the sum of squares of variable X.
∑ y 2 is the sum of squares of variable Y.
34
Example 1: Approximate readings of the volume of production (x) and the volume of exports (y) over
several years are recorded as follows :
4 9 6 2 3
4 16 8 2 4
4 4 4 2 2
1 4 2 1 2
1 4 2 1 2
1 4 2 1 2
2
6(24) − (15)9
rxy = = 0.65
√((6 × 41) − 152 )((6 × 15) − 92 )
Since the correlation coefficient is 65.0, the relationship between volumes of production and exports is
a moderate positive correlation. Statistical software may be used to calculate the correlation coefficient
very easily, e.g. SPSS may be used, as shown on the screenshot below, to calculate the correlation
coefficient between two variables.
Multiple correlation
A correlation coefficient that describes the relationship between a dependent variable and a number of
independent variables. For example, this coefficient is used to identify the type of correlation between
the volume of production of a dunum of wheat, the amount of rain and fertilizer, and the temperature.
In this case, this coefficient measures correlation between the volume of production as a dependent
variable and a set of other independent variables on which this variable depends.
35
The coefficient of multiple correlation is calculated using the following formula:
Where:
R212 is the square of the simple correlation coefficient of variables 1 and 2.
R213 is the square of the simple correlation coefficient of variables 1 and 3.
R12 is the simple correlation coefficient of variables 1 and 2.
R 23 is the simple correlation coefficient of variables 2 and 3.
Example 2: A swimming coach wanted to know the relationship between the time of (100) meter
freestyle swimming (dependent variable), and stretch (independent variable1) and cardiovascular and
respiratory reflexes (independent variable 2), the simple correlation coefficient between the variables
is:
Using the formula above, the coefficient of multiple correlation R21.23 = 0.88.
2 2 2
ry2 − ry1 r12
ρy2.1 =
2
√(1 − (R2y1 ) )(1 − (R212 )2 )
Where:
2
ry2 is the simple correlation coefficient of variables y and 2.
2
r12 is the simple correlation coefficient of variables 1 and 2.
2
ry1 is the simple correlation coefficient of variables Y and 1.
Example 3: An advertising agency wants to describe the relationship between the number of
respondents to advertisements y, the size of the advertisement published in the newspaper x 1, and the
number of distributed newspapers x2. The agency has obtained the following data:
Number of respondents in hundreds (yi), the size of the advertisement in inches (x1) and number of
distributed newspapers in thousands (x2).
36
Accordingly, the following results have been obtained:
using the formula above, the partial correlation coefficient ρy2.1 is calculated as follows:
There are also many statistical software packages used to calculate the partial correlation coefficient.
In SPSS package, the partial correlation coefficient of variables is, after entering necessary data,
calculated as follows:
6 ∑ni=1 d2i
R = 1−
n(n2 − 1)
37
Example 4: The table below shows the grades of a group of students in a test done twice in a row on
the same students. Calculate Spearman's rank correlation coefficient of the grades of both tests.
Variable Y has two equal numbers (4, 4) and their order is (2, 3); each has the average rank (3 + 2)/ 2
= 2/5 = 2.5.
5 6 3 4 -1 1
9 7 5 5 0 0
2 3 1 1 0 0
Sum 3.5
6(3.5)
R= 1− = 0.825
5(24)
4.2. Regression
Regression analysis is an analysis used to find a mathematical formula to correlate a dependent
variable and one or more independent variable. For example, regression analysis is used to study
factors that affect increased demand for the product and finding a mathematical relationship (formula)
for this correlation, which not only enables us to understand the nature and determine the influencing
factors of the relation, but also to predict the impact of changing any independent variable on the
dependent variable.
This regression is variously and widely used. An engineer needs to study the factors affecting increased
temperature of gases used in a process, may need to know the real effect of many factors. Using
regression, an engineer can identify influencing and neglect non-influencing factors, and predict any
change to the temperature of gases based on a specific change in any influencing variable. An HR
manager needs to identify the factors affecting the performance of new employees including age, GPA,
university, etc. Using regression analysis, An HR manager may identify influencing and non-influencing
factors of the performance of new employees and have a mathematical relationship to predict and
understand how much these factors influence performance.
As mentioned above, correlation coefficient describes the relationship among phenomena. However,
we often need to understand the nature of relationships to study phenomena, as they may be in the
form of a line or a curve. A regression curve is a graph representing the relationship between two
38
variables or a graph representing the relationship among phenomena, regression is used to estimate
the value of the dependent variable, knowing the value of the other independent variables.
Regression analysis is based on the relationship between two or more variables. Analysis here is based
on having a dependent and an independent variable. Once the mathematical relationship between the
two variables is identified, it is easy to identify the dependent variable based on the data of the
independent variable.
If, for example, imports are affected by the national income, then by quantifying this relationship, imports
may be predicted once the expected national income is determined. Mathematically, if the value of
dependent variable Y depends on the amount of change to the value of independent variable X, then Y
is expressed as a function of X, which is called regression. Regression coefficient is the indicator that
shows the extent of change to a dependent variable based on a change to a unit of an independent
variable.
Thus, there are examples in many economic, agricultural, commercial, behavioral sciences, and other
fields.
39
Linear regression model
In simple regression analysis, the researcher seeks to study the effect of the independent or predictor
variable on the dependent or response variable, then the linear regression model may be presented as
a linear equation, where the dependent variable is a function of the independent variable:
yi = β0 + β1xi + ei , i = 1, 2, 3, ... , n
Where:
xi is the values of independent variable x.
β1 is the estimated value of the regression coefficient.
β0 is the intersection of the regression line with the vertical axis.
e is random error, the difference between true and estimated values of y.
Where x is the mean of x values, y is the mean of y values, and the estimated value of the dependent
variable is:
yˆ = ˆ0 + ˆ1 x ,
this estimate is called the regression equation Y on X.
Coefficient of determination R2
The real standard of the strength the regression relationship represents the analytical model is the
coefficient of determination; it is the square of correlation coefficient, (r2), is always positive and
describes the strength of correlation between any two variablesa. For example, if correlation coefficient
is (r=0.08), then there is a positive regression relationship and the strength of the relationship is
(r2=0.64), we can get the percentage of correlation using (r2×100) to get the percentage of the strength
of correlation.
40
Example 5: The table below shows the daily protein intake in gm and weight gain in kg for a sample of
10 individuals.
Protein intake 10 11 14 15 20 25 46 50 59 70
Weight gain 10 10 12 12 13 13 19 15 16 20
Solution:
Protein intake
Weight gain Y xy x2 Required sums
x
10 10 100 100
11 10 110 121
∑x = 320
14 12 168 196
∑y = 140
15 12 180 225 ∑xy = 511
46 19 874 2116 x̅
50 15 750 2500 y̅
59 16 944 3481
70 20 1400 4900
yˆ = 9.44 + 0.143x
41
Explanation of the equation
• The constant: ˆ0 = 9.44 shows that in case of no protein diet, weight will increase by
9.44 Kg.
• Regression coefficient ( ˆ1 = 0.143 ) shows that if protein intake is increased by 1gm,
weight increases by 0.143 kg (143 gm).
• Weight gain observed by taking 50 gm of protein:
The equation of a straight line can be drawn through two known points on the straight line.
Then, the regression equation line:
Coefficient of determination, in the example above, is (0.83 x 100 = 83%). Correlation can be explained
that 83% of the two variables are affected by each other, or that 83% of the change in a dependent
variable is due to the change in the other independent variable.
On the other hand, a statistical package, e.g. SPSS, may be used to make a linear regression analysis
easily, as shown below:
42
References:
43
/
/
/
I
I \ I
I I
/
: I
\ l
\
/
/
/
(
/
I
/
I
/
/
/
/
/
/
/
/
.rl r1 1 �I j.-!)J---'1
STATISTICS CENTRE
www.scad.gov.ae
0 0 0 (!) adstatistics