Definition of Terms
1. Statistics
It is the branch of mathematics concerned with the techniques by which
information is collected, organized, analysed, and interpreted.
2. Descriptive Statistics
It utilizes numerical and graphical methods to look for patterns in the data
set
3. Inferential Statistics
It is one of the two categories of statistics that concerns with treatment of
data leading to predictions or inferences concerning a larger group of data. It
draws conclusions like decisions, predictions or generalization about the data
set.
4. Mean
It is the sum of all items in a set of data divided by the number of items. It
is also known as arithmetic average.
5. Median
The median is, represented by Md, is the value of the middle term when
data are arrange in either ascending or descending order. Hal of the terms are
located above the median, while the other half below the median. It is affected
by the number of items and nit by the size of extreme values.
6. Mode
It is referred to as the most frequently occurring value in a given set of
data. In the distribution the element of measure which is repeated the most
number of items is the mode. When the highest frequency corresponds to two
elements or two measures, the distribution is said to be bimodal. When the
distribution has more than two modes, it is said to be multimodal. It is also
possible that a mode may not exist at all.
7. Standard Deviation
It is the measure of variation of a set of data in terms of the amounts by
which the individual values differ from their mean. It is considered the most
stable measure of spread, and is usually preferred in experimental and
research studies where in-depth statistical analysis of data is involve. It is
affected by the value of each data.
8. Nominal data
It is a type of data that is used to label variables without providing any
quantitative value.
9. Ordinal data
It is a categorical, statistical data type where the variables have natural,
ordered categories and the distances between the categories is not known. It
has a ranking.
10. Interval data
It is a type of data which is measured along a scale, in which each point is
placed at an equal distance (interval) from one another.
11. Ratio data
It is defined as a variable measurement scale that not only produces the
order of variables but also makes the difference between variables known
along with information on the value of true zero.
12. Simple random sampling
It is a procedure where a sample is selected in such a way that every
element is as likely to be selected as any other element from the population.
13. Systematic random sampling
It is a sampling procedure with a random at a start.
14. Cluster sampling
It is a probability sampling technique where researchers divide the
population into multiple groups (clusters) for research.
15. Stratified random Sampling
It is specifically used when the population can be naturally be classified
into groups or data.
16. Independent Variable
It is a variable that is being manipulated in an experiment in order to
observe the effect on a dependent variable.
17. Dependent Variable
This is what you are measuring in the experiment and what is affected
during the experiment. The dependent variable responds to the independent
variable. It is called dependent because it "depends" on the independent
variable.
18. Sample
It is any subset of elements drawn by some appropriate method from a
defined population.
19. Null hypothesis
The hypothesis that there is no significant difference between specified
populations, any observed difference being due to sampling or experimental
error. It is represented using H0.
20. Alternative Hypothesis
It is the hypothesis that defines a statistically important relationship between two
variables. It is denoted using the symbol Ha or H1
SAMPLE PROBLEMS FOR DESCRIPTIVE STATISTICS
MEAN OF UNGROUPED DATA
Formula: x̄=
∑x
n
where: ∑x= sum of the item values
n= number of items
Problem
Find the average sale of bananacue in a school canteen if the daily sales are
as follows:
Monday - Php. 353.25
Tuesday -Php. 220.75
Wednesday -Php. 347.00
Thursday -Php. 210.50
Friday -Php. 193.50
Solution
x̄=
∑x
n
Php . 353.25+ Php. 220.75+ Php .347.00+ Php . 210.50+ Php. 193.50
=
5
1325
=
5
x̄ = 265
MEAN OF GROUPED DATA using the long method
Formula: x̄=
∑ fx
n
where: f – frequency of class interval
n – midpoint of class interval (presumed to be the mean of the values
grouped under this interval).
Problem
Calculate the mean grade of 50 students in Statistics.
Class Interval f x(midpoint) Fx
90-94 7 92 644
85-89 13 87 1131
80-84 16 82 1312
75-79 8 77 616
70-74 6 72 432
n= 50 fx= 4135
Solution:
x̄=
∑ fx
n
4135
= 50
x̄= 82.7 or 83, the mean grade
MEDIAN FOR UNGROUPED DATA
In computing for the median, it is important to remember the following.
1. Arrange the data in the array of descending or ascending order.
2. Take note of the items in the middle position. If there is an odd number of an item,
the middle item is the median. If there is an even number of items, the median is
taken as the arithmetic mean of the two values falling in the middle.
Problems
A. The numbers of books borrowed from the library during each day of the week
were 36, 31, 24, 45, and 50. What is the median?
Solution:
Arrange the numbers as 24, 31, 36, 45, and 50. Since there are 5
items, the middle item is 36. Thus, the median is 36.
B. The numbers of books borrowed from the library during another week from
Monday to Saturday were 36, 31, 24, 25, 50, and 47. What is the median?
Solution:
Arrange the numbers as 24, 25, 31, 36, 47, and 50. In this case, there
are two middle numbers: 31 and 36. The median is the average of the middle
numbers, that is,
31+36
Md= =33.5
2
MEDIAN FOR GROUPED DATA
Formula:
[ ]
Md = L + 2
−F
f
i
where: L= exact lower limit of median class
n= total number of items
F= “less than” or “equal to” cumulative frequency preceding the class interval
containing the median
f= frequency of the median class
i= size of the class interval
Problem
(f) (L) (F)
Scores Frequency Exact Lower Limit Cumulative Cumulative
or Lower Frequency Percent
Boundary
95-99 5 94.5 100 100.0
90-94 11 89.5 95 95.0
85-89 17 84.5 84 84.0
80-84 25 79.5 67 67.0
75-79 20 74.5 42 42.0
70-74 12 69.5 22 22.0
65-69 7 64.5 10 10.0
60-64 3 59.5 3 3.0
i= 5 n= 100
n
Solution: n = 100; = 50; L = 79.5; F = 42; f = 25; i = 5
2
50−42
Md = 79.5 + [ 25 ]5
= 79.5 + 1.6
Md = 81.1
MODE FOR UNGROUPED DATA
In mode for ungrouped data, there is no calculation required, just counting,
and it can be determined for qualitative as well as quantitative data.
Example:
A. The size of 15 classes selected at random are:
40, 39, 42, 48, 45, 46, 42, 49, 43, 42, 41, 44, 38, 42, and 47
The mode is 42 because it is the measure that occurs the most number
of times.
B. The size of 15 families in a barangay chosen at random are:
8, 7, 4, 6, 12, 6, 7, 6, 8, 10, 7, 8, 5, 3, 4
The modes are 6, 7, and 8. The distribution is multimodal.
MODE FOR GROUPED DATA
In a grouped distribution, the class interval where the value with the highest
frequency is the modal class. The midpoint of the class interval is the mode.
Formula:
d
[ ]
1
Mo = Lmo + d +d i
1 2
Example:
Consider the distribution of the weekly wages of the factory workers in KRNRD
Garments Factory. Where is the highest frequency in the distribution located? What
is the modal class in the distribution?
Weekly Wages (in Php.) No. of Workers
1,380 - 1,399 4
1,360 - 1,379 6
1,340 – 1,359 12
1,320 – 1,339 modal class 31
1,300 – 1,319 24
1,280 – 1,299 15
1,260 – 1,279 11
1,240 - 1,259 8
Substituting the values
d1
Mo = Lmo + [ ]
d 1 +d 2
i
31−24
= 1,319.5 + [ ( 31−24 ) +(31−12) ]
20
7
= 1,319.5 + [ 7+ 19 ]20
7
= 1,319.5 + [ ]2026
140
= 1,319.5 + 26
= 1,319.5 + 5.38
Mo = 1,324.88
STANDARD DEVIATION for Ungrouped Data
To find the standard deviation of an ungrouped data, use the formula:
2 2
√
s= : n ∑ x −( ∑ x)
n(n−1)
where: s- standard deviation
∑ x 2- sum of squared deviations
n- number of items
∑ x - summation of x
Example:
Calculate the standard deviation of the given scores in an Algebra quiz: 18,
20, 22, 15, 16, 12, 17, 21, 10, 19.
Step 1: Construct a table of values.
X X2
18 324
20 400
22 484
15 225
16 256
12 144
17 289
21 441
10 100
19 361
∑ X = 170 ∑ X 2= 3024
Step 2: Substitute in the formula
2 2
n ∑ x −( ∑ x )
s=
n (n−1)
= √ 10 ( 3024 )−¿ ¿ ¿
30240−28900
=
√ 90
1340
=
√ 90
=√ 14.89
s= 3.86
STANDARD DEVIATION of Grouped Data
To find the standard deviation of a grouped data, use the formula:
∑ f d2
s=
√ n
where: s- standard deviation
∑ f d 2- sum of the product of frequency and squared deviation
n- number of items
Example
Using the data of The Arts and Craft Shop shown below, calculate the
standard deviation.
Amount in F X d d2
Pesos days midpoint fx (deviation f d2
)
172-180 3 176 528 25 625 1875
163-171 5 167 835 16 256 1280
154-152 9 158 1422 7 49 441
145-153 12 149 1788 -2 4 48
136-144 5 140 700 -11 121 605
127-135 4 131 524 -20 400 1600
118=126 2 122 244 -29 841 1680
n= 40 ∑ fx =¿ ∑ f d 2= 7531
6041
Step 1: Prepare the frequency distribution with appropriate class intervals and write
the corresponding frequency (f).
Step 2: Get the midpoint (x) of each class interval.
Step 3: Multiply the (f) at the midpoint (x) of each interval to get fx.
Step 4: Add fx of each interval to get ∑ fx.
Step 5: Compute the mean (x̄) using x̄=
∑ fx
n
x̄=
∑ fx
n
6041
=
40
= 151.03 or 151
Step 6: Calculate the deviation (d) by subtracting the mean (x̄) from each midpoint
(x). Thus, d= x - x̄.
Step 7: Square the deviation (d) of each interval to get d 2.
2
Step 8: Multiply the frequency (f) and d 2. Find the sum of each product to get ∑ f d .
Step 9: Calculate the standard deviation (s) using the formula.
∑ f d2
s=
√ n
Substitute the values in the formula
∑ f d2
s=
√ n
7531
=
√ 40
=√ 188.275
s = 13.72
SAMPLE PROBLEMS INFERENTIAL STATISTICS
T-TEST (Pooled Estimate)
Problem: The data are collected to determine the difference between the
supplies of laundry detergent and fabric conditioner in a laundry shop during
the past 6 months.
Laundry X 21 Fabric X 22
Months Detergent Conditioner
( X 1) ( X 2)
January 40 1600 30 900
February 50 2500 25 625
March 35 1225 30 900
April 20 400 15 225
May 15 225 20 400
June 45 2025 50 2500
∑ X 1= 205 ∑ X 21 = 7975 ∑ X 2= 170 ∑ X 22= 5550
I. Statement of the Problem
Is there a difference between the supplies of laundry detergent and
fabric conditioner in a laundry shop for the past 6 months?
Statement of Null Hypothesis
There is no difference between the supplies of laundry detergent and
fabric conditioner in a laundry shop for the past 6 months.
II. Level of Significance
α= 5%
III. Critical Value
df= n1 +n 2−2
df= 6+ 6−2
df= 10
Tabular t-value = 2.228
IV. Computation.
x́ 1−x́2
2 2
Formula: t= ( n1 −1 ) S 1 + ( n2−1 ) S 2 1
√ n1+ n2−2
(
n1
+
1
n2
)
SD= n ∑ x 2−¿ ¿ ¿ ¿
Solution of Standard Deviation:
6 ( 7975 )−( 205)2 47850−42025 5825
S12= n ∑ x 2−¿ ¿ ¿ ¿ = = = = 194.17
6(6−1) 30 30
2 2 6 ( 5550 )−( 170)2 33300−28900 4400
S2 = n ∑ x −¿ ¿ ¿ ¿ = = = = 146.67
6(6−1) 30 30
Solution:
x́ 1−x́ 2
2 2
t= ( n1 −1 ) S 1 + ( n2−1 ) S 2 1
√ n1+ n2−2 ( n + n1 )
1 2
34.2−28.3
= ( 5 ) 194.17+ ( 5 ) 146.67 1 1
√ 6+6−2 ( )
+
6 6
5.9
= 970.85+733.35
√ 10
.333333333
5.9
5.9 5.9
t= 1704.2 = = = 0.7828
√ 10
(0.333333333) √ 56.8066666099 7.53701974323141
V. Interpretation
Since the computed t-value of 0.7828 is less than the tabular t- value of
2.228, at 5% level of significance, df=10 the null hypothesis is therefore
accepted, thus there is no difference between the supplies of laundry
detergent and fabric conditioner in a laundry shop for the past 6 months.
T-test (Paired Estimate)
Problem: Cybernetic Company, one of the leading computer companies in the
Philippines, wants to know the difference between the number of produced
computer in 2019 and 2020.
2019 (in 2020 (in d d2
thousands) thousands)
20 22 -2 4
15 13 2 4
40 35 5 25
12 15 -3 9
10 17 -7 49
25 15 10 100
∑d = 5 ∑ d 2 = 191
I. Statement of the Problem
Is there a difference between the number of produced computers in 2019 and
2020 of Cybernetic Company?
Statement of Null Hypothesis
There is no difference between the number of produced computers in 2019
and 2020 of Cybernetic Company.
II. Level of Significance
α= 5%
III. Critical Value
df= 6-1
df= 6-1
df= 5
Tabular t-value = 2.571
IV. Computation
d
Formula: d́=
n
2
√
Sd= n (∑ d )−¿ ¿¿ ¿
d́
t= sd
√n
6
Solution: d́= = d́= 1
6
5 ( 191 )−(5)2 930
Sd=
√ 6(6−1)
=
√ 30
= √ 31 =5.5678
1
1
t= 5.5678 = = .4399
2.2730
√ 6
V. Interpretation
Since the computed t-value of 0.4399 is less than the tabular t- value of 2.571,
at 5% level of significance, df=5 the null hypothesis is therefore accepted, thus there
is no difference between the number of produced computers in 2019 and 2020 of
Cybernetic Company..
F-test ANOVA
Problem: The data below are the 5 month sales of three branches of Oriental
Milk tea in Oriental Mindoro.
Month OMT 1 X 21 OMT 2 X 22 OMT3 X 23
January 30 900 40 1600 20 400
February 40 1600 60 3600 45 2025
March 35 1225 25 625 15 225
April 50 2500 45 2025 30 900
May 45 2025 55 3025 40 1600
t.j 200 225 150 T= 575
m.j 5 5 5 N= 15
∑ X2 8250 10875 5150 ∑ ∑ x2=¿
24275
I. Statement of the Problem
Is there a difference on sales of Oriental Milk Tea 1, Oriental Milk Tea 2, and
Oriental Milk Tea 3 in Oriental Mindoro for the first 5 months of its operations?
Statement of Null Hypothesis
There is no difference on sales of Oriental Milk Tea 1, Oriental Milk Tea 2,
and Oriental Milk Tea 3 on Oriental Mindoro for the first 5 months of its
operations.
II. Level of Significance
α= 5%
III. Critical Value
df= [ k −1, N −k ] = (3-1), (15-3) = 2, 12
Tabular ƒ value = 3.88
IV. Computation
(r)2
Formula: SST= ∑ ∑ x2−
N
(t . j) r2
SSt r = ∑
[ ]
(m. j) N
−
SSE= SST- SSt r
SStr
MS tr=
(k −1)
SSE
MSE=
(N −k )
MStr
ƒ= .
MSE
Solution:
2
−(575)
SST= 24275 = 24275−22,041.7= 2233.3
15
2 2 2
[ ]
SSt r = (200) +(225) +(150) − 330625 = [ 22625 ] −22041.7 = 583.3
5 21
SSE= 2233.3 – 583.3 = 1650
ANOVA TABLE
Source of Sum of Degrees of Mean Square F-value
Variation Squares (SS) Freedom (DF) (MS)
Treatment 583.3 (k-1) =2 583.3 291.65
= 291.65 =
2 137.5
2.1211
Error 1650 (N-k) =12 1650
= 137.5
12
Total 2233.3 (N-1) =14
V. Interpretation
Since the computed ƒ -value of 2.1211 is less than the tabular ƒ - value of
3.88, at 5% level of significance, df= 2, 12; the null hypothesis is therefore accepted,
thus there is no difference in the sales of three branches of Oriental Milk Tea in
Oriental Mindoro for the first 5 months of its operations.
CHI-SQUARE
Problem: A public opinion poll surveyed a simple random sample of 1000 voters.
Respondents were classified by gender (male or female) and by voting preference
(Republican, Democrat, or Independent). Results are shown in the table below.
Voting Preferences Row total
Rep Dem Ind
Male 200 150 50 400
Female 250 300 50 600
Column total 450 450 100 1000
I. Statement of the Problem
Is there a difference in the opinion of 1000 voters when it comes to their
gender and voting preference?
Statement of Null Hypothesis
There is no difference in the opinion of 1000 voters when it comes to their
gender and voting preference.
II. Level of significance
α= 5%
III. Critical Value
df= r −1 x c−1 = (2-1)(3-1) = 2
Tabular x 2 value = 4.605
IV. Computation
Formula: E=
∑ row total x column total
total total
(O−E)2
x 2=∑
E
Whereas: O =the frequency Observed
E= the frequency Expected
∑= the ‘sum of’
Observed Frequency
Republican Democrat Independent Row Total
Female 200 150 50 400
Male 250 300 50 600
Column Total 450 450 100 1000
Expected Frequency
Republican Democrat Independent Row Total
Female 180 180 40 400
Male 270 270 60 600
Column Total 450 450 100 1000
CHI-SQUARE
Republican Democrat Independent Row Total
Female 2.22222222222 5 2.5 9.7222222222
Male 1.48148148148 3.33333333333 1.66666666667 6.48148148148
3
Column 3.7037037037 8.33333333333 4.16666666667 16.2037037037
Total 3
V. Interpretation
Since the computed chi-square ( x 2) -value of 16.2037037037 is greater than
the tabular chi-square ( x 2)- value of 4.605, at 5% level of significance, df= 2 the null
hypothesis is therefore rejected, thus there is a difference in the opinion of 1000
voters when it comes to their gender and voting preferences..
Pearson’s R
Problem: Red Marasigan, an AB History student wants to examine the
relationship between the amount of time spent studying for an exam (X) in
hours and the score that a student makes on an exam (Y). Data are shown
below.
X (no. of hours in studying) Y (scores on exam)
3 90
1 80
2 85
4 93
1.5 83
2.5 87
4 95
2 85
5 97
3 90
I. Statement of the Problem
Is there a relationship between the number of hours in studying and the score
that a student makes on an exam?
Statement of Null Hypothesis
There is no difference on the number of hours in studying and the score that a
student makes on an exam.
II. Level of significance
α= 5%
III. Critical Value
df= n-2
= 10 – 2
=8
Tabular r value = .632
IV. Computation
Fomula:
r= n ¿ ¿
Solution:
X Y X2 Y2 XY
3 90 9 8100 270
1 80 1 6400 80
2 85 4 7225 170
4 95 16 9025 380
1.5 83 2.25 6889 124.5
2.5 87 6.25 7569 217.5
4 95 16 9025 380
2 85 4 7225 170
5 97 25 9409 485
3 90 9 8100 270
∑ X= 28 ∑Y = 887 ∑ X 2 = 92.5 ∑ Y 2 = 78967 ∑ XY = 2547
Substitute the formula
r= n ¿ ¿
10 ( 2574 )−(28)(887)
=
√¿¿¿
25740−24836
=
√ [ 925−784 ][ 789670−786769 ]
904
=
√(141)(2901)
904
=
√ 409041
904
= 639.5631
=1.4135
V. Interpretation
1. Since the computed r-value of 1.4135 is greater than the tabular
t-value of .632, df=8 at 5% level of significance, the null
hypothesis is therefore rejected, thus there is a relationship
between the number of hours in studying and the scores that a
student makes on an exam.
2. r-value of 1.4135 indicates a direct proportional relationship,
which means that as the number of hours in studying increases,
the scores that a student makes on an exam also increases, or
as the number of hours in studying decreases, the scores that a
student makes on an exam also decreases.
3. r-value indicates a very high relationship between the number of
hours in studying and the scores that a students makes on an
exam.
Spearman’s Rho
Problem: Problem: Red Marasigan, an AB History student wants to examine
the relationship between the amount of time spent studying for an exam (X) in
hours and the score that a student makes on an exam (Y). Data are shown
below.
X (no. of hours in studying) Y (scores on exam)
3 87
1 81
2 85
4 93
1.5 84
2.5 88
4 92
2 86
5 98
3 89
I. Statement of the Problem
Is there a relationship between the number of hours in studying and the score
that a student makes on an exam?
Statement of Null Hypothesis
There is no difference on the number of hours in studying and the score that a
student makes on an exam.
II. Level of significance
α= 5%
III. Critical Value
df= n
= 10
Tabular p value = .564
IV. Computation
Formula:
P = 1-
∑ d2
n(n−2)
Solution:
X Rank Y Rank D d2
3 4.5 87 6 -1.5 2.25
1 10 81 10 0 0
2 7.5 85 8 -.5 .25
3.5 2.5 93 2 .5 .25
1.5 9 84 9 0 0
2.5 6 88 5 1 1
4 2 92 3 -1 1
2 7.5 86 7 .5 .25
5 1 98 1 0 0
3 4.5 89 4 .5 .25
∑ d 2=
5.25
Substitute:
P = 1-
∑ d2
n(n−2)
5.25
= 1-
10(10−2)
5.25
= 1- 80 = 1- .065625 = .934375
V. Interpretations
1. Since the computed rho value of .934375 is greater than the
tabular value of .564 at 5% level of significance, df= 10, the null
hypothesis is therefore rejected , thus there is a relationship
between the number of hours in studying and the scores that a
student makes on an exam.
2. Since the computed rho value is .934375, we can say that there
is a moderate relationship between the number of hours in
studying and the scores that a student makes on an exam.
3. Since the computed rho value is .934375, we can say that there
is a positive relationship between the number of hours in
studying and the scores that a student makes on an exam.
Regression Analysis
X1 X2 Y X 21 X 22 Y2 X1 X2 X1 Y X2 Y
3 2 11 9 4 121 6 33 22
1 1 8 1 1 64 1 8 8
4 3 9 16 9 81 12 36 27
2 2 10 4 4 100 4 20 20
5 2 12 25 4 144 10 60 24
8 32 64
78 81
70 30
14 56
45 74
∑ X 1 ∑ X 2 ∑ Y = ∑ X 21 ∑❑
= = =