Assignment On:
DATA PROCESSING OF A DATASET INCLUDING FREQUENCY
DISTRIBUTION, CENTRAL TENDENCY, DISPERSION, CORRELATION
& REGRESSION ANALYSIS
Course Code: EMBA 502
Course Name: Business Mathematics and Statistics
Course Instructor: Dr. M. Amir Hossain
Group members
S. Tanvir Ahmed 2019-3-91-025
Mirza Alamin 2019-1-91-016
Injamul Hossain 2019-1-91-007
Farhana Ahmed 2019-2-91-001
Shaikh Rayhan Hossain 2019-1-91-013
Source of Data: BASF Bangladesh Limited
(A German based multinational chemical company)
Plant Location: Tejgaon, Dhaka
Plant Capacity: Maximum 60 MT per day
Type of Data: Daily basis production of admixture chemical
Data Collection: Month of October 2019
SL Production Production Product Quantity
Product Name Batch No.
No. Date Hour No. (MT)
1 01-10-19 6 50424562 MRheobuild 623 DJ33119 25
2 02-10-19 3 50548519 MRheobuild 623 DJ33319 12
3 03-10-19 9 50191350 MPolyheed 8650 DJ33419 35
4 04-10-19 9 50191350 MPolyheed 8650 DJ33519 37
5 05-10-19 4 50548519 MRheobuild 623 DJ33619 14
6 06-10-19 8 50424561 MPolyheed 8632 DJ33719 33
7 07-10-19 9 50548519 MRheobuild 623 DJ33819 37
8 08-10-19 11 50424562 MRheobuild 623 DJ33919 45
9 09-10-19 8 50411740 MPolyheed 8320 DJ34019 32
10 10-10-19 14 50424562 MRheobuild 623 DJ34119 55
11 11-10-19 6 50191350 MPolyheed 8650 DJ34219 25
12 12-10-19 11 50411740 MPolyheed 8320 DJ34319 42
13 13-10-19 6 50424562 MRheobuild 623 DJ34419 22
14 14-10-19 10 50411740 MPolyheed 8320 DJ34519 38
15 15-10-19 14 50411740 MPolyheed 8320 DJ34619 55
16 16-10-19 10 50411740 MPolyheed 8320 DJ34719 38
17 17-10-19 5 51747280 MRheobuild 1100 DJ34819 18
18 18-10-19 7 50424562 MRheobuild 623 DJ34919 28
19 19-10-19 5 50626218 MPolyheed 8395 DJ35019 18
20 20-10-19 11 50424562 MRheobuild 623 DJ35119 45
21 21-10-19 11 56277085 MGlenium ACE 30JP DJ35219 42
22 22-10-19 7 50548519 MRheobuild 623 DJ35319 28
23 23-10-19 8 50544229 MGlenium SKY 8632 DJ35419 32
24 24-10-19 4 50411740 MPolyheed 8320 DJ35519 15
25 25-10-19 6 50411740 MPolyheed 8320 DJ35619 25
26 26-10-19 12 50548519 MRheobuild 623 DJ35719 48
27 27-10-19 9 50660161 MPolyheed 8396 DJ35819 36
28 28-10-19 14 50411740 MPolyheed 8320 DJ35919 55
29 29-10-19 9 50411740 MPolyheed 8320 DJ36019 35
30 30-10-19 9 50626218 MPolyheed 8395 DJ36119 37
31 31-10-19 11 50548519 MRheobuild 623 DJ35319 45
Variable Types:
Variable is a characteristic that can assume any set of prescribed values.
Example: Age, Height, Temperature etc.
There are two types of variables
➢ Qualitative Variable (Attribute)
➢ Quantitative Variable
Qualitative Variable: The Characteristic or variable being studied in non-numeric.
Example: Gender, Religious.
Quantitative Variable: Variables can be measured and reported numerically.
Example- Time, age, income, weather, temperature
There are 2 types of Quantitative Variable
1. Discrete Variable: Can only assume certain values and there are usually “gaps”
between values.
Example- Income, Temperature, GPA
2. Continuous Variables: Can assume any value with a specific range
Examples: Weather, CGPA
This Data set is a discrete variable data
Scale of Measurement:
Scale of measurement
Measurement means assigning numbers or other symbols to characteristics of
objects according to certain prescribed rules.
Ratio level
The interval level with an inherent zero starting point. Differences and ratios are
meaningful for this level of measurement
Example: Materials quantity
Frequency Distribution:
Data Presentation (Frequency Distribution)
a. Class mark (midpoint): A point that divides a class into two equal parts. This is the
average between the upper and lower class limits.
Lower Limit + Upper Limit
b. Class Midpoint = ( )
2
c. Class interval: For a frequency distribution having classes of the same size, the
class interval is the difference between upper and lower limits of a class.
d. Class Interval = (Upper Limit − Lower Limit)
Product Quantity Range Tallies Frequency
10 - 20 lllll 5
20 - 30 llllll 6
30 - 40 lllllllllll 11
40 - 50 llllll 6
50 - 60 lll 3
Histogram:
A graph in which the classes are marked on the horizontal axis and the class frequencies
on the vertical axis. The class frequencies are represented by the heights of the bars
(equal class interval) and the bars are drawn adjacent to each other
Product Quantity Range Frequency
10 - 20 5
20 - 30 6
30 - 40 11
40 - 50 6
50 - 60 3
Histogram
12
10
Frequency
0
0-10 10-20 20-30 30-40 40-50 50-60
Product Quantity Range
Frequency Polygon:
Frequency polygon consists of line segments connecting the points formed by plotting the
midpoint and the class frequency for each class and then joined with X-axis at lower limit
of first class and upper limit of last class.
Product Quantity Range Mid-Point Frequency
10 - 20 15 5
20 - 30 25 6
30 - 40 35 11
40 - 50 45 6
50 - 60 55 3
Frequency Polygon
12
10
8
Frequency
0
0-10 10-20 20-30 30-40 40-50 50-60 60-70
Mid Level of Production Range
Cumulative Frequency Curve:
Cumulative frequency curve (ogive curve)) is a smooth curve obtained by joining the
points formed by plotting upper limit (less than type) or lower limit (more than type) of and
the cumulative frequency of each class. It is used to determine how many or what
proportion of the data values are below or above a certain value.
Less than type
Product Quantity Range Frequency
CF
10 - 20 5 5
20 - 30 6 11
30 - 40 11 22
40 - 50 6 28
50 - 60 3 31
Cumulative Frequency Curve
35
30
Cumulative Frequency
25
20
15
10
0
0 10 20 30 40 50 60 70
Upper Limit of CI (production quantity)
Central Tendency:
Measure of Central Tendency:
• Measure of Central Tendency is a single value that summarizes a set of data
• It locates the center of value
• Also known as measure of location or average
Different Measures of Central Tendency:
• Arithmetic mean
• Median
• Mode
Product Quantity Frequency Mid Value
fx fc
Range (f) (x)
10 - 20 5 15 75 5
20 - 30 6 25 150 11
30 - 40 11 35 385 22
40 - 50 6 45 270 28
50 - 60 3 46 138 31
Σf = 31 Σfx = 1018
Σfx 1018
Mean: X= = = 32.84
Σf 31
𝑁
−𝑓𝑐
2
Median: L+[ ]xC
𝑓𝑚
31
−11
2
= 30 + [ x 10
11
= 30 + [0.409] x 10
= 34.09
∆1
Mode: L+( )xC
∆1+∆2
5
= 30 + ( ) x 10
5+5
= 30 + (0.5) x 10
= 35
The result is showing that Mode > Median > Mean
So, the frequency distribution is Negative Skewed Distribution
Dispersion:
❖ It deals with spread of the data
❖ A small value of the measure of dispersion indicates that data are clustered closely
❖ A large value of dispersion indicates the estimate of central tendency is not reliable
❖ Measure of Scale:
Absolute Measure: Range
Mean Deviation
Variance
Standard Deviation
Relative Measure: Co-efficient of Variation (CV)
Product Frequency Mid Value
fx |x-x̅| f|x-x̅| (x-x̅)² f(x-x̅)²
Quantity Range (f) (x)
10 – 20 5 15 75 17.84 89.20 318.26 1591.30
20 – 30 6 25 150 7.84 47.04 61.46 368.76
30 - 40 11 35 385 2.16 23.76 4.66 51.26
40 - 50 6 45 270 12.16 72.96 147.86 887.16
50 - 60 3 46 138 13.16 39.48 173.18 519.54
Σf = 31 Σfx = 1018 Σf|x-x̅| = 272.44 Σf(x-x̅)² = 3418.02
Σfx 1018
Here, x̅ = = = 32.84
Σf 31
1) Range = Highest Limit of Upper Class- Lowest Limit of Lower Class
= 60 - 10
= 50
So, the maximum deviation range of observation is 50
Σf|x−x̅|
2) Mean Deviation =
Σf
272.44
=
31
= 8.79
So, the arithmetic mean of the deviations of the observations from the mean and among
themselves is 8.79
Σf(x−x̅)²
3) Variance: S2 =
Σf
3418.02
=
31
= 110.25
So, the arithmetic mean of the squared deviations of the observations from the mean is
110.25
4) Standard Deviation = √S2
= √110.25
= 10.50
So, the standard deviation of observations is 10.50
𝑆
5) Co-Efficient of Variation, CV = ∗ 100
𝑋
10.50
= ∗ 100
32.84
= 31.97 %
x y x-x̅ y-y̅ (x-x̅)² (y-y̅)² (x-x̅)(y-y̅)
6 25 -2.58 -8.94 6.66 79.92 23.07
3 12 -5.58 -21.94 31.14 481.36 122.43
9 35 0.42 1.06 0.18 1.12 0.45
9 37 0.42 3.06 0.18 9.36 1.29
4 14 -4.58 -19.94 20.98 397.60 91.33
8 33 -0.58 -0.94 0.34 0.88 0.55
9 37 0.42 3.06 0.18 9.36 1.29
11 45 2.42 11.06 5.86 122.32 26.77
8 32 -0.58 -1.94 0.34 3.76 1.13
14 55 5.42 21.06 29.38 443.52 114.15
6 25 -2.58 -8.94 6.66 79.92 23.07
11 42 2.42 8.06 5.86 64.96 19.51
6 22 -2.58 -11.94 6.66 142.56 30.81
10 38 1.42 4.06 2.02 16.48 5.77
14 55 5.42 21.06 29.38 443.52 114.15
10 38 1.42 4.06 2.02 16.48 5.77
5 18 -3.58 -15.94 12.82 254.08 57.07
7 28 -1.58 -5.94 2.50 35.28 9.39
5 18 -3.58 -15.94 12.82 254.08 57.07
11 45 2.42 11.06 5.86 122.32 26.77
11 42 2.42 8.06 5.86 64.96 19.51
7 28 -1.58 -5.94 2.50 35.28 9.39
8 32 -0.58 -1.94 0.34 3.76 1.13
4 15 -4.58 -18.94 20.98 358.72 86.75
6 25 -2.58 -8.94 6.66 79.92 23.07
12 48 3.42 14.06 11.70 197.68 48.09
9 36 0.42 2.06 0.18 4.24 0.87
14 55 5.42 21.06 29.38 443.52 114.15
9 35 0.42 1.06 0.18 1.12 0.45
9 37 0.42 3.06 0.18 9.36 1.29
11 45 2.42 11.06 5.86 122.32 26.77
Σx = 266 Σy = 1052 Σ(x-x̅)² = 266.55 Σ(y-y̅)² = 4299.87 Σ(x-x̅)(y-y̅) = 1063.16
x̅ = 8.58 y̅ = 33.94
Co-efficient of Correlation:
The Coefficient of Correlation (r) is a measure of the strength of the relationship between
two variables.
It can range from -1.00 to 1.00
𝛴(𝑥−𝑥̅ )(𝑦−𝑦̅ )
Co-efficient of Correlation, r =
√𝛴(𝑥−𝑥̅ )2 𝛴(𝑦−𝑦̅ )²
1063.16
=
√(266.55∗4299.87)
1063.16
=
1070.57
= 0.99
As we know, if 0.5 ≤ r ≤ 1, then variables have strong positive correlation. So, admixture
production time and production quantity have a strong positive Correlation.
Coefficient of Determination:
Coefficient of Determination (r2) is proportion of the total variation in the dependent
variable y that is explained or accounted for by the variation in the independent variable
x.
The coefficient of determination is the square of the coefficient of correlation.
It can range from -1.00 to 1.00
So, Co-efficient of Determination, r² = (0.99)2
= 0.98
So, 98% variation of admixture production can be explained by the variation in
production time.
Scatter Diagram:
➢ A plot of the paired observations of X and Y on a graph
➢ Graphically shows the relationship between two variables
➢ Common practice is to place the dependent variable on Y–axis and independent
variable on X–axis
Here, dependent variable is production quantity (Y) and independent variable is
production time (X)
Scatter Diagram
60
50
Production Quantity
40
30
20
10
0
0 2 4 6 8 10 12 14 16
Production Time
Regression Model:
➢ In regression analysis an equation is developed to express the relationship
between dependent and independent variables
➢ The equation must be linear
Purpose: To determine the regression equation; it is used to predict the value of the
dependent variable (Y) based on the independent variable (X).
General form of linear regression model:
Y = a + bX
Where, Y is dependent variable, X is independent variable and b is regression co-efficient,
a is intercept term.
𝛴(𝑥−𝑥̅ )(𝑦−𝑦̅)
b=
𝛴(𝑥−𝑥̅ )2
1063.16
=
266.55
= 3.98
a = Y̅- bX̅
= 33.94 – 3.98 * 8.58
= 33.94 – 34.15
= -0.21
So, Linear Regression Model is, Y = -0.21 + 3.98 X