MAS291 Final Project
Group: 7
Member:
{Nguyễn Thái Nguyên- QS170069; Lương Hoàng Duy - DE170114; Trần Đinh Khang -
DS170082; Bùi Nữ Vân Nhi - DA160062}
I. Dataset:
The dataset given by teacher includes the weight and height of 30 female students. Here is the
example of the dataset:
The dataset only contains two variables weights and heights of 30 female students. In this project
we will conduct descriptive and inferential statistical analysis based on the given data and then a
regression model is introduced to show the mathematical relation between the two variables.
II. Descriptive and Inferential Statistical Analysis:
1. Descriptive Analysis:
By using the Analysis tool provided by Microsoft Excel, a descriptive analysis table were
generated.
Weight Height
Mean 59,73333 Mean 165,6667
Standard Error 1,231701 Standard Error 1,132928
Median 59,5 Median 166
Mode 58 Mode 170
Standard Deviation 6,746306 Standard Deviation 6,2053
Sample Variance 45,51264 Sample Variance 38,50575
Kurtosis 0,524973 Kurtosis -0,12615
Skewness 0,837299 Skewness -0,10425
Range 27 Range 25
Minimum 50 Minimum 154
Maximum 77 Maximum 179
Sum 1792 Sum 4970
Count 30 Count 30
Confidence Level (95.0%) 2,519112 Confidence Level (95.0%) 2,317097
a. Variable Weight (kg):
The average weight of the students was found to be approximately 59.73 kg, with a standard
deviation of 6.75 kg, indicating a moderate amount of variability in the weights. The mode is 58
kg, indicating that it is the weight that appears most often in the dataset.
The skewness value of 0.84 indicates that the weight distribution is positively skewed. This
means that the tail of the distribution extends more towards higher weights, suggesting that there
might be a few students with relatively higher weights compared to the majority.
The range of weights observed in the dataset was 27 kg, ranging from a minimum of 50 kg to a
maximum of 77 kg. The most frequently occurring weight was 58 kg. The sample variance, a
measure of dispersion, was calculated to be approximately 45.51 kg^2
Histogram
10 120.00%
8 100.00%
Frequency
6 80.00%
60.00% Frequency
4 40.00% Cumulative %
2 20.00%
0 0.00%
50 55,4 60,8 66,2 71,6 More
Bin
The majority of students have weights ranging from 55.4 kg to 71.6 kg, with the highest
frequency occurring within the 55.4 kg to 60.8 kg range. The distribution appears to be slightly
skewed towards relatively lower weights, as indicated by the lower frequencies in the higher
weight ranges.
b. Variable Height (cm):
The average height of the students was calculated to be approximately 165.67 cm, with a
standard deviation of 6.21 cm. This suggests that the heights of the students varied around the
mean, indicating a moderate level of diversity within the group. The height distribution exhibited
a nearly symmetrical shape, as evidenced by a small negative skewness value.
The analysis revealed that the range of heights in the dataset spanned 25 cm, with the shortest
height recorded at 154 cm and the tallest at 179 cm. The most common height among the
students was 170 cm, reflecting a frequently occurring value within the group. The sample
variance, a measure of height dispersion, was computed to be approximately 38.51 cm^2,
providing an understanding of the spread of heights around the mean.
Histogram
12 120.00%
10 100.00%
8 80.00%
Frequency
6 60.00% Frequency
Cumulative %
4 40.00%
2 20.00%
0 0.00%
154 159 164 169 174 More
Bin
The majority of individuals have heights falling within the range of 164 cm to 169
cm, which is the most prevalent range. The distribution appears to be slightly
skewed towards relatively shorter heights, as indicated by the lower frequencies in
the taller height ranges.
2. Inferential Statistics:
Hypothesis Testing and Confidence Interval of the mean of a population
Population: All female students
Sample: This dataset
Analysis: Construct a Confidence Interval with 5% significance level for average height of all
female students. In here, we will use t-distribution because sigma is unknown.
C.I for average height of all althletes in the world
Use t-distribution (sigma is unknown)
alpha 5%
n 30
sample mean x 165,67
sample stdev s 6,21
t(alpha/2,n-1) 2,05
right bound 167,98
left bound 163,35
163.35 <= height <=
C.I for average height: 167.98
So after the analysis, we can conclude that a 95% confidence interval on the average height of all
female students based on this data is (163.35, 167.98)
Research question for Hypothesis Testing: Average Height of all female students is 164 cm.
Test the claim with significance level of 10% based on the data.
Hypotheses:
Null Hypothesis (H0): The average height of all female students is 164 cm.
Alternative Hypothesis (H1): The average height of all female students is not 164 cm.
alpha 10%
n 30
sample mean x 165,67
sample stdev s 6,21
Test statistic
mean0 164
1,47111
t0 5
t(alpha/2,n-1) 1,70
-t(alpha/2,n-1) -1,70
Because t0 is in acceptance range, fail to reject null hypothesis. This indicates that there is not
enough evidence to suggest that the average height of all female student is not equal to 164 cm.
Hypothesis Testing for population proportion P
Population: All female students
Sample: This data
Analysis: Test the claim that the percentage of female students with weight under 65 in the
world is 60% of the total population with 10% significance level
Hypotheses:
Null Hypothesis (H0): The percentage of female students with weight under 65 in the world is
equal to 60%.
Alternative Hypothesis (H1): The percentage of female students with weight under 65 in the
world is not equal to 60%.
p hat 0,80
test static z0 2,24
z(alpha/2) = right critical 1,96
- z(alpha/2) = left critical -1,96
Z0 is not in acceptance, so we reject the null hypothesis (H0), which means this would indicate
evidence to suggest that the percentage of female students with weight under 65 in the world is
different from 60%.
III. Constructing Simple Linear Regression and Analyzing Result:
For Linear Regression analysis, we set the variable height(cm) as “X” and variable weight(kg) as
“Y”. Using Regression Analysis from Analysis Tool in Excel, a summary output was generated:
Regression Statistics
Multiple R 0,875052
R Square 0,765716
Adjusted R Square 0,757349
Standard Error 3,323203
Observations 30
The multiple correlation coefficient (R) indicates the strength and direction of the linear
relationship between the predictor variables and the response variable. In this case, the multiple
R value is approximately 0.88. This suggests a strong positive correlation between the predictor
variables and the response variable.
The R Square value is approximately 0.77, indicating that around 77% of the variability in the
response variable can be explained by the predictor variables included in the regression model.
The adjusted R Square takes into account the number of predictor variables and the sample size
to provide a more accurate measure of the proportion of variance explained. The adjusted R
Square value of approximately 0.76 suggests that the predictor variables explain about 76% of
the variance in the response variable, considering the model's complexity and sample size.
The standard error represents the average deviation of the observed values from the regression
line. In this case, the standard error is approximately 3.32. It provides an estimate of the typical
distance between the actual data points and the predicted values from the regression model.
The number of observations indicates the sample size used in the regression analysis. In this
case, the analysis is based on 30 observations.
ANOVA table:
The analysis conducted involved performing an ANOVA (analysis of variance) to assess the
significance of a regression model. The results showed that the regression model was highly
significant (p < 0.001), indicating that the predictor variable(s) included in the model have a
strong influence on the response variable.
Further examination of the coefficients revealed that the intercept had a significant negative
effect, with a value of approximately -97.87. The predictor variable (referred to as X Variable 1)
had a significant positive effect, with a coefficient of approximately 0.95. These coefficients
imply that for every unit increase in X Variable 1, the response variable is expected to increase
by approximately 0.95 units, after accounting for the intercept.
For the given regression analysis above, we can conduct a formula calculating Y based on X and
vice versa: Y= 0.951343*X - 97.87254
Example: Given a student in the class with her height is 170cm, predict her weight?
Answer: Her predicted weight is Y = 63.8557 (kg)
In the analysis, we based on the formula and conduct tests to see if the regression model can
correctly predict the weights given their heights. The output table is given below.
X Variable 1 Line Fit Plot
100
80
60 Y
40 Predicted Y
Y
20
0
150 155 160 165 170 175 180 185
X Variable 1
As we can see, the regression model can estimate the value quite correct. This means that the
regression model can estimate well the relation between weights and heights of 30 examined
female students.
IV. Conclusion:
In conclusion, this data analysis project focused on examining the heights and weights of 30
female students. The project utilized descriptive statistics to summarize and analyze the data,
inferential statistics to draw conclusions and make predictions, and visual representations to
effectively communicate the findings.
The descriptive analysis revealed that the average height of the female students was
approximately 165.67 cm, with a standard deviation of 6.21 cm, indicating a moderate level of
variability. The weight data showed an average weight of approximately 59.73 kg, with a
standard deviation of 6.75 kg. Both height and weight distributions exhibited near-normal
shapes.
Inferential statistics were employed to explore relationships and determine the statistical
significance of certain variables. The results indicated strong positive correlations between
height and weight.
Visual representations, including graphs, tables, and frequency distributions, were created to
enhance the communication of the findings. These visuals aided in effectively presenting the
data, allowing for easier interpretation and comprehension.
Overall, this project successfully analyzed the heights and weights of the female students,
providing valuable insights into their distributions, relationships, and characteristics. The
findings can inform further research, interventions, or decision-making processes related to
height and weight management among female students.