Data Analysis
Data Analysis
2
Descriptive statistics can also be used to summarize data visually before
quantitative methods of analysis are applied to them. Some important forms
of representations of descriptive statistics are as follows:
Frequency Distribution Tables: These can be either simple or
grouped frequency distribution tables. They are used to show the distribution
of values or classes along with the corresponding frequencies. Such tables
are very useful in making charts as well as catching patterns in data.
Graphs and Charts: Graphs and charts help to represent data in a
completely visual format. It can be used to display percentages,
distributions, and frequencies. Scatter plots, bar graphs, pie charts, etc., are
some graphs that are used in descriptive statistics.
Descriptive Statistics vs Inferential Statistics
Inferential and descriptive statistics are both used to analyze data.
Descriptive statistics helps to describe the data quantitatively while
inferential statistics uses these parameters to make inferences about the
population. The differences between descriptive statistics and inferential
statistics are given below.
3
Descriptive Statistics Inferential Statistics
4
3 Adolescence
4 Stress
5 Lack of parental guidance
1 Fellow students
2 Chemist
3 Family members
4 Teachers
5 Workers
1 Tobacco/shisha
2 Alcohol (Beer, changaa
3 Kuber
4 Bhang
5 Cocaine/heroine
5
Using a Likert scale where 1 is Disagree, 2 Undecided/not sure, 3 Agree,
indicate your opinion on how often students take drugs and alcohol in
secondary schools
Table 4: How often student take drugs/alcohol
S/N 1 2 3
How often D U A
1 Once in a day
2 Twice in a day
3 Not at all
4 Sometimes
5 Many times/frequently
DATA ANALYSIS
6
1 Peer pressure/peer influence f 54 11 75
% 38. 7.8 53.6
6
2 Availability of drugs f 60 15 65
% 42. 10.7 46.4
9
3 Adolescence f 41 19 80
% 29. 13.8 57.1
3
4 Stress f 72 22 46
% 51. 15.7 32.9
4
5 Lack of parental guidance f 55 16 69
% 39. 11.4 49.3
3
KEY:
D- Disagree
U-Undecided
A- Agree
The results show that 53.6% of the respondents agreed that peer pressure
was the cause of drug and alcohol abuse, 38.6% disagreed and 7.8 % were
not sure. Thus, peer pressure need to be controlled in schools.
The study also indicated that 46.4% of the respondents agreed that
availability of drugs was the cause of drug and alcohol abuse, 42.9%
disagreed and 10.7 % were not sure. Hence, there is need to control the
availability and access of drugs and alcohol by students.
7
3 Family members f
%
4 Teachers f
%
5 Workers f
%
8
2 Twice in a day f
%
3 Not at all f
%
4 Sometines f
%
5 Many times/frequently f
%
9
different or if they are similar. Analysis of Variance (ANOVA) is a statistical
formula used to compare variances across the means (or average) of
different groups. A range of scenarios use it to determine if there is any
difference between the means of different groups.
ANOVA terminology
Dependent variable: This is the item being measured that is theorized to
be affected by the independent variables.
Independent variable/s: These are the items being measured that may
have an effect on the dependent variable.
A null hypothesis (H0): This is when there is no difference between the
groups or means. Depending on the result of the ANOVA test, the null
hypothesis will either be accepted or rejected.
An alternative hypothesis (H1): When it is theorized that there is a
difference between groups and means.
Factors and levels: In ANOVA terminology, an independent variable is
called a factor which affects the dependent variable. Level denotes the
different values of the independent variable that are used in an experiment.
Null Hypothesis, H0: μ1= μ2 = μ3= ... = μk
Alternative Hypothesis, H1: The means are not equal
Decision Rule: If test statistic > critical value then reject the null hypothesis
and conclude that the means of at least two groups are statistically
significant.
10
In our example, we can use one-way ANOVA to compare the effectiveness of
the three different teaching methods (lecture, workshop, and online learning)
on student exam scores. The teaching method is the independent variable
with three groups, and the exam score is the dependent variable.
Null Hypothesis (H₀): The mean exam scores of students across the
three teaching methods are equal (no difference in means).
Alternative Hypothesis (H₁): At least one group’s mean significantly
differs.
The one-way ANOVA test will tell us if the variation in student exam scores
can be attributed to the differences in teaching methods or if it’s likely due
to random chance.
One-way ANOVA is effective when analyzing the impact of a single factor
across multiple groups, making it simpler to interpret. However, it does not
11
account for the possibility of interaction between multiple independent
variables, where two-way ANOVA becomes necessary.
2. Two-way ANOVA
Two-way ANOVA is used when there are two independent variables, each
with two or more groups. The objective is to analyze how both independent
variables influence the dependent variable.
Let’s assume you are interested in the relationship between teaching
methods and study techniques and how they jointly affect student
performance. The two-way ANOVA is suitable for this scenario. Here we test
three hypotheses:
The main effect of factor 1 (teaching method): Does the teaching
method influence student exam scores?
The main effect of factor 2 (study technique): Does the study
technique affect exam scores?
Interaction effect: Does the effectiveness of the teaching method
depend on the study technique used?
For example, two-way ANOVA could reveal that students using the lecture
method perform better in group study, and those using online learning might
perform better in individual study. Understanding these interactions gives a
deeper insight into how different factors together impact outcomes.
What is an ANOVA Test?
ANOVA stands for Analysis of Variance, a statistical test used to compare the
means of three or more groups. It analyzes the variance within the group
and between groups. The primary objective is to assess whether the
observed variance between group means is more significant than within the
groups. If the observed variance between group means is significant, it
suggests that the differences are meaningful.
Mathematically, ANOVA breaks down the total variability in the data into two
components:
12
Within-Group Variability: Variability caused by differences within
individual groups, reflecting random fluctuations.
Between-Group Variability: Variability caused by differences
between the means of the different groups.
The test produces an F-statistic, which shows the ratio between between-
group and within-group variability. If the F-statistic is sufficiently large, it
indicates that at least one of the group means is significantly different from
the others.
To understand this better, consider a scenario where you are asked to assess
a student’s performance (exam scores) based on three teaching methods:
lecture, interactive workshop, and online learning. ANOVA can help us assess
whether the teaching method statistically impacts the student’s exam
performance.
The steps to perform the one-way ANOVA test are given below:
13
Step 6: Calculate the degrees of freedom of errors.
Step 7: Determine the MSB and the MSE.
Step 8: Find the f test statistic.
Step 9: Using the f table for the specified level of significance, αα,
find the critical value. This is given by F(αα, df1. df2).
Step 10: If f > F then reject the null hypothesis.
EXAMPLE ONE
Teaching Methods
Lecture Workshop Online Learning
80 55 70
85 34 65
78 43 74
83 54 77
Exam scores for each teaching method for four students each.
The first step in the process is defining the hypothesis. State the null and
alternative hypotheses:
Null Hypothesis (H₀): The means of exam scores for students across
the three teaching methods are equal.
Alternative Hypothesis (H₁): At least one teaching method has a
different mean exam score.
14
H0: μ1 = μ2 = … = μk
Before performing ANOVA, ensure that the assumptions are met. Normality,
independence, and homogeneity of variances. For simplicity, let’s assume all
the assumptions are met.
F-statistic in one-way ANOVA is the ratio between the mean square sum
between the groups and the mean square sum within the groups.
1. Calculate the mean for each group and the overall mean.
Use the equation below to calculate the mean for each teaching method (Ai).
Divide the sum of the exam scores for each group by the number of students
in each group.
15
Mean for each group (teaching method).
Next, calculate the overall mean (G) by dividing the sum of all the instances
by the total number of students.
The equation is as follows to calculate the sum of squares for each group.
16
The sum of squares for each method of teaching. Image by Author
After computing, fill this table with the values for easy access.
17
Summary of students' performance by teaching method.
Using the equation below, calculate the sum of squares between the groups.
In the equation,
18
Make use of the values from the summary table for the calculation.
Next, calculate the sum of squares within the group. It is the summation of
the sum of squares (SS) for each group.
19
Verify the calculation by checking if the total sum of squares is the addition
of the sum of squares between the groups and the sum of squares within the
group. After verifying, move on to calculating mean squares.
With the values calculated in the previous step, compute the mean squares.
F-statistic is the ratio of the mean square between the group to the mean
square within the group.
F-statistic.
20
The computed value of the F-statistic is 28.747.
Finally, the p-value is computed using the F-statistic, degree of freedom df,
and F-distribution table.
The p-value is 0.000123, and we would reject the null hypothesis to conclude
that the teaching method significantly affects exam scores.
EXAMPLE TWO
The students in each group are randomly assigned to use one of the three
exam prep programs for the next three weeks to prepare for an exam. At the
end of the three weeks, all of the students take the same exam.
21
To perform a one-way ANOVA on this data, we will use the Statology One-
Way ANOVA Calculator with the following input:
22
From the output table we see that the F test statistic is 2.358 and the
corresponding p-value is 0.11385.
23
Since this p-value is not less than 0.05, we fail to reject the null hypothesis.
EXAMPLE THREE
We combine all of this variation into a single statistic, called the F statistic
because it uses the F-distribution. We do this by dividing the variation
between samples by the variation within each sample. The way to do this is
typically handled by software, however, there is some value in seeing one
such calculation worked out.
1. Calculate the sample means for each of our samples as well as the
mean for all of the sample data.
2. Calculate the sum of squares of error. Here within each sample, we
square the deviation of each data value from the sample mean. The
sum of all of the squared deviations is the sum of squares of error,
abbreviated SSE.
3. Calculate the sum of squares of treatment. We square the deviation of
each sample mean from the overall mean. The sum of all of these
squared deviations is multiplied by one less than the number of
24
samples we have. This number is the sum of squares of treatment,
abbreviated SST.
4. Calculate the degrees of freedom. The overall number of degrees of
freedom is one less than the total number of data points in our sample,
or n - 1. The number of degrees of freedom of treatment is one less
than the number of samples used, or m - 1. The number of degrees of
freedom of error is the total number of data points, minus the number
of samples, or n - m.
5. Calculate the mean square of error. This is denoted MSE = SSE/(n - m).
6. Calculate the mean square of treatment. This is denoted MST =
SST/m - `1.
7. Calculate the F statistic. This is the ratio of the two mean squares that
we calculated. So F = MST/MSE.
Software does all of this quite easily, but it is good to know what is
happening behind the scenes. In what follows we work out an example of
ANOVA following the steps as listed above.
Suppose we have four independent populations that satisfy the conditions for
single factor ANOVA. We wish to test the null hypothesis H0: μ1 = μ2 = μ3 =
μ4. For purposes of this example, we will use a sample of size three from
each of the populations being studied. The data from our samples is:
Sample from population #1: 12, 9, 12. This has a sample mean of 11.
Sample from population #2: 7, 10, 13. This has a sample mean of 10.
Sample from population #3: 5, 8, 11. This has a sample mean of 8.
Sample from population #4: 5, 8, 8. This has a sample mean of 7.
We now calculate the sum of the squared deviations from each sample
mean. This is called the sum of squares of error.
For the sample from population #1: (12 – 11)2 + (9– 11)2 +(12 – 11)2 =
6
For the sample from population #2: (7 – 10)2 + (10– 10)2 +(13 – 10)2 =
18
For the sample from population #3: (5 – 8)2 + (8 – 8)2 +(11 – 8)2 = 18
For the sample from population #4: (5 – 7)2 + (8 – 7)2 +(8 – 7)2 = 6.
25
Sum of Squares of Treatment
Degrees of Freedom
Before proceeding to the next step, we need the degrees of freedom. There
are 12 data values and four samples. Thus the number of degrees of
freedom of treatment is 4 – 1 = 3. The number of degrees of freedom of
error is 12 – 4 = 8.
Mean Squares
The F-statistic
The final step of this is to divide the mean square for treatment by the mean
square for error. This is the F-statistic from the data. Thus, for our example F
= 10/6 = 5/3 = 1.667.
EXAMPLE FOUR
Steps in One-Way ANOVA
One-way ANOVA (Analysis of Variance) is a statistical test used to compare
the means of three or more samples to determine if there are significant
differences among them. It is based on the assumption that the samples are
drawn from normally distributed populations with equal variances.
Steps in One-way ANOVA:
26
1. Specify the null and alternative hypotheses. The null
hypothesis is usually that there is no difference among the means of
the samples, while the alternative hypothesis is that there is at least
one difference among the means of the samples.
2. Select three or more samples from the populations and calculate
the sample means and sizes.
3. Calculate the overall mean of all the samples combined.
4. Calculate the sum of squares within groups (SSW), the sum
of squares between groups (SSB) and the degrees of freedom
for both SSW and SSB.
5. Calculate the F statistic as the ratio of the Mean SSB to the Mean
SSW.
6. Determine the critical value of the F statistic based on the test's
significance level (alpha) and the degrees of freedom for the
numerator and denominator. The degrees of freedom for the
numerator is the number of groups minus 1, while the degrees of
freedom for the denominator is the total sample size minus the
number of groups.
7. Compare the calculated F statistic to the critical value to
determine whether to reject or fail to reject the null hypothesis. If the
calculated F statistic exceeds the critical value, the null hypothesis is
rejected, and the alternative hypothesis is accepted.
Conditions for One-way ANOVA:
To conduct a valid one-way ANOVA, the following conditions must be met:
1. The samples must be drawn randomly from the populations.
2. Each observation in each sample must be independent of the others.
3. The population distributions must approximate a normal
distribution.
4. The population variances must be equal.
Typical Null and Alternate Hypotheses in One-way ANOVA:
27
The null hypothesis in a one-way ANOVA is that there is no difference among
the means of the samples. This can be expressed as:
H0: μ1 = μ2 = … = μk
Where μ1, μ2, …, μk are the means of the samples.
The alternate hypothesis is the opposite of the null hypothesis and is that
there is at least one difference among the means of the samples. This can be
expressed as:
H0: At least one μ1 ≠ μ2
Calculating SSW and SSB:
Sum of Square Within (SSW):
The Sum of Squares Within (SSW) measures the variance within the groups.
It is calculated as the sum of the squared differences between each
individual observation and the group mean divided by the degrees of
freedom within the groups.
The formula for calculating SSW is:
SSW=Σ(Xi−M)2
Where X is the individual observation, and M is the group mean.
28
The formula for calculating the Sum of Squares Between (also known as SSB)
is:
SSB=Σ(Mi−M)2∗ni
Where Mi is the mean of the ith group, M is the grand mean (the mean of all
observations), and n is the number of observations in each group. (assuming
an equal number of observations in each group)
To calculate SSB, you would first need to find the mean of each group and
the grand mean. Then, for each group, you would subtract the grand mean
from the group mean and square the result. Finally, you would multiply this
squared difference by the number of observations in the group and sum the
results for all groups to get the SSB.
Calculating F Statistic:
The F statistic in a one-way ANOVA (analysis of variance) is a measure of
how much variation in the data can be attributed to the different groups
(also known as "treatments") compared to the variation within the groups. It
is calculated as the ratio of the Mean SSB (mean sum of squares between
groups) to the Mean SSW (mean sum of squares within groups).
29
The Mean SSB is a measure of the variation between the group means, and
is calculated as:
Mean SSB=SSB(k−1)
Where SSB is the sum of squares between groups, and k is the number of
groups.
The Mean SSW is a measure of the variation within each group, and is
calculated as:
Mean SSW=SSW(n−k)
Where SSW is the sum of squares within groups, and n is the total number of
observations.
The F statistic is then calculated as:
F=Mean SSB Mean SSW=SSBk−1SSWn−k
A high F value indicates that there is a significant difference between the
group means, while a low F value indicates that the group means are not
significantly different. The F statistic is used to test the null hypothesis that
there is no significant difference between the group means. If the calculated
F value is greater than the critical F value, the null hypothesis is rejected,
and it is concluded that there is a significant difference between the group
means.
Calculating Critical Values:
The critical values for the F statistic in a one-way ANOVA depend on the
significance level of the test and the degrees of freedom for the numerator
and denominator. The degrees of freedom for the numerator are the number
of groups minus 1 or (k-1), while the degrees of freedom for the denominator
are the total sample size minus the number of groups or (n-k). Using these
two values (significance level and degrees of freedom), you can find out the
value of the critical F statistic using an F Distribution Table.
30
In addition, you can use statistical software to find out the critical value. In
Excel, you can use =F.INV.RT(probability, deg_freedom1, deg_freedom2)
for the right tail value.
Remember: One-way ANOVA is always a one-tail (right tail) test.
2. CORRELATION
A correlation is a statistical calculation which describes the nature of the
relationship between two variables (i.e., strong and negative, weak and
positive, statistically significant).
An important thing to remember when using correlations is that a correlation
does not explain causation. A correlation merely indicates that a relationship
or pattern exists, but
it does not mean that one variable is the cause of the other.
For example, you might see a strong positive correlation between
participation in the Group discussions and students’ grades during course
work; however, the correlation will not tell you if the course work duration is
the reason why students’ grades were higher.
3. REGRESSION
31
Regression is an extension of correlation and is used to determine whether
one variable is a predictor of another variable. A regression can be used to
determine how strong the relationship is between your intervention and your
outcome variables. More importantly, a regression will tell you whether a
variable (e.g., participation in your program) is a statistically significant
predictor of the outcome variable (e.g.,1st class, 2rd class upper, etc.). A
variable can have a positive or negative influence, and the strength of the
effect can be weak or strong. For example, a regression would help you
determine if the length of participation (number of weeks) within the
semester program is actually a predictor of students’ grades at the end of
the semester. Like correlations, causation cannot be inferred from
regression, only prediction.
Other than the above, quantitative data analysis may also be done using
parametric and non-parametric statistics.
Parametric methods make assumptions about the population distribution.
Whether normal or not normal. On the other hand, non-parametric are
statistical techniques for which we do not have to make any assumption of
normality for the population we are studying. Indeed, the methods do not
have any dependence on the population of interest.
32