Data Analysis and Techniques
By
Abrha Gebregergs(Assistant Professor)
Planning for Analysis
Type of Type of
Data Formatting
Type of
Analysis
Three Steps to Data Analysis
Analyze Results
Communicate Findings
Use Findings for Improvement
Two Kinds of Data
Qualitative
Quantitative
Qualitative Data
Narratives, logs, experience
Focus groups
Interviews
Open-ended survey items
Diaries and journals
Notes from observations
Quantitative Data
Data that is numerical, counted, or compared on a
scale
Demographic data
Answers to closed-ended survey items
Attendance data
Scores on standardized instruments
Qualitative Data Analysis
Qualitative data analysis involves the systematic exami-
nation and interpretation of non-numerical data, such
as text, images, audio, or video.
Principles to consider when conducting qualitative
data analysis:
1. Start by immersing yourself in the data to gain a deep
understanding of its content and context.
2. Identify codes and themes: Develop a coding frame-
work to categorize and label different aspects of the
data. Use inductive coding to allow new themes to
emerge from the data, but also consider deductive cod-
ing if you have pre-existing theories or concepts to
guide your analysis.
3. Maintain rigor and transparency: Ensure trans-
parency in your data analysis process by documenting
your decisions, methods, and interpretations
4.Seek patterns and connections: Look for patterns,
5. Organize the data into smaller units and then to cate-
gories (based on major points). Qualitative data analysis
typically follows an inductive approach, where themes,
patterns, or categories emerge from the data itself
rather than being pre-determined.
6. Interpret and contextualize the data: Move beyond de-
scription and aim for deeper interpretation of the data.
Consider the broader context, relevant theories, and ex-
isting literature to provide meaningful explanations and
insights. Be open to alternative interpretations and mul-
7. Triangulation: Use multiple data sources, methods, or
researchers to validate and strengthen your findings.
Triangulation helps ensure the reliability and validity of
your analysis by incorporating different viewpoints and
sources of evidence.
8. Iterative analysis: Qualitative data analysis is often an
iterative process. It involves going back and forth be-
tween the data, codes, themes, and interpretations.
Continuously refine and revise your analysis as new in-
sights emerge or as you gain a deeper understanding of
9. Analyze negative cases to reflect their perspectives. An-
alyzing negative cases involves examining instances or
data points that do not fit the expected patterns or
themes identified in the analysis. Negative cases can pro-
vide valuable insights and challenge assumptions, leading
to a more comprehensive understanding of the research
topic.
10. Synthesize the patterns into the grounded theory. It
aims to generate theories or explanations directly from
the data, rather than starting with pre-existing theories or
Steps in Qualitative Data Analysis
1. Familiarization with the data.
2. Data organization.
3. Coding.
4. Category development: Once a substantial number of
codes have been assigned, begin grouping similar codes
into categories.
5. Theme identification: Analyze the categories and look
for patterns or themes that cut across the data. Themes
are central concepts or ideas that capture the main find-
ings or insights. They may emerge from the categories or
be identified through a process of constant comparison
and reflection.
6. Data exploration and interpretation: Engage in a deeper
exploration and interpretation of the data. Look for con-
nections, relationships, or contradictions within and be-
7. Triangulation: Consider using triangulation to en-
hance the credibility and validity of the analysis. Tri-
angulation involves comparing and contrasting differ-
ent sources of data, methods, or perspectives to en-
sure consistency and reliability in the findings.
8. Synthesis and reporting: Finally, synthesize the find-
ings into a coherent and compelling narrative.
Present the results in a clear and organized manner,
using quotes or citations from the data to support the
themes and interpretations then develop ground
Grounded Theory Analysis Strategies
Grounded theory:
A qualitative research methodology that aims to de-
velop theories or concepts grounded in the data itself.
A process of constructing various data
Inductive process by collecting, analyzing and com-
paring data systematically.
Theory is grounded on data to explain the phenom-
ena.
The main purpose is to develop theory through under-
standing concepts that are related by means of state-
ments of relationships.
Iterative and flexible process: Grounded theory is an iter-
ative and flexible process that involves moving back and
forth between data collection and analysis.
An interactional method of theory building by comparing
and analyzing the data.
Saturation: Saturation is an important concept in
grounded theory. It refers to the point at which new data
Three steps in the grounded theory analytic process:
1. Open coding:
Break data into small parts compare for similari-
ties and differences explain the meanings of the
data by focusing on “ who, when, where, what, how
much, why” (ask questions to get a clear story)
2. Axial coding:
After open coding, make connection (sort) between
categories and confirm or disconfirm your hypothe-
ses.
3. Selective coding:
Select the core category (match hypotheses) and ex-
plain the minor category (against hypotheses) with
additional supporting data.
Interpretation Issues in Qualitative Data
Analysis
A. Triangulating Data
Use multiple methods and data sources to support the
strength of interpretations and conclusion
Ex) semi-structured interviews, consent form,
grounded theory
B. Audits
Questions to examine the data for interpretations and
conclusion
1. Is sampling appropriate to ground the findings?
2. Are coding strategies applied correctly?
3. Is the category process appropriate?
4. Do the results link hypotheses? (examine literature
review)
5. Are the negative cases explained? (minority’s voice)
Quantitative Data Analysis
You should choose a level of analysis that is ap-
propriate for your research question
You should choose the type of statistical analysis
appropriate for the variables you have
o nominal/Categorical, Ordinal, or Continuous
Quantitative data analysis
Key points
Data must be analysed to produce information
Computer software analysis is normally used for this
process
Data should be carefully prepared for analysis
Researchers need to know how to select and use dif-
ferent charting and statistical techniques
Quantitative data analysis
Main concerns
Preparing, inputting and checking data
Choosing the most appropriate statistics to describe
the data
Choosing the most appropriate statistics to examine
data relationships and trends
Preparing, inputting and checking data
Main considerations
Type of data (scale of measurement)
Data format for input to analysis software
Impact of data coding on subsequent analyses
Case weighting
Methods for error checking
Scales of Measurement
Data
Qualitative Quantitative
Numerical
Numerical Nonnumerical
Nonnumerical Numerical
Numerical
Nominal
Nominal Ordinal
Ordinal Nominal
Nominal Ordinal
Ordinal Interval
Interval Ratio
Ratio
Nominal (categorical)
• Data are labels or names used to identify an attribute of
the element.
• A nonnumeric label or numeric code may be used.
Ex: Gender: Male, Female, Eye color: Blue, Brown,
Marital status: Single, Married, Divorced, Ethnicity:
African, Asian.
Ordinal
• The data have the properties of nominal data and the
order or rank of the data is meaningful.
Ex: Educational attainment: Elementary school,
High school, Bachelor's degree, Master's degree, Doc-
torate.
Likert scale ratings: Strongly disagree, Disagree,
Neutral, Agree, Strongly agree.
Satisfaction levels: Very dissatisfied, Dissatisfied,
Neutral, Satisfied, Very satisfied.
Socioeconomic status: Low income, Middle income,
High income
Interval
•Interval data is a type of quantitative data that has a con-
sistent and equal interval between values. It represents
measurements on a continuous scale where the difference
between any two values is meaningful and consistent.
•Interval data are always numeric.
Ex: interval data is the measurement of time on a 24-hour
clock
12:00 PM, 1:00 PM, 2:00 PM, 3:00 PM
In this example, the interval between each recorded time
is consistent and equal (1 hour), but there is no true zero
point on the clock. The value of 12:00 AM could be con-
sidered as a reference point, but it does not indicate the
Ratio
•The data have all the properties of interval data and the ra-
tio of two values is meaningful.
•Variables such as distance, height, weight, and time use
the ratio scale.
•This scale must contain a zero value that indicates that
nothing exists for the variable at the zero point.
•The load applied during tensile testing can be considered
as ratio data, as it has a true zero point (no load) and equal
intervals between values. The strain or elongation of the
material can also be considered as ratio data, as it repre-
sents a continuous variable with a true zero point (no elon-
gation) and equal intervals between values.
Quantitative Levels of Analysis
• Univariate - simplest form, describe a case in
terms of a single variable.
• Bivariate - subgroup comparisons, describe a case
in terms of two variables simultaneously.
• Multivariate - analysis of two or more variables
simultaneously.
Statistical Data Analysis techniques
Statistics is the science of collecting,
organizing, summarising, analysing,
and making conclusion from data
Inferential stat. Includes
Descriptive stat. Includes Making inferences,
collecting, organizing, hypothesis testing,
summarising, analysing, Determining relationship,
and presenting data and making prediction
Ex: Mean, Median, Mode EX: Regression analysis, t-test,
Chi2, ANOVA
Variables
Quantitative Qualitative
•Discrete •Ordinal
•Continuous •Categorical
Parametric Vs. non parametric tests
Parametric: decision making method where the
distribution of the sampling statistic is known
Ex:t-tests, analysis of variance (ANOVA), and linear
regression.
Non-Parametric: decision making method which
does not require knowledge of the distribution of
the sampling statistic
EX:chi-square test
t-Test
A t-test is a statistical test used to determine if
there is a significant difference between the
means of two groups.
The t-test assumes that the data is normally dis-
tributed and follows a continuous scale of mea-
surement, such as interval or ratio data.
Only two variables are required.
Two-tailed test:
used when the alternative hypothesis is not specific about
the direction of the difference or effect.
It is interested in determining if there is a significant differ-
ence between the two groups, regardless of whether one
group has a higher or lower mean than the other.
One-tailed test:
used when the alternative hypothesis specifies the direc-
tion of the difference or effect. Ex: Value1 is greater than
Value2.
It is interested in determining if there is a significant differ-
Example
1. Let's say you work in a manufacturing company that
produces steel cables. You want to compare the tensile
strength of two different types of steel cables, Cable A
and Cable B, to determine if there is a significant dif-
ference in their strength. To ensure a fair comparison,
you randomly select 15 samples of Cable A and 15
samples of Cable B. Each sample is carefully prepared
and subjected to a standardized tensile test to mea-
sure the maximum force it can withstand before break-
ing.
Test result:
For Cable A: Mean (μA) = 1500 Newtons
Standard Deviation (σA) = 100 Newtons
For Cable B:
Mean (μB) = 1600 Newtons
Standard Deviation (σB) = 120 Newtons
Solution: (two tailed test)
Step1: The null hypothesis (H0) assumes that there is no
significant difference between the mean test scores of
Group A and Group B.
Step2: Calculate the mean and standard deviation for
each group.
For Cable A: Mean (μA) = 1500 Newtons Standard Devi-
ation (σA) = 100 Newtons
For Cable B: Mean (μB) = 1600 Newtons Standard Devia-
tion (σB) = 120 Newtons
Step 3: Calculate the pooled standard deviation (sp) using the
formula:
sp = sqrt(((nA - 1) * σA^2 + (nB - 1) * σB^2) / (nA + nB - 2))
nA = 15 (sample size for Cable A)
nB = 15 (sample size for Cable B)
σA = 100 Newtons (standard deviation for Cable A)
σB = 120 Newtons (standard deviation for Cable B)
sp = sqrt(((15 - 1) * 100^2 + (15 - 1) * 120^2) / (15 + 15 - 2))
sp = sqrt((14 * 10000 + 14 * 14400) / 28)
sp = 104.88
Step 4: Calculate the t-value using the formula:
t = (μA - μB) / (sp * sqrt(1/nA + 1/nB))
t = (1500 - 1600) / (104.88 * sqrt(1/15 + 1/15))
t = -100 / (104.88 * sqrt(0.0667 + 0.0667))
t = -100 / (104.88 * sqrt(0.1333))
t = -100 / (104.88 * 0.3651)
t = -100 / 38.30
t = -2.61
Step 5: Determine the degrees of freedom (df) using the
formula:
df = nA + nB - 2
df = 15+ 15- 2
df = 28
Step 5: Determine the critical t-value associated with an as-
sumed significance level of 0.05 and degrees of freedom us-
ing a t-distribution table or a statistical software.
For a significance level of 0.05 and 28 degrees of freedom,
the critical t-value is approximately ± 2.048.
Therefore, Since the calculated t-value of -2.61 is less than
the critical t-value of -2.048, we can reject the null hypothe-
sis and conclude that there is a significant difference in the
tensile strength between Cable A and Cable B.
chi² test
Used to test strength of association between qualitative vari-
ables
Used for categorical data
Requirement
Data should be in form of frequency
If the calculated chi² test statistic is larger than the critical
value from the chi-square distribution table or if the p-value
associated with the test statistic is smaller than the chosen
significance level (α), we reject the null hypothesis and con-
EX: Determine if there is a relationship between the type of
material used in a mechanical component and its failure
mode. Data on 100 mechanical components is taken and the
type of material used (steel or aluminum) and the failure
mode (brittle or ductile) is recorded.
Data Brittle Ductile
Steel 30 20
Aluminum 10 40
Step 1: Set up hypotheses: - Null hypothesis (H0): There is no
relationship between the type of material and failure mode. -
Alternative hypothesis (Ha): There is a relationship between
the type of material and failure mode.
Step 2: Calculate expected frequencies: We need to calculate
the expected frequencies under the assumption of
independence. The expected frequency for each cell can be
calculated as (row total * column total) / grand total.
Brittle Ductile Total
Steel 40*50/100=20 60*50/100=30 50
Aluminum 40*50/100=20 60*50/100=30 50
Total 40 60 100
Step 3: Calculate the test statistic (χ²): The test statistic is
calculated as the sum of the squared differences between the
observed and expected frequencies, divided by the expected
frequencies.
χ² = Σ [(O - E)² / E], O-bserved frequency, E- expected frequency
χ² = [(30-20)²/20] + [(20-30)²/30] + [(10-20)²/20] + [(40-
30)²/30] = 1 + 3.33 + 5 + 3.33
χ² = 12.66
Step 4: Determine the critical value and p-value: The critical
value or the p-value is obtained from the chi-square distribu-
tion table or using statistical software. Let's assume we are us-
ing a significance level (α) of 0.05 with 1 degree of freedom (df
= (rows-1) * (columns-1) = (2-1) * (2-1) = 1).
From the chi-square distribution table, the critical value for α =
0.05 and df = 1 is approximately 3.84.
Step 5: Make a decision: the calculated χ² value exceeds the
critical value and then we reject the null hypothesis and con-
clude that there is a significant relationship between the type
of material and failure mode.
Correlation
Method to study magnitude of the association and the
functional relationship between two or more variables.
Positive Linear Correlation: the x -axis increases as the
variable on the y -axis increases.
Negative Linear Correlation: the x -axis decreases as the
variable on the y -axis increases.
The correlation coefficient (r) is a statistic that tells you
the strength and direction of that relationship. It is ex-
pressed as a positive or negative number between -1 and
1. The value of the number indicates the strength of the
relationship: r = 0 means there is no correlation.
r = Σ((Xi - X_mean) * (Yi - Y_mean)) / sqrt(Σ((Xi - X_mean)²) * Σ((Yi -
Y_mean)²))
Correlation
Denote strength of relationship between variables
Ex: Let's calculate the correlation coefficient between two
variables, X and Y, using a sample dataset:
X: [2, 4, 6, 8, 10]
Y: [5, 10, 15, 20, 25]
Solution:
r = Σ((Xi - X_mean) * (Yi - Y_mean)) / sqrt(Σ((Xi - X_mean)²) *
Σ((Yi - Y_mean)²))
X_mean = (2 + 4 + 6 + 8 + 10) / 5 = 6
Y_mean = (5 + 10 + 15 + 20 + 25) / 5 = 15
Now, let's calculate the numerator and denominator of the
correlation coefficient formula:
Numerator:
Σ((Xi - X_mean) * (Yi - Y_mean)) = (2-6)*(5-15) + (4-6)*(10-15)
+ (6-6)*(15-15) + (8-6)*(20-15) + (10-6)*(25-15) = 100
Denominator:
sqrt(Σ((Xi - X_mean)²) * Σ((Yi - Y_mean)²)) = sqrt((2-6)² + (4-6)²
+ (6-6)² + (8-6)² + (10-6)²) * ((5-15)² + (10-15)² + (15-15)² + (20-
15)² + (25-15)²)) = sqrt(40 * 250) = sqrt(10000) = 100
Now, let's calculate the correlation coefficient:
r = 100 / 100=1
The correlation coefficient between X and Y is 1. This value in-
dicates a strong linear relationship between the two variables,
meaning that as X increases, Y tends to increase.
Regression
Method that indicate a mathematical relationship
between a dependant and one or more independent
variables
Simple linear regression and multiple regression are
appropriate for continuous variables like( Weight)
Logistic regression applicable for binary response
like alive/dead
EX1: The sales of a new car is based on the advertising expen-
diture. Data is collected on the advertising expenditure (in
thousands of dollars) and the corresponding sales (in thou-
sands of units) for 10 different time periods.
Here is the dataset:
Advertising Expenditure (X): [5, 7, 3, 2, 9, 4, 6, 8, 1, 5] Sales (Y):
[15, 20, 10, 8, 25, 12, 18, 22, 6, 14]
Estimate the sales for an advertising expenditure of 6,000 dol-
lars.
Solution:
Step 1: Data Preparation: The data is already organized.
Step 2: Model Selection: We will use simple linear regression
since we have one independent variable (advertising expendi-
ture) and one dependent variable (sales).
Step 3: Model Estimation: To estimate the regression coeffi-
The formula to estimate the coefficients is:
β1 = Σ((Xi - X_mean) * (Yi - Y_mean)) / Σ((Xi - X_mean)²)
β0 = Y_mean - β1 * X_mean
Using the provided data, we can calculate the coefficients as
follows: X_mean = (5 + 7 + 3 + 2 + 9 + 4 + 6 + 8 + 1 + 5) / 10 =
5.0 Y_mean = (15 + 20 + 10 + 8 + 25 + 12 + 18 + 22 + 6 + 14) /
10 = 15
Σ((Xi - X_mean) * (Yi - Y_mean)) = (5-5)*(15-15) + (7-5)*(20-
15.5) + ... + (5-5)*(14-15) = 144
Σ((Xi - X_mean)²) = (5-5)² + (7-5)² + ... + (5-5)² = 60
β1 = 144 / 60 = 2.4
β0 = 15 – 2.4 * 5 = 3
The estimated regression equation is: Y= β0 + β1X
Y = 3 + 2.4X then the sales is for 6000 dollars is 17,400.
EX2: you are assigned to predict the fuel efficiency of a car based
on its weight and engine size. you have collected data on 12 dif-
ferent car models and recorded their weight (in tons), engine size
(in liters), and fuel efficiency (in miles per gallon).
Here is dataset:
Weight (X1): [1.5, 1.8, 1.6, 2.0, 1.7, 1.9, 2.2, 1.4, 1.8, 2.1, 2.3, 1.6]
Engine Size(X2): [1.4, 1.6, 1.5, 1.8, 1.6, 1.7, 2.0, 1.3, 1.6, 2.0, 2.2,
1.5]
Fuel Efficiency (Y): [30, 28, 29, 25, 27, 26, 24, 31, 28, 23, 22, 29]
Estimate the fuel efficiency for a car weighs 1.9 tons and has an
engine size of 1.7 liters.
Solution
Step 1: Data Preparation
Step 2: Model Selection: use multiple linear regression since we
have two independent variables (weight and engine size) and
Step 3: Model Estimation: To estimate the regression coeffi-
cients (β0, β1, and β2), we need to use a method like ordinary
least squares (OLS). The formula to estimate the coefficients is:
β1 = Σ((Xi1 - X1_mean) * (Yi - Y_mean)) / Σ((Xi1 - X1_mean)²)
β2 = Σ((Xi2 - X2_mean) * (Yi - Y_mean)) / Σ((Xi2 - X2_mean)²)
β0 = Y_mean - β1 * X1_mean - β2 * X2_mean
Analysis of Variance (ANOVA)
is used to uncover the main and interaction effects
of categorical independent variables (called
"factors") on an interval dependent variable.
Data must be experimental
If you do not have access to statistical software, an
ANOVA can be computed by hand
With many experimental designs, the sample sizes
must be equal for the various factor level combi-
nations
A regression analysis will accomplish the same
goal as an ANOVA.
ANOVA example 1
Mean micronutrient intake from the school lunch by school
S1a, n=25 S2b, n=25 S3c, n=25 P-valued
Calcium (mg) Mean 117.8 158.7 206.5 0.000
SDe 62.4 70.5 86.2
Iron (mg) Mean 2.0 2.0 2.0 0.854
SD 0.6 0.6 0.6
Folate (μg) Mean 26.6 38.7 42.6 0.000
SD 13.1 14.5 15.1
Mean 1.9 1.5 1.3 0.055
Zinc (mg)
SD 1.0 1.2 0.4
a School 1 (most deprived; 40% subsidized lunches).
b School 2 (medium deprived; <10% subsidized).
c School 3 (least deprived; no subsidization, private school).
d ANOVA; significant differences are highlighted in bold (P<0.05).
ANOVA example2
How to select appropriate statistical test
Type of variables
Quantitative (blood pres.)
Qualitative (gender)
Type of research question
Association
Comparison
Risk factor
Data structure
Independent
Paired
matched
Looking for Risk Factor
Types of variables Test
Dependent several indepen.
categorical categorical chi2
quantitative categorical ANOVA
quantitative quantitative Linear regression