Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
37 views60 pages

Cha 3 Data Analysis and Techniques New

Uploaded by

kibrom adisu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views60 pages

Cha 3 Data Analysis and Techniques New

Uploaded by

kibrom adisu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Data Analysis and Techniques

By

Abrha Gebregergs(Assistant Professor)


Planning for Analysis

Type of Type of
Data Formatting

Type of
Analysis
Three Steps to Data Analysis

 Analyze Results

 Communicate Findings

 Use Findings for Improvement

Two Kinds of Data


 Qualitative

 Quantitative
Qualitative Data

Narratives, logs, experience

Focus groups

Interviews

Open-ended survey items

Diaries and journals

Notes from observations


Quantitative Data
Data that is numerical, counted, or compared on a

scale
Demographic data

Answers to closed-ended survey items

Attendance data

Scores on standardized instruments


Qualitative Data Analysis

Qualitative data analysis involves the systematic exami-


nation and interpretation of non-numerical data, such
as text, images, audio, or video.

Principles to consider when conducting qualitative


data analysis:

1. Start by immersing yourself in the data to gain a deep


understanding of its content and context.
2. Identify codes and themes: Develop a coding frame-
work to categorize and label different aspects of the
data. Use inductive coding to allow new themes to
emerge from the data, but also consider deductive cod-
ing if you have pre-existing theories or concepts to
guide your analysis.

3. Maintain rigor and transparency: Ensure trans-


parency in your data analysis process by documenting
your decisions, methods, and interpretations

4.Seek patterns and connections: Look for patterns,


5. Organize the data into smaller units and then to cate-
gories (based on major points). Qualitative data analysis
typically follows an inductive approach, where themes,
patterns, or categories emerge from the data itself
rather than being pre-determined.

6. Interpret and contextualize the data: Move beyond de-


scription and aim for deeper interpretation of the data.
Consider the broader context, relevant theories, and ex-
isting literature to provide meaningful explanations and
insights. Be open to alternative interpretations and mul-
7. Triangulation: Use multiple data sources, methods, or
researchers to validate and strengthen your findings.
Triangulation helps ensure the reliability and validity of
your analysis by incorporating different viewpoints and
sources of evidence.

8. Iterative analysis: Qualitative data analysis is often an


iterative process. It involves going back and forth be-
tween the data, codes, themes, and interpretations.
Continuously refine and revise your analysis as new in-
sights emerge or as you gain a deeper understanding of
9. Analyze negative cases to reflect their perspectives. An-
alyzing negative cases involves examining instances or
data points that do not fit the expected patterns or
themes identified in the analysis. Negative cases can pro-
vide valuable insights and challenge assumptions, leading
to a more comprehensive understanding of the research
topic.

10. Synthesize the patterns into the grounded theory. It


aims to generate theories or explanations directly from
the data, rather than starting with pre-existing theories or
Steps in Qualitative Data Analysis

1. Familiarization with the data.

2. Data organization.

3. Coding.

4. Category development: Once a substantial number of


codes have been assigned, begin grouping similar codes
into categories.
5. Theme identification: Analyze the categories and look
for patterns or themes that cut across the data. Themes
are central concepts or ideas that capture the main find-
ings or insights. They may emerge from the categories or
be identified through a process of constant comparison
and reflection.

6. Data exploration and interpretation: Engage in a deeper


exploration and interpretation of the data. Look for con-
nections, relationships, or contradictions within and be-
7. Triangulation: Consider using triangulation to en-
hance the credibility and validity of the analysis. Tri-
angulation involves comparing and contrasting differ-
ent sources of data, methods, or perspectives to en-
sure consistency and reliability in the findings.

8. Synthesis and reporting: Finally, synthesize the find-


ings into a coherent and compelling narrative.
Present the results in a clear and organized manner,
using quotes or citations from the data to support the
themes and interpretations then develop ground
Grounded Theory Analysis Strategies
Grounded theory:
A qualitative research methodology that aims to de-

velop theories or concepts grounded in the data itself.


 A process of constructing various data

 Inductive process by collecting, analyzing and com-

paring data systematically.


Theory is grounded on data to explain the phenom-

ena.
The main purpose is to develop theory through under-

standing concepts that are related by means of state-


ments of relationships.
Iterative and flexible process: Grounded theory is an iter-

ative and flexible process that involves moving back and


forth between data collection and analysis.
An interactional method of theory building by comparing

and analyzing the data.


Saturation: Saturation is an important concept in

grounded theory. It refers to the point at which new data


Three steps in the grounded theory analytic process:

1. Open coding:

Break data into small parts compare for similari-


ties and differences explain the meanings of the
data by focusing on “ who, when, where, what, how
much, why” (ask questions to get a clear story)
2. Axial coding:

After open coding, make connection (sort) between


categories and confirm or disconfirm your hypothe-
ses.

3. Selective coding:

Select the core category (match hypotheses) and ex-


plain the minor category (against hypotheses) with
additional supporting data.
Interpretation Issues in Qualitative Data
Analysis

A. Triangulating Data
Use multiple methods and data sources to support the

strength of interpretations and conclusion

Ex) semi-structured interviews, consent form,


grounded theory
B. Audits
Questions to examine the data for interpretations and

conclusion

1. Is sampling appropriate to ground the findings?

2. Are coding strategies applied correctly?

3. Is the category process appropriate?

4. Do the results link hypotheses? (examine literature


review)

5. Are the negative cases explained? (minority’s voice)


Quantitative Data Analysis

You should choose a level of analysis that is ap-

propriate for your research question


You should choose the type of statistical analysis

appropriate for the variables you have


o nominal/Categorical, Ordinal, or Continuous
Quantitative data analysis
Key points
Data must be analysed to produce information

Computer software analysis is normally used for this

process
Data should be carefully prepared for analysis

Researchers need to know how to select and use dif-

ferent charting and statistical techniques


Quantitative data analysis
Main concerns
Preparing, inputting and checking data

Choosing the most appropriate statistics to describe

the data
Choosing the most appropriate statistics to examine

data relationships and trends


Preparing, inputting and checking data
Main considerations
 Type of data (scale of measurement)

 Data format for input to analysis software

 Impact of data coding on subsequent analyses

 Case weighting

 Methods for error checking


Scales of Measurement

Data

Qualitative Quantitative

Numerical
Numerical Nonnumerical
Nonnumerical Numerical
Numerical

Nominal
Nominal Ordinal
Ordinal Nominal
Nominal Ordinal
Ordinal Interval
Interval Ratio
Ratio
Nominal (categorical)
• Data are labels or names used to identify an attribute of
the element.
• A nonnumeric label or numeric code may be used.
Ex: Gender: Male, Female, Eye color: Blue, Brown,
Marital status: Single, Married, Divorced, Ethnicity:
African, Asian.

Ordinal
• The data have the properties of nominal data and the
order or rank of the data is meaningful.
Ex: Educational attainment: Elementary school,
High school, Bachelor's degree, Master's degree, Doc-
torate.
Likert scale ratings: Strongly disagree, Disagree,
Neutral, Agree, Strongly agree.
Satisfaction levels: Very dissatisfied, Dissatisfied,
Neutral, Satisfied, Very satisfied.
Socioeconomic status: Low income, Middle income,
High income
Interval
•Interval data is a type of quantitative data that has a con-
sistent and equal interval between values. It represents
measurements on a continuous scale where the difference
between any two values is meaningful and consistent.
•Interval data are always numeric.
Ex: interval data is the measurement of time on a 24-hour
clock
12:00 PM, 1:00 PM, 2:00 PM, 3:00 PM
In this example, the interval between each recorded time
is consistent and equal (1 hour), but there is no true zero
point on the clock. The value of 12:00 AM could be con-
sidered as a reference point, but it does not indicate the
Ratio
•The data have all the properties of interval data and the ra-
tio of two values is meaningful.
•Variables such as distance, height, weight, and time use
the ratio scale.
•This scale must contain a zero value that indicates that
nothing exists for the variable at the zero point.
•The load applied during tensile testing can be considered
as ratio data, as it has a true zero point (no load) and equal
intervals between values. The strain or elongation of the
material can also be considered as ratio data, as it repre-
sents a continuous variable with a true zero point (no elon-
gation) and equal intervals between values.
Quantitative Levels of Analysis

• Univariate - simplest form, describe a case in


terms of a single variable.
• Bivariate - subgroup comparisons, describe a case
in terms of two variables simultaneously.
• Multivariate - analysis of two or more variables
simultaneously.
Statistical Data Analysis techniques

Statistics is the science of collecting,


organizing, summarising, analysing,
and making conclusion from data

Inferential stat. Includes


Descriptive stat. Includes Making inferences,
collecting, organizing, hypothesis testing,
summarising, analysing, Determining relationship,
and presenting data and making prediction
Ex: Mean, Median, Mode EX: Regression analysis, t-test,
Chi2, ANOVA
Variables

Quantitative Qualitative
•Discrete •Ordinal
•Continuous •Categorical
Parametric Vs. non parametric tests
Parametric: decision making method where the

distribution of the sampling statistic is known

Ex:t-tests, analysis of variance (ANOVA), and linear


regression.
Non-Parametric: decision making method which

does not require knowledge of the distribution of


the sampling statistic
EX:chi-square test
t-Test
A t-test is a statistical test used to determine if

there is a significant difference between the


means of two groups.
The t-test assumes that the data is normally dis-

tributed and follows a continuous scale of mea-


surement, such as interval or ratio data.
Only two variables are required.
Two-tailed test:
used when the alternative hypothesis is not specific about

the direction of the difference or effect.


It is interested in determining if there is a significant differ-

ence between the two groups, regardless of whether one


group has a higher or lower mean than the other.

One-tailed test:
used when the alternative hypothesis specifies the direc-

tion of the difference or effect. Ex: Value1 is greater than


Value2.
 It is interested in determining if there is a significant differ-
Example
1. Let's say you work in a manufacturing company that
produces steel cables. You want to compare the tensile
strength of two different types of steel cables, Cable A
and Cable B, to determine if there is a significant dif-
ference in their strength. To ensure a fair comparison,
you randomly select 15 samples of Cable A and 15
samples of Cable B. Each sample is carefully prepared
and subjected to a standardized tensile test to mea-
sure the maximum force it can withstand before break-
ing.
Test result:
For Cable A: Mean (μA) = 1500 Newtons
Standard Deviation (σA) = 100 Newtons
For Cable B:
Mean (μB) = 1600 Newtons
Standard Deviation (σB) = 120 Newtons
Solution: (two tailed test)
Step1: The null hypothesis (H0) assumes that there is no
significant difference between the mean test scores of
Group A and Group B.
Step2: Calculate the mean and standard deviation for
each group.
For Cable A: Mean (μA) = 1500 Newtons Standard Devi-
ation (σA) = 100 Newtons
For Cable B: Mean (μB) = 1600 Newtons Standard Devia-
tion (σB) = 120 Newtons
Step 3: Calculate the pooled standard deviation (sp) using the
formula:

sp = sqrt(((nA - 1) * σA^2 + (nB - 1) * σB^2) / (nA + nB - 2))

nA = 15 (sample size for Cable A)

nB = 15 (sample size for Cable B)

σA = 100 Newtons (standard deviation for Cable A)

σB = 120 Newtons (standard deviation for Cable B)

sp = sqrt(((15 - 1) * 100^2 + (15 - 1) * 120^2) / (15 + 15 - 2))

sp = sqrt((14 * 10000 + 14 * 14400) / 28)

sp = 104.88
Step 4: Calculate the t-value using the formula:
t = (μA - μB) / (sp * sqrt(1/nA + 1/nB))
t = (1500 - 1600) / (104.88 * sqrt(1/15 + 1/15))
t = -100 / (104.88 * sqrt(0.0667 + 0.0667))
t = -100 / (104.88 * sqrt(0.1333))
t = -100 / (104.88 * 0.3651)
t = -100 / 38.30
t = -2.61
Step 5: Determine the degrees of freedom (df) using the
formula:
df = nA + nB - 2
df = 15+ 15- 2
df = 28
Step 5: Determine the critical t-value associated with an as-
sumed significance level of 0.05 and degrees of freedom us-
ing a t-distribution table or a statistical software.

For a significance level of 0.05 and 28 degrees of freedom,


the critical t-value is approximately ± 2.048.

Therefore, Since the calculated t-value of -2.61 is less than


the critical t-value of -2.048, we can reject the null hypothe-
sis and conclude that there is a significant difference in the
tensile strength between Cable A and Cable B.
chi² test
Used to test strength of association between qualitative vari-

ables
Used for categorical data

Requirement
Data should be in form of frequency

If the calculated chi² test statistic is larger than the critical

value from the chi-square distribution table or if the p-value


associated with the test statistic is smaller than the chosen
significance level (α), we reject the null hypothesis and con-
EX: Determine if there is a relationship between the type of
material used in a mechanical component and its failure
mode. Data on 100 mechanical components is taken and the
type of material used (steel or aluminum) and the failure
mode (brittle or ductile) is recorded.
Data Brittle Ductile
Steel 30 20
Aluminum 10 40

Step 1: Set up hypotheses: - Null hypothesis (H0): There is no


relationship between the type of material and failure mode. -
Alternative hypothesis (Ha): There is a relationship between
the type of material and failure mode.
Step 2: Calculate expected frequencies: We need to calculate
the expected frequencies under the assumption of
independence. The expected frequency for each cell can be
calculated as (row total * column total) / grand total.
Brittle Ductile Total
Steel 40*50/100=20 60*50/100=30 50
Aluminum 40*50/100=20 60*50/100=30 50
Total 40 60 100

Step 3: Calculate the test statistic (χ²): The test statistic is


calculated as the sum of the squared differences between the
observed and expected frequencies, divided by the expected
frequencies.
χ² = Σ [(O - E)² / E], O-bserved frequency, E- expected frequency
χ² = [(30-20)²/20] + [(20-30)²/30] + [(10-20)²/20] + [(40-
30)²/30] = 1 + 3.33 + 5 + 3.33
χ² = 12.66
Step 4: Determine the critical value and p-value: The critical
value or the p-value is obtained from the chi-square distribu-
tion table or using statistical software. Let's assume we are us-
ing a significance level (α) of 0.05 with 1 degree of freedom (df
= (rows-1) * (columns-1) = (2-1) * (2-1) = 1).

From the chi-square distribution table, the critical value for α =


0.05 and df = 1 is approximately 3.84.

Step 5: Make a decision: the calculated χ² value exceeds the


critical value and then we reject the null hypothesis and con-
clude that there is a significant relationship between the type
of material and failure mode.
Correlation
Method to study magnitude of the association and the
functional relationship between two or more variables.
Positive Linear Correlation: the x -axis increases as the
variable on the y -axis increases.
Negative Linear Correlation: the x -axis decreases as the
variable on the y -axis increases.
The correlation coefficient (r) is a statistic that tells you
the strength and direction of that relationship. It is ex-
pressed as a positive or negative number between -1 and
1. The value of the number indicates the strength of the
relationship: r = 0 means there is no correlation.
r = Σ((Xi - X_mean) * (Yi - Y_mean)) / sqrt(Σ((Xi - X_mean)²) * Σ((Yi -
Y_mean)²))
Correlation
Denote strength of relationship between variables
Ex: Let's calculate the correlation coefficient between two
variables, X and Y, using a sample dataset:
X: [2, 4, 6, 8, 10]
Y: [5, 10, 15, 20, 25]
Solution:
r = Σ((Xi - X_mean) * (Yi - Y_mean)) / sqrt(Σ((Xi - X_mean)²) *
Σ((Yi - Y_mean)²))
X_mean = (2 + 4 + 6 + 8 + 10) / 5 = 6
Y_mean = (5 + 10 + 15 + 20 + 25) / 5 = 15
Now, let's calculate the numerator and denominator of the
correlation coefficient formula:
Numerator:
Σ((Xi - X_mean) * (Yi - Y_mean)) = (2-6)*(5-15) + (4-6)*(10-15)
+ (6-6)*(15-15) + (8-6)*(20-15) + (10-6)*(25-15) = 100
Denominator:

sqrt(Σ((Xi - X_mean)²) * Σ((Yi - Y_mean)²)) = sqrt((2-6)² + (4-6)²


+ (6-6)² + (8-6)² + (10-6)²) * ((5-15)² + (10-15)² + (15-15)² + (20-
15)² + (25-15)²)) = sqrt(40 * 250) = sqrt(10000) = 100

Now, let's calculate the correlation coefficient:

r = 100 / 100=1

The correlation coefficient between X and Y is 1. This value in-


dicates a strong linear relationship between the two variables,
meaning that as X increases, Y tends to increase.
Regression

Method that indicate a mathematical relationship

between a dependant and one or more independent


variables
Simple linear regression and multiple regression are

appropriate for continuous variables like( Weight)


Logistic regression applicable for binary response

like alive/dead
EX1: The sales of a new car is based on the advertising expen-
diture. Data is collected on the advertising expenditure (in
thousands of dollars) and the corresponding sales (in thou-
sands of units) for 10 different time periods.
Here is the dataset:
Advertising Expenditure (X): [5, 7, 3, 2, 9, 4, 6, 8, 1, 5] Sales (Y):
[15, 20, 10, 8, 25, 12, 18, 22, 6, 14]
Estimate the sales for an advertising expenditure of 6,000 dol-
lars.
Solution:
Step 1: Data Preparation: The data is already organized.
Step 2: Model Selection: We will use simple linear regression
since we have one independent variable (advertising expendi-
ture) and one dependent variable (sales).
Step 3: Model Estimation: To estimate the regression coeffi-
The formula to estimate the coefficients is:
β1 = Σ((Xi - X_mean) * (Yi - Y_mean)) / Σ((Xi - X_mean)²)
β0 = Y_mean - β1 * X_mean
Using the provided data, we can calculate the coefficients as
follows: X_mean = (5 + 7 + 3 + 2 + 9 + 4 + 6 + 8 + 1 + 5) / 10 =
5.0 Y_mean = (15 + 20 + 10 + 8 + 25 + 12 + 18 + 22 + 6 + 14) /
10 = 15
Σ((Xi - X_mean) * (Yi - Y_mean)) = (5-5)*(15-15) + (7-5)*(20-
15.5) + ... + (5-5)*(14-15) = 144
Σ((Xi - X_mean)²) = (5-5)² + (7-5)² + ... + (5-5)² = 60
β1 = 144 / 60 = 2.4
β0 = 15 – 2.4 * 5 = 3
The estimated regression equation is: Y= β0 + β1X
Y = 3 + 2.4X then the sales is for 6000 dollars is 17,400.
EX2: you are assigned to predict the fuel efficiency of a car based
on its weight and engine size. you have collected data on 12 dif-
ferent car models and recorded their weight (in tons), engine size
(in liters), and fuel efficiency (in miles per gallon).
Here is dataset:
Weight (X1): [1.5, 1.8, 1.6, 2.0, 1.7, 1.9, 2.2, 1.4, 1.8, 2.1, 2.3, 1.6]
Engine Size(X2): [1.4, 1.6, 1.5, 1.8, 1.6, 1.7, 2.0, 1.3, 1.6, 2.0, 2.2,
1.5]
Fuel Efficiency (Y): [30, 28, 29, 25, 27, 26, 24, 31, 28, 23, 22, 29]
Estimate the fuel efficiency for a car weighs 1.9 tons and has an
engine size of 1.7 liters.
Solution
Step 1: Data Preparation
Step 2: Model Selection: use multiple linear regression since we
have two independent variables (weight and engine size) and
Step 3: Model Estimation: To estimate the regression coeffi-
cients (β0, β1, and β2), we need to use a method like ordinary
least squares (OLS). The formula to estimate the coefficients is:
β1 = Σ((Xi1 - X1_mean) * (Yi - Y_mean)) / Σ((Xi1 - X1_mean)²)
β2 = Σ((Xi2 - X2_mean) * (Yi - Y_mean)) / Σ((Xi2 - X2_mean)²)
β0 = Y_mean - β1 * X1_mean - β2 * X2_mean
Analysis of Variance (ANOVA)
is used to uncover the main and interaction effects
of categorical independent variables (called
"factors") on an interval dependent variable.
Data must be experimental
If you do not have access to statistical software, an
ANOVA can be computed by hand
With many experimental designs, the sample sizes
must be equal for the various factor level combi-
nations
A regression analysis will accomplish the same
goal as an ANOVA.
ANOVA example 1
Mean micronutrient intake from the school lunch by school
S1a, n=25 S2b, n=25 S3c, n=25 P-valued
Calcium (mg) Mean 117.8 158.7 206.5 0.000
SDe 62.4 70.5 86.2
Iron (mg) Mean 2.0 2.0 2.0 0.854
SD 0.6 0.6 0.6
Folate (μg) Mean 26.6 38.7 42.6 0.000
SD 13.1 14.5 15.1
Mean 1.9 1.5 1.3 0.055
Zinc (mg)
SD 1.0 1.2 0.4

a School 1 (most deprived; 40% subsidized lunches).


b School 2 (medium deprived; <10% subsidized).
c School 3 (least deprived; no subsidization, private school).
d ANOVA; significant differences are highlighted in bold (P<0.05).
ANOVA example2
How to select appropriate statistical test
Type of variables
Quantitative (blood pres.)
Qualitative (gender)
Type of research question
Association
Comparison
Risk factor
Data structure
Independent
Paired
matched
Looking for Risk Factor

Types of variables Test


Dependent several indepen.
categorical categorical chi2

quantitative categorical ANOVA

quantitative quantitative Linear regression

You might also like