Module name: Epidemiology and Biostatistics
Module code: 04104
Total number of session = 15
Facilitator: Dr Meshack Simon, MBBS
SESSION 4
ANALYSIS OF HEALTH DATA
OBJECTIVES
• Define data analysis
• Identify data to be analyzed
• Identify methods for data analysis
• Describe data entry process
• Explain data cleaning
• Explain data summarization
• List tools for data analysis
• Analyze data
Data analysis
Data analysis, also known as data analytics, is a
process of inspecting, cleansing, transforming,
and modeling data with the goal of discovering
useful information, suggesting conclusions, and
supporting decision making
Methods of data analysis
(Types of Data Analysis)
There are two main statistical methods used in data
analysis:
Descriptive Analysis
– Is a statistical interpretation used to analyze
descriptive data to identify patterns and relationships.
– It involves Collecting, summarizing and presention of
data by using tables, graph and summary measures as
descriptive statistics.
– It involves types/methods of descriptive statistics
ie. Measures of central tendency and varability.
Types of Data Analysis(2)
Inferential Analysis
– Drawing conclusion and/or making decisions
concerning a population based only on sample data.
– With inferential statistics, you take data from samples
and make generalizations about a population
– A basic tool in inferential Analysis is probability.
Measures of central tendency
There are three commonly used measures of central
tendency
– Mean
– Median
– Mode
Mean
• Mathematical average and it is the most popular
measures of central tendency
• It is obtained by dividing sum of the values of all
observations in a series (ƩX) by the number of items
(N) constituting the series.
Measures of central tendency(2)
Median
• Median is the number present in the middle when the
numbers in a set of data are arranged in ascending or
descending order.
Mode
• Mode is the most frequent value or score in the
distribution.
• Mode is the value that occurs most frequently in a set of
data.
• Highest point of the frequencies distribution curve
Measures of Dispersion/Variability
There are three commonly used measures of
dispersion/Variability
– Range
– Variance
– Standard variation
Data entry
Data entry refer to entering information into a
computer or data-recording system using an
electronic or mechanical device.
Process of converting data to an electronic form
Data coding
Coding – process of translating information
gathered from questionnaires or other sources into
something that can be analyzed
Involves assigning a value to the information given
—often value is given a label
Coding can make data more consistent:
Example: Question = Sex
Answers = Male, Female, M, or F
Coding will avoid such inconsistencies
Examples of coding
S/N Variable name Operational coding
definition
1 SEX Sex: male, female male=0
female=1
missing=9
2. SUPER Is your position non-supervisory=0
supervisory or non- supervisory=1
supervisory? missing=9
3. DIVISION Name of Division Planning=1
where you work? Traffic=2
Engineering=3
Enforcement=4
missing=9
4. JOB CLASS What is your job Management=1
classification? Technical=2
Management, Administrative=3
Technical, Clerical=4
Administrative, missing=9
Clerical
Data cleaning
Data cleaning refer to Improving the quality of
data by removing errors and resolving
inconsistencies
Is the process of preventing and correcting
Statistical errors
Common tasks include record matching,
identifying inaccuracy of data, overall quality of
existing data, reduplication, mislabeled ,
misspellings, and column segmentation
Data cleaning(2)
One of the first steps in analyzing data is to
“clean” it of any obvious data entry errors:
– Outliers? (really high or low numbers)
• Example: Age = 110 (really 10 or 11?)
– Value entered that doesn’t exist for variable?
• Example: 2 entered where 1=male, 0=female
– Missing values?
• Did the person not give an answer? Was answer accidentally
not entered into the database?
Data Summarization
Summarizing is defined as taking a lot of
information and creating a condensed version that
covers the main points
When data are in their original form, as collected,
they are called raw data
Data Summarization(2)
We want to be able to visualize the characteristics
of a data set; hence we construct graphical
representations of the data
Data summarization depends on the types of data
you have
Simple ways of summarizing data is by using Tables
Statistical tools for Data Analysis
Epi Info is a free set of software tools for public
health practitioners and researchers across the
globe. Mostly used tool.
SPSS(Statistical Package for the Social Sciences)
Microsoft Excel
MATLAB (Matrix Laboratory)
R (R Foundation for Statistical Computing)
Statistical analysis system, SAS
18