Introduction to SPSS and Data
Analysis
• Statistics is a field of study concerned with
collection, organization and analysis of data
to make more effective decisions
Methods of Organizing, summarizing and
presenting data in an informative way
5
INFERENTIAL STATISTICS
Drawing conclusions about a population based on sample results.
Population?
For example based on a sample survey results reported in USA today, only
46% of high school students can solve problems involving fractions,
decimals and percentages
This is inference about the population(all high school students) based on
sample data
7
Basic Concepts
Population
A population is a collection of all individuals or objects of interest.
Sample
The subset or a portion of a population is called sample.
• Often there is limited time, staff and money to gather data from the
entire population.
• So researchers use samples
Basic Concepts
Parameter
Any value (like Mean, SD) calculated from the population.
Example
Average heights of all students of DUHS
Average price of all statistics books in a book fair
Basic Concepts
Statistic
Any value (like Mean, SD) calculated from the sample.
Examples
Average heights of any 30 students selected from DUHS
Average price of any 20 statistics books selected from a book fair
• A variable is something whose value can vary
• For example age ,Gender ,blood type,etc are the variables.
variables
Age Gender Blood Type
32 Male A
24 Male B data
40 Female A
Types of Variable
Categorical Variables :
• They can only be classified into categories. For example:
• Gender
- Male
- Female
• Socio economic status
- upper class
- middle class
- lower class
• Type of disease
- Diarrhea
- Fever
Nominal Categorical Variables
• The ordering of categories is completely arbitrary (No ordering)
Examples
- Gender: Male, Female
- Religion: Islam, Christianity
- Hair Colour : Black, Brown etc
Ordinal Categorical Variables
• There is an natural order among the categories, for example
• Personal Health Status
- Very Good
- Good
- Fair
- Poor
• Faculty Position
- Professor
- Associate Professor
- Assistant Professor
Continuous variables
• Any thing that can be measured
•Values from a continuous range of possible values
Examples
- Blood Pressure(mmHg) of the Subject
- Body mass index(kg/m2) of the Subject
- Length of stay of a patient in hospital
Discrete variables
• Anything that can be expressed in numbers
• Can assume only whole (integer) numbers
Examples
- Number of clinical visits
- Number of adverse events
- Number of angina attacks
- Number of Household in a community, etc
Distinguish Between Qualitative and
Quantitative Variables
• Colour of eyes
• The number of leaves on Neem tree.
• The length of time on a phone call.
• Religion of people of a country
• Colour of hair of 100 children
Distinguish Between Discrete and
Continuous Variables
• The number of light bulbs that burn out in a room
• The age of a children
• The numbers of flowers on a tree
• The daily temperature recorded at the weather bureau
• No of Houses in a city
• Life time of an energy saver
Data Collection
Survey/Questionnaires Records
Experimentation
• The first step to describe a set of data is in the form of table called
frequency distribution
24
A grouping of data into non-overlapping classes showing the
number of observations in each class.
25
Frequency Distribution for qualitative
Variables
• Following table showing the frequency distribution of education
status for a group of 25 people surveyed
Degrees Frequency
None 2
Bachelor 11
Master 7
Doctorate 5
Construct frequency Table for
• Gender, ID Gender Minority
• Minority 1
2
m
m
No
No
3 f No
4 f No
5 m No
6 m No
7 m No
8 f No
9 f No
10 f No
11 f No
12 m Yes
13 m Yes
14 f Yes
15 m No
16 m No
17 m No
18 m No
19 m No
20 f No
Frequency table for Gender
Gender frequency
Female 8
Male 12
Total 20
Frequency table for Minority
Minority frequency
No 17
Yes 3
Total 20
Relative frequency
•It is the proportion of cases in any category/class to total cases
Relative
Degree Frequency Frequency(%)
None 2 =2/25*100=8
Bachelor 11 =11/25*100=44
Master 7 =7/25*100=28
Doctorate 5 =5/25*100=20
Total 25 =25/25*100=100
Cumulative Relative frequency
•It is the proportion of cases in a particular category and all preceding
category
Commulative
Relative
Degree Frequency Relative
Frequency
frequency
None 2 8% 8%
Bachelor 11 44% 52%
Master 7 28% 80%
Doctorate 5 20% 100%
Total 25 100%
The commonly used graphic forms are:
• Bar Graph
• Pie Chart
Bar Graph: A graph in which classes are reported on the horizontal
axis and class frequencies on vertical axis. It is important to note that
as the length / height of the bar increases the value is greater
y – axis: Frequency or frequency percentage
x – axis: Class/Category
Gender frequency
Female 8
Male 12
Total 20
Single bar graph
Used to convey discrete values of each category
shown on opposite axis
When more than one discrete values for each
category are meant to be represented
It is a preliminary data analysis tool. It is used
to show segments of total.
Usually not used b/c results cannot be read
properly and can lead to misleading conclusions
Pie Chart: A graph in which circle is divided into sectors. Each
sector represents a category of data.
Degree Frequency Percent(%)
None 2 8
Bachelor 11 44
Master 7 28
Doctorate 5 20
Total 25 100
• A researcher wishes to prepare a report showing the number of hours
per week students spend studying. He selects a random sample of 30
students and determines the number of hours each student studied last
week.
15.0, 23.7, 19.7, 15.4, 18.3, 23.0, 14.2, 20.8, 13.5,
20.7, 17.4, 18.6, 12.9, 20.3, 13.7, 21.4, 18.3, 29.8,
17.1, 18.9, 10.3, 26.1, 15.7, 14.0, 17.8, 33.8, 23.2,
12.9, 27.1, 16.6.
• Organize the data into a frequency distribution.
Frequency Distribution for Quantitative
variables
Study Hours Frequency Percent CP
8
10-15 27 27
11
15-20 37 63
7
20-25 23 87
3
25-30 10 97
1
3
30-35 100
30
Total 100
The commonly used graphic forms is:
• Histograms
A Histogram shows the shape of a distribution. In Histogram the classes
are marked on the horizontal axis and the class frequencies on the vertical
axis.
Common Shapes of frequency distribution
Measures of Central Tendency
Measures of Central Tendency
• A measure of central tendency is a measure which indicates where the
middle of the data is.
• The three most commonly used measures of central tendency are:
- Mean
- Median
- Mode
Mean:
For a given set of n observation, x1, x2, x3, …, xn, the mean is the sum of
these numbers divided by n, and denoted by
x i
x i 1
n
• The following are the weight losses of 10 individuals who entered in a 5
week weight-control program:
• 9, 7, 10, 11, 10, 11, 4, 8, 10, 9
10
x i
9 7 10 ... 9 89
x i 1
8.9
n 10 10
45
Median:
•It is the middle most value of the data set.
•It divides the data in such a way that half the observations
are less than that number and half the observations are
greater than that number.
•It is denoted by ~
x
46
Example:
•Find median of following three numbers 2, 8, 5
•To find median, first arrange data in ascending or descending order
•The arranged data is 2,5,8
•Now ,median =middle most observation=5
47
• If n (no of observations) is odd, median = (n+1)/2 th ordered
observation.
Example :
Data: 1, 7, 6, 2, 5 n=5
Ordered: 1, 2, 5, 6, 7
n 1th th=5 1
rd
median is
2
observation =
3 observation
2
So, median = 5
If n (no of observations) is even, median= mean of th
n
observation and th observation.
1
2
EX2.
Data: 4, 6, 2, 7, 5, 8 n=6
Ordered: 2, 4, 5, 6, 7, 8
n 6
median 3rd observation
2 2
n
and 1 (3 1)th observation 4th observation
2
56
So, median 5.5
2
Determine the median of following two dataset:
1. {1, 2, 3, 4, 5}
2. {2, 3, 4, 5, 6, 7, 8,9}
The Mode:
The value which occurs most frequently in the data set.
If all values are different there is no mode.
Sometimes, there are more than one mode.
EX.
Data: 4, 5, 2, 2, 6, 8 n=6
So mode = 2 Here Mean=4.50 & Median=4.5
For example we have following monthly incomes
30000,45000,45000,45000,200000
Here Median=45,000,Mode=45,000 and Mean
30000 45000 45000 45000 20000
73000
5
• We can see mean is inflated to 73000 because of one extreme observation.
• So, median is a better choice
Mean
Uses all the value of data sets so it is most sensitive to variations in
the data.
The mean is affected by extremely high or low values called outliers
and may not be used in appropriate average to use in these
situations.
53
Median
The median is affected less than the mean by extremely high or
extremely low values and is therefore a valuable measure of central
tendency when such values occur.
• It is less sensitive to variations in the data
54
Measure of Dispersion
7 7 7 8
3 2
7 77 7 77
7 8 13
7 6
9
Mean = 7 Mean = 7
Mean = 7
Measure of Dispersion
Example:
In a study conducted in pharmaceutical company to determine
the use of omeperazole and rabeperazole (anti inflammatory
enzymes) has an influence on the density of drug. The
densities in grams/cm3 of drugs are as
Omeperazole 0.55 0.32 0.36 0.37 0.39 0.43 0.43 0.47 0.52 0.53
Rabeperazole 0.26 0.26 0.43 0.23 0.47 0.51 0.52 0.55 0.59 0.55
The obtained measure of central tendency of the previous example are as
follow:
Mean Median Mode
Omeperazole 0.437 0.430 0.430
Rabeperazole 0.437 0.490 0.260
Both drugs have the same mean i.e.0.437.but still two drugs differ.
There is more variation in the values of Rabeperazole
Therefore, Measure of Central Tendency do not give the complete
description of data
• Measure of dispersion indicates the amount of variability in the data
set.
- Range
- Variance
- Standard Deviation
Range = Maximum Value – Minimum Value
Let’s calculate the range of previous example
Mininum Maximum Range
Omeperazole 0.32 0.55 0.23
Rabeperazole 0.23 0.59 0.36
• The variation in the densities of drugs using two contents is given below
Obs. Xi
1 0.28 -0.129 0.016641
2 0.32 -0.089 0.007921
3 0.36 -0.049 0.002401
4 0.37 -0.039 0.001521
5 0.38 -0.029 0.000841
6 0.43 0.021 0.000441
7 0.43 0.021 0.000441
8 0.47 0.061 0.003721
9 0.52 0.111 0.012321
10 0.53 0.121 0.014641
Total 4.09 0 0.06089
62
Due to square quantity, the variance is not considered as good measure
of dispersion. To avoid this, square root of variance is taken, which is
called as standard deviation
n
1
s s2
n 1 i 1
( xi x ) 2
• The standard deviation of the densities of drugs using two contents
is given below
Measure of Shape: Skewness
• Frequency distribution can assume many shapes. the three most
observed shapes are :
- symmetrical
- positively skewed and
- negatively skewed.
• Measure of skewness describes the shape of data
For a symmetrical distribution, the mean will equal the median, and the
skewness coefficient will be zero.
mean = median = mode
69
If the distribution is skewed to the right, the mean will be greater
than the median, and the coefficient will be positive.
70
mean > median > mode
If the distribution is skewed to the right, the mean will be less than
the median, and the coefficient will be negative.
mode > median > mean 72
• One of the most popular statistical packages which can perform highly
complex data manipulation and analysis with simple instructions
• Start → All Programs → SPSS Inc→ SPSS 19.0 → SPSS
19.0
• The default window will have the data editor
• There are two sheets in the window:
1. Data view 2. Variable view
• Data Editor
- For defining, entering, editing, and displaying data. Extension of
the saved file will be “sav.”
Output Viewer
• Displays output. Extension of the saved file will be “spv.”
• Displays output. Extension of the saved file will be “spv.”
• How would you put the following information into SPSS?
Patient Smoking
Gender Age Height
ID Status
1 2 50 5.4 2
2 2 45 5.1 2
3 1 43 5.6 1
4 2 35 6 1
5 1 29 5.9 2
6 2 32 5.6 2
7 2 36 5.8 2
8 2 55 5 2
9 1 49 5.4 1
10 1 43 5.11 1
Gender: 1= Male ,2 = Female
Smoking Status:1= Yes, 2 = No
Click
Practice 1
• To save the data file you created simply click ‘file’ and click ‘save as.’
Click
• Sort the data by the ‘Height’ of students in descending order.
• Click ‘Data’ and then click Sort Cases
Sorting the data (cont’d)
Double Click
Sorting the data (cont’d)
Click
Opening the sample data
• Open ‘Employee data.sav’ from the SPSS
• Go to “File,” “Open,” and Click Data
• Go to Program Files,” “SPSSInc,” “SPSS16,” and “Samples” folder.
• Open “Employee Data.sav” file
Opening the sample data
• Recoding into the same variable
• Recoding into different variables
• It is always recommended to recode into
different variables and not to alter the original
variable
• Click on Transform > Recode > Into different variables.
• Select the variable you want to recode. Educational Level
• Start by giving the new
variable a new name (educat)
• Click on Change
• Click on Old and New Values
• Use “Range” (fourth option down) to recode as follows. Remember
to click on “Add” after entering each recode.
- 8 to 12 = 1
- 13 to 16 = 2
- 17 to 21 = 3
• Click Continue
• And then OK.
Basic Analysis with SPSS
• Frequencies
- This analysis produces frequency tables showing frequency counts
and percentages of the values of individual variables.
• Descriptive
- This analysis shows the maximum, minimum, mean, Range ,standard
deviation etc. of the variables
Frequencies
• Click ‘Analyze,’ ‘Descriptive statistics,’ then click ‘Frequencies’
Frequencies
• Click gender and put it into the variable box.
• Click ‘Charts.’
• Then click ‘Bar charts’ and click ‘Continue.’
Click
Click
• Finally Click OK in the Frequencies box.
Click
Frequencies
• Click ‘Analyze,’ ‘Descriptive statistics,’ then click ‘Descriptives…’
• Click ‘Current Salary’ and ‘Beginning Salary,’ and put it into the
variable box.
• Click Options
Click
• The options allows you to analyze other descriptive statistics besides the mean and Std.
• Click ‘variance’, ‘Minimum’, ’Maximum’ and ‘Range’
• Finally click ‘Continue’
Click
• Finally Click OK in the Descriptives box. You will be able to see the
result of the analysis.
Select File Open Data
Choose Excel as file type
Select the file you want to import
Then click Open
106
107
Key in values and labels for each variable
Run frequency for each variable
Check outputs to see if you have variables with
wrong values.
Check missing values and physical surveys if you
use paper surveys, and make sure they are real
missing.
Sometimes, you need to recode string variables
into numeric variables
108
Wrong
entries
109
Recode variables
1. Select Transform Recode
into Different Variables
2. Select variable that you want
to transform (e.g. Q20): we
want
1= Yes and 0 = No
3. Click Arrow button to put
your variable into the right
window
4. Under Output Variable: type
name for new variable and
label, then click Change
5. Click Old and New Values
110