Dr.
Urooj A Siddiqui
Data – Raw Facts, especially numerical facts,
collected together for reference or
information.
Data is collected on some particular
variable/s
Data analysis is processing of data to derive
useful information
Knowledge communicated concerning some
particular fact
The created knowledge helps in APPLICATION /
DECISION MAKING
Categorical:Qualitative
Continuous: Quantitative
Data
Categorical Continuous
Nominal Ordinal Interval Ratio
Any phenomenon which takes at least two
different values/ observations
Data:Set of values/ observations
collected on variable is called data
Nominal
Ordinal
Interval
Ratio
1. Data Preparation / Initial 2. Summarizing Data / Data
Operations Analysis Operations
Tables / Crosstab
Editing / Cleaning
Graph / Figure
Coding Statistical Analysis
Classification 1. Descriptive Methods
Frequency, %age, Ratio,
Tabulation
Mean, Median, Standard
Graphical Deviation (Variance)
Representation 2. Inferential Methods
Comparison (t/z-test/Anova)
Association (chi square test)
Correlation (r)
Prediction/ Regression
(y = ax + b)
Editing / Data Cleaning
examining the collected raw data to detect any errors
and omit/correct it if possible
Coding
assigning numerals to answers so that responses can
be put into a limited number of categories
Classification
Grouping of data on some basis (large volume of raw
data is reduced into homogenous groups
I. Attribute - on the basis of demographic bases
eg. gender, rural/urban, day scholar/hosteller
II. Class Interval – on the basis on some numeric range
eg. 0-10, 10-20 etc.
I. Tabulation
is the process of displaying raw data in tabular
form and summarising it for further analysis
orderly arranging data in columns and rows
Tabulation is essential because
It conserves space and reduces statements
It facilitates the process of summation of
items, comparison, detection of errors and
omissions
Basis for various statistical computations
temp of
Gende Yrs in Pain
Name Caste Age Mob. No. Edu IQ locality
r school level
deg cel
Ram M Hindu 60 9450366367 NIL 0 16 Mild-0 -4
Akbar M Muslim 65 8004896712 HS 16 14 Mod-1 20
Sita F Hindu 309 9934876545 Int. 19 0 Mild-0 15
Shalini F Hindu 90 2542543598 HS 8 16 Mild-0 0
Mehnaj F Sikh 38 9458098734 UG 21 13 Severe-2 0
Ravi M Hindu 48 9412890112 PG 23 20 Mod-1 -1
Hari M Hindu 45 8796654398 Prim 12 10 Mod-1 30
temp of
Edu Yrs in Pain
Name Gender Caste Age Mob.No. IQ locality
level sch. level
deg cel
7 1 1 60 9450366367 1 0 16 0 4
2 1 2 65 8004896712 1 16 14 2 20
5 2 1 35 9934876545 2 19 0 0 15
4 2 1 90 2542543598 1 8 16 0 0
3 2 3 38 9458098734 3 21 13 3 0
6 1 1 48 9412890112 4 23 20 2 -1
1 1 1 45 8796654398 0 12 10 2 30
Nominal & Ordinal called qualitative . Interval and Ratio called quantitative
Roll. Age
Single / Multi Variable Table - one or No (yr)
more variable (no interaction) 1 22
2 24
Single Variable Freq. Table
3 23
Age Group (years) Freq.
4 26
Below 20 2
5 19
20-22 28
6 25
22-24 16
. .
24-26 10
. .
Above 26 4
. .
60 . .
. .
**Multiple Variable Table – as presented in above slide
60 22
Crosstabs – interaction of two or more
variables
Two Variable Interaction – Crosstab
Gender
Age Group Male Female Total
Below 20 1 1 2
20-22 18 10 28
22-24 9 7 16
24-26 7 3 10
Above 26 3 1 4
38 22 60
Graphical Representation of Data
Pie Chart
Bar Graph
Histogram
Line Graph
Scatter Plot
Scatter Plot & Correlation
Pie Charts
It is used to represent %ages, distribution of 1
variable at various levels
Sales (in mn)
1.2,
8%
1.4,
10% 1st Qtr
2nd Qtr
3.2, 8.2, 58% 3rd Qtr
23%
4th Qtr
Bar Chart
It is used to represent 1 variable at various levels
Levels can be year/ groups etc.
4 Sales
3.5
3
2.5
4.3 4.5
2
3.5
1.5
2.5
1
0.5
0
2018 2019 2020 2021
Bar Chart
5 Clustered Bar
4.5
4
3.5
3 1st
2.5 2nd
2 4.3 4.4
4 3rd
3.5
1.5 3 3 4th
1 2.4 2.5 2.5
2 2 1.8
0.5
0
2018 2019 2020
Histogram
To show the distribution of a Roll. Age
No (yr)
quantitative variable
1 22
2 24
3 23
12
4 26
10
5 19
8
Frequency
6 25
6
10 . .
4 8
6 . .
2 4 . .
2 0
0
10 20 30 40 50
. .
Class Interval/Variable Unit . .
60 22
Line Diagram
To show change in variable in a particular time
period / on some reference range
₹ 7.40
₹ 7.20
₹ 7.00
₹ 6.80
Stock Price
₹ 6.60
₹ 6.40
₹ 6.20
₹ 6.00
₹ 5.80
₹ 5.60
1 2 3 4 5 6 7 8 9 10
Last 10 Days
Line Diagram
May also be used to compare 2 or more variables
along the range
14
12
10
8 Adani
6 Tata
4 Reliance
2
0
1 2 3 4 5 6 7 8
Scatter Plot
It is used to express relationships between two
variables
6
5
4
Sales in
3
Crore
2 Y-Values
1
0
0 1 2 3 4
Adv Budget in 10’Lacs
Scatter Plot
to express relationships between two variables
Scatter Plot
Trend Lines - Correlation
No. of
Income / day 80
families
70
0-500 20
60
500-1000 30
50
No.of families
1000-1500 50 40
1500-2000 70 30
2000-2500 40 20
2500-3000 30 10
3000-3500 10 0
0 1000 2000 3000 4000
Income
. .
age (xi) x-xi (x-xi) sqr.
A 21 2 4
B 22 1 1
C 23 0 0
D 24 -1 1
E 25 -2 4
10 (sum x-xi sq)
mean x 23 Sum 0
Avg Sq (variance) 2 (10 by 5), n=5
SD (root v) s 1.41
Roll. Age
No (yr) Age Group (years) Freq. Probability
1 22 Below 20 2 2/60
2 24
20-22 28 28/60
3 23
22-24 16 16/60
4 26
24-26 10 10/60
5 19
Above 26 4 4/60
6 22
60
. .
Mean 23 (years)
. . (x-sample-known)
. . (µ-population - unknown)
. . SD 2 (years)
(s-sample-known)
. . (𝜎 – population - unknown)
60 22
A distribution in frequencies of observations is
known – probability distribution
Z- Normal Distribution/Test - Mean (µ), SD-
To compare means (1 or 2 means)
t – Distribution/Test- Mean (x), SD (s)
To compare means (1 or 2 means)
Chi Square Distribution / Test
To compare sample SD with population SD
F Test
To compare two sample variances
A freq. distribution with bell shape curve and
some known properties
Parameters - Mean (µ), SD (sigma)
Known properties
68% values are within µ ± 1 SD
95% values are within µ ± 2 SD
99% values are within µ ± 3 SD
95% CI = µ ± 2.SD (range)
Lower limit µ - 2.SD
Upper limit µ + 2.SD
23
21 25
19 27
17 29
Example of our case
95% CI = µ ± 2.SD
Lower limit = µ - 2.SD, Upper limit = µ + 2.SD,
LL = 23 - 2.2 = 19, UL = 23 + 2.2 = 27
95% CI Range = 19-27 years
95% of the students in the class are in the range
of 19-27 yrs
We are 95% confident that if we randomly select
a student from the class his/her age will be
within this range (19-27 yrs)
Reverse is Hypothesis Testing
If mean and SD of any population is known and if
some value is given can we determine whether it
belongs to this population or distribution ?
0
-0.5 +0.5
-1
+1
-1.5 +1.5
Finding Probability
Calculate z score (test statistic) of the observed
value or hypothesized value with the formula
Determine p value associated with particular z
score at selected significance level (5%)
P value can be seen in the tables of the particular
test
When Population SD is KNOWN When Population SD is UNKNOWN
t=
Two types of Hypothesis, Null - H0, Alternate - Ha
P Value Method Table Value Method
Determine p value Calculate test statistic
Compare with selected value – TSCal
alpha level (0.05) Determine Critical value
p ≤ 0.05 – Reject Null of test statistic at
selected significance level
P > 0.05 – Fail to Reject
– TSTab
null / accept null
If TSCal ≥ TSTab – Reject
This method is generally
Null
employed by data analysis
software – Excel, SPSS If TSCal < TSTab – Fail to
Reject null / accept null
This method is generally
employed when manual
testing is done
No. of Marks Specialization
Gender Caste Age
RN Mob.No. Classes Obtained Opted
G C A
N M S
1 1 1 22 9450366367 87 72 HR-3
2 1 2 24 8004896712 65 68 HR-3
3 2 1 26 9934876545 48 56 Fin.-2
4 2 1 21 2542543598 95 83 Mktg.-1
5 2 3 22 9458098734 65 58 Fin.-2
6 1 1 23 9412890112 74 65 Mktg.-1
• Mean & Variance (SD) – Eg. A, N, M – sample stat. – x, s
• Correlation Eg. N-M, A-N, A-M –r
• Association between Gender and Sp. Opted (G n S) - chi
Note Sample Ch.c – Statistic , Population Ch.c - Parameter
Assume a population – N, µ,
Now assume we take many samples of size n and
calculate mean for each sample
x1, x2, x3, x4, x5, x6, . . . . . . . . x100
Can we make a freq. distribution of these values
and draw a curve?
Now when we draw a distribution of these values
we will have an average (x) and SD (s)
This average is called mean of means and
considered mean of population
The SD of population is calculated as
which is called as Standard Error
Sample mean & their difference - z / t
Sample correlation statistic– z / t (derived from r)
Variance (SD2) – F
Association – Chi Sqr.
Central Limit Theorem
If we collect many samples and draw its
distribution the mean of this distribution is
population mean and SD of population is
We use CLT in Hypothesis Testing
z - when is Known and sample size is ≥ 30
t - when is Unknown and sample size < 30
In sample estimation t test is employed
Example - H0 & H1
H0 – There is no difference b/w mean of two groups
H1 – There is a significant difference b/w mean of two groups
H0 – There is no difference b/w mean marks of males &
females
H1 – There is a significant difference b/w male & females
Hypothesis Testing steps
Set Null Value (u1=u2, u1-u2=0) – Make Null Distribution –
Calculate z /t sample test statistic – compare with table
value/set p value – reject/accept null
Used to compare variance of two samples
Employed in ANOVA – analysis of variance
When there are more than two groups and their
means are to be compared
Example
Comparison of marks among three streams of
students arts, commerce and science
H0 – There is no difference among mean marks of three groups
H1 – There is a significant difference among mean marks of three
groups
Set Null Value (µ1=µ2=µ3) – Make Null Distribution – Calculate F
test statistic – compare with table value/p value – reject/accept
null
Test of Independence
It is used to determine association between two
categorical variables (nominal & ordinal)
Example
Gender (M/F) and Opted Specialization (M/F/HR)
Question like ‘is any specialisation is preferred by
females?’ are answered
H0 – There is no association b/w gender and opted speclisa.n
H1 – There is a significant association b/w gender & opted
speclisa.n
Here, mean is not calculated instead frequency of categories
is taken into consideration
Actual Frequency and Expected Frequency
Cross tabs are used to calculate actual & expected freq
Two Variable Interaction – Crosstab
Opted Total Gender
Specialization (60) Male (40) Female (20)
Mktg. 30 20 8
Fin. 15 10 2
HR 15 10 10
60 40 20
Hypothesis Testing steps
Set Null Value (actual freq. = expected freq.) – Make Null
Distribution – Calculate chi sqr. sample test statistic –
compare with table value/set p value – reject/accept null
Set Null and Alternate Hypothesis – H0 H1
Select the null value
Null – status quo, no difference, no effect
Status quo – no change
No difference – 0 difference
No relationship – 0 effect / 0 correlation
No association – 0 relationship (b/w nominal variab.)
It is assumed that H0 is true in population
Draw Null Distribution – find range of expected values
if null is true (µ ± 2.SE)
Take observed value from sample and compare with
expected null values
If observed value is among expected null range –
accept null
If observed value is different from null range – reject
null
1. Univariate/Bi-variate 2. Muti-variate
Mean/Variance Correlation
Estimation Regression
Z test Discriminant
T test Cluster Analysis etc.
Chi Square
F Test
Correlation
Regression analysis
1 dependent variable/DV (continuous)
many independent variables/IV (continuous)
Y = a.x1 +b.x2 +c.x3…….+.x.n
Discriminant analysis
1 dependent variable (categorical)
many independent variables (continuous)
Z (yes/no) = a.x1 +b.x2 +c.x3…….+.x.n
Cluster analysis
No DV/IV
Used to group respondents/customers in
various cluster
Employed in market segmentation
Factor analysis
No DV/IV
Used to group variables in various cluster of
more condensed variables