FDSA Unit - 2
FDSA Unit - 2
1
AD3491- FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS /II CSBS/III-SEM/KG-KITE
UNIT- II
2.WHAT IS STATISTICS?
3. TYPES OF DATA
4.LEVELS OF MEASUREMENT
5.TYPES OF VARIABLES
2
AD3491- FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS /II CSBS/III-SEM/KG-KITE
UNIT- II
2.WHAT IS STATISTICS?
Branch of mathematics that is responsible for collecting, analyzing, interpreting, and presenting
numerical data
statistics serves as the backbone, providing the essential tools and techniques for extracting
meaningful insights from data.
3
AD3491- FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS /II CSBS/III-SEM/KG-KITE
Types of Statistics
1. Descriptive Statistics
• Descriptive statistics are tools that help us simplify and organize large chunks of data, making
vast amounts of information easier to understand.
Uses cases of Descriptive Statistics
Measures of Central Tendency
• Point Estimation: Provides a single value estimate of a population parameter (e.g., sample
mean as an estimate of population mean).
• Interval Estimation: Provides a range of values within which the population parameter is
expected to lie (e.g., confidence intervals).
6
Hypothesis Testing
• Null Hypothesis (H0): A statement of no effect or no difference, which researchers aim to
test against.
• Alternative Hypothesis (H1): A statement indicating the presence of an effect or
difference.
• p-value: The probability of observing the test results under the null hypothesis.
• Significance Level (α): The threshold for rejecting the null hypothesis, commonly set at
0.05.
Regression Analysis
• Simple Linear Regression: Analysing the relationship between two continuous variables
.
• Multiple Regression: Examining the relationship between one dependent variable and
multiple independent variables
7
8
Three Types of Data
Data
A collection of actual observations or scores in a survey or an experiment
Collection of numbers, characters, images, and others that can arranged in some manner to
form meaningful information.
In statistics, the data - Collection of numbers that is first studied then analysed and presented
in some way that we can get some meaningful insight from that data.
For example, 12, 19, 17, and 15 this is data.
9
1. Qualitative Data
A set of observations where any single observation is a word, letter, or numerical code that represents a class or
category.
qualitative data consist of words (Yes or No), letters (Y or N), or numerical codes (0 or 1) that represent a class
or category
Qualitative Data is also called Categorical Data and its categories the data into various categories.
Qualitative data includes data such as gender of people, their family name and others in sample of population data.
Types:
• Nominal Data
• Ordinal Data
1. Nominal Data :
Nominal data is a type of data that consists of categories or names that cannot be ordered or ranked.
Examples of nominal data :
gender (Male or female),
race (White, Black, Asian)
religion (Hinuduism, Christianity, Islam, Judaism), and
blood type (A, B, AB, O).
Ordinal Data
Ordinal data is a type of data that consists of categories that can be ordered or ranked.
10
Examples of ordinal data include education level (Elementary, Middle, High School, College), job position
count.
Quantitative Data is the type of the data that represents the numerical value of the data.
This data type is used to represent the height, weight, length and other things of the data
These data types have values that can be easily counted as whole numbers.
12
13
Levels of Measurement
Level of Measurement
Specifies the extent to which a number (or word or letter) actually represents
some attribute and, therefore, has implications for the appropriateness of various
arithmetic operations and statistical procedures
Nominal Measurement
Words, letters, or numerical codes of qualitative data that reflect differences in
Interval/Ratio Measurement
Amounts or counts of quantitative data reflect differences in degree based on
14
equal intervals and a true zero
15
Types Of Variables
1. Variable - A characteristic or property that can take on different values.
2. Constant - A characteristic or property that can take on only one value
16
Experiment - A study in which the investigator decides who receives the special
treatment
Independent Variable - The treatment manipulated by the investigator in an
experiment
Dependent Variable - A variable that is believed to have been influenced by the
independent variable
form which helps us summarize the large dataset into a concise table.
In the frequency distribution table, there are two columns one representing the
data either in the form of a range or an individual data set and the other column
shows the frequency of each interval or individual.
18
19
Types of Frequency Distribution
1. Grouped Frequency Distribution
20
1. Frequency distribution for grouped data :
Observations are divided between different intervals known as class
intervals and then their frequencies are counted for each class interval.
This Frequency Distribution is used mostly when the data set is very large.
21
GUIDELINES
Lists seven rules for producing a well-constructed frequency distribution.
The first three rules are essential and should not be violated.
The last four rules are optional and can be modified or ignored as circumstances warrant
22
Gaps between Classes
Unit of Measurement
• The smallest possible difference
between scores
• The size of the gap should always
equal one unit of measurement
23
24
Real Limits of Class Intervals
Real limits - Located at the midpoint of the gap between
26
27
28
2. Frequency distribution for Ungrouped data :
All distinct observations are mentioned and counted individually.
This Frequency Distribution is often used when the given dataset is small.
29
OUTLIERS
Outlier - A very extreme score
A GPA of 0.06, an IQ of 170, summer wages of $62,000
Check for Accuracy
Whenever you encounter an outrageously extreme value, such as a GPA of 0.06, attempt to
verify its accuracy
For instance, was a respectable GPA of 3.06 recorded erroneously as 0.06? If the outlier
survives an accuracy check, it should be treated as a legitimate score
Might Exclude from Summaries
You might choose to segregate (but not to suppress!) an outlier from any summary of the
data
Might Enhance Understanding
Insofar as a valid outlier can be viewed as the product of special circumstances, it might
help you to understand the data
30
31
3. Relative Frequency Distribution
A frequency distribution showing the frequency of each class as a fraction of the
interval or class.
It is useful for comparing different data sets or for analysing the distribution of
each proportion by 100; that is, move the decimal point two places to the right.
For example, multiply .06 (the proportion for the class 130–139) by 100 to obtain
6 percent.
33
34
35
36
4. Cumulative Frequency Distribution
A frequency distribution showing the total number of observations in each class and all lower-
ranked classes
Cumulative frequency is defined as the sum of all the frequencies in the previous values or intervals
up to the current one.
Constructing Cumulative Frequency Distributions
To convert a frequency distribution into a cumulative frequency distribution, add to the frequency of
each class the sum of the frequencies of all classes ranked below it. This gives the cumulative
frequency for that class.
There are two types of cumulative frequency distributions:
• Less than Type: We sum all the frequencies before the current interval.
• More than Type: We sum all the frequencies after the current interval.
Cumulative Percentage
• Cumulative Frequency / Total Frequency (40/53=75 %) Percentile Rank of an Observation
• To obtain this cumulative percentage (75%), the cumulative frequency of 40 for the class 170–179
should be divided by the total frequency of 53 for the entire distribution
Percentile Rank of an Observation 37
• Percentage of scores in the entire distribution with equal or smaller values than that score Percentage
of scores in the entire distribution with equal or smaller values than that score
38
DISTRIBUTION BY SUMMING UP THE VALUES OF CURRENT INTERVAL AND ALL THE
PREVIOUS INTERVALS.
39
40
41
42
FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA
When, among a set of observations, any single observation is a word, letter,
these frequencies as shown in Table 2.7 for the Facebook profile survey.
This frequency distribution reveals that Yes replies are approximately twice as
prevalent as No replies.
When, however, qualitative data have an ordinal level of measurement because
observations can be ordered from least to most, that order should be preserved
in the frequency table, as illustrated in Table 2.8, in which military ranks are
listed in descending order from general to lieutenant
43
Ordered Qualitative Data
It’s totally arbitrary whether Yes is listed above or below No in Table 2.7.
can be ordered from least to most, that order should be preserved in the
frequency table, as illustrated in Table 2.8, in which military ranks are listed in
descending order from general to lieutenant
44
Relative and Cumulative Distributions for Qualitative Data
Frequency distributions for qualitative variables can always be converted
has an approximate percentile rank of 63 among officers since 62.5 (or 63) is
the cumulative percent for this class.
If measurement is only nominal because observations cannot be ordered, as in
45
46
47
INTERPRETING DISTRIBUTIONS
C O N S T R U C T E D BY OTHERS
When inspecting a distribution for the first time, train yourself to look at the entire table, not
just the distribution.
Read the title, column headings, and any footnotes. Where do the data come from? Is a source
cited? Next, focus on the form of the frequency distribution. Is it well constructed?
For quantitative data, does the total number of classes seem to avoid either over- or under-
summarizing the data?
After these preliminaries, inspect the content of the frequency distribution. What is the
approximate range? Does it seem reasonable?
As best you can, disregard the inevitable irregularities that accompany a frequency distribution
and focus on its overall appearance or shape.
Do the frequencies arrange themselves around a single peak (high point) or several peaks? Is
the distribution fairly balanced around its peak?
When interpreting distributions, including distributions constructed by someone else, keep an
open mind. Follow the previous suggestions but also pursue any questions stimulated by your 48
inspection of the entire table.
GRAPHS
Graphics Representation is a way of representing any data in picturized
form.
It helps a reader to understand the large set of data very easily as it gives us
49
GRAPHS FOR QUANTITATIVE DATA
1. Line Graphs
A line graph is used to show how the value of a particular variable changes with
time
50
2. Histograms
A bar-type graph for quantitative data. The common boundaries between
values in that range. That is, it counts how many values of the data lie in a
particular range.
51
Features of histograms.
Equal units along the horizontal axis (the X axis, or abscissa) reflect the various
frequency.
The intersection of the two axes defines the origin at which both numerical
scales equal 0.
Numerical scales always increase from left to right along the horizontal axis and
52
3. Frequency Polygon
A line graph for quantitative data that also emphasizes the continuity of
continuous variables.
A frequency polygon is a graph that is constructed by joining the midpoint of
the intervals.
The height of the interval or the bin represents the frequency of the values that
53
54
4. Stem and Leaf Plot
A device for sorting quantitative data on the basis of leading and trailing
digits.
This is a type of plot in which each value is split into a “leaf”(in most cases, it is
55
56
57
58
An important characteristic of a frequency distribution is its shape.
FREQUENCY CURVE /TYPICAL SHAPES
within a dataset
59
Normal Curve
Any distribution that approximates the normal shape
uninterrupted gestation periods of human foetuses, scores on standardized tests,
and even the popping times of individual kernels in a batch of popcorn
Bimodal :
Any distribution that approximates the bimodal shape
coexistence of two different types of observations in the same distribution.
For instance, the distribution of the ages of residents in a neighbourhood
consisting largely of either new parents or their infants has a bimodal shape.
Positively Skewed Distribution:
A distribution that includes a few extreme observations in the positive direction (to
the right of the majority of observations).
Negatively Skewed Distribution :
60
A distribution that includes a few extreme observations in the negative direction (to
the left of the majority of observations).
GRAPH FOR QUALITATIVE (NOMINAL) DATA
Bar Graph
A bar-type graph for qualitative data.
Gaps between adjacent bars emphasize the discontinuous nature of the data.
A bar graph also can be used with quantitative data to emphasize the
61
MISLEADING GRAPHS
Graphs can be constructed in an unscrupulous manner to support a particular point
of view.
Indeed, this type of statistical fraud gives credibility to popular sayings, including
Gaps between adjacent bars emphasize the discontinuous nature of the data
64
AVERAGES
Measures of Central Tendency
Numbers or words that attempt to describe, most generally, the middle or typical
value for a distribution
1. MODE - The value of the most frequent score
65
More Than One Mode – Bimodal
Describes any distribution with two obvious peaks
multimodal
66
2. MEDIAN - The middle value when observations are ordered from least to
most
67
68
3. MEAN
The mean is found by adding all scores and then dividing by the number of scores.
69
Sample or Population
Population - A complete set of scores.
Sample - A subset of scores
Sample Mean (X ) The balance point for a sample, found by
dividing the sum for the values of all scores in the sample by the
number of scores in the sample.
72
DESCRIBING VARIABILITY
73
RANGE
Shortcomings of Range :
1. First, since its value depends on only two scores—the largest and the smallest
2. Value of the range tends to increase with increases in the total number of
scores.
3. Size of the range tends to vary with the size of the group
74
VARIANCE
Variance - The mean of all squared deviation scores.
STANDARD DEVIATION
75
A Small Minority of Scores Deviate More Than Two Standard Deviations
For most frequency distributions, a small minority (often as small as 5 percent) of
all scores deviate more than two standard deviations on either side of the mean.
76
Standard Deviation: A Measure of Distance
Mean - measure of position
Standard deviation - measure of distance (on either side of the mean of the
distribution).
Value of Standard Deviation Cannot Be Negative
77
SUM OF SQUARES (SS)
Sum of Squares (SS) - The sum of squared deviation scores.
78
79
Sum of Squares Formulas for Sample
80
81
Standard Deviation for Population σ
Population Standard Deviation (σ) - A rough measure of the average amount by
which scores in the population deviate on either side of their population mean.
82
Standard Deviation for Sample (s)
Sample Standard Deviation (s) A rough measure of the average amount by which
scores in the sample deviate on either side of their sample mean
83
84
85
86
DEGREES OF FREEDOM (DF)
Degrees of Freedom (df ) : The number of values free to vary, given one or more
mathematical restrictions.
87
INTERQUARTILE RANGE (IQR)
Interquartile Range (IQR) The range for the middle 50 per cent of the scores.
The IQR equals the distance between the third quartile (or 75th percentile) and
the first quartile (or 25th percentile), that is, after the highest quarter (or top 25
percent) and the lowest quarter (or bottom 25 percent) have been trimmed from
the original set of scores.
Not Sensitive to Extreme Scores
88
89
MEASURES OF VARIABILITY FOR QUALITATIVE AND
RANKED DATA
Measures of variability are virtually non-existent for qualitative or nominal data.
It is probably adequate to note merely whether scores are evenly divided among the
various classes (maximum variability), unevenly divided among the various classes
(intermediate variability), or concentrated mostly in one class (minimum variability).
For example, if the ethnic composition of the residents of a city is about evenly divided
among several groups, the variability with respect to ethnic groups is maximum; there is
considerable heterogeneity.
At the other extreme, if almost all the residents are concentrated in a single ethnic group,
ranked), then it’s appropriate to describe variability by identifying extreme scores (or
ranks).
For instance, the active membership of an officers’ club might include no one with a rank 90
below first lieutenant or above brigadier general
NORMAL DISTRIBUTION & Z-SCORE
Normal Curve : A theoretical curve noted for its symmetrical bell-shaped form.
91
92
Finding a Proportion for the Shaded Area
To find this new proportion, we cannot rely on the vertical scale in Figure 5.2, because it
describes as proportions the areas in the rectangular bars of histograms, not the areas in
the various curved sectors of the normal curve.
Properties of the Normal Curve
Obtained from a mathematical equation, the normal curve is a theoretical curve defined
half.
Being bell shaped, the normal curve peaks above a point midway along the horizontal
spread and then tapers off gradually in either direction from the peak (without actually
touching the horizontal axis, since, in theory, the tails of a normal curve extend
infinitely far).
The values of the mean, median (or 50th percentile), and mode, located at a point 93
midway along the horizontal spread, are the same for the normal curve.
Different Normal Curves
94
Z SCORES
z Score : A unit-free, standardized score that indicates how many standard
deviations a score is above or below the mean of its distribution.
2. a number indicating the size of its deviation from the mean in standard
deviation units.
A z score of 2.00 always signifies that the original score is exactly two standard
A z score of 0 signifies that the original score coincides with the mean.
Replace X with 66 (the maximum permissible height)
μ with 69 (the mean height), and
σ with 3 (the standard deviation of heights) and solve for z as follows:
96
STANDARD NORMAL CURVE
Standard Normal Curve : The tabled normal curve for z scores, with a mean of 0 and a
standard deviation of 1.
Standard normal curve always has a mean of 0 and a standard
deviation of 1.
The mean of a standard normal distribution equals 0, replace X in the z
score formula with μ, the mean of any (nonstandard) normal distribution,
and then solve for z:
97
98
99
SOLVING NORMAL CURVE PROBLEMS
two main types of normal curve problems.
In the first type of problem, we use a known score (or scores) to find an
unknown proportion. For instance, we use the known score of 66 inches to find
the unknown proportion of eligible FBI applicants.
In the second type of problem, the procedure is reversed. Now we use a known
100
F I N D I N G P R O P O RT I O N S
Example: Finding Proportions for One Score
to find the proportion of all FBI applicants who are shorter than exactly 66 inches, given that
the distribution of heights approximates a normal curve with a mean of 69 inches and a
standard deviation of 3 inches.
1. Sketch a normal curve and shade in the target area, as in the left part of Figure 5.6. Being
less than the mean of 69, 66 is located to the left of the mean. Furthermore, since the unknown
proportion represents those applicants who are shorter than 66 inches, the shaded target sector
is located to the left of 66.
2. Plan your solution according to the normal table. Decide precisely how you will find the
value of the target area. In the present case, the answer will be obtained from column C′ of the
standard normal table, since the target area coincides with the type of area identified with
column C′, that is, the area in the lower tail beyond a negative z.
3. Convert X to z. Express 66 as a z score:
101
4. Find the target area. Refer to the standard normal table, using the bottom
legend, as the z score is negative.
The arrows in Table 5.1 show how to read the table. Look up column A’ to 1.00
(representing a z score of –1.00), and note the corresponding proportion of .1587
in column C’: This is the answer, as suggested in the right part of Figure 5.6.
It can be concluded that only .1587 (or .16) of all of the FBI applicants will be
shorter than 66 inches.
102
103
104
Finding Proportions between Two Scores
the gestation periods for human fetuses approximate a normal curve with a mean of 270 days (9
1. Sketch a normal curve and shade in the target area, as in the top panel of Figure 5.7.
Satisfy yourself that, in fact, the shaded area represents just those gestation periods between
245 and 255 days.
2. Plan your solution according to the normal table.
This type of problem requires more effort to solve because the value of the target area cannot be
read directly from Table A. As suggested in the bottom two panels of Figure 5.7, the basic idea is
to identify the target area with the difference between two overlapping areas whose values can be
read from column C′ of Table A. The larger area (less than 255 days) contains two sectors: the
target area (between 245 and 255 days) and a remainder (less than 245 days). The smaller area
contains only the remainder (less than 245 days). Subtracting the smaller area (less than 245 days)
from the larger area (less than 255 days), therefore, eliminates the common remainder (less than
245 days), leaving only the target area (between 245 and 255 days). 105
109
110
SCATTERPLOTS
Scatterplot A graph containing a cluster of dots that represents all pairs of
scores.
111
• A dot cluster that has a slope from the lower left to the upper right, as in panel
A of Figure 6.2, reflects a positive relationship.
• a dot cluster that has a slope from the upper left to the lower right, as in panel
B of Figure 6.2, reflects a negative relationship
• a dot cluster that lacks any apparent slope, as in panel C of Figure 6.2,
reflects little or no relationship
112
Strong or Weak Relationship
113
Linear Relationship A relationship that can be described best with a straight
line.
Curvilinear Relationship A relationship that can be described best with a
curved line.
114
REGRESSION
REGRESSION LINE
The regression line is a straight line rather than a curved line because of the
of 15.20.
Placement of Line
If all five dots had defined a single straight line, placement of the regression line
115
116
LEAST SQUARES REGRESSION LINE
the placement of the regression line minimizes not the total predictive error but
the total squared predictive error, that is, the total for all squared predictive
errors. When located in this fashion, the regression line is often referred to as
the least squares regression line.
118
Y and X refer to the sample means for all Y and X scores, respectively,
and b is defined by the preceding expression.
119
120
Least Squares Regression Equation The equation that minimizes the total of
all squared prediction errors for known Y scores in the original correlation
analysis.
121
STANDARD ERROR OF ESTIMATE , SY|X
125
as expected, when predictions are based on a nonexistent relationship, the sum of
squares for predictive errors equals SSy , the sum of squares of Y scores about Y,
and there is no reduction in predictive error.
126
INTERPRETATION OF
The squared correlation coefficient, , provides us with not only a key
interpretation of the correlation coefficient but also a measure of predictive
accuracy that supplements the standard error of estimate, sy|x
Goal is to show the relationship between r2 and predictive accuracy, we will
scores (cards sent), for each of the five friends. Lacking information about the
relationship between X and Y scores, we could not construct a regression
equation and use it to generate a customized prediction, Y′, for each friend
mount a primitive predictive effort by always predicting the mean, Y, for each
128
shows the predictive errors for all five friends when the mean
2. Predictive Errorsfor all five friends, Y, of 12 (shown as the mean line) is always
used to predict each of their five Y scores
129
130
predictive errors for all five friends when a series of different Y′ values,
obtained from the least squares equation (shown as the least squares line), is
used to predict each of their five Y scores
panel A of Figure 7.5 shows the error for John when the mean for all five
to predict the same Y score of 6. This Y’ value of 7.20 is obtained from the least
squares equation
131
Positive and negative errors indicate that Y scores are either above or below
their corresponding predicted scores.
Overall, as expected, errors are smaller when customized predictions of Y′ from
the least squares equation can be used (because X scores are known) than when
only the repetitive prediction of Y can be used (because X scores are ignored.)
As with most statistical phenomena, there are exceptions: The predictive error
for Doris is slightly larger when the least squares equation is used.
Error Variability (Sum of Squares)
132
Proportion of Predicted Variability
To obtain an SS measure of the actual gain in accuracy due to the least squares
predictions, subtract the residual variability from the total variability, that is,
subtract SSy|x from SSy , to obtain
134
MULTIPLE REGRESSION EQUATIONS
Multiple Regression Equation : A least squares equation that contains more
than one predictor or X variable.
135
REGRESSION TOWARD THE MEAN
Regression Toward the Mean : A tendency for scores, particularly extreme
scores, to shrink toward the mean.
For example, because of regression toward the mean, we would expect that
students who made the top five scores on the first statistics exam would not
make the top five scores on the second statistics exam. Although all five
students might score above the mean on the second exam, some of their scores
would regress back toward the mean.
Most likely, the top five scores on the first exam reflect two components.
One relatively permanent component reflects the fact that these students are
superior because of good study habits, a strong aptitude for quantitative reasoning,
and so forth.
The other relatively transitory component reflects the fact that, on the day of the
exam, at least some of these students were very lucky because all sorts of little
chance factors, such as restful sleep, a pleas ant commute to campus, etc., worked 136
in their favor.
On the second test, even though the scores of these five students
continue to reflect an above-average permanent component, some of
their scores will suffer because of less good luck or even bad luck. The
net effect is that the scores of at least some of the original five top
students will drop below the top five scores—that is, regress back
toward the mean—on the second exam
There is good news for those students who made the five lowest scores
on the first exam. Although all five students might score below the
mean on the second exam, some of their scores probably will regress
up toward the mean. On the second exam, some of them will not be as
unlucky. The net effect is that the scores of at least some of the
original five lowest scoring students will move above the bottom five 137
138
Regression Fallacy : Occurs whenever regression toward the mean is
interpreted as a real, rather than a chance, effect.
Israeli Air Force study of pilot training , Some trainees were praised after very
good landings, while others were reprimanded after very bad landings.
On their next landings, praised trainees did more poorly and reprimanded
trainees did better. It was concluded, therefore, that praise hinders but a
reprimand helps performance! A valid conclusion considers regression toward the
mean.
It’s reasonable to assume that, in addition to skill, chance plays a role in landings.
Some trainees who made very good landings were lucky, while some who made
very bad landings were unlucky.
Therefore, there would be a tendency, attributable to chance, that good landings
would be followed by less good landings and poor landings would be followed
by less poor landings—even if trainees had not been praised after very good 139
141
142