Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
63 views142 pages

FDSA Unit - 2

Unit II focuses on descriptive analytics, covering topics such as frequency distributions, outliers, measures of central tendency, and types of data. It explains the importance of statistics in various fields and differentiates between descriptive and inferential statistics. The unit also details types of data, levels of measurement, and methods for constructing frequency distributions for both qualitative and quantitative data.

Uploaded by

Mohana Priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views142 pages

FDSA Unit - 2

Unit II focuses on descriptive analytics, covering topics such as frequency distributions, outliers, measures of central tendency, and types of data. It explains the importance of statistics in various fields and differentiates between descriptive and inferential statistics. The unit also details types of data, levels of measurement, and methods for constructing frequency distributions for both qualitative and quantitative data.

Uploaded by

Mohana Priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 142

UNIT- II

UNIT II DESCRIPTIVE ANALYTICS

Frequency distributions – Outliers –interpreting distributions – graphs – averages -


describing variability – interquartile range – variability for qualitative and ranked data
- Normal distributions – z scores –correlation – scatter plots – regression – regression
line – least squares regression line – standard error of estimate – interpretation of r2 –
multiple regression equations – regression toward the mean.

1
AD3491- FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS /II CSBS/III-SEM/KG-KITE
UNIT- II

1.WHY STUDY STATISTICS?

2.WHAT IS STATISTICS?

3. TYPES OF DATA

4.LEVELS OF MEASUREMENT

5.TYPES OF VARIABLES

2
AD3491- FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS /II CSBS/III-SEM/KG-KITE
UNIT- II

1.WHY STUDY STATISTICS?


Statistics is used for data mining, speech recognition, vision and image analysis, data
compression, artificial intelligence, and network and traffic modeling.

2.WHAT IS STATISTICS?
 Branch of mathematics that is responsible for collecting, analyzing, interpreting, and presenting
numerical data
 statistics serves as the backbone, providing the essential tools and techniques for extracting
meaningful insights from data.

3
AD3491- FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS /II CSBS/III-SEM/KG-KITE
Types of Statistics
1. Descriptive Statistics
• Descriptive statistics are tools that help us simplify and organize large chunks of data, making
vast amounts of information easier to understand.
Uses cases of Descriptive Statistics
 Measures of Central Tendency

• Mean: The average of all data points.


• Mode: The most frequently occurring value in a dataset.
• Median: The middle value that separates the higher half from the lower half of the data.
 Graphical Representation

• Histograms: Bar graphs representing the frequency distribution of a dataset.


• Pie Charts: Circular charts divided into sectors representing relative frequencies.
• Box Plots: Graphical depiction of data through their quartiles.
 Measures of Dispersion

• Range: The difference between the maximum and minimum values.


• Variance: The measure of how data points differ from the mean.
4
• Standard Deviation: The square root of the variance, representing the average distance from the
mean.
5
2. Inferential Statistics
• Inferential statistics are techniques that allow us to make generalizations and predictions
about a population based on a sample of data.
• Any group of data that includes all the data you are interested in is known as
population.

Uses cases of Inferential Statistics


 Estimation

• Point Estimation: Provides a single value estimate of a population parameter (e.g., sample
mean as an estimate of population mean).
• Interval Estimation: Provides a range of values within which the population parameter is
expected to lie (e.g., confidence intervals).
6
Hypothesis Testing
• Null Hypothesis (H0): A statement of no effect or no difference, which researchers aim to
test against.
• Alternative Hypothesis (H1): A statement indicating the presence of an effect or
difference.
• p-value: The probability of observing the test results under the null hypothesis.
• Significance Level (α): The threshold for rejecting the null hypothesis, commonly set at
0.05.
Regression Analysis
• Simple Linear Regression: Analysing the relationship between two continuous variables
.
• Multiple Regression: Examining the relationship between one dependent variable and
multiple independent variables

7
8
Three Types of Data
Data
 A collection of actual observations or scores in a survey or an experiment
 Collection of numbers, characters, images, and others that can arranged in some manner to
form meaningful information.
 In statistics, the data - Collection of numbers that is first studied then analysed and presented
in some way that we can get some meaningful insight from that data.
 For example, 12, 19, 17, and 15 this is data.

9
1. Qualitative Data
A set of observations where any single observation is a word, letter, or numerical code that represents a class or
category.
 qualitative data consist of words (Yes or No), letters (Y or N), or numerical codes (0 or 1) that represent a class

or category
 Qualitative Data is also called Categorical Data and its categories the data into various categories.

 Qualitative data includes data such as gender of people, their family name and others in sample of population data.

Types:
• Nominal Data
• Ordinal Data
1. Nominal Data :
Nominal data is a type of data that consists of categories or names that cannot be ordered or ranked.
Examples of nominal data :
 gender (Male or female),
 race (White, Black, Asian)
 religion (Hinuduism, Christianity, Islam, Judaism), and
 blood type (A, B, AB, O).

Ordinal Data
 Ordinal data is a type of data that consists of categories that can be ordered or ranked.
10
 Examples of ordinal data include education level (Elementary, Middle, High School, College), job position

(Manager, Supervisor, Employee), etc.


2. Quantitative Data (Numerical Data)
 A set of observations where any single observation is a number that represents an amount or a

count.
 Quantitative Data is the type of the data that represents the numerical value of the data.

 They are also called the Numerical Data.

 This data type is used to represent the height, weight, length and other things of the data

Quantitative data is further classified into two categories that are,


• Discrete Data
• Continuous Data
1. Discrete Data
 Discrete data type is a type of data in statistics that only uses Discrete Value or Single Values.

 These data types have values that can be easily counted as whole numbers.

The example of the discreate data types are,


• Height of Students in a class
• Marks of the students in a class test 11

• Weight of different members of a family, etc.


2. Continuous Data
• Continuous data is the type of the quantitative data that represent the data in a continuous range.
• The variable in the data set can have any value between the range of the data set.
• Examples of the continuous data types are,
• Temperature Range
• Salary range of Workers in a Factory, etc.
3. Ranked Data
A set of observations where any single observation is a number that indicates relative standing

12
13
Levels of Measurement
Level of Measurement
 Specifies the extent to which a number (or word or letter) actually represents

some attribute and, therefore, has implications for the appropriateness of various
arithmetic operations and statistical procedures
Nominal Measurement
 Words, letters, or numerical codes of qualitative data that reflect differences in

kind based on classification.


Ordinal Measurement
 Relative standing of ranked data that reflects differences in degree based on order

Interval/Ratio Measurement
 Amounts or counts of quantitative data reflect differences in degree based on
14
equal intervals and a true zero
15
Types Of Variables
1. Variable - A characteristic or property that can take on different values.
2. Constant - A characteristic or property that can take on only one value

Discrete Variable : A variable that consists of isolated numbers separated by


gaps.
Continuous Variable : A variable that consists of numbers whose values, at least
in theory, have no restrictions
Approximate Numbers : Numbers that are rounded off, as is always the case
with values for continuous variables

16
Experiment - A study in which the investigator decides who receives the special
treatment
Independent Variable - The treatment manipulated by the investigator in an
experiment
Dependent Variable - A variable that is believed to have been influenced by the
independent variable

Observational Study - A study that focuses on detecting relationships between


variables not manipulated by the investigator.
Confounding variable- An uncontrolled variable that compromises the
17
interpretation of a study
FREQUENCY DISTRIBUTION -TABLE

Frequency Distribution for quantitative data


Frequency:
 How often something happens

 A collection of observations produced by sorting observations into classes and

showing their frequency (f) of occurrence in each class


 A frequency distribution table is a way to organize and present data in a tabular

form which helps us summarize the large dataset into a concise table.
 In the frequency distribution table, there are two columns one representing the

data either in the form of a range or an individual data set and the other column
shows the frequency of each interval or individual.

18
19
Types of Frequency Distribution
1. Grouped Frequency Distribution

2. Ungrouped Frequency Distribution

3. Relative Frequency Distribution

4. Cumulative Frequency Distribution

20
1. Frequency distribution for grouped data :
 Observations are divided between different intervals known as class

intervals and then their frequencies are counted for each class interval.
 This Frequency Distribution is used mostly when the data set is very large.

21
GUIDELINES
 Lists seven rules for producing a well-constructed frequency distribution.

 The first three rules are essential and should not be violated.

 The last four rules are optional and can be modified or ignored as circumstances warrant

22
Gaps between Classes
Unit of Measurement
• The smallest possible difference
between scores
• The size of the gap should always
equal one unit of measurement

23
24
Real Limits of Class Intervals
 Real limits - Located at the midpoint of the gap between

adjacent tabled boundaries


 one-half of one unit of measurement below the lower tabled

boundary and one-half of one unit of measurement above the


upper tabled boundary
 For example:
 Find Real limits for 140–149
139.5 (140 minus one-half of the unit of measurement of 1)
140-0.5 =139.5
149.5 (149 plus one-half of the unit of measurement of 1)
149 +0.5 =149.5
Actual width of the class interval would be 10 (from 149.5 -139.5
= 10).
25
Constructing Frequency Distribution

26
27
28
2. Frequency distribution for Ungrouped data :
 All distinct observations are mentioned and counted individually.

 This Frequency Distribution is often used when the given dataset is small.

29
OUTLIERS
Outlier - A very extreme score
 A GPA of 0.06, an IQ of 170, summer wages of $62,000
Check for Accuracy
 Whenever you encounter an outrageously extreme value, such as a GPA of 0.06, attempt to
verify its accuracy
 For instance, was a respectable GPA of 3.06 recorded erroneously as 0.06? If the outlier
survives an accuracy check, it should be treated as a legitimate score
Might Exclude from Summaries
 You might choose to segregate (but not to suppress!) an outlier from any summary of the
data
Might Enhance Understanding
 Insofar as a valid outlier can be viewed as the product of special circumstances, it might
help you to understand the data
30
31
3. Relative Frequency Distribution
 A frequency distribution showing the frequency of each class as a fraction of the

total frequency for the entire distribution.


 This distribution displays the proportion or percentage of observations in each

interval or class.
 It is useful for comparing different data sets or for analysing the distribution of

data within a set.

Constructing Relative Frequency Distributions


 To convert a frequency distribution into a relative frequency distribution,
divide the frequency for each class by the total frequency for the entire
32
distribution
Percentages or Proportions
 Some people prefer to deal with percentages rather than proportions because

percentages usually lack decimal points.


 A proportion always varies between 0 and 1

 A percentage always varies between 0 percent and 100 percent.

 To convert the relative frequencies from proportions to percentages, multiply

each proportion by 100; that is, move the decimal point two places to the right.
For example, multiply .06 (the proportion for the class 130–139) by 100 to obtain
6 percent.

33
34
35
36
4. Cumulative Frequency Distribution
 A frequency distribution showing the total number of observations in each class and all lower-
ranked classes
 Cumulative frequency is defined as the sum of all the frequencies in the previous values or intervals
up to the current one.
Constructing Cumulative Frequency Distributions
 To convert a frequency distribution into a cumulative frequency distribution, add to the frequency of
each class the sum of the frequencies of all classes ranked below it. This gives the cumulative
frequency for that class.
There are two types of cumulative frequency distributions:
• Less than Type: We sum all the frequencies before the current interval.
• More than Type: We sum all the frequencies after the current interval.
Cumulative Percentage
• Cumulative Frequency / Total Frequency (40/53=75 %) Percentile Rank of an Observation
• To obtain this cumulative percentage (75%), the cumulative frequency of 40 for the class 170–179
should be divided by the total frequency of 53 for the entire distribution
Percentile Rank of an Observation 37
• Percentage of scores in the entire distribution with equal or smaller values than that score Percentage
of scores in the entire distribution with equal or smaller values than that score
38
DISTRIBUTION BY SUMMING UP THE VALUES OF CURRENT INTERVAL AND ALL THE
PREVIOUS INTERVALS.

39
40
41
42
FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA
 When, among a set of observations, any single observation is a word, letter,

or numerical code, the data are qualitative


 Determine the frequency with which observations occupy each class, and report

these frequencies as shown in Table 2.7 for the Facebook profile survey.
 This frequency distribution reveals that Yes replies are approximately twice as

prevalent as No replies.
 When, however, qualitative data have an ordinal level of measurement because

observations can be ordered from least to most, that order should be preserved
in the frequency table, as illustrated in Table 2.8, in which military ranks are
listed in descending order from general to lieutenant

43
Ordered Qualitative Data
 It’s totally arbitrary whether Yes is listed above or below No in Table 2.7.

 Qualitative data have an ordinal level of measurement because observations

can be ordered from least to most, that order should be preserved in the
frequency table, as illustrated in Table 2.8, in which military ranks are listed in
descending order from general to lieutenant

44
Relative and Cumulative Distributions for Qualitative Data
 Frequency distributions for qualitative variables can always be converted

into relative frequency distributions, as illustrated in Table 2.8.


 Furthermore, if measurement is ordinal because observations can be

ordered from least to most, cumulative frequencies (and cumulative


percentages) can be used.
 As illustrated in Table 2.8, it’s appropriate to claim, for example, that a captain

has an approximate percentile rank of 63 among officers since 62.5 (or 63) is
the cumulative percent for this class.
 If measurement is only nominal because observations cannot be ordered, as in

Table 2.7, a cumulative frequency distribution is meaningless

45
46
47
INTERPRETING DISTRIBUTIONS
C O N S T R U C T E D BY OTHERS
 When inspecting a distribution for the first time, train yourself to look at the entire table, not
just the distribution.
 Read the title, column headings, and any footnotes. Where do the data come from? Is a source
cited? Next, focus on the form of the frequency distribution. Is it well constructed?
 For quantitative data, does the total number of classes seem to avoid either over- or under-
summarizing the data?
 After these preliminaries, inspect the content of the frequency distribution. What is the
approximate range? Does it seem reasonable?
 As best you can, disregard the inevitable irregularities that accompany a frequency distribution
and focus on its overall appearance or shape.
 Do the frequencies arrange themselves around a single peak (high point) or several peaks? Is
the distribution fairly balanced around its peak?
 When interpreting distributions, including distributions constructed by someone else, keep an
open mind. Follow the previous suggestions but also pursue any questions stimulated by your 48
inspection of the entire table.
GRAPHS
 Graphics Representation is a way of representing any data in picturized
form.
 It helps a reader to understand the large set of data very easily as it gives us

various data patterns in visualized form.

 There are two ways of representing data


• Tables
• Pictorial Representation through graphs.

49
GRAPHS FOR QUANTITATIVE DATA

1. Line Graphs
 A line graph is used to show how the value of a particular variable changes with

time

50
2. Histograms
 A bar-type graph for quantitative data. The common boundaries between

adjacent bars emphasize the continuity of the data, as with continuous


variables.
 The data is organized into intervals and the bars represent the frequency of the

values in that range. That is, it counts how many values of the data lie in a
particular range.

51
Features of histograms.
 Equal units along the horizontal axis (the X axis, or abscissa) reflect the various

class intervals of the frequency distribution.


 Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in

frequency.
 The intersection of the two axes defines the origin at which both numerical

scales equal 0.
 Numerical scales always increase from left to right along the horizontal axis and

from bottom to top along the vertical axis.


 The body of the histogram consists of a series of bars whose heights reflect the

frequencies for the various classes.

52
3. Frequency Polygon
 A line graph for quantitative data that also emphasizes the continuity of

continuous variables.
 A frequency polygon is a graph that is constructed by joining the midpoint of

the intervals.
 The height of the interval or the bin represents the frequency of the values that

lie in that interval.


 Frequency polygons may be constructed directly from frequency distributions

53
54
4. Stem and Leaf Plot
 A device for sorting quantitative data on the basis of leading and trailing

digits.
 This is a type of plot in which each value is split into a “leaf”(in most cases, it is

the last digit) and “stem”(the other remaining digits).


 For example: the number 42 is split into leaf (2) and stem (4).

55
56
57
58
 An important characteristic of a frequency distribution is its shape.
FREQUENCY CURVE /TYPICAL SHAPES

 A frequency distribution curve, also known as a frequency curve, is a graphical

representation of a data set’s frequency distribution.


 It is used to visualize the distribution and frequency of values or observations

within a dataset

59
Normal Curve
 Any distribution that approximates the normal shape
 uninterrupted gestation periods of human foetuses, scores on standardized tests,
and even the popping times of individual kernels in a batch of popcorn
Bimodal :
 Any distribution that approximates the bimodal shape
 coexistence of two different types of observations in the same distribution.
 For instance, the distribution of the ages of residents in a neighbourhood
consisting largely of either new parents or their infants has a bimodal shape.
Positively Skewed Distribution:
A distribution that includes a few extreme observations in the positive direction (to
the right of the majority of observations).
Negatively Skewed Distribution :
60
A distribution that includes a few extreme observations in the negative direction (to
the left of the majority of observations).
GRAPH FOR QUALITATIVE (NOMINAL) DATA
Bar Graph
 A bar-type graph for qualitative data.

 Gaps between adjacent bars emphasize the discontinuous nature of the data.

 A bar graph also can be used with quantitative data to emphasize the

discontinuous nature of a discrete variable, such as the number of children in a


family.

61
MISLEADING GRAPHS
 Graphs can be constructed in an unscrupulous manner to support a particular point

of view.
 Indeed, this type of statistical fraud gives credibility to popular sayings, including

“Numbers don’t lie, but statisticians do”


There are three kinds of lies—lies, damned lies, and statistics.”
For example, to imply that comparatively many students responded Yes to the Face
book profile question, an unscrupulous person might resort to the various tricks
 The width of the Yes bar is more than three times that of the No bar, thus violating

the custom that bars be equal in width.


 The lower end of the frequency scale is omitted, thus violating the custom that the

entire scale be reproduced, beginning with zero.


 The height of the vertical axis is several times the width of the horizontal axis, thus
62
violating the custom
63
GRAPH FOR QUALITATIVE (NOMINAL) DATA
Bar Graph :
 A bar-type graph for qualitative data.

 Gaps between adjacent bars emphasize the discontinuous nature of the data

64
AVERAGES
Measures of Central Tendency
Numbers or words that attempt to describe, most generally, the middle or typical
value for a distribution
1. MODE - The value of the most frequent score

65
More Than One Mode – Bimodal
 Describes any distribution with two obvious peaks

 Distributions with more than two peaks are referred to as

multimodal

66
2. MEDIAN - The middle value when observations are ordered from least to
most

67
68
3. MEAN
The mean is found by adding all scores and then dividing by the number of scores.

Mean = 112/20 =5.60 Years

Mean = 672/11 =61.09

69
Sample or Population
Population - A complete set of scores.
Sample - A subset of scores
Sample Mean (X ) The balance point for a sample, found by
dividing the sum for the values of all scores in the sample by the
number of scores in the sample.

Sample Size (n) - The total number of scores in the sample


Population Mean (μ) - The balance point for a population, found by dividing the
sum for all scores in the population by the number of scores in the population.
Population Size (N) - The total number of scores in the population
70
AVERAGE
1. If Distribution Is Not Skewed
 When a distribution of scores is not too skewed, the values of the mode, median, and
mean are similar, and any of them can be used to describe the central tendency of the
distribution
2. If Distribution Is Skewed
 Interpreting Differences between Mean and Median Ideally, when a distribution is
skewed, report both the mean and the median.
3. Special Status of the Mean - the mean is the single most preferred average
for quantitative data.
4. Using the Word Average
 an average can refer to the mode, median, or mean—or even to some more exotic
average, such as the geometric mean or the harmonic mean.
 Conventional usage prescribes that average usually signifies mean, and this connotation
is often reinforced by the context. 71
 For instance, grade point average is virtually synonymous with mean grade point
AVERAGES FOR QUALITATIVE AND RANKED DATA
1. Mode Always Appropriate for Qualitative Data
2. Median Sometimes Appropriate
 The median can be used whenever it is possible to order qualitative data from least to
most because the level of measurement is ordinal
3. Inappropriate Averages
 It would not be appropriate to report a median for unordered qualitative data with
nominal measurement
4. Averages for Ranked Data
 When the data consist of a series of ranks, with its ordinal level of measurement, the
median rank always can be obtained

72
DESCRIBING VARIABILITY

73
RANGE

Range: The difference between the largest and smallest scores.

Shortcomings of Range :
1. First, since its value depends on only two scores—the largest and the smallest

2. Value of the range tends to increase with increases in the total number of
scores.
3. Size of the range tends to vary with the size of the group

74
VARIANCE
Variance - The mean of all squared deviation scores.

STANDARD DEVIATION

Standard Deviation : A rough measure of the average (or standard) amount by


which scores deviate on either side of their mean.

Majority of Scores within One Standard Deviation


For most frequency distributions, a majority (often as many as 68 percent) of all
scores are within one standard deviation on either side of the mean.

75
A Small Minority of Scores Deviate More Than Two Standard Deviations
For most frequency distributions, a small minority (often as small as 5 percent) of
all scores deviate more than two standard deviations on either side of the mean.

76
Standard Deviation: A Measure of Distance
 Mean - measure of position
 Standard deviation - measure of distance (on either side of the mean of the
distribution).
Value of Standard Deviation Cannot Be Negative

77
SUM OF SQUARES (SS)
Sum of Squares (SS) - The sum of squared deviation scores.

Sum of Squares Formulas for Population

78
79
Sum of Squares Formulas for Sample

80
81
Standard Deviation for Population σ
Population Standard Deviation (σ) - A rough measure of the average amount by
which scores in the population deviate on either side of their population mean.

82
Standard Deviation for Sample (s)
Sample Standard Deviation (s) A rough measure of the average amount by which
scores in the sample deviate on either side of their sample mean

83
84
85
86
DEGREES OF FREEDOM (DF)
Degrees of Freedom (df ) : The number of values free to vary, given one or more
mathematical restrictions.

87
INTERQUARTILE RANGE (IQR)

 Interquartile Range (IQR) The range for the middle 50 per cent of the scores.
 The IQR equals the distance between the third quartile (or 75th percentile) and

the first quartile (or 25th percentile), that is, after the highest quarter (or top 25
percent) and the lowest quarter (or bottom 25 percent) have been trimmed from
the original set of scores.
 Not Sensitive to Extreme Scores

88
89
MEASURES OF VARIABILITY FOR QUALITATIVE AND
RANKED DATA
 Measures of variability are virtually non-existent for qualitative or nominal data.
 It is probably adequate to note merely whether scores are evenly divided among the

various classes (maximum variability), unevenly divided among the various classes
(intermediate variability), or concentrated mostly in one class (minimum variability).
 For example, if the ethnic composition of the residents of a city is about evenly divided

among several groups, the variability with respect to ethnic groups is maximum; there is
considerable heterogeneity.
 At the other extreme, if almost all the residents are concentrated in a single ethnic group,

the variability will be minimum; there is little heterogeneity.


Ordered Qualitative and Ranked Data
 If qualitative data can be ordered because measurement is ordinal (or if the data are

ranked), then it’s appropriate to describe variability by identifying extreme scores (or
ranks).
 For instance, the active membership of an officers’ club might include no one with a rank 90
below first lieutenant or above brigadier general
NORMAL DISTRIBUTION & Z-SCORE
Normal Curve : A theoretical curve noted for its symmetrical bell-shaped form.

91
92
Finding a Proportion for the Shaded Area
To find this new proportion, we cannot rely on the vertical scale in Figure 5.2, because it
describes as proportions the areas in the rectangular bars of histograms, not the areas in
the various curved sectors of the normal curve.
Properties of the Normal Curve
 Obtained from a mathematical equation, the normal curve is a theoretical curve defined

for a continuous variable, noted for its symmetrical bell-shaped form


 Because the normal curve is symmetrical, its lower half is the mirror image of its upper

half.
 Being bell shaped, the normal curve peaks above a point midway along the horizontal

spread and then tapers off gradually in either direction from the peak (without actually
touching the horizontal axis, since, in theory, the tails of a normal curve extend
infinitely far).
 The values of the mean, median (or 50th percentile), and mode, located at a point 93

midway along the horizontal spread, are the same for the normal curve.
Different Normal Curves

94
Z SCORES
z Score : A unit-free, standardized score that indicates how many standard
deviations a score is above or below the mean of its distribution.

A z score consists of two parts:


1. a positive or negative sign indicating whether it’s above or below the mean

2. a number indicating the size of its deviation from the mean in standard
deviation units.
 A z score of 2.00 always signifies that the original score is exactly two standard

deviations above its mean.


 Similarly, a z score of –1.27 signifies that the original score is exactly 1.27

standard deviations below its mean. 95

 A z score of 0 signifies that the original score coincides with the mean.
Replace X with 66 (the maximum permissible height)
μ with 69 (the mean height), and
σ with 3 (the standard deviation of heights) and solve for z as follows:

cutoff height is exactly one standard deviation below the mean

96
STANDARD NORMAL CURVE
Standard Normal Curve : The tabled normal curve for z scores, with a mean of 0 and a
standard deviation of 1.
 Standard normal curve always has a mean of 0 and a standard
deviation of 1.
 The mean of a standard normal distribution equals 0, replace X in the z
score formula with μ, the mean of any (nonstandard) normal distribution,
and then solve for z:

 Verify that the standard deviation of the standard normal distribution


equals 1, replace X in the z score formula with μ + 1σ, the value
corresponding to one standard deviation above the mean for any
(nonstandard) normal distribution, and then solve for z

97
98
99
SOLVING NORMAL CURVE PROBLEMS
 two main types of normal curve problems.

 In the first type of problem, we use a known score (or scores) to find an

unknown proportion. For instance, we use the known score of 66 inches to find
the unknown proportion of eligible FBI applicants.
 In the second type of problem, the procedure is reversed. Now we use a known

proportion to find an unknown score (or scores).

100
F I N D I N G P R O P O RT I O N S
Example: Finding Proportions for One Score
 to find the proportion of all FBI applicants who are shorter than exactly 66 inches, given that
the distribution of heights approximates a normal curve with a mean of 69 inches and a
standard deviation of 3 inches.
1. Sketch a normal curve and shade in the target area, as in the left part of Figure 5.6. Being
less than the mean of 69, 66 is located to the left of the mean. Furthermore, since the unknown
proportion represents those applicants who are shorter than 66 inches, the shaded target sector
is located to the left of 66.
2. Plan your solution according to the normal table. Decide precisely how you will find the
value of the target area. In the present case, the answer will be obtained from column C′ of the
standard normal table, since the target area coincides with the type of area identified with
column C′, that is, the area in the lower tail beyond a negative z.
3. Convert X to z. Express 66 as a z score:
101
4. Find the target area. Refer to the standard normal table, using the bottom
legend, as the z score is negative.
The arrows in Table 5.1 show how to read the table. Look up column A’ to 1.00
(representing a z score of –1.00), and note the corresponding proportion of .1587
in column C’: This is the answer, as suggested in the right part of Figure 5.6.
It can be concluded that only .1587 (or .16) of all of the FBI applicants will be
shorter than 66 inches.

102
103
104
Finding Proportions between Two Scores
 the gestation periods for human fetuses approximate a normal curve with a mean of 270 days (9

months) and a standard deviation of 15 days.


 What proportion of gestation periods will be between 245 and 255 days?

1. Sketch a normal curve and shade in the target area, as in the top panel of Figure 5.7.
Satisfy yourself that, in fact, the shaded area represents just those gestation periods between
245 and 255 days.
2. Plan your solution according to the normal table.
This type of problem requires more effort to solve because the value of the target area cannot be
read directly from Table A. As suggested in the bottom two panels of Figure 5.7, the basic idea is
to identify the target area with the difference between two overlapping areas whose values can be
read from column C′ of Table A. The larger area (less than 255 days) contains two sectors: the
target area (between 245 and 255 days) and a remainder (less than 245 days). The smaller area
contains only the remainder (less than 245 days). Subtracting the smaller area (less than 245 days)
from the larger area (less than 255 days), therefore, eliminates the common remainder (less than
245 days), leaving only the target area (between 245 and 255 days). 105

3. Convert X to z by expressing 255 as


4. Find the target area.
Look up column A′ to a negative z score of –1.00 (remember, you must
imagine the negative sign), and note the corresponding proportion of .1587
in column C′. Likewise, look up column A′ to a z score of –1.67, and note the
corresponding proportion of .0475 in column C′.
Subtract the smaller proportion from the larger proportion to obtain the
answer, .1112.
Thus, only .11, or 11 percent, of all gestation periods will be between 245
and 255 days 106
107
108
DESCRIBING RELATIONSHIPS: CORRELATION
Positive Relationship Occurs insofar as pairs of scores tend to occupy similar
relative positions (high with high and low with low) in their respective
distributions
Negative Relationship Occurs insofar as pairs of scores tend to occupy
dissimilar relative positions (high with low and vice versa) in their respective
distributions.

109
110
SCATTERPLOTS
Scatterplot A graph containing a cluster of dots that represents all pairs of
scores.

111
• A dot cluster that has a slope from the lower left to the upper right, as in panel
A of Figure 6.2, reflects a positive relationship.
• a dot cluster that has a slope from the upper left to the lower right, as in panel
B of Figure 6.2, reflects a negative relationship
• a dot cluster that lacks any apparent slope, as in panel C of Figure 6.2,
reflects little or no relationship

112
Strong or Weak Relationship

113
Linear Relationship A relationship that can be described best with a straight
line.
Curvilinear Relationship A relationship that can be described best with a
curved line.

114
REGRESSION
REGRESSION LINE
 The regression line is a straight line rather than a curved line because of the

linear relationship between cards sent and cards received


 which guides the string of arrows, beginning at 11, toward the predicted value

of 15.20.
Placement of Line
 If all five dots had defined a single straight line, placement of the regression line

would have been simple


Predictive Errors
Solid dots reflect the actual number of cards received, and open dots, always
located along the regression line, reflect the predicted number of cards received.

115
116
LEAST SQUARES REGRESSION LINE
 the placement of the regression line minimizes not the total predictive error but
the total squared predictive error, that is, the total for all squared predictive
errors. When located in this fashion, the regression line is often referred to as
the least squares regression line.

Y´ - predicted value (the predicted number of cards that will be received


by any new friend, such as Emma);
X - represents the known value (the known number of cards sent by any
new friend)
b and a - represent numbers calculated from the original correlation 117
analysis
r - represents the correlation between X and Y (cards sent and
received by the five friends);
SSy - represents the sum of squares for all Y scores (the cards
received by the five friends);
SSx - represents the sum of squares for all X scores (the cards
sent by the five friends).

118
Y and X refer to the sample means for all Y and X scores, respectively,
and b is defined by the preceding expression.

119
120
Least Squares Regression Equation The equation that minimizes the total of
all squared prediction errors for known Y scores in the original correlation
analysis.

121
STANDARD ERROR OF ESTIMATE , SY|X

SSy|x - represents the sum of the squares for predictive errors,


Y − Y′, and the degrees of freedom term in the denominator, n −
2, reflects the loss of two degrees of freedom because any
straight line, including the regression line, can be made to
coincide with two data points.
The symbol sy|x is read as “s sub y given x.
122
SSy is the sum of the squares for Y scores (cards received by the five friends),
that is,

r is the correlation coefficient (cards sent and


received)
Standard Error of Estimate (sy|x ) - A rough measure of the
123
average amount of predictive error.
124
Substituting a value of 1 for r, we obtain

As expected, when predictions are based on perfect relationships, the sum


of squares for predictive errors equals zero, and there is no predictive
error.

125
as expected, when predictions are based on a nonexistent relationship, the sum of
squares for predictive errors equals SSy , the sum of squares of Y scores about Y,
and there is no reduction in predictive error.

as expected, when predictions are based on a non existent


relationship, the sum of squares for predictive errors equals SSy ,
the sum of squares of Y scores about Y, and there is no reduction in
predictive error

126
INTERPRETATION OF
 The squared correlation coefficient, , provides us with not only a key
interpretation of the correlation coefficient but also a measure of predictive
accuracy that supplements the standard error of estimate, sy|x
 Goal is to show the relationship between r2 and predictive accuracy, we will

initially concentrate on two kinds of predictive errors—those due to the


repetitive prediction of the mean and those due to the regression equation.
1. Repetitive Prediction of the Mean
 Pretend that we know the Y scores (cards received), but not the corresponding X

scores (cards sent), for each of the five friends. Lacking information about the
relationship between X and Y scores, we could not construct a regression
equation and use it to generate a customized prediction, Y′, for each friend
 mount a primitive predictive effort by always predicting the mean, Y, for each

of the five friends’ Y scores.


127
 Most important for our purposes, using the repetitive prediction
of Y for each of the Y scores of all five friends will supply us with
a frame of reference against which to evaluate our customary
predictive effort based on the correlation between cards sent
(X) and cards received (Y).
 Any predictive effort that capitalizes on an existing correlation

between X and Y should be able to generate a smaller error


variability—and, conversely, more accurate predictions of Y—
than a primitive effort based only on the repetitive prediction of
Y

128
shows the predictive errors for all five friends when the mean
2. Predictive Errorsfor all five friends, Y, of 12 (shown as the mean line) is always
used to predict each of their five Y scores

129
130
 predictive errors for all five friends when a series of different Y′ values,
obtained from the least squares equation (shown as the least squares line), is
used to predict each of their five Y scores
 panel A of Figure 7.5 shows the error for John when the mean for all five

friends, Y, of 12 is used to predict his Y score of 6. Shown as a broken vertical


line, the error of −6 for John (from Y − Y = 6 − 12 = −6) indicates that Y
overestimates John’s Y score by 6 cards
 Panel B shows a smaller error of −1.20 for John when a Y′ value of 7.20 is used

to predict the same Y score of 6. This Y’ value of 7.20 is obtained from the least
squares equation

131
 Positive and negative errors indicate that Y scores are either above or below
their corresponding predicted scores.
 Overall, as expected, errors are smaller when customized predictions of Y′ from

the least squares equation can be used (because X scores are known) than when
only the repetitive prediction of Y can be used (because X scores are ignored.)
As with most statistical phenomena, there are exceptions: The predictive error
for Doris is slightly larger when the least squares equation is used.
Error Variability (Sum of Squares)

132
Proportion of Predicted Variability
To obtain an SS measure of the actual gain in accuracy due to the least squares
predictions, subtract the residual variability from the total variability, that is,
subtract SSy|x from SSy , to obtain

To express this difference, 51.2, as a gain in accuracy relative to the original


error variability for the repetitive prediction of , Y divide the above difference by
SSy , that is,

.64 or 64 percent, represents the proportion or percent gain in predictive


accuracy when the repetitive prediction of Y is replaced by a series of
133
customized Y′ predictions based on the least squares equation.
In other words, .64 or 64 percent represents the proportion or percent of the
total variability of SSy that is predictable from its relationship with the X
Squared Correlation Coefficient ()
The proportion of the total variability in one variable that is predictable from its
relationship with the other variable

• r2 Does Not Apply to Individual Scores


• Small Values of r2
• r2 Doesn’t Ensure Cause-Effect

134
MULTIPLE REGRESSION EQUATIONS
Multiple Regression Equation : A least squares equation that contains more
than one predictor or X variable.

135
REGRESSION TOWARD THE MEAN
Regression Toward the Mean : A tendency for scores, particularly extreme
scores, to shrink toward the mean.
For example, because of regression toward the mean, we would expect that
students who made the top five scores on the first statistics exam would not
make the top five scores on the second statistics exam. Although all five
students might score above the mean on the second exam, some of their scores
would regress back toward the mean.
Most likely, the top five scores on the first exam reflect two components.
One relatively permanent component reflects the fact that these students are
superior because of good study habits, a strong aptitude for quantitative reasoning,
and so forth.
The other relatively transitory component reflects the fact that, on the day of the
exam, at least some of these students were very lucky because all sorts of little
chance factors, such as restful sleep, a pleas ant commute to campus, etc., worked 136

in their favor.
On the second test, even though the scores of these five students
continue to reflect an above-average permanent component, some of
their scores will suffer because of less good luck or even bad luck. The
net effect is that the scores of at least some of the original five top
students will drop below the top five scores—that is, regress back
toward the mean—on the second exam
There is good news for those students who made the five lowest scores
on the first exam. Although all five students might score below the
mean on the second exam, some of their scores probably will regress
up toward the mean. On the second exam, some of them will not be as
unlucky. The net effect is that the scores of at least some of the
original five lowest scoring students will move above the bottom five 137

scores—that is, regress up toward the mean—on the second exam.


Appears in Many Distributions
it appears for the subset of best (or worst) performing stocks on the New York
Stock Exchange across any period, such as a week, month, or year

138
Regression Fallacy : Occurs whenever regression toward the mean is
interpreted as a real, rather than a chance, effect.
Israeli Air Force study of pilot training , Some trainees were praised after very
good landings, while others were reprimanded after very bad landings.
On their next landings, praised trainees did more poorly and reprimanded
trainees did better. It was concluded, therefore, that praise hinders but a
reprimand helps performance! A valid conclusion considers regression toward the
mean.
It’s reasonable to assume that, in addition to skill, chance plays a role in landings.
Some trainees who made very good landings were lucky, while some who made
very bad landings were unlucky.
Therefore, there would be a tendency, attributable to chance, that good landings
would be followed by less good landings and poor landings would be followed
by less poor landings—even if trainees had not been praised after very good 139

landings or reprimanded after very bad landings.


Avoiding the Regression Fallacy
The regression fallacy can be avoided by splitting the subset of extreme
observations into two groups.
In the previous example, one group of trainees would continue to be praised
after very good landings and reprimanded after very poor landings.
A second group of trainees would receive no feedback whatsoever after very
good and very bad landings.
In effect, the second group would serve as a control for regression toward the
mean, since any shift toward the mean on their second landings would be due to
chance.
Most important, any observed difference between the two groups (that survives
a statistical analysis described in Part 2) would be viewed as a real difference
not attributable to the regression effect.
140
END

141
142

You might also like