0% found this document useful (0 votes)

36 views18 pages

BA 216 Lecture 3 Notes

The document covers concepts of central tendency and dispersion in statistics, including measures like mean, median, mode, range, variance, and standard deviation. It emphasizes the importance of representative sampling for valid statistical inferences and warns against anecdotal evidence. Additionally, it introduces the use of Box-and-Whisker plots for visualizing data distributions.

Uploaded by

Harrison Lim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views18 pages

BA 216 Lecture 3 Notes

Uploaded by

Harrison Lim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

BA 216

Calculating and comparing measures for Central Tendency & Dispersion;

Box-and-Whisker plots

Note*: Use ?(data set) to answer your homework questions (second attempt)
Don’t forget to correct your previous answers on the multiple choice questions

Lecture Overview:
1. Finishing R Demo 1
2. Introduction to Populations & Sampling (and notation for population parameters vs.
point estimates)
3. Describing numerical data distributions
● Central Tendency (mean, median, mode)
● Dispersion (range, IQR, variance, standard deviation)
● Outliers
4. Examining both centrality and dispersion with Box-and-whisker plots

Homework:
1. Note: Don’t forget HW1 is due this Friday (9/10).
2. HW2 should be posted next Monday (9/13) to be due Monday following (9/20)

Introduction to RStudio setup and operation

● The console
● The environment
● R Scripts (and why to use them vs. the console)
● Plots
● Libraries

Basic R Operations and Concepts

● Commenting with #
● Arithmetic
● Order of operations
● Variables: Assignment, Names, Vectors, Functions, Help
● Libraries & ggplot

Sampling is natural – we do it all the time

● Let’s say, you’ve decided to make soup.
● You taste a spoonful of soup, and decide the spoonful you tasted isn't salty
enough
○ This is an exploratory analysis, where you describe a sample (spoonful)
from the whole pot of soup

● You think to yourself, probably your entire soup needs salt.

○ An inference about the whole population of soup spoonfuls

For reliable statistics, we need representative samples

● For your inference to be valid, the spoonful you tasted (the sample) needs to be
representative of the entire pot (the population).
○ If your spoonful comes only from the surface and the salt is collected at
the bottom of the pot, what you tasted is probably not representative of the
whole pot.
○ If you first stir the soup thoroughly before you taste, your spoonful will
more likely be representative of the whole pot.

● We need a properly representative sample so we can compute inferential

statistics and generalize the findings to a population. You gotta stir your soup!

Samples vs. populations

● Each research question refers to a target population.
● Often times, it is too expensive (or will take too long, or is impossible) to collect
data for every case in a population. Instead, a sample is taken.
● A sample represents a subset of the cases and is usually only a small fraction
of the population.
● The shorthand statistical notation is usually different for populations vs.
samples (see the population vs. sample size to the right)
Point estimates vs. Population parameters

● POPULATION PARAMETERS: the characteristics of the whole population (like

the population mean, population standard deviation, or population proportion)
○ NOTE! The “true” population parameter is effectively “hidden” from
statisticians and analysts.
○ Question: Why would this be?

● POINT ESTIMATES (also called SAMPLE STATISTICS): the things you

calculate from your sample (like sample mean, sample standard deviation,
sample proportion) are considered point estimates for the population parameters.
● Often, a goal of statistics is to go from describing a sample, to making educated
guesses about the characteristics of the whole population (i.e. “inferences”).
○ Another way to say this -- is we hope to use the sample’s point estimates
to make inferences about the ‘hidden’ population parameters

Shorthand statistical notation for Population parameters vs. Point

estimates/sample statistics

Examples: Populations and samples

Consider the following three research questions:

1. What is the average mercury content in swordfish in the Atlantic Ocean?
2. Over the last 5 years, what is the average time to complete a degree for Duke
undergrads?
3. Does a new drug reduce the number of deaths in patients with severe heart
disease?

For each, what is the target populations? What is an individual case/observation?

“Over the last 5 years, what is the average time to complete a degree for Duke
undergrads?”
A: The first question is only relevant to students who complete their degree; the
average cannot be computed using a student who never finished her degree. Thus, only
Duke undergrads who graduated in the last five years represent cases in the population
under consideration. Each such student is an individual case.

“Does a new drug reduce the number of deaths in patients with severe heart disease?”

A: A person with severe heart disease represents a case. The population includes all
people with severe heart disease.

Anecdotal data – beware!

1. What is the average mercury content in swordfish in the Atlantic Ocean?
● Someone might say -- “Well, a man on the news got mercury poisoning from
eating swordfish, so the average mercury concentration in swordfish must be
dangerously high.”

2. Over the last 5 years, what is the average time to complete a degree for Duke
undergrads?
● “I met two students who took more than 7 years to graduate from Duke, so it
must take longer to graduate at Duke than at many other colleges.”

3. Does a new drug reduce the number of deaths in patients with severe heart disease?
● “My friend's dad had a heart attack and died after they gave him a new heart
disease drug, so the drug must not work.”

Question: What’s wrong here?

Sure, each conclusion is based on “data.”

However, there are two big problems.

● First, the “data” only represent one or two cases.
● Second, and more importantly, it is unclear whether these individual cases are
representative of the population. These might represent unusual or even
extraordinary cases.
Say no to anecdotal evidence, say yes to proper sampling
● Instead of looking at the most unusual cases, that are often easily remembered
with anecdotal evidence, we should examine a sample of cases that accurately
represent the target population.
● As researchers, we want to pull a sample of people (or dogs, or cars, or whatever
other population we’re interested in) that have a similar set of characteristics to
our target population.
● This is called a representative sample.

Example 1: Non-representative samples can muddy your results

● We (as consumers) can easily access ratings for products, sellers, and
companies through websites. These ratings are based only on those people who
go out of their way to provide a rating.
○ Q: If 50% of online reviews for a product are negative, do you think this
definitely means that 50% of buyers are dissatisfied with the product?

○ A: From our own anecdotal experiences, we believe people tend to rant

more about products that fell below expectations than rave about those
that perform as expected. For this reason, we suspect there is a negative
bias in product ratings on sites like Amazon. However, since our
experiences may not be representative, we also keep an open mind.

Example 2: Non-representative samples can muddy your results

● Sometimes, a sample “chooses itself.”
● Example: suppose a family doctor located in Santa Monica, CA wants to do
some research on the frequency with which various ailments occur among the
patients who happen to visit her office over a period of time.
● Her “sample members” chose themselves by contacting her clinic for care.

● Q: What are the concerns?

○ A: Can’t assume the sample will be typical (i.e. representative) of patients
living in that area.
○ A: Very useful for their office, less useful for talking about patients in the
US, or even in California, or even in LA!
Random sampling is incredibly important for a representative sample
● In research or analysis, the researcher starts off with the population in mind.
● He or she then selects a sample that they believe will represent it.
● In order for a sample to be representative, member of a sample must be chosen
at RANDOM from the population. Each member of the population should have an
equal chance of being chosen.
● This is not always easy.

But, representative samples can be tricky to get…

Example – problems with non-random sampling
● If you wanted to interview people in downtown LA about their political views, by
standing on a street corner and talking to people, you are unlikely to create a
representative sample.
● Question - why might that be?
○ You are most likely to approach (and be successful with) people who are
not in a huge hurry, people without headphones on, or people who don’t
look super angry.
○ These people may very well differ in their political opinions than those who
you avoided (or who wouldn’t talk to you)

● Despite your best efforts, you’ve introduced Statistical Systematic Bias into
your sample.
Random sampling is KEY for reducing systematic bias
● If someone was permitted to pick and choose exactly which people/observations
were included in the sample, it is entirely possible that the sample could be
skewed to that person's interests, subconscious biases, laziness, or a whole host
of other issues.
● This introduces STATISTICAL SYSTEMATIC BIAS into a sample.
○ Statistical Systematic Bias is the difference between the sample value
that you calculate from your problematic sample, and the true population
value.
■ These are most commonly due to two problems: (1) measurements
being taken on a nonrepresentative sample, and/or (2) incorrect
measurements being taken
○ Usually “systematic bias” is unintentional and accidental! But no less of an
issue…
● It’s preferable to use mechanical or computerized methods of selecting a
random sample. We’ll cover what this looks like in future classes.

Summary: sampling & statistical systematic bias

● We’ve covered the differences been samples and populations.
○ This was important before we cover the mathematics behind mean &
standard deviation (and others) as the statistical notation (and sometimes
the math!!) changes based on if you are working with a sample or a
population

● We’ve also introduced the idea of statistical bias in research. This most
commonly comes from a non-random sample, but we’ll learn about other ways
that statistical bias can sneak into research studies and statistics.

● In future lectures, we’ll also be covering important concepts related to sampling

and bias, including: the difference between observational studies and
experiments; different sampling methods including stratified sampling; the key
concepts of randomized experiments and how they reduce statistical bias.

Central Tendency (mean, median, mode)

Two key ways to describe a dataset are Central Tendency & Dispersion (we’ll be
adding shape/modality, skewness, and outliers over the next two lectures)
● There are two ultra-important questions to answer when describing a dataset.
○ Central Tendency: where is the ‘middle’ of the dataset?
○ Dispersion: how spread out, how ‘wide’ is the dataset?

● Measures of Central Tendency

○ Categorical data: Mode
○ Numerical data: Mean & median, rarely mode

● Measures of Dispersion
○ Categorical data: Range
○ Numerical data: Standard deviation, Interquartile Range (IQR), rarely
range

Central Tendency for Categorical Data -- Mode

● A mode is represented by a prominent peak in the distribution.
● A definition of mode sometimes taught in math classes is the value with the most
occurrences in the data set.
● However, for many real-world numerical data sets, it is common to have no
observations with the same value, making this definition impractical in data
analysis.
● Most useful measure of central tendency for categorical data, and most
commonly use for categorical ordinal.
Central Tendency for Numerical Data -- Median

Central Tendency for Numerical Data -- Mean (𝑥 and µ)

Key (if obvious) concept: the sample mean (𝑥) provides a window to the
true, hidden population mean (µ)

Shorthand statistical notation for population vs. sample mean

Dispersion (range, standard deviation, IQR)

Overview of dispersion – 3 options

● Range - the highest number minus the lowest number
○ Simpler than measures of variance/dispersion
○ This is the only option for categorical data, but also can be used for
numerical data if a super simple statistic is needed.

● Variance & Standard deviation

○ Mathematically and analytically paired with mean
○ Can be used with numerical data

● Quartiles & Interquartile range (IQR)

○ Mathematically and analytically paired with median
○ Can be used with numerical data

DEVIATION is just another way to say “distance from mean”, which we use to calculate
variance (and standard deviation)
2
The Variance (𝑠 ) of a sample is roughly the average deviation from the mean, across
all the observations in the dataset

2
Why is squared deviation used in the numerator when calculating variance (𝑠 ) ?
2
Now, we use variance (𝑠 ) to find standard deviation (s)

2
Summary: variance (𝑠 ) vs. standard deviation (s)

Summary of process for finding standard deviation (s):

Deviation → Variance → Standard Deviation

● The variance is the average squared distance from the mean.

● The standard deviation is the square root of the variance.

○ The standard deviation is useful when considering how far the data are
distributed from the mean.
○ We nearly always use standard deviation, because…

● The standard deviation represents the typical deviation of observations from the
mean.
○ Usually about 70% of the data will be within one standard deviation of the
mean and about 95% will be within two standard deviations.
○ However, these percentages are not strict rules.

● Summary: sample statistics (𝑥) vs. population parameters (µ)

Numerical Variables:
● When working with averages/means, we calculate a sample statistic 𝑥 from our
sample, and (if certain conditions are met) use that as the point estimate for the
(“real”, but unknown) population parameter µ.
○ Population parameter: The “true population mean” is denoted with the
Greek symbol µ, pronounced “mew”
○ Sample Statistic/point estimate: The “sample mean” is denoted with 𝑥,
pronounced “x-bar”.
○ Sample standard deviation – s
○ Population standard deviation - σ

Interquartile range (IQR) is another way to measure dispersion, and is conceptually

related to median
Interquartile range (IQR) is another way to measure dispersion, and is related to
median

Interquartile range (IQR) is another way to measure dispersion, and is related to

median
Interquartile range (IQR) is another way to measure dispersion, and is related to
median

IB Standard Level Maths Analysis Approaches
No ratings yet
IB Standard Level Maths Analysis Approaches
23 pages
Presentation 1
No ratings yet
Presentation 1
88 pages
Chapter 1 Introduction To Psych Stat
No ratings yet
Chapter 1 Introduction To Psych Stat
4 pages
F77SA1 Introduction To Statistical Science A Lecture Notes: Jennie Hansen George Streftaris
No ratings yet
F77SA1 Introduction To Statistical Science A Lecture Notes: Jennie Hansen George Streftaris
55 pages
Lesson 1 Introduction To Statistics
No ratings yet
Lesson 1 Introduction To Statistics
12 pages
Advanced Statistics Concepts
No ratings yet
Advanced Statistics Concepts
96 pages
Introduction To Statistics Web
No ratings yet
Introduction To Statistics Web
18 pages
Chapter 3
100% (1)
Chapter 3
79 pages
AA SL - Unit 1a - Representing Data (Statistics)
No ratings yet
AA SL - Unit 1a - Representing Data (Statistics)
74 pages
Topic 03 - Basic Statistics
No ratings yet
Topic 03 - Basic Statistics
42 pages
Biostatistics Iust 1
No ratings yet
Biostatistics Iust 1
28 pages
Class Slides Part2
No ratings yet
Class Slides Part2
36 pages
Math 140 Final Review Notes
No ratings yet
Math 140 Final Review Notes
20 pages
Random Sampling
No ratings yet
Random Sampling
22 pages
'MATH 233 Statistics For Social Sciences - Week 1' D - 241029 - 161224
No ratings yet
'MATH 233 Statistics For Social Sciences - Week 1' D - 241029 - 161224
110 pages
RM 7
No ratings yet
RM 7
47 pages
Chapter1 Stats
No ratings yet
Chapter1 Stats
7 pages
Biostatistics for Medical Research
No ratings yet
Biostatistics for Medical Research
164 pages
Biostatistics for Health Research
No ratings yet
Biostatistics for Health Research
28 pages
Bio Statistics
No ratings yet
Bio Statistics
72 pages
Business Analytics Module 2
No ratings yet
Business Analytics Module 2
24 pages
RMB W2
No ratings yet
RMB W2
22 pages
Chapter 1: Introduction To Statistics: 1.1 An Overview of Statistics
No ratings yet
Chapter 1: Introduction To Statistics: 1.1 An Overview of Statistics
5 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
67 pages
What Is Statistical Sampling
No ratings yet
What Is Statistical Sampling
5 pages
Statistics Lecture Notes
No ratings yet
Statistics Lecture Notes
6 pages
Lecture 2
No ratings yet
Lecture 2
24 pages
EBH R3 Populations and Samples: Objectives
No ratings yet
EBH R3 Populations and Samples: Objectives
6 pages
Intro to Statistics Basics
No ratings yet
Intro to Statistics Basics
8 pages
Samples and Data
No ratings yet
Samples and Data
28 pages
Quant Notes 9-21-21
No ratings yet
Quant Notes 9-21-21
68 pages
Complete Basic Stats
No ratings yet
Complete Basic Stats
18 pages
Lesson 5 Notes
No ratings yet
Lesson 5 Notes
10 pages
MMW Reviewer
No ratings yet
MMW Reviewer
9 pages
Topic 3 - ETC1000
No ratings yet
Topic 3 - ETC1000
10 pages
Eknm 201 - Statistics I Departments: Business Administration International Trade and Finance
No ratings yet
Eknm 201 - Statistics I Departments: Business Administration International Trade and Finance
94 pages
Sampling and Measures of Central Tendency
No ratings yet
Sampling and Measures of Central Tendency
41 pages
Unit 2-2 Sampling Design
No ratings yet
Unit 2-2 Sampling Design
26 pages
Stats Notes
No ratings yet
Stats Notes
7 pages
Ba123iu Week 8
No ratings yet
Ba123iu Week 8
42 pages
Year 12 Statistics
No ratings yet
Year 12 Statistics
62 pages
Example:: Population
No ratings yet
Example:: Population
7 pages
Stats 1 For Students
No ratings yet
Stats 1 For Students
60 pages
Session 1 Stats BBA 1Y Essec
No ratings yet
Session 1 Stats BBA 1Y Essec
39 pages
Research Methods: Dr. Abeer Yasin
No ratings yet
Research Methods: Dr. Abeer Yasin
109 pages
Statistics Lec 1
No ratings yet
Statistics Lec 1
28 pages
Sampling & Sample Size (CRK)
No ratings yet
Sampling & Sample Size (CRK)
12 pages
UcrZPPieEeiTKQ5ajE7PqA Role of Statistics Lecture Slides
No ratings yet
UcrZPPieEeiTKQ5ajE7PqA Role of Statistics Lecture Slides
65 pages
4.1 - Why Take Samples, and How Not To
No ratings yet
4.1 - Why Take Samples, and How Not To
16 pages
Ch. 1 Notes
No ratings yet
Ch. 1 Notes
3 pages
Biostatistics Course Overview
No ratings yet
Biostatistics Course Overview
60 pages
Basic Statistics 1 Collecting Data
No ratings yet
Basic Statistics 1 Collecting Data
26 pages
Lecture 13
No ratings yet
Lecture 13
44 pages
DAT100 Int Data Ana Lec4 Obtaining Data
No ratings yet
DAT100 Int Data Ana Lec4 Obtaining Data
30 pages
Applied Statistics - Kavitha
No ratings yet
Applied Statistics - Kavitha
18 pages
Random Variables & Sampling Methods
No ratings yet
Random Variables & Sampling Methods
6 pages
Detection of Malicious Hyperlinks Using Machine Learning A Proposed System
No ratings yet
Detection of Malicious Hyperlinks Using Machine Learning A Proposed System
4 pages
Cases of Domestication and Foreignization in The Translation of Indonesian Poetry Into English A Preliminary Inquiry
No ratings yet
Cases of Domestication and Foreignization in The Translation of Indonesian Poetry Into English A Preliminary Inquiry
9 pages
Thesis
No ratings yet
Thesis
44 pages
Additional Data Analysis and Statistics
100% (1)
Additional Data Analysis and Statistics
11 pages
Air Conditioner
No ratings yet
Air Conditioner
54 pages
Capstone Proposal 1
No ratings yet
Capstone Proposal 1
5 pages
A Preliminary Mayan Etymological Dictionary
100% (2)
A Preliminary Mayan Etymological Dictionary
1,535 pages
Johannes Fabian
100% (1)
Johannes Fabian
152 pages
Human Resource Management Gaining A Competitive Advantage
No ratings yet
Human Resource Management Gaining A Competitive Advantage
24 pages
HCM07 Competency Based Behavioural Interviewing PDF
No ratings yet
HCM07 Competency Based Behavioural Interviewing PDF
21 pages
MITB Compulsory Subjects Overview
No ratings yet
MITB Compulsory Subjects Overview
4 pages
A Level Biology A (Salters-Nuffield) Core Practical Activity 1 25 Technician Worksheet PDF
No ratings yet
A Level Biology A (Salters-Nuffield) Core Practical Activity 1 25 Technician Worksheet PDF
2 pages
Saxena and Gupta - Elements of Hydrology and Groundwater - Sample Chatper
100% (1)
Saxena and Gupta - Elements of Hydrology and Groundwater - Sample Chatper
27 pages
Business IA Netflix Katie Word Vers
100% (1)
Business IA Netflix Katie Word Vers
14 pages
Moisture in Oil
No ratings yet
Moisture in Oil
2 pages
An Examination of The Relationship Between Gadgets and Academic Performance of The Students
No ratings yet
An Examination of The Relationship Between Gadgets and Academic Performance of The Students
14 pages
(Eslami Et Al., 2024) .
No ratings yet
(Eslami Et Al., 2024) .
11 pages
Citizenship Coursework Unit 2
100% (2)
Citizenship Coursework Unit 2
8 pages
Marketing Research Telecom Industry Egypt
No ratings yet
Marketing Research Telecom Industry Egypt
52 pages
Methods For A Multidisciplinary Landscape Assessment
No ratings yet
Methods For A Multidisciplinary Landscape Assessment
106 pages
The Application of National Biometric Database System in Nigerian Electoral Process
No ratings yet
The Application of National Biometric Database System in Nigerian Electoral Process
15 pages
Bechtel - Philosophy of Mind
100% (2)
Bechtel - Philosophy of Mind
87 pages
Timogan - Retraction Paper
No ratings yet
Timogan - Retraction Paper
2 pages
MGT201 TermPaper AUTUMN2023
No ratings yet
MGT201 TermPaper AUTUMN2023
2 pages
Образец Europass CV и Language Passport
100% (1)
Образец Europass CV и Language Passport
4 pages
M855A1 Lead-Free 5.56 MM Cartridge: Army Programs
No ratings yet
M855A1 Lead-Free 5.56 MM Cartridge: Army Programs
2 pages
Thai Journal of Nursing Research Vol 13 No 3 Jul 92974
No ratings yet
Thai Journal of Nursing Research Vol 13 No 3 Jul 92974
96 pages
(Ebook) The Essence of Multivariate Thinking - Basic Themes and Methods by Lisa L. Harlow ISBN 9780805837308, 0805837302 Download
100% (1)
(Ebook) The Essence of Multivariate Thinking - Basic Themes and Methods by Lisa L. Harlow ISBN 9780805837308, 0805837302 Download
51 pages
Climate Risk Factsheet
No ratings yet
Climate Risk Factsheet
15 pages
Strategic Management Concepts and Cases Competitiveness and Globalization 11th Edition Hitt Fast Access
No ratings yet
Strategic Management Concepts and Cases Competitiveness and Globalization 11th Edition Hitt Fast Access
324 pages

BA 216 Lecture 3 Notes

Uploaded by

BA 216 Lecture 3 Notes

Uploaded by

BA 216

Calculating and comparing measures for Central Tendency & Dispersion;

Introduction to RStudio setup and operation

Basic R Operations and Concepts

Sampling is natural – we do it all the time

● You think to yourself, probably your entire soup needs salt.

For reliable statistics, we need representative samples

● We need a properly representative sample so we can compute inferential

Samples vs. populations

● POPULATION PARAMETERS: the characteristics of the whole population (like

● POINT ESTIMATES (also called SAMPLE STATISTICS): the things you

Shorthand statistical notation for Population parameters vs. Point

Examples: Populations and samples

Consider the following three research questions:

For each, what is the target populations? What is an individual case/observation?

Anecdotal data – beware!

Question: What’s wrong here?

Sure, each conclusion is based on “data.”

However, there are two big problems.

Example 1: Non-representative samples can muddy your results

○ A: From our own anecdotal experiences, we believe people tend to rant

Example 2: Non-representative samples can muddy your results

● Q: What are the concerns?

But, representative samples can be tricky to get…

Summary: sampling & statistical systematic bias

● In future lectures, we’ll also be covering important concepts related to sampling

Central Tendency (mean, median, mode)

● Measures of Central Tendency

Central Tendency for Categorical Data -- Mode

Central Tendency for Numerical Data -- Mean (𝑥 and µ)

Shorthand statistical notation for population vs. sample mean

Overview of dispersion – 3 options

● Variance & Standard deviation

● Quartiles & Interquartile range (IQR)

Summary of process for finding standard deviation (s):

● The variance is the average squared distance from the mean.

● The standard deviation is the square root of the variance.

● Summary: sample statistics (𝑥) vs. population parameters (µ)

Interquartile range (IQR) is another way to measure dispersion, and is conceptually

Interquartile range (IQR) is another way to measure dispersion, and is related to

You might also like