Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
36 views18 pages

BA 216 Lecture 3 Notes

The document covers concepts of central tendency and dispersion in statistics, including measures like mean, median, mode, range, variance, and standard deviation. It emphasizes the importance of representative sampling for valid statistical inferences and warns against anecdotal evidence. Additionally, it introduces the use of Box-and-Whisker plots for visualizing data distributions.

Uploaded by

Harrison Lim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views18 pages

BA 216 Lecture 3 Notes

The document covers concepts of central tendency and dispersion in statistics, including measures like mean, median, mode, range, variance, and standard deviation. It emphasizes the importance of representative sampling for valid statistical inferences and warns against anecdotal evidence. Additionally, it introduces the use of Box-and-Whisker plots for visualizing data distributions.

Uploaded by

Harrison Lim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

BA 216

Calculating and comparing measures for Central Tendency & Dispersion;


Box-and-Whisker plots

Note*: Use ?(data set) to answer your homework questions (second attempt)
Don’t forget to correct your previous answers on the multiple choice questions

Lecture Overview:
1. Finishing R Demo 1
2. Introduction to Populations & Sampling (and notation for population parameters vs.
point estimates)
3. Describing numerical data distributions
● Central Tendency (mean, median, mode)
● Dispersion (range, IQR, variance, standard deviation)
● Outliers
4. Examining both centrality and dispersion with Box-and-whisker plots

Homework:
1. Note: Don’t forget HW1 is due this Friday (9/10).
2. HW2 should be posted next Monday (9/13) to be due Monday following (9/20)

Introduction to RStudio setup and operation


● The console
● The environment
● R Scripts (and why to use them vs. the console)
● Plots
● Libraries

Basic R Operations and Concepts


● Commenting with #
● Arithmetic
● Order of operations
● Variables: Assignment, Names, Vectors, Functions, Help
● Libraries & ggplot

Sampling is natural – we do it all the time


● Let’s say, you’ve decided to make soup.
● You taste a spoonful of soup, and decide the spoonful you tasted isn't salty
enough
○ This is an exploratory analysis, where you describe a sample (spoonful)
from the whole pot of soup

● You think to yourself, probably your entire soup needs salt.


○ An inference about the whole population of soup spoonfuls

For reliable statistics, we need representative samples


● For your inference to be valid, the spoonful you tasted (the sample) needs to be
representative of the entire pot (the population).
○ If your spoonful comes only from the surface and the salt is collected at
the bottom of the pot, what you tasted is probably not representative of the
whole pot.
○ If you first stir the soup thoroughly before you taste, your spoonful will
more likely be representative of the whole pot.

● We need a properly representative sample so we can compute inferential


statistics and generalize the findings to a population. You gotta stir your soup!

Samples vs. populations


● Each research question refers to a target population.
● Often times, it is too expensive (or will take too long, or is impossible) to collect
data for every case in a population. Instead, a sample is taken.
● A sample represents a subset of the cases and is usually only a small fraction
of the population.
● The shorthand statistical notation is usually different for populations vs.
samples (see the population vs. sample size to the right)
Point estimates vs. Population parameters

● POPULATION PARAMETERS: the characteristics of the whole population (like


the population mean, population standard deviation, or population proportion)
○ NOTE! The “true” population parameter is effectively “hidden” from
statisticians and analysts.
○ Question: Why would this be?

● POINT ESTIMATES (also called SAMPLE STATISTICS): the things you


calculate from your sample (like sample mean, sample standard deviation,
sample proportion) are considered point estimates for the population parameters.
● Often, a goal of statistics is to go from describing a sample, to making educated
guesses about the characteristics of the whole population (i.e. “inferences”).
○ Another way to say this -- is we hope to use the sample’s point estimates
to make inferences about the ‘hidden’ population parameters

Shorthand statistical notation for Population parameters vs. Point


estimates/sample statistics

Examples: Populations and samples

Consider the following three research questions:


1. What is the average mercury content in swordfish in the Atlantic Ocean?
2. Over the last 5 years, what is the average time to complete a degree for Duke
undergrads?
3. Does a new drug reduce the number of deaths in patients with severe heart
disease?

For each, what is the target populations? What is an individual case/observation?

“Over the last 5 years, what is the average time to complete a degree for Duke
undergrads?”
A: The first question is only relevant to students who complete their degree; the
average cannot be computed using a student who never finished her degree. Thus, only
Duke undergrads who graduated in the last five years represent cases in the population
under consideration. Each such student is an individual case.

“Does a new drug reduce the number of deaths in patients with severe heart disease?”

A: A person with severe heart disease represents a case. The population includes all
people with severe heart disease.

Anecdotal data – beware!


1. What is the average mercury content in swordfish in the Atlantic Ocean?
● Someone might say -- “Well, a man on the news got mercury poisoning from
eating swordfish, so the average mercury concentration in swordfish must be
dangerously high.”

2. Over the last 5 years, what is the average time to complete a degree for Duke
undergrads?
● “I met two students who took more than 7 years to graduate from Duke, so it
must take longer to graduate at Duke than at many other colleges.”

3. Does a new drug reduce the number of deaths in patients with severe heart disease?
● “My friend's dad had a heart attack and died after they gave him a new heart
disease drug, so the drug must not work.”

Question: What’s wrong here?

Sure, each conclusion is based on “data.”

However, there are two big problems.


● First, the “data” only represent one or two cases.
● Second, and more importantly, it is unclear whether these individual cases are
representative of the population. These might represent unusual or even
extraordinary cases.
Say no to anecdotal evidence, say yes to proper sampling
● Instead of looking at the most unusual cases, that are often easily remembered
with anecdotal evidence, we should examine a sample of cases that accurately
represent the target population.
● As researchers, we want to pull a sample of people (or dogs, or cars, or whatever
other population we’re interested in) that have a similar set of characteristics to
our target population.
● This is called a representative sample.

Example 1: Non-representative samples can muddy your results


● We (as consumers) can easily access ratings for products, sellers, and
companies through websites. These ratings are based only on those people who
go out of their way to provide a rating.
○ Q: If 50% of online reviews for a product are negative, do you think this
definitely means that 50% of buyers are dissatisfied with the product?

○ A: From our own anecdotal experiences, we believe people tend to rant


more about products that fell below expectations than rave about those
that perform as expected. For this reason, we suspect there is a negative
bias in product ratings on sites like Amazon. However, since our
experiences may not be representative, we also keep an open mind.

Example 2: Non-representative samples can muddy your results


● Sometimes, a sample “chooses itself.”
● Example: suppose a family doctor located in Santa Monica, CA wants to do
some research on the frequency with which various ailments occur among the
patients who happen to visit her office over a period of time.
● Her “sample members” chose themselves by contacting her clinic for care.

● Q: What are the concerns?


○ A: Can’t assume the sample will be typical (i.e. representative) of patients
living in that area.
○ A: Very useful for their office, less useful for talking about patients in the
US, or even in California, or even in LA!
Random sampling is incredibly important for a representative sample
● In research or analysis, the researcher starts off with the population in mind.
● He or she then selects a sample that they believe will represent it.
● In order for a sample to be representative, member of a sample must be chosen
at RANDOM from the population. Each member of the population should have an
equal chance of being chosen.
● This is not always easy.

But, representative samples can be tricky to get…


Example – problems with non-random sampling
● If you wanted to interview people in downtown LA about their political views, by
standing on a street corner and talking to people, you are unlikely to create a
representative sample.
● Question - why might that be?
○ You are most likely to approach (and be successful with) people who are
not in a huge hurry, people without headphones on, or people who don’t
look super angry.
○ These people may very well differ in their political opinions than those who
you avoided (or who wouldn’t talk to you)

● Despite your best efforts, you’ve introduced Statistical Systematic Bias into
your sample.
Random sampling is KEY for reducing systematic bias
● If someone was permitted to pick and choose exactly which people/observations
were included in the sample, it is entirely possible that the sample could be
skewed to that person's interests, subconscious biases, laziness, or a whole host
of other issues.
● This introduces STATISTICAL SYSTEMATIC BIAS into a sample.
○ Statistical Systematic Bias is the difference between the sample value
that you calculate from your problematic sample, and the true population
value.
■ These are most commonly due to two problems: (1) measurements
being taken on a nonrepresentative sample, and/or (2) incorrect
measurements being taken
○ Usually “systematic bias” is unintentional and accidental! But no less of an
issue…
● It’s preferable to use mechanical or computerized methods of selecting a
random sample. We’ll cover what this looks like in future classes.

Summary: sampling & statistical systematic bias


● We’ve covered the differences been samples and populations.
○ This was important before we cover the mathematics behind mean &
standard deviation (and others) as the statistical notation (and sometimes
the math!!) changes based on if you are working with a sample or a
population

● We’ve also introduced the idea of statistical bias in research. This most
commonly comes from a non-random sample, but we’ll learn about other ways
that statistical bias can sneak into research studies and statistics.

● In future lectures, we’ll also be covering important concepts related to sampling


and bias, including: the difference between observational studies and
experiments; different sampling methods including stratified sampling; the key
concepts of randomized experiments and how they reduce statistical bias.

Central Tendency (mean, median, mode)


Two key ways to describe a dataset are Central Tendency & Dispersion (we’ll be
adding shape/modality, skewness, and outliers over the next two lectures)
● There are two ultra-important questions to answer when describing a dataset.
○ Central Tendency: where is the ‘middle’ of the dataset?
○ Dispersion: how spread out, how ‘wide’ is the dataset?

● Measures of Central Tendency


○ Categorical data: Mode
○ Numerical data: Mean & median, rarely mode

● Measures of Dispersion
○ Categorical data: Range
○ Numerical data: Standard deviation, Interquartile Range (IQR), rarely
range

Central Tendency for Categorical Data -- Mode


● A mode is represented by a prominent peak in the distribution.
● A definition of mode sometimes taught in math classes is the value with the most
occurrences in the data set.
● However, for many real-world numerical data sets, it is common to have no
observations with the same value, making this definition impractical in data
analysis.
● Most useful measure of central tendency for categorical data, and most
commonly use for categorical ordinal.
Central Tendency for Numerical Data -- Median

Central Tendency for Numerical Data -- Mean (𝑥 and µ)


Key (if obvious) concept: the sample mean (𝑥) provides a window to the
true, hidden population mean (µ)

Shorthand statistical notation for population vs. sample mean


Dispersion (range, standard deviation, IQR)

Overview of dispersion – 3 options


● Range - the highest number minus the lowest number
○ Simpler than measures of variance/dispersion
○ This is the only option for categorical data, but also can be used for
numerical data if a super simple statistic is needed.

● Variance & Standard deviation


○ Mathematically and analytically paired with mean
○ Can be used with numerical data

● Quartiles & Interquartile range (IQR)


○ Mathematically and analytically paired with median
○ Can be used with numerical data

DEVIATION is just another way to say “distance from mean”, which we use to calculate
variance (and standard deviation)
2
The Variance (𝑠 ) of a sample is roughly the average deviation from the mean, across
all the observations in the dataset

2
Why is squared deviation used in the numerator when calculating variance (𝑠 ) ?
2
Now, we use variance (𝑠 ) to find standard deviation (s)

2
Summary: variance (𝑠 ) vs. standard deviation (s)

Summary of process for finding standard deviation (s):


Deviation → Variance → Standard Deviation

● The variance is the average squared distance from the mean.

● The standard deviation is the square root of the variance.


○ The standard deviation is useful when considering how far the data are
distributed from the mean.
○ We nearly always use standard deviation, because…

● The standard deviation represents the typical deviation of observations from the
mean.
○ Usually about 70% of the data will be within one standard deviation of the
mean and about 95% will be within two standard deviations.
○ However, these percentages are not strict rules.

● Summary: sample statistics (𝑥) vs. population parameters (µ)


Numerical Variables:
● When working with averages/means, we calculate a sample statistic 𝑥 from our
sample, and (if certain conditions are met) use that as the point estimate for the
(“real”, but unknown) population parameter µ.
○ Population parameter: The “true population mean” is denoted with the
Greek symbol µ, pronounced “mew”
○ Sample Statistic/point estimate: The “sample mean” is denoted with 𝑥,
pronounced “x-bar”.
○ Sample standard deviation – s
○ Population standard deviation - σ

Interquartile range (IQR) is another way to measure dispersion, and is conceptually


related to median
Interquartile range (IQR) is another way to measure dispersion, and is related to
median

Interquartile range (IQR) is another way to measure dispersion, and is related to


median
Interquartile range (IQR) is another way to measure dispersion, and is related to
median

You might also like