Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
34 views38 pages

Data Types

The document introduces data types, categorizing them into Quantitative (Continuous and Discrete) and Categorical (Ordinal and Nominal). It explains how to analyze these data types, focusing on measures of center (mean, median, mode) and the importance of understanding data types for effective data analysis. Additionally, it discusses notation used in statistics and provides examples of random variables and their outcomes.

Uploaded by

chala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views38 pages

Data Types

The document introduces data types, categorizing them into Quantitative (Continuous and Discrete) and Categorical (Ordinal and Nominal). It explains how to analyze these data types, focusing on measures of center (mean, median, mode) and the importance of understanding data types for effective data analysis. Additionally, it discusses notation used in statistics and provides examples of random variables and their outcomes.

Uploaded by

chala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Data Types

In this video, two data types are introduced: Quantitative and Categorical.

Quantitative data takes on numeric values that allow us to perform


mathematical operations (like the number of dogs).

Categorical is used to label a group or set of items (like dog breeds - Collies,
Labs, Poodles, etc.).

Categorical Ordinal vs. Categorical Nominal


We can divide categorical data further into two types: Ordinal and Nominal.

Categorical Ordinal data take on a ranked ordering (like a ranked interaction


on a scale from Very Poor to Very Good with the dogs).

Categorical Nominal data do not have an order or ranking (like the breeds of
the dog).

Continuous vs. Discrete


We can think of quantitative data as being either continuous or discrete.

Continuous data can be split into smaller and smaller units, and still a smaller
unit exists. An example of this is the age of the dog - we can measure the
units of the age in years, months, days, hours, seconds, but there are still
smaller units that could be associated with the age.

Discrete data only takes on countable values. The number of dogs we


interact with is an example of a discrete data type.
The table below summarizes our data types. To expand on the information in
the table, you can look through the text that follows.
Data Types

Quantitativ
Continuous Discrete
e:

Pages in a Book, Trees in Yard, Dogs at a


Height, Age, Income
Coffee Shop

Categorical: Ordinal Nominal

Letter Grade, Survey


Gender, Marital Status, Breakfast Items
Rating

Below is a little more detail of the information shared in the above tabl

Another Look
To break down our data types, there are two main blocks:

Quantitative and Categorical

Quantitative can be further divided into Continuous or Discrete.

Categorical data can be divided into Ordinal or Nominal.

You should have now mastered what types of data in the world around us falls
into each of these four buckets: Discrete, Continuous, Nominal, and Ordinal.
In the next sections, we will work through the numeric summaries that relate
specifically to quantitative variables.

Quantitative vs. Categorical


Some of these can be a bit tricky - notice even though zip codes are a
number, they aren’t really a quantitative variable. If we add two zip codes
together, we do not obtain any useful information from this new value.
Therefore, this is a categorical variable.

Height, Age, the Number of Pages in a Book, and Annual Income all take
on values that we can add, subtract and perform other operations with to gain
useful insight. Hence, these are quantitative.

Gender, Letter Grade, Breakfast Type, Marital Status, and Zip Code can
be thought of as labels for a group of items or individuals. Hence, these
are categorical.

Continuous vs. Discrete


To consider if we have continuous or discrete data, we should see if we can
split our data into smaller and smaller units. Consider time - we could
measure an event in years, months, days, hours, minutes, or seconds, and
even at seconds we know there are smaller units we could measure time in.
Therefore, we know this data type is continuous. Height, age,
and income are all examples of continuous data. Alternatively, the number
of pages in a book, dogs I count outside a coffee shop, or trees in a
yard are discrete data. We would not want to split our dogs in half.

Ordinal vs. Nominal


In looking at categorical variables, we found Gender, Marital Status, Zip
Code, and your Breakfast items are nominal variables where there is no
order ranking associated with this type of data. Whether you ate cereal, toast,
eggs, or only coffee for breakfast; there is no rank-ordering associated with
your breakfast.

Alternatively, the Letter Grade or Survey Ratings have a rank ordering


associated with it, as ordinal data. If you receive an A, this is higher than an
A-. An A- is ranked higher than a B+, and so on... Ordinal variables frequently
occur on rating scales from very poor to very good. In many cases, we turn
these ordinal variables into numbers, as we can more easily analyze them,
but more on this later!

Final Words
In this section, we looked at the different data types we might work with in the
world around us. When we work with data in the real world, it might not be
very clean - sometimes there are typos or missing values. When this is the
case, simply having some expertise regarding the data and knowing the data
type can assist in our ability to ‘clean’ this data. Understanding data types can
also assist in our ability to build visuals to best explain the data. But more on
this very soon!

Introduction to Summary Statistics


In the next lessons, you will learn how to use statistics to describe quantitative data.
You'll gain insight into the process of how data is collected and how to answer questions
using your data. Throughout this lesson, you will learn to be critical of the analysis that
happens "under the hood" and understand what the numbers actually mean.

As an example of the analysis we do here at Udacity, we look at how long students take
to complete one of our courses or programs. We try to provide an estimate of the
number of hours or months that students will spend. One way to start is to report the
average amount of time it takes to complete a course. But that doesn't tell the whole
story because there will be differences in time spent depending on what students knew
before beginning the course.

The shortest time might be just a few weeks and the longest might be a couple of years.
What proportion of students finishes within two months and what proportion takes
longer than eight months?
Using a variety of measures, like measures of center, give you an idea of the average
student. Measures of spread, give you an idea of how students differ. Visuals provide
a more complete picture of how long it takes any student to complete a course or
program.
Note, at 1:13 in the number in the equation is 45, not 48

Analyzing Quantitative Data

Four Aspects for Quantitative Data


There are four main aspects to analyzing Quantitative data.

1. Measures of Center
2. Measures of Spread
3. The Shape of the data.

4. Outliers

Analyzing Categorical Data


Though not discussed in the video, analyzing categorical data has fewer parts
to consider. Categorical data is analyzed usually by looking at the counts or
proportion of individuals that fall into each group. For example, if we were
looking at the breeds of the dogs, we would care about how many dogs are of
each breed, or what proportion of dogs are of each breed type.

Measures of Center
There are three measures of center:

1. Mean
2. Median

3. Mode
The Mean
In this video, we focused on the calculation of the mean. The mean is often
called the average or the expected value in mathematics. We calculate the
mean by adding all of our values together and dividing by the number of
values in our dataset.

The remaining measures of the median and mode will be discussed in detail
in the upcoming quizzes and videos.

The Median
The median splits our data so that 50% of our values are lower and 50% are
higher. We found in this video that how we calculate the median depends on if
we have an even number of observations or an odd number of observations.

Median for Odd Values


If we have an odd number of observations, the median is simply the number
in the direct middle. For example, if we have 7 observations, the median is
the fourth value when our numbers are ordered from smallest to largest. If we
have 9 observations, the median is the fifth value.

Median for Even Values


If we have an even number of observations, the median is the average of
the two values in the middle. For example, if we have 8 observations, we
average the fourth and fifth values together when our numbers are ordered
from smallest to largest.

In order to compute the median, we MUST sort our values first.

Whether we use the mean or median to describe a dataset is largely


dependent on the shape of our dataset and if there are any outliers. We will
talk about this in just a bit!
The Mode
The mode is the most frequently observed value in our dataset.

There might be multiple modes for a particular dataset or no mode at all.

No Mode
If all observations in our dataset are observed with the same frequency, there
is no mode. If we have the dataset:

1, 1, 2, 2, 3, 3, 4, 4

There is no mode because all observations occur the same number of times.

Many Modes
If two (or more) numbers share the maximum value, then there is more than
one mode. If we have the dataset:

1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 8, 9

There are two modes 3 and 6, because these values share the maximum
frequencies at 3 times, while all other values only appear once.

Notation
Notation is a common language used to communicate mathematical
ideas. Think of notation as a universal language used by academic and
industry professionals to convey mathematical ideas. In the next videos,
you might see things that seem confusing. Use the quizzes to assist with your
understanding of the concepts.

You likely already know some notation. Plus, minus, multiply, division, and
equal signs all have mathematical symbols that you are likely familiar with.
Each of these symbols replaces an idea for how numbers interact with one
another. In the coming concepts, you will be introduced to some additional
ideas related to notation. Though you will not need to use notation to complete
the project, it does have the following properties:

1. Understanding how to correctly use notation makes you seem really


smart. Knowing how to read and write in notation is like learning a new language. A
language that is used to convey ideas associated with mathematics.
2. It allows you to read documentation, and implement an idea to your own
problem. Notation is used to convey how problems are solved all the time. One really
popular mathematical algorithm that is used to solve some of the world's most difficult
problems is known as Gradient Boosting. The way that it solves problems is explained
here(opens in a new tab) . If you really want to understand how this algorithm works,
you need to be able to read and understand notation.
3. It makes ideas that are hard to say in words easier to convey. Sometimes we just
don't have the right words to say. For those situations, I prefer to use notation to convey
the message. Similar to the way an emoji or meme might convey a feeling better than
words, the notation can convey an idea better than words. Usually, those ideas are
related to mathematics, but I am not here to stifle your creativity.

Example to Introduce Notation


There is a lot going on in this video - here is a recap of the big ideas.

Rows and Columns


If you aren't familiar with spreadsheets, this will be covered in detail in future
lessons. Spreadsheets are a common way to hold data. They are composed
of rows and columns. Rows run horizontally, while columns run vertically.
Each column in a spreadsheet commonly holds a specific variable, while
each row is commonly called an instance or individual.

The example used in the video is shown below.


Date Day of Week Time Spent On Site (X) Buy (Y)

June 15 Thursday 5 No

June 15 Thursday 10 Yes

June 16 Friday 20 Yes

This is a row:

Date Day of Week Time Spent On Site (X) Buy (Y)

June 15 Thursday 5 No

This is a column:

Time Spent On Site (X)

10

20
Before Collecting Data
Before collecting data, we usually start with a question, or multiple
questions, that we would like to answer. The purpose of data is to help
us in answering these questions.

Random Variables
A random variable is a placeholder for the possible values of some process
(mostly... the term 'some process' is a bit ambiguous). As was stated before,
notation is useful in that it helps us take complex ideas and simplify (often to a
single letter or single symbol). We see random variables represented by
capital letters (X, Y, or Z are common ways to represent a random variable).

We might have the random variable X, which is a holder for the possible
values of the amount of time someone spends on our site. Or the random
variable Y, which is a holder for the possible values of whether or not an
individual purchases a product.

X is 'a holder' of the values that could possibly occur for the amount of time
spent on our website. Any number from 0 to infinity really.

Example Dataset
An example of the data we might have collected in the previous video is
shown here:

Date Day of Week Time Spent On Site (X) Buy (Y)

June 15 Thursday 5 No

June 15 Thursday 10 Yes

June 16 Friday 20 Yes


Capital vs. Lower Case Letters
Random variables are represented by capital letters. Once we observe an outcome of
these random variables, we notate it as a lower case of the same letter.

Example 1
For example, the amount of time someone spends on our site is a random
variable (we are not sure what the outcome will be for any particular visitor), and we
would notate this with X. Then when the first person visits the website, if they spend 5
minutes, we have now observed this outcome of our random variable. We would notate
any outcome as a lowercase letter with a subscript associated with the order that we
observed the outcome.

If 5 individuals visit our website, the first spend 10 minutes, the second spends 20
minutes, the third spend 45 mins, the fourth spends 12 minutes, and the fifth spends 8
minutes; we can notate this problem in the following way:

X is the amount of time an individual spends on the website.

x1x1 = 10, x2x2 = 20 x3x3 = 45 x4x4 = 12 x5x5 = 8.

The capital X is associated with this idea of a random variable, while the observations
of the random variable take on lowercase x values.

Example 2
Taking this one step further, we could ask:

What is the probability someone spends more than 20 minutes in our website?

In notation, we would write:

P(X > 20)?

Here P stands for probability, while the parentheses encompass the statement for
which we would like to find the probability. Since X represents the amount of time spent
on the website, this notation represents the probability the amount of time on the
website is greater than 20.
We could find this in the above example by noticing that only one of the 5 observations
exceeds 20. So, we would say there is a 1 (the 45) in 5 or 20% chance that an
individual spends more than 20 minutes on our website (based on this dataset).

Example 3
If we asked: What is the probability of an individual spending 20 or more minutes
on our website? We could notate this as:

P(X ≥≥ 20)?

We could then find this by noticing there are two out of the five individuals that spent 20
or more minutes on the website. So this probability is 2 out of 5 or 40%.
Consider we have the following table:

Years Experience Department Part/Full-Time

5 IT Part-Time

10 Finance Full-Time

8 HR Full-Time

1 Finance Part-Time

Consider we have the following labels:

XX= years of experience

YY= Department

ZZ= Part/Full-Time

Match the following notation to their corresponding:


A. x1x1

B. y2y2

C. z3z3

D. nn

Quiz Question
Use the information above to match the correct notation label to its
corresponding value.
Department
Full Time
Part Time
4
5
Years Experience
16
Finance

Notation

Value
A. (this refers to the letter with the corresponding notation above)
B. (this refers to the letter with the corresponding notation above)
C. (this refers to the letter with the corresponding notation above)
D. (this refers to the letter with the corresponding notation above)

Notation for Calculating the Mean


We know that the mean is calculated as the sum of all our values divided by
the number of values in our dataset.

In our current notation, adding all of our values together can be extremely
tedious. If we want to add 3 values of some random variable together, we
would use the notation:
x1+x2+x3x1+x2+x3
If we want to add 6 values together, we would use the notation:

x1+x2+x3+x4+x5+x6x1+x2+x3+x4+x5+x6
To extend this to add one hundred, one thousand, or one million values would
be ridiculous! How can we make this easier to communicate?!

Aggregations
An aggregation is a way to turn multiple numbers into fewer numbers
(commonly one number).

Summation is a common aggregation. The notation used to sum our values


is a greek symbol called sigma ΣΣ.

Example 1
Imagine we are looking at the amount of time individuals spend on our
website. We collect data from nine individuals:

x1x1 = 10, x2x2 = 20 x3x3 = 45 x4x4 = 12 x5x5 = 8 x6x6 =


12, x7x7 = 3 x8x8 = 68 x9x9 = 5

If we want to sum the first three values together in our previous notation, we
write:

x1+x2+x3x1+x2+x3
In our new notation, we can write:

∑i=13xii=1∑3xi.
Notice, our notation starts at the first observation (i=1i=1) and ends at 3
(the number at the top of our summation).

So all of the following are equal to one another:


∑i=13xii=1∑3xi = x1+x2+x3x1+x2+x3 = 10 + 20 + 45 = 75

Example 2
Now, imagine we want to sum the last three values together.

x7+x8+x9x7+x8+x9
In our new notation, we can write:

∑i=79xii=7∑9xi.
Notice, our notation starts at the seventh observation (i=7i=7) and ends at
9 (the number at the top of our summation).

Other Aggregations
The ΣΣ sign is used for aggregating using summation, but we might choose
to aggregate in other ways. Summing is one of the most common ways to
need to aggregate. However, we might need to aggregate in alternative ways.
If we wanted to multiply all of our values together we would use a product
sign ΠΠ** **, capital Greek letter pi. The way we aggregate continuous
values is with something known as integration (a common technique in
calculus), which uses the following symbol ∫∫ which is just a long s. We will
not be using integrals or products for quizzes in this class, but you may see
them in the future!

Final Steps for Calculating the Mean


To finalize our calculation of the mean, we introduce n as the total number of
values in our dataset. We can use this notation both at the top of our
summation, as well as for the value that we divide by when calculating the
mean.

1n∑i=1nxin1i=1∑nxi
Instead of writing out all of the above, we commonly write xˉxˉ to represent
the mean of a dataset. Although similar to the first video, we could use any
variable. Therefore, we might also write yˉyˉ, or any other letter.

We also could index using any other letter, not just ii. We could just as easily
use jj, kk, or mm to index each of our data values. The quizzes on the next
concept will help reinforce this idea.

Notice
At second 0:12, this should say ∑i=15xi=x1+x2+x3+x4+x5 i=1∑5
xi=x1+x2+x3+x4+x5. The xixi is missing here in front of the
summation.

Notation Recap
Notation is an essential tool for communicating mathematical ideas. We have
introduced the fundamentals of notation in this lesson that will allow you to
read, write, and communicate with others using your new skills!

Notation and Random Variables


As a quick recap, capital letters signify random variables. When we look
at individual instances of a particular random variable, we identify these
as lowercase letters with subscripts attach themselves to each specific
observation.

For example, we might have X be the amount of time an individual spends on


our website. Our first visitor arrives and spends 10 minutes on our website,
and we would say x1x1 is 10 minutes.

We might imagine the random variables as columns in our dataset, while a


particular value would be notated with the lower case letters.
Notation English Example

Time spent
X A random variable
on website

x1x1 First observed value of the random variable X 15 mins

∑i=1nxii=1∑ Sum values beginning at the first observation and


5 + 2 + ... + 3
nx i ending at the last

Sum values beginning at the first observation and


1n∑i=1nxin
ending at the last and divide by the number of (5 + 2 + 3)/3
1i=1∑nxi
observations (the mean)

Exactly the same as the above - the mean of our


xˉxˉ data.
(5 + 2 + 3)/3

Notation for the Mean


We took our notation even further by introducing the notation for
summation ∑∑. Using this we were able to calculate the mean as:

1n∑i=1nxin1i=1∑nxi

In the next section, you will see this notation used to assist in your
understanding of calculating various measures of spread. Notation can take
time to fully grasp. Understanding notation not only helps in conveying
mathematical ideas but also in writing computer programs - if you decide you
want to learn that too! Soon you will analyze data using spreadsheets. When
that happens, many of these operations will be hidden by the functions you
will be using. But until we get to spreadsheets, it is important to understand
how mathematical ideas are commonly communicated. This isn't easy, but
you can do it!

Lesson Recap
This lesson covered some of the foundational statistical topics needed to use
statistics in practice. You can now:

 Evaluate data types and variable types


 Analyze measures of center
 Implement notation

What are Measures of Spread?

 Evaluate measures of spread


 Range
 Interquartile Range (IQR)
 Standard Deviation
 Variance
 Analyze outliers
 Evaluate descriptive and inferential statistics

Throughout this lesson, you will learn how to calculate these, as well as why
we would use one measure of spread over another.

Histograms
Histograms are super useful for understanding the different aspects of data
and they are the most common visual used for quantitative data. In the
upcoming concepts, you will see histograms used all the time to help you
understand the four aspects we outlined earlier regarding a quantitative
variable:

 center
 spread
 shape
 outliers

How are Histograms constructed?


First, we need to bin our data. Each bin represents a range of values in a
dataset. The number of values that fall in the range of each bin determines the
height of each histogram bar. As shown in the video above, changing the
range of our bins can result in slightly different visuals. However, there is no
right or wrong answer in choosing how to bin, and in most cases, the software
you use will choose the appropriate bins for you.

Weekdays vs. Weekends


The two histograms below illustrate the number of dogs Josh saw on weekdays versus
weekends. The measures of center for both histograms (mean, median, mode) are basically
the same and centered about the highest bin for both histograms, 13.

Visually, the difference between the histograms is the range or spread of dogs Josh sees during
each time period. In the upcoming lessons, we will discuss the most common ways to measure
the spread of our data.

Five Number Summary

Calculating the 5 Number Summary


The five-number summary consist of 5 values:

1. Minimum: The smallest number in the dataset.


2. Q1Q1: The value such that 25% of the data fall below.
3. Q2Q2: The value such that 50% of the data fall below.
4. Q3Q3: The value such that 75% of the data fall below.
5. Maximum: The largest value in the dataset.

In the above video, we saw that calculating each of these values was
essentially just finding the median of a bunch of different datasets. Because
we are essentially calculating a bunch of medians, the calculation depends on
whether we have an odd or even number of values.

Range
The range is then calculated as the difference between the maximum and
the minimum.

IQR
The interquartile range is calculated as the difference
between Q3Q3 and Q1Q1.

In the upcoming sections, you will practice this with Katie and on your own.
Looking back at the histograms Josh created for the number of dogs he
recorded seeing on weekdays and weekends, we can use the histograms to
mark the values of the 5 number summary and create a box plot.

 Box plots are useful for quickly comparing the spread of two data sets across
some key metrics, like quartiles, maximum, and minimum.

How do we create the box plot?


1. The beginning of the line to the left of the box and the end of the line to the
right of the box represent the minimum and maximum values in a dataset.
2. The visual distance between these markings is an indication of the range of
the values.
3. The box itself represents the IQR. The box begins at the Q1 value, ends at the
Q3 value, and Q2, or the median, is represented by a line within the box.
4. From both the histograms and box plots, we can see that the number of
dogs seen on weekends varies much more than on weekdays.
5. However, instead of depending on a visual of the 5 number summary to
compare our data, in the next lesson, we will learn about using a single
value to compare the two distribution spreads - standard deviation.

Standard Deviation and Variance

Standard Deviation and Variance


The standard deviation is one of the most common measures for talking
about the spread of data. It is defined as the average distance of each
observation from the mean.

In the above video, we saw this as how far individuals were from the average
distance from work (the example distances shown are examples from the full
data set, the mean of just those 4 numbers is 38.5. The mean of 18 shown
later in the video is the mean of the full data set which is not shown in the
video). In the next video, you will see exactly how this is calculated.

Standard Deviation Calculation


Note: at 2:00 the 4 in (14-10) 2 = 4 = 16 should be squared. So it should be (14-10) 2 =
42 = 16

Example: Calculating the Standard Deviation


The dataset for the example is 10,14,10,610,14,10,6

1. First, calculate the mean:

x‾=(∑i=14xi)n=404=10x=n(i=1∑4xi)=440=10
2. Next, calculate the distance of each observation from the mean and square
the value:
(xi−x‾)2=(xi−x)2=
(10−10)2=02=0(10−10)2=02=0
(14−10)2=42=16(14−10)2=42=16
(10−10)2=02=0(10−10)2=02=0
(6−10)2=−42=16(6−10)2=−42=16
2. Then calculate the variance, the average squared difference of each
observation from the mean:

1n∑i=1n(xi−x‾)2=14(0+16+0+16)=324=8n1i=1∑n(xi
−x)2=41(0+16+0+16)=432=8
4. Finally, calculate the standard deviation, the square root of the variance:

1n∑i=1n(xi−x‾)2=8=2.83n1i=1∑n(xi−x)2=8=2.83
The standard deviation is, on average, how far each point in our dataset is from the
mean.

Other Measures of Spread

5 Number Summary
In the previous sections, we have seen how to calculate the values associated
with the five-number summary (min, Q1Q1, Q2Q2, Q3Q3, max), as well
as the measures of spread associated with these values (range and IQR).

For datasets that are not symmetric, the five-number summary and a
corresponding box plot are a great way to get started with understanding the
spread of your data. Although I still prefer a histogram in most cases, box
plots can be easier to compare two or more groups. You will see this in
the quizzes towards the end of this lesson.
Variance and Standard Deviation
Two additional measures of spread that are used all the time are
the variance and standard deviation. At first glance, the variance and
standard deviation can seem overwhelming. If you do not understand the
expressions below, don't panic! In this section, I just want to give you an
overview of what the next sections will cover. We will walk through each of
these parts thoroughly in the next few sections, but the big picture goal is to
generally understand the following:

1. How the mean, variance, and standard deviation are calculated.


2. Why the measures of variance and standard deviation make sense to capture
the spread of our data.
3. Fields, where you might see these values used.
4. Why we might use the standard deviation or variance as opposed to the
values associated with the 5 number summary for a particular dataset.

Calculation
We calculate the variance in the following way:

1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
The variance is the average squared difference of each observation from
the mean.

To calculate the variance of a set of 10 values in a spreadsheet application,


with our 10 data points in column A, we would create a new column B by
typing in something like =A1-AVERAGE(A$1:A$10) and copying this down
for all 10 rows. This would find us the difference between each data point and
the mean average of all the data. Then we create a new column C having the
square of these differences, using the formula =B1^2 in cell C1, and copying
that down for all rows. Then in the cell below this new column, cell C11, type
in =SUM(C1:C10). This adds up all these values in column C. Finally in cell
C12, we divide this sum by the number of data points we have, in this case,
ten: =C11/10. This cell C12 now contains the variance for our 10 data points.
More detailed guidance on using spreadsheets like this may be included in a
future lesson in your program.

The standard deviation is the square root of the variance. Therefore, the
formula for the standard deviation is the following:

1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
In the same spreadsheet as above, to find the standard deviation of our same
set of 10 data values, we would use another cell like C13 to take the square
root of our variance measure, by typing in =sqrt(C12).

The standard deviation is a measurement that has the same units as our
original data, while the units of the variance are the square of the units in our
original data. For example, if the units in our original data were dollars, then
units of the standard deviation would also be dollars, while the units of the
variance would be dollars squared.

Again, this section is designed as background knowledge for the


following sections. If it doesn't make sense on this first pass, do not worry.
You will be guided in future sections in performing these calculations, and
building your intuition, as you work through an example using the salary data.
Then we will provide context about why these calculations are important, and
where you might see them!
Standard deviation is a common metric used to compare the spread of two
datasets. The benefits of using a single metric instead of the 5 number
summary are:

 It simplifies the amount of information needed to give a measure of spread


 It is useful for inferential statistics

Important Final Points

Important Final Points


1. The variance is used to compare the spread of two different groups. A set of data with
higher variance is more spread out than a dataset with lower variance. Be careful
though, there might just be an outlier (or outliers) that is increasing the variance when
most of the data are actually very close.
2. When comparing the spread between two datasets, the units of each must be the same.
3. When data are related to money or the economy, higher variance (or standard
deviation) is associated with higher risk.
4. The standard deviation is used more often in practice than the variance because it
shares the units of the original dataset.

Use in the World


The standard deviation is associated with risk in finance, assists in
determining the significance of drugs in medical studies, and measures the
error of our results for predicting anything from the amount of rainfall we can
expect tomorrow to your predicted commute time tomorrow.

These applications are beyond the scope of this lesson as they pertain to
specific fields, but know that understanding the spread of a particular set of
data is extremely important to many areas. In this lesson, you mastered the
calculation of the most common measures of spread.

Investment Data
Consider we have two investment opportunities:

Returns

Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

Investment 1 5% 5% 5% 5% 5% 5%

Investment 2 12% -2% 10% 0% 7% 3%


Returns

The returns for 6 consecutive years for each investment are shown above.
Use this information to answer the questions below.

nvestment Data
In the previous two questions, you should have found that these investments have the
same mean! That is, regardless of which investment opportunity you choose, you are
expected to earn the same amount. So how are they different? Let's look at some
additional questions to see if we can find some differences.

The same data as above is provided again (to minimize scrolling).

Returns

Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

Investment 1 5% 5% 5% 5% 5% 5%

Investment 2 12% -2% 10% 0% 7% 3%

The returns for 6 consecutive years for each investment are shown above. Use this
information to answer the questions below.

Measures of Center and Spread Summary


Recap

Variable Types
We have covered a lot up to this point! We started with identifying data types
as either categorical or quantitative . We then learned we could identify
quantitative variables as either continuous or discrete . We also found we
could identify categorical variables as either ordinal or nominal .

Categorical Variables
When analyzing categorical variables, we commonly just look at the count or
percent of a group that falls into each level of a category. For example, if we
had two levels of a dog category: lab and not lab . We might say, 32% of
the dogs were lab (percent), or we might say 32 of the 100 dogs I saw were
labs (count).

However, the 4 aspects associated with describing quantitative variables are


not used to describe categorical variables.

Quantitative Variables
Then we learned there are four main aspects used to
describe quantitative variables:

1. Measures of Center
2. Measures of Spread
3. Shape of the Distribution
4. Outliers
We looked at calculating measures of Center

1. Means
2. Medians
3. Modes

We also looked at calculating measures of Spread

1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance

Calculating Variance
We saw that we could calculate the variance as:

1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
You will also see:

1n−1∑i=1n(xi−xˉ)2n−11i=1∑n(xi−xˉ)2
The reason for this is beyond the scope of what we have covered thus far, but
you can find an explanation here(opens in a new tab) .

You can commonly find answers to your questions with a quick Google
search(opens in a new tab) . Now is a great time to get started with this
practice! This answer should make more sense at the completion of this
lesson.

Standard Deviation vs. Variance


The standard deviation is the square root of the variance. In practice, you
usually use the standard deviation rather than the variance. The reason for
this is because the standard deviation shares the same units with our original
data, while the variance has squared units.

Histograms
We learned how to build a histogram in this video, as this is the most popular visual for
quantitative data.

Shape
From a histogram, we can quickly identify the shape of our data, which helps influence
all of the measures we learned in the previous concepts. We learned that the
distribution of our data is frequently associated with one of the three shapes:

1. Right-skewed

2. Left-skewed

3. Symmetric (frequently normally distributed)

Summary
Mean vs.
Shape Real-World Applications
Median

Symmetric Mean equals


Height, Weight, Errors, Precipitation
(Normal) Median

Amount of drug remaining in a bloodstream, Time


Mean greater
Right-skewed between phone calls at a call center, Time until light
than Median
bulb dies

Mean less than Grades as a percentage in many universities, Age of


Left-skewed
Median death, Asset price changes
The mode of a distribution is essentially the tallest bar in a histogram. There may be
multiple modes depending on the number of peaks in our histogram.

The Shape For Data In The World


When working with data, building a quick plot lets you quickly see the shape
of your data.

Distribution Shape Types of Data

Bell Shaped Heights, Weight, Scores

Left Skewed GPA, Age of Death, Price

Right Skewed Distribution of Wealth, Athletic Abilities

Shape and Outliers


LessonDownloads

Shape and Outliers


Show TranscriptSummarize Video

Outliers
We learned that outliers are points that fall very far from the rest of our data points.
This influences measures like the mean and standard deviation much more than
measures associated with the five-number summary.

Identifying Outliers
There are a number of different techniques for identifying outliers. A full paper on this
topic is provided here(opens in a new tab) . In general, I usually just look at a picture
and see if something looks suspicious!

Working With Outliers


Show TranscriptSummarize Video

Common Techniques
When outliers are present we should consider the following points.

1. Noting they exist and the impact on summary statistics.

2. If typo - remove or fix

3. Understanding why they exist, and the impact on questions we are trying to answer
about our data.

4. Reporting the 5 number summary values is often a better indication than measures
like the mean and standard deviation when we have outliers.

5. Be careful in reporting. Know how to ask the right questions.

Outliers Advice
Below are my guidelines for working with any column (random variable) in
your dataset.

1. Plot your data to identify if you have outliers.

2. Handle outliers accordingly via the previous methods.

3. If no outliers and your data follow a normal distribution - use the mean and
standard deviation to describe your dataset, and report that the data are
normally distributed.
Quiz: Shape and Outliers (Comparing Distributions)
LessonDownloads

Image Summary
In the below image, we have three box-plots. Each box-plot is for a different Iris
flower: setosa , versicolor , or virginica . On the y-axis, we are given the
sepal length. Notice that virginica has an outlier towards the bottom of the plot.
Therefore, the minimum is not given by the bottom line here; rather, it is provided by
this point.

Box Plots of Sepal length for 3 Iris Flower Species


Quick Refresher: The measures of center and spread we can determine from a Box
Plot are as follows. Let's use Setosa for these examples.

Median is the centerline inside the box and is 5

IQR is space between the first and third quartile which are the edges of the box. They
are about 4.8 for the first quartile and 5.2 for the third
Descriptive Statistics Summary
LessonDownloads

Recap

Variable Types
We have covered a lot up to this point! We started with identifying data types as
either categorical or quantitative . We then learned we could identify quantitative
variables as either continuous or discrete . We also found we could identify
categorical variables as either ordinal or nominal .

Categorical Variables
When analyzing categorical variables, we commonly just look at the count or percent of
a group that falls into each level of a category. For example, if we had two levels of a
dog category: lab and not lab . We might say, 32% of the dogs were lab (percent),
or we might say 32 of the 100 dogs I saw were labs (count).

However, the 4 aspects associated with describing quantitative variables are not used
to describe categorical variables.

Quantitative Variables
Then we learned there are four main aspects used to describe quantitative variables:

1. Measures of Center
2. Measures of Spread
3. Shape of the Distribution
4. Outliers

Measures of Center
We looked at calculating measures of Center

1. Means
2. Medians
3. Modes

Measures of Spread
We also looked at calculating measures of Spread

1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance

Shape
We learned that the distribution of our data is frequently associated with one of the
three shapes:

1. Right-skewed

2. Left-skewed

3. Symmetric (frequently normally distributed)


Depending on the shape associated with our dataset, certain measures of center or
spread may be better for summarizing our dataset.

When we have data that follows a normal distribution, we can completely understand
our dataset using the mean and standard deviation .

However, if our dataset is skewed, the 5 number summary (and measures of center
associated with it) might be better to summarize our dataset.

Outliers
We learned that outliers have a larger influence on measures like the mean than on
measures like the median. We learned that we should work with outliers on a situation
by situation basis. Common techniques include:

1. At least note they exist and the impact on summary statistics.

2. If typo - remove or fix

3. Understand why they exist, and the impact on questions we are trying to answer
about our data.

4. Reporting the 5 number summary values is often a better indication than measures
like the mean and standard deviation when we have outliers.

5. Be careful in reporting. Know how to ask the right questions.

Histograms and Box Plots


We also looked at histograms and box plots to visualize our quantitative data.
Identifying outliers and the shape associated with the distribution of our data are easier
when using a visual as opposed to using summary statistics.
Descriptive vs. Inferential Statistics
Show TranscriptSummarize Video

In this section, we learned about how Inferential Statistics differs from Descriptive
Statistics.

Descriptive Statistics
Descriptive statisticsis about describing our collected data.

Inferential Statistics
Inferential Statisticsis about using our collected data to draw conclusions
about a larger population.

We looked at specific examples that allowed us to identify the

1. Population - our entire group of interest.


2. Parameter - numeric summary about a population
3. Sample - a subset of the population
4. Statistic numeric summary about a sample

Descriptive vs. Inferential Statistics


In this section, we learned about how Inferential Statistics differs
from Descriptive Statistics.
Descriptive Statistics
Descriptive statistics is about describing our collected data using the
measures discussed throughout this lesson: measures of center, measures of
spread, the shape of our distribution, and outliers. We can also use plots of
our data to gain a better understanding.

Inferential Statistics
Inferential Statistics is about using our collected data to draw
conclusions to a larger population. Performing inferential statistics well
requires that we take a sample that accurately represents our population of
interest.

A common way to collect data is via a survey. However, surveys may be


extremely biased depending on the types of questions that are asked, and the
way the questions are asked. This is a topic you should think about when
tackling the first project.

We looked at specific examples that allowed us to identify the

1. Population - our entire group of interest.


2. Parameter - numeric summary about a population
3. Sample - a subset of the population
4. Statistic - numeric summary about a sample

Looking Ahead
Though we will not be diving deep into inferential statistics within this course,
you are now aware of the difference between these two branches of statistics.
If you have ever conducted a hypothesis test or built a confidence interval,
you have performed inferential statistics. The way we perform inferential
statistics is changing as technology evolves. Many career paths
involving Machine Learning and Artificial Intelligence are aimed at using
collected data to draw conclusions about entire populations at an individual
level. It is an exciting time to be a part of this space, and you are now well on
your way to joining the other practitioners!

Lesson Review
Congratulations on completing this lesson on descriptive statistics. You
learned some foundational metrics for understanding data, including how to :

 Evaluate measures of spread


 Range
 Interquartile Range (IQR)
 Standard Deviation
 Variance
 Analyze outliers
 Evaluate descriptive and inferential statistics

You might also like