Data Types
Data Types
In this video, two data types are introduced: Quantitative and Categorical.
Categorical is used to label a group or set of items (like dog breeds - Collies,
Labs, Poodles, etc.).
Categorical Nominal data do not have an order or ranking (like the breeds of
the dog).
Continuous data can be split into smaller and smaller units, and still a smaller
unit exists. An example of this is the age of the dog - we can measure the
units of the age in years, months, days, hours, seconds, but there are still
smaller units that could be associated with the age.
Quantitativ
Continuous Discrete
e:
Below is a little more detail of the information shared in the above tabl
Another Look
To break down our data types, there are two main blocks:
You should have now mastered what types of data in the world around us falls
into each of these four buckets: Discrete, Continuous, Nominal, and Ordinal.
In the next sections, we will work through the numeric summaries that relate
specifically to quantitative variables.
Height, Age, the Number of Pages in a Book, and Annual Income all take
on values that we can add, subtract and perform other operations with to gain
useful insight. Hence, these are quantitative.
Gender, Letter Grade, Breakfast Type, Marital Status, and Zip Code can
be thought of as labels for a group of items or individuals. Hence, these
are categorical.
Final Words
In this section, we looked at the different data types we might work with in the
world around us. When we work with data in the real world, it might not be
very clean - sometimes there are typos or missing values. When this is the
case, simply having some expertise regarding the data and knowing the data
type can assist in our ability to ‘clean’ this data. Understanding data types can
also assist in our ability to build visuals to best explain the data. But more on
this very soon!
As an example of the analysis we do here at Udacity, we look at how long students take
to complete one of our courses or programs. We try to provide an estimate of the
number of hours or months that students will spend. One way to start is to report the
average amount of time it takes to complete a course. But that doesn't tell the whole
story because there will be differences in time spent depending on what students knew
before beginning the course.
The shortest time might be just a few weeks and the longest might be a couple of years.
What proportion of students finishes within two months and what proportion takes
longer than eight months?
Using a variety of measures, like measures of center, give you an idea of the average
student. Measures of spread, give you an idea of how students differ. Visuals provide
a more complete picture of how long it takes any student to complete a course or
program.
Note, at 1:13 in the number in the equation is 45, not 48
1. Measures of Center
2. Measures of Spread
3. The Shape of the data.
4. Outliers
Measures of Center
There are three measures of center:
1. Mean
2. Median
3. Mode
The Mean
In this video, we focused on the calculation of the mean. The mean is often
called the average or the expected value in mathematics. We calculate the
mean by adding all of our values together and dividing by the number of
values in our dataset.
The remaining measures of the median and mode will be discussed in detail
in the upcoming quizzes and videos.
The Median
The median splits our data so that 50% of our values are lower and 50% are
higher. We found in this video that how we calculate the median depends on if
we have an even number of observations or an odd number of observations.
No Mode
If all observations in our dataset are observed with the same frequency, there
is no mode. If we have the dataset:
1, 1, 2, 2, 3, 3, 4, 4
There is no mode because all observations occur the same number of times.
Many Modes
If two (or more) numbers share the maximum value, then there is more than
one mode. If we have the dataset:
1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 8, 9
There are two modes 3 and 6, because these values share the maximum
frequencies at 3 times, while all other values only appear once.
Notation
Notation is a common language used to communicate mathematical
ideas. Think of notation as a universal language used by academic and
industry professionals to convey mathematical ideas. In the next videos,
you might see things that seem confusing. Use the quizzes to assist with your
understanding of the concepts.
You likely already know some notation. Plus, minus, multiply, division, and
equal signs all have mathematical symbols that you are likely familiar with.
Each of these symbols replaces an idea for how numbers interact with one
another. In the coming concepts, you will be introduced to some additional
ideas related to notation. Though you will not need to use notation to complete
the project, it does have the following properties:
June 15 Thursday 5 No
This is a row:
June 15 Thursday 5 No
This is a column:
10
20
Before Collecting Data
Before collecting data, we usually start with a question, or multiple
questions, that we would like to answer. The purpose of data is to help
us in answering these questions.
Random Variables
A random variable is a placeholder for the possible values of some process
(mostly... the term 'some process' is a bit ambiguous). As was stated before,
notation is useful in that it helps us take complex ideas and simplify (often to a
single letter or single symbol). We see random variables represented by
capital letters (X, Y, or Z are common ways to represent a random variable).
We might have the random variable X, which is a holder for the possible
values of the amount of time someone spends on our site. Or the random
variable Y, which is a holder for the possible values of whether or not an
individual purchases a product.
X is 'a holder' of the values that could possibly occur for the amount of time
spent on our website. Any number from 0 to infinity really.
Example Dataset
An example of the data we might have collected in the previous video is
shown here:
June 15 Thursday 5 No
Example 1
For example, the amount of time someone spends on our site is a random
variable (we are not sure what the outcome will be for any particular visitor), and we
would notate this with X. Then when the first person visits the website, if they spend 5
minutes, we have now observed this outcome of our random variable. We would notate
any outcome as a lowercase letter with a subscript associated with the order that we
observed the outcome.
If 5 individuals visit our website, the first spend 10 minutes, the second spends 20
minutes, the third spend 45 mins, the fourth spends 12 minutes, and the fifth spends 8
minutes; we can notate this problem in the following way:
The capital X is associated with this idea of a random variable, while the observations
of the random variable take on lowercase x values.
Example 2
Taking this one step further, we could ask:
What is the probability someone spends more than 20 minutes in our website?
Here P stands for probability, while the parentheses encompass the statement for
which we would like to find the probability. Since X represents the amount of time spent
on the website, this notation represents the probability the amount of time on the
website is greater than 20.
We could find this in the above example by noticing that only one of the 5 observations
exceeds 20. So, we would say there is a 1 (the 45) in 5 or 20% chance that an
individual spends more than 20 minutes on our website (based on this dataset).
Example 3
If we asked: What is the probability of an individual spending 20 or more minutes
on our website? We could notate this as:
P(X ≥≥ 20)?
We could then find this by noticing there are two out of the five individuals that spent 20
or more minutes on the website. So this probability is 2 out of 5 or 40%.
Consider we have the following table:
5 IT Part-Time
10 Finance Full-Time
8 HR Full-Time
1 Finance Part-Time
YY= Department
ZZ= Part/Full-Time
B. y2y2
C. z3z3
D. nn
Quiz Question
Use the information above to match the correct notation label to its
corresponding value.
Department
Full Time
Part Time
4
5
Years Experience
16
Finance
Notation
Value
A. (this refers to the letter with the corresponding notation above)
B. (this refers to the letter with the corresponding notation above)
C. (this refers to the letter with the corresponding notation above)
D. (this refers to the letter with the corresponding notation above)
In our current notation, adding all of our values together can be extremely
tedious. If we want to add 3 values of some random variable together, we
would use the notation:
x1+x2+x3x1+x2+x3
If we want to add 6 values together, we would use the notation:
x1+x2+x3+x4+x5+x6x1+x2+x3+x4+x5+x6
To extend this to add one hundred, one thousand, or one million values would
be ridiculous! How can we make this easier to communicate?!
Aggregations
An aggregation is a way to turn multiple numbers into fewer numbers
(commonly one number).
Example 1
Imagine we are looking at the amount of time individuals spend on our
website. We collect data from nine individuals:
If we want to sum the first three values together in our previous notation, we
write:
x1+x2+x3x1+x2+x3
In our new notation, we can write:
∑i=13xii=1∑3xi.
Notice, our notation starts at the first observation (i=1i=1) and ends at 3
(the number at the top of our summation).
Example 2
Now, imagine we want to sum the last three values together.
x7+x8+x9x7+x8+x9
In our new notation, we can write:
∑i=79xii=7∑9xi.
Notice, our notation starts at the seventh observation (i=7i=7) and ends at
9 (the number at the top of our summation).
Other Aggregations
The ΣΣ sign is used for aggregating using summation, but we might choose
to aggregate in other ways. Summing is one of the most common ways to
need to aggregate. However, we might need to aggregate in alternative ways.
If we wanted to multiply all of our values together we would use a product
sign ΠΠ** **, capital Greek letter pi. The way we aggregate continuous
values is with something known as integration (a common technique in
calculus), which uses the following symbol ∫∫ which is just a long s. We will
not be using integrals or products for quizzes in this class, but you may see
them in the future!
1n∑i=1nxin1i=1∑nxi
Instead of writing out all of the above, we commonly write xˉxˉ to represent
the mean of a dataset. Although similar to the first video, we could use any
variable. Therefore, we might also write yˉyˉ, or any other letter.
We also could index using any other letter, not just ii. We could just as easily
use jj, kk, or mm to index each of our data values. The quizzes on the next
concept will help reinforce this idea.
Notice
At second 0:12, this should say ∑i=15xi=x1+x2+x3+x4+x5 i=1∑5
xi=x1+x2+x3+x4+x5. The xixi is missing here in front of the
summation.
Notation Recap
Notation is an essential tool for communicating mathematical ideas. We have
introduced the fundamentals of notation in this lesson that will allow you to
read, write, and communicate with others using your new skills!
Time spent
X A random variable
on website
1n∑i=1nxin1i=1∑nxi
In the next section, you will see this notation used to assist in your
understanding of calculating various measures of spread. Notation can take
time to fully grasp. Understanding notation not only helps in conveying
mathematical ideas but also in writing computer programs - if you decide you
want to learn that too! Soon you will analyze data using spreadsheets. When
that happens, many of these operations will be hidden by the functions you
will be using. But until we get to spreadsheets, it is important to understand
how mathematical ideas are commonly communicated. This isn't easy, but
you can do it!
Lesson Recap
This lesson covered some of the foundational statistical topics needed to use
statistics in practice. You can now:
Throughout this lesson, you will learn how to calculate these, as well as why
we would use one measure of spread over another.
Histograms
Histograms are super useful for understanding the different aspects of data
and they are the most common visual used for quantitative data. In the
upcoming concepts, you will see histograms used all the time to help you
understand the four aspects we outlined earlier regarding a quantitative
variable:
center
spread
shape
outliers
Visually, the difference between the histograms is the range or spread of dogs Josh sees during
each time period. In the upcoming lessons, we will discuss the most common ways to measure
the spread of our data.
In the above video, we saw that calculating each of these values was
essentially just finding the median of a bunch of different datasets. Because
we are essentially calculating a bunch of medians, the calculation depends on
whether we have an odd or even number of values.
Range
The range is then calculated as the difference between the maximum and
the minimum.
IQR
The interquartile range is calculated as the difference
between Q3Q3 and Q1Q1.
In the upcoming sections, you will practice this with Katie and on your own.
Looking back at the histograms Josh created for the number of dogs he
recorded seeing on weekdays and weekends, we can use the histograms to
mark the values of the 5 number summary and create a box plot.
Box plots are useful for quickly comparing the spread of two data sets across
some key metrics, like quartiles, maximum, and minimum.
In the above video, we saw this as how far individuals were from the average
distance from work (the example distances shown are examples from the full
data set, the mean of just those 4 numbers is 38.5. The mean of 18 shown
later in the video is the mean of the full data set which is not shown in the
video). In the next video, you will see exactly how this is calculated.
x‾=(∑i=14xi)n=404=10x=n(i=1∑4xi)=440=10
2. Next, calculate the distance of each observation from the mean and square
the value:
(xi−x‾)2=(xi−x)2=
(10−10)2=02=0(10−10)2=02=0
(14−10)2=42=16(14−10)2=42=16
(10−10)2=02=0(10−10)2=02=0
(6−10)2=−42=16(6−10)2=−42=16
2. Then calculate the variance, the average squared difference of each
observation from the mean:
1n∑i=1n(xi−x‾)2=14(0+16+0+16)=324=8n1i=1∑n(xi
−x)2=41(0+16+0+16)=432=8
4. Finally, calculate the standard deviation, the square root of the variance:
1n∑i=1n(xi−x‾)2=8=2.83n1i=1∑n(xi−x)2=8=2.83
The standard deviation is, on average, how far each point in our dataset is from the
mean.
5 Number Summary
In the previous sections, we have seen how to calculate the values associated
with the five-number summary (min, Q1Q1, Q2Q2, Q3Q3, max), as well
as the measures of spread associated with these values (range and IQR).
For datasets that are not symmetric, the five-number summary and a
corresponding box plot are a great way to get started with understanding the
spread of your data. Although I still prefer a histogram in most cases, box
plots can be easier to compare two or more groups. You will see this in
the quizzes towards the end of this lesson.
Variance and Standard Deviation
Two additional measures of spread that are used all the time are
the variance and standard deviation. At first glance, the variance and
standard deviation can seem overwhelming. If you do not understand the
expressions below, don't panic! In this section, I just want to give you an
overview of what the next sections will cover. We will walk through each of
these parts thoroughly in the next few sections, but the big picture goal is to
generally understand the following:
Calculation
We calculate the variance in the following way:
1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
The variance is the average squared difference of each observation from
the mean.
The standard deviation is the square root of the variance. Therefore, the
formula for the standard deviation is the following:
1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
In the same spreadsheet as above, to find the standard deviation of our same
set of 10 data values, we would use another cell like C13 to take the square
root of our variance measure, by typing in =sqrt(C12).
The standard deviation is a measurement that has the same units as our
original data, while the units of the variance are the square of the units in our
original data. For example, if the units in our original data were dollars, then
units of the standard deviation would also be dollars, while the units of the
variance would be dollars squared.
These applications are beyond the scope of this lesson as they pertain to
specific fields, but know that understanding the spread of a particular set of
data is extremely important to many areas. In this lesson, you mastered the
calculation of the most common measures of spread.
Investment Data
Consider we have two investment opportunities:
Returns
Investment 1 5% 5% 5% 5% 5% 5%
The returns for 6 consecutive years for each investment are shown above.
Use this information to answer the questions below.
nvestment Data
In the previous two questions, you should have found that these investments have the
same mean! That is, regardless of which investment opportunity you choose, you are
expected to earn the same amount. So how are they different? Let's look at some
additional questions to see if we can find some differences.
Returns
Investment 1 5% 5% 5% 5% 5% 5%
The returns for 6 consecutive years for each investment are shown above. Use this
information to answer the questions below.
Variable Types
We have covered a lot up to this point! We started with identifying data types
as either categorical or quantitative . We then learned we could identify
quantitative variables as either continuous or discrete . We also found we
could identify categorical variables as either ordinal or nominal .
Categorical Variables
When analyzing categorical variables, we commonly just look at the count or
percent of a group that falls into each level of a category. For example, if we
had two levels of a dog category: lab and not lab . We might say, 32% of
the dogs were lab (percent), or we might say 32 of the 100 dogs I saw were
labs (count).
Quantitative Variables
Then we learned there are four main aspects used to
describe quantitative variables:
1. Measures of Center
2. Measures of Spread
3. Shape of the Distribution
4. Outliers
We looked at calculating measures of Center
1. Means
2. Medians
3. Modes
1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance
Calculating Variance
We saw that we could calculate the variance as:
1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
You will also see:
1n−1∑i=1n(xi−xˉ)2n−11i=1∑n(xi−xˉ)2
The reason for this is beyond the scope of what we have covered thus far, but
you can find an explanation here(opens in a new tab) .
You can commonly find answers to your questions with a quick Google
search(opens in a new tab) . Now is a great time to get started with this
practice! This answer should make more sense at the completion of this
lesson.
Histograms
We learned how to build a histogram in this video, as this is the most popular visual for
quantitative data.
Shape
From a histogram, we can quickly identify the shape of our data, which helps influence
all of the measures we learned in the previous concepts. We learned that the
distribution of our data is frequently associated with one of the three shapes:
1. Right-skewed
2. Left-skewed
Summary
Mean vs.
Shape Real-World Applications
Median
Outliers
We learned that outliers are points that fall very far from the rest of our data points.
This influences measures like the mean and standard deviation much more than
measures associated with the five-number summary.
Identifying Outliers
There are a number of different techniques for identifying outliers. A full paper on this
topic is provided here(opens in a new tab) . In general, I usually just look at a picture
and see if something looks suspicious!
Common Techniques
When outliers are present we should consider the following points.
3. Understanding why they exist, and the impact on questions we are trying to answer
about our data.
4. Reporting the 5 number summary values is often a better indication than measures
like the mean and standard deviation when we have outliers.
Outliers Advice
Below are my guidelines for working with any column (random variable) in
your dataset.
3. If no outliers and your data follow a normal distribution - use the mean and
standard deviation to describe your dataset, and report that the data are
normally distributed.
Quiz: Shape and Outliers (Comparing Distributions)
LessonDownloads
Image Summary
In the below image, we have three box-plots. Each box-plot is for a different Iris
flower: setosa , versicolor , or virginica . On the y-axis, we are given the
sepal length. Notice that virginica has an outlier towards the bottom of the plot.
Therefore, the minimum is not given by the bottom line here; rather, it is provided by
this point.
IQR is space between the first and third quartile which are the edges of the box. They
are about 4.8 for the first quartile and 5.2 for the third
Descriptive Statistics Summary
LessonDownloads
Recap
Variable Types
We have covered a lot up to this point! We started with identifying data types as
either categorical or quantitative . We then learned we could identify quantitative
variables as either continuous or discrete . We also found we could identify
categorical variables as either ordinal or nominal .
Categorical Variables
When analyzing categorical variables, we commonly just look at the count or percent of
a group that falls into each level of a category. For example, if we had two levels of a
dog category: lab and not lab . We might say, 32% of the dogs were lab (percent),
or we might say 32 of the 100 dogs I saw were labs (count).
However, the 4 aspects associated with describing quantitative variables are not used
to describe categorical variables.
Quantitative Variables
Then we learned there are four main aspects used to describe quantitative variables:
1. Measures of Center
2. Measures of Spread
3. Shape of the Distribution
4. Outliers
Measures of Center
We looked at calculating measures of Center
1. Means
2. Medians
3. Modes
Measures of Spread
We also looked at calculating measures of Spread
1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance
Shape
We learned that the distribution of our data is frequently associated with one of the
three shapes:
1. Right-skewed
2. Left-skewed
When we have data that follows a normal distribution, we can completely understand
our dataset using the mean and standard deviation .
However, if our dataset is skewed, the 5 number summary (and measures of center
associated with it) might be better to summarize our dataset.
Outliers
We learned that outliers have a larger influence on measures like the mean than on
measures like the median. We learned that we should work with outliers on a situation
by situation basis. Common techniques include:
3. Understand why they exist, and the impact on questions we are trying to answer
about our data.
4. Reporting the 5 number summary values is often a better indication than measures
like the mean and standard deviation when we have outliers.
In this section, we learned about how Inferential Statistics differs from Descriptive
Statistics.
Descriptive Statistics
Descriptive statisticsis about describing our collected data.
Inferential Statistics
Inferential Statisticsis about using our collected data to draw conclusions
about a larger population.
Inferential Statistics
Inferential Statistics is about using our collected data to draw
conclusions to a larger population. Performing inferential statistics well
requires that we take a sample that accurately represents our population of
interest.
Looking Ahead
Though we will not be diving deep into inferential statistics within this course,
you are now aware of the difference between these two branches of statistics.
If you have ever conducted a hypothesis test or built a confidence interval,
you have performed inferential statistics. The way we perform inferential
statistics is changing as technology evolves. Many career paths
involving Machine Learning and Artificial Intelligence are aimed at using
collected data to draw conclusions about entire populations at an individual
level. It is an exciting time to be a part of this space, and you are now well on
your way to joining the other practitioners!
Lesson Review
Congratulations on completing this lesson on descriptive statistics. You
learned some foundational metrics for understanding data, including how to :