DS502/MA543
Welcome!
https://stats.stackexchange.com/questions/423/what-is-your-favorite-data-analysis-cartoon
DS502/MA543: Course Introduction
Welcome to
Statistical Methods for Data Science!
Prof. Randy Paffenroth
[email protected]
Worcester Polytechnic Institute
What are
“Statistical Methods for Data Science”?
Given a large mass of data, we can by judicious selection construct
perfectly plausible unassailable theories—all of which, some of which, or
none of which may be right.
- Paul Arnold Srere
Objectives for today
●
Discuss the course mechanics (syllabus, book,
grading, etc.)
●
Start to get inspired about statistical methods in data
science!
Basic course information
●
Course number: DS502/MA543
●
Course name: Statistical Methods for Data Science
●
When/where:
– Tuesdays from 6:00pm-8:50pm - HL218
Teaching Assistant/Grader
Office hours:
2-4pm on Wednesdays, AK123
Instructor information
●
Randy Paffenroth (a.k.a. “Dr. Paffenroth”, “Prof. Paffenroth”, or “Randy”)
●
Office location: AK124
Office hours: 4-5pm on Mondays and 1-2pm on Fridays
Other times are available by appointment, and walk-ins are always
welcome if I am around and not otherwise indisposed.
Also, I plan to come a half hour early to every class.
●
Best ways to contact me:
– WPI email:
[email protected] – Office phone: (508) 831-6562
●
I should be able to turn around email questions relatively quickly 9am-
5pm, Monday-Friday. My availability at night and on weekends is more
limited and I certainly check my email far more infrequently, but you
may feel free to try and contact me.
But, who am I really?
●
I have two bachelor’s degrees, one in Math and one in CS, from
Boston University.
●
I have a Ph.D. in applied mathematics from University of Maryland.
●
Before coming to WPI I was a Program Director at a small
company (50 people), and before that I worked at the California
Institute of Technology and another small company (3 people!).
High level course goals and learning objectives
By the end of the class you should be able to:
●
Use tools such as Linear Regression, Logistic Regression, and Trees for
making predictions from data. “Hey boss, let’s bet the farm on advertising our
widget on television on Thursday nights.”
●
Explain the pros and cons of various approaches. “I choose linear regression
since our spending and sales are well represented by quantitative variables, and the
fit looks roughly linear.”
●
Avoid common pitfalls such as over-fitting and data snooping. “Don’t worry, I
was trained in the WPI Data Science program. You can trust me!”
●
Given a prediction generated from such a method, be able to assess the
validity of the prediction. “I am pretty sure that things will work out as predicted.
The variance of the prediction is small on the testing data and the model fit looks
good.”
●
Diagnose what can go wrong with a prediction. “Except, when the Patriots are
playing on Thursday night football, when I believe that we are having issues with
outliers messing up the linear regression.”
Math background for the course
●
The recommended background for the course are
statistics at the level of MA 2611 and MA2612 and
linear algebra at the level of MA 2071.
Math background for the course
●
You will also need to know some probability and
statistics
– Random variables (what they represent, etc.)
– Descriptive statistics (mean, variance, etc.)
– Hypothesis testing
– Estimation and prediction
– etc.
●
We will spend some time going over these!
Math background for the course
●
In particular, you will need to know some linear
algebra:
– Vectors (that they can represent points in space, column vs. row, etc.)
– Matrices (transposes, that they don’t commute, etc.)
– Inner products
– How to solve linear systems
– etc.
●
We will spend some time going over these as well!
Let me mention...
●
There is a new class
being offered this
year.
●
DS517
●
It is meant to be a
preparation class for
DS502/MA543.
Computing background for the course
●
You will need to be able get your hands dirty playing with,
processing, and plotting data using the R computer language!
– The textbook uses R,
– the homework uses R,
●
and that will be the officially supported language for the course
and all lecture examples will be in R.
●
Now, with that being said, this is not intended to be a
programming course (i.e., your code will not be graded), but
actually working with data will be extremely important (i.e., the
results of the code will be graded)!
R
●
R itself can be found at:
– http://cran.r-project.org
●
I also highly recommend the RStudio front end. It makes
developing R code much easier. It can be found at:
– http://www.rstudio.com
●
Note, RStudio requires that you have R itself already installed
(so you have to access both of the web pages above).
●
Good place to start:
– Learning R: A Step-by-Step Function Guide to Data Analysis By
Richard Cotton O'Reilly Media, September 2013
– Available for free from the library.
Textbook
●
An Introduction to
Statistical Learning
– Gareth James, Daniela Witten,
Trevor Hastie, Robert
Tibshirani
●
If you have access to the
WPI library then a PDF of the
book can be downloaded for
free from Springer. Just
search for the title at the WPI
library web page and then
click on the ebook version.
Recommend texts
●
Other texts that would be useful for the course are:
– Linear Algebra and Its Applications, by David Lay. This has been used as the textbook
for MA2071 (one of the requirements for the course).
– Applied Statistics for Engineers and Scientists, by Joseph Petruccelli, Balgobin
Nandram, and Minghui Chen. This has been the textbook for MA2611 and MA2612
(the other requirement for the course).
– The Elements of Statistical Learning: Data Mining, Inference, and Prediction, by Trevor
Hastie, Robert Tibshirani, and Jerome Friedman. This is the “big brother” of our
textbook, and a great resource that covers a lot of interesting material.
– Learning From Data, by Yaser S. Abu Mostafa, Malik Magdon Ismail, and Hsuan Tien
Lin. This book is used in the Caltech “Learning from Data” course and does a great job
covering things like cross validation and VC dimension.
– Learning R: A Step by Step Function Guide to Data Analysis By Richard Cotton
O'Reilly Media, September 2013.
Course activities
●
Lectures: The lectures and in class discussions are an important part of
the course.
– The base lecture notes will be posted on the class web page before each class and any
annotations made during class will be posted afterwards.
– The lectures will be video recorded and posted as well.
●
Home-works: Doing the home-work is how you get experience solving
interesting problems.
– The home-work assignments are to be done in teams of up to two, see the syllabus for details of
the collaboration policy.
●
Class project: A team based exercise that simulates how problems
really get solved in research and business settings.
– The class project is an important way to get experience solving real world problems.
●
Exams: There will be a midterm and final exam.
– The final exam will be non-cumulative (i.e., the midterm exam will cover roughly the first half of the
course and the final exam will cover roughly the second half of the course.
Course requirements and grading standards
Home-works 20%
(5 assignments)
Midterm exam 20%
Final project 30%
Final exam 30%
The midterm exam and final exam will be in class, cumulative, and open note, but no collaboration
will be allowed and the exams be graded based upon demonstrated understanding of key
concepts. For each exam, you are allowed to bring in up to two (2) 8 ½ by 11 sheets of paper
(either printed or handwritten) with whatever notes you want for the exam. The homework
problems will be performed in groups of at most two and will be graded for demonstrated
understanding of key concepts and quality of presentation. You can choose your own teammate,
but team changes will need to be approved by Prof. Paffenroth. The final project will be performed
in groups of 3-5 and will be graded based upon the quality and completeness of a final
presentation and final report.
I reserve the right to curve the final grades (either up or down) based upon the aggregate
performance of the class.
Advice on doing the home-works
●
Make sure your home-works are clearly written and
stand alone.
– Do not refer to your code for plots or answers to questions.
– Make sure that everything appears in your homework writeup!
●
Make sure it is clear where each part of each
question is answered.
– Don't make the grader have to second guess which part of your write up
answers which part of each question!
Advice on doing the home-works
●
To say it again
– THE HOMEWORK REPORT MUST
BE STAND ALONE!
– Neither the TA, nor I, should have to
run any code to grade your
homework.
– If we do, then you will not get full
credit.
Important dates
Assigned Due
Homework 1 September 4 September 18
Homework 2 September 18 October 2
Homework 3 October 2 October 23
Homework 4 October 23 November 6
Homework 5 November 6 November 20
Midterm exam October 9
Project proposal due October 30
Project reports due November 27
Final exam December 4
Collaboration and Academic Honesty Policy
●
Collaboration is prohibited on the exams. Collaboration
is encouraged on homeworks and the final project.
Homeworks will be conducted in teams of one or two.
You will also be allowed to select your own teams of 3-5
for the final project. On homeworks you may discuss
problems across teams, but each homework team is
responsible for generating solutions and writing up
results on their own from scratch. On the final project,
each of the teams will be using their own data sets, but
the same collaboration policy applies. All violations of
the collaboration policy will be handled in accordance
with the WPI Academic Honesty Policy.
Collaboration and Academic Honesty Policy
●
As examples, each of the following would be a violation of the collaboration
policy (this list is not exhaustive):
– Two different homework teams share a solution to any assigned problem.
– One homework or project team allows another homework or project team to copy any part of a solution to
an assigned problem.
– Any code or plots are shared between homework or project teams.
●
As examples, each of the following would not be a violation of the
collaboration policy:
– Students within a team sharing solutions and code for a problem.
– Students from different teams discussing an assignment at the level of goals, where ideas for solutions
can be found in the book or notes, what parts are more challenging, or how one might approach the
problem.
●
Of course, you can ask Prof. Paffenroth or the TA any questions you like, show
them code, etc.
●
If there is any doubt as to what is allowed and what is not allowed, please just
ask!
Student responsibilities and course policies
●
Accommodation for Special Needs or Disabilities
– If you need course adaptations or accommodations because of a
disability, or if you have medical information to share with me, please
make an appointment with me as soon as possible. If you have not
already done so, students with disabilities who believe that they may
need accommodations in this class are encouraged to contact the
Office of Disability Services as soon as possible to ensure that such
accommodations are implemented in a timely fashion. This office is
located in the West St. House (157 WestSt), (508) 831-4908.
●
Accommodation for Religious Observance
– Students requiring accommodation for religious observance must make
alternate arrangements with Prof. Paffenroth at least one week before
the date in question.
Student responsibilities and course policies
●
Personal Emergencies
– In the event of a medical or family emergency, please contact Prof.
Paffenroth to work out appropriate accommodations.
●
Make-up Exam Policy
– Make-up exams will only be allowed in the event of a documented
emergency or religious observance. The exam dates are listed on the
syllabus and you are responsible for avoiding conflicts with the exams.
Student responsibilities and course policies
●
Late Assignment Policy
– In general, late assignments will either not be accepted or, at best, be
heavily penalized (50% of possible points). If an emergency arises or
you know in advance about a conflict please let Prof. Paffenroth know
as soon as possible.
Dr. Paffenroth's keys to success!
●
Read the book!
– Reading the appropriate chapter before each class will make you much
better prepared for the lectures.
– You are responsible for the book and the lectures... and it is not my
intention to cover every part of the book.
●
Do the homework!
– The home-works are perhaps the most important part of the learning
experience in the class.
– You need to get “your hands dirty”!
●
Attend the lectures and ask questions!
– There are no dumb questions, and I can guarantee you that any
question you want to ask will help your classmates as much as you.
– I will also post the annotated slides after each class on canvas.wpi.edu
Course topics
●
Linear Regression: A workhorse of statistical learning.
●
Classification: Categorical variables (“red”, “blue”, “tall”, “short”, etc.)
●
Resampling: How do you live in a world with finite (though perhaps
lots of) data?
●
Model Selection and Regularization: How do you avoid the hobgoblin
of statistical learning... over-fitting!?
●
Dimension Reduction: One of my favorite parts of statistical learning!
●
Non-linear Methods: My second favorite part of statistical learning,
especially when combined with dimension reduction.
●
Tree Methods: Another workhorse of statistical learning.
●
Support Vector Machines: A famous algorithm used for classification
problems (as well as in many other interesting areas).
●
Unsupervised learning: This is where things get kind of crazy but, in
my opinion, particularly beautiful.
Course syllabus available online, along with
a lot of other stuff…
http://canvas.wpi.edu
Student responsibilities and course policies
Fairness!
http://www.whaleoil.co.nz/wp-
content/uploads/2015/09/cat-
begging.jpg
What is Data Science?
Data
Science
●
Based upon Drew Conway's Data Science Venn Diagram
●
http://en.wikipedia.org/wiki/Data_science
●
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
What is Data Science?
Mathematical
Sciences
Department
Data
Science
●
Based upon Drew Conway's Data Science Venn Diagram
●
http://en.wikipedia.org/wiki/Data_science
●
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
What is Data Science?
Computer Mathematical
Science Sciences
Department Department
Data
Science
●
Based upon Drew Conway's Data Science Venn Diagram
●
http://en.wikipedia.org/wiki/Data_science
●
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
What is Data Science?
Computer Mathematical
Science Sciences
Department Department
Data
Science
School of
Business
●
Based upon Drew Conway's Data Science Venn Diagram
●
http://en.wikipedia.org/wiki/Data_science
●
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
What is Data Science?
Computer Mathematical
Science ? Sciences
Department Department
Data
Science
? ?
School of
Business
●
Based upon Drew Conway's Data Science Venn Diagram
●
http://en.wikipedia.org/wiki/Data_science
●
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Which is most important?
Computer Mathematical
Science ? Sciences
Department Department
Data
Science
? ?
School of
Business
http://en.wikipedia.org/wiki/
View_of_the_World_from_9th_Avenue
Which is most important?
Computer Mathematical
Science ? Sciences
Department Department
Data
Science
? ?
School of
Business
http://en.wikipedia.org/wiki/
View_of_the_World_from_9th_Avenue
Which is most important?
Mathematical
Sciences
Computer Department
Science ?
Department
Data
Science
? ?
School of
Business
http://en.wikipedia.org/wiki/
View_of_the_World_from_9th_Avenue
Big Data
●
http://www.ibmbigdatahub.com/sites/default/file
s/infographic_file/4-Vs-of-big-data.jpg
So, there is one thing that I really wany to stress.
●
People often talk about the three V's of Big data
(Volume, Velocity, and Variety). Statistics is
very important to Data Sciene since it supports
th
the 4 V!
So, there is one thing that I really wany to stress.
●
People often talk about the three V's of Big data
(Volume, Velocity, and Variety). Statistics is
very important to Data Sciene since it supports
th
the 4 V!
●
Veracity:
So, there is one thing that I really wany to stress.
●
People often talk about the three V's of Big data
(Volume, Velocity, and Variety). Statistics is
very important to Data Sciene since it supports
th
the 4 V!
●
Veracity:
●
Ok you have made a prediction, do you bet the farm (or your job
on it)?
So, there is one thing that I really wany to stress.
●
People often talk about the three V's of Big data
(Volume, Velocity, and Variety). Statistics is
very important to Data Sciene since it supports
th
the 4 V!
●
Veracity:
●
Ok you have made a prediction, do you bet the farm (or your job
on it)?
●
Or, maybe you do an *experiment* to see if the predictions you
are making are correct?
So, there is one thing that I really wany to stress.
●
People often talk about the three V's of Big data
(Volume, Velocity, and Variety). Statistics is
very important to Data Sciene since it supports
th
the 4 V!
●
Veracity:
●
Ok you have made a prediction, do you bet the farm (or your job
on it)?
●
Or, maybe you do an *experiment* to see if the predictions you
are making are correct?
●
Is one experiment enough to bet the farm?
So, there is one thing that I really wany to stress.
●
People often talk about the three V's of Big data
(Volume, Velocity, and Variety). Statistics is
very important to Data Sciene since it supports
th
the 4 V!
●
Veracity:
●
Ok you have made a prediction, do you bet the farm (or your job
on it)?
●
Or, maybe you do an *experiment* to see if the predictions you
are making are correct?
●
Is one experiment enough to bet the farm?
●
Is a million?
So, there is one thing that I really wany to stress.
●
People often talk about the three V's of Big data
(Volume, Velocity, and Variety). Statistics is
very important to Data Sciene since it supports
th
the 4 V!
●
Veracity:
●
Ok you have made a prediction, do you bet the farm (or your job
on it)?
●
Or, maybe you do an *experiment* to see if the predictions you
are making are correct?
●
Is one experiment enough to bet the farm?
●
Is a million?
●
How do you think critically about data, and all the things that go
along with it?
Statistical rigor is necessary to justify the inferential leap from data
to knowledge, and many difficulties arise in attempting to bring
statistical principles to bear on massive data. Overlooking this
foundation may yield results that are not useful at best, or harmful
at worst. In any discussion of massive data and inference, it is
essential to be aware that it is quite possible to turn data into
something resembling knowledge when actually it is not.
Moreover, it can be quite difficult to know that this has happened.
- Frontiers in Massive Data Analysis, National Research Council of the National Academies
What do Big Data and Data Science
have to do with each other?
●
Are Big Data and Data Science the same thing?
●
I wouldn't say so...
Data
Big Data
Science
What do Big Data and Data Science
have to do with each other?
●
Are Big Data and Data Science the same thing?
●
I wouldn't say so...
●
Data Science can be done on small data sets.
Data
Big Data
Science
What do Big Data and Data Science
have to do with each other?
●
Are Big Data and Data Science the same thing?
●
I wouldn't say so...
●
Data Science can be done on small data sets.
●
And not everything done using Big Data would
necessarily be called Data Science.
Data
Big Data
Science
What do Big Data and Data Science
have to do with each other?
●
Are Big Data and Data Science the same thing?
●
I wouldn't say so...
●
Data Science can be done on small data sets.
●
And not everything done using Big Data would
necessarily be called Data Science.
●
But there certainly is a substantial overlap!
Data
Big Data
Science
But... what does getting "knowledge"
from data really mean? Are we
searching for causality?
But... what does getting "knowledge"
from data really mean? Are we
searching for causality?
“Causation: The relation between
mosquitoes and mosquito bites.
Easily understood by both parties
but never satisfactorily defined by
philosophers and scientists.”
- http://freshspectrum.com/causation/ Michael Scriven, Evaluation
Thesaurus, 1991
But... what does getting "knowledge"
from data really mean? Are we
searching for causality?
“Causation: The relation between Most strikingly, society will need
mosquitoes and mosquito bites. to shed some of its obsession for
Easily understood by both parties causality in exchange for simple
but never satisfactorily defined by correlations: not knowing why but
philosophers and scientists.” only what.
- Big Data: A Revolution that Transform How We Live, Work, and
- http://freshspectrum.com/causation/ Michael Scriven, Evaluation Think, Viktor Mayer-Schönberger and Kenneth Cukier.
Thesaurus, 1991
Can you even be certain?
Can you even be certain?
●
For real world problems, I
claim that you will never be
certain of any inferences from
data.
●
I mean, what happens to your
carefully thought out marketing plan
for some rocking slacks when the
Martians land.
●
What is unacceptable is when
the data you actually have
does not support the "Amazing Stories 1927 08" by Frank R. Paul - http://dc-
mrg.english.ucsb.edu/WarnerTeach/E192/assignments.htmlTr
conclusion you report. ansferred from en.wikipedia by User:HyjuThe 1500 by 2000
pixel photograph of this magazine cover was created by Karl
Bunker, en:User:RedSpruce, in May 2006.. Licensed under
Public Domain via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Amazing_Stories_1927
_08.jpg#mediaviewer/File:Amazing_Stories_1927_08.jpg
It can be easy to fool yourself!
It can be easy to fool yourself!
Human beings are really
good at pattern
detection...
It can be easy to fool yourself!
Human beings are really Perhaps a bit too good!
good at pattern
detection...
http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars)
It can be easy to fool yourself!
"Martian face viking" by Viking 1, NASA - Viking 1 Orbiter, image
F035A72 (Viking CD-ROM Volume
10)http://photojournal.jpl.nasa.gov/catalog/PIA01141raw .imq data -
ftp://pdsimage2.wr.usgs.gov/data/.cdroms2/viking_orbiter/vo_1010/f
035axx/f035a72.imqraw data directly converted to .gif -
http://www.solarviews.com/cap/face/035a72.htm. Licensed under
Public Domain via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Martian_face_viking.jpg#me
diaviewer/File:Martian_face_viking.jpg
Math warmup
http://cdn.gmotors.co.uk/news/wp-content/
uploads/2016/09/photo-1421218108559-
eb1ff78357f5-620x349.jpg
Did I mention... :-)
●
There is a new class
being offered this
year.
●
DS517
●
It is meant to be a
preparation class for
DS502/MA543.
Whole course in one slide!
The Bias-Variance Trade-off
What is the relationship of interest?
Our estimate of that relationship...
Our estimate of that relationship...
Note, the function of interest will often have dials...
Think of it like a machine...
Cat
Dog
Dog
Cat
????
Some nomenclature...
●
Input variables
●
Examples: TV, radio, and newspaper spending
●
Names: Predictors, input, independent variables, or
features
●
Output variables
●
Example: sales
●
Names: Response, output, or dependent variable
The essence of "supervised"
learning...
Dog
Cat
????
Ok, so what is "unsupervised"
learning?
Classificiation vs. regression
Probability density functions and
continuous random variables
Probability density functions and
continuous random variables
Probability density functions and
continuous random variables
Did I mention... :-)
●
There is a new class
being offered this
year.
●
DS517
●
It is meant to be a
preparation class for
DS502/MA543.
Bourbaki's Dangerous Bend...
Nicolas Bourbaki… the most
famous mathematcian who
never existed!
https:////en.wikitpedia.org//wiki//
Nicolas_Bourbaki
What is the “most common” PDF?
Central Limit Theorem
https:////en.wikitpedia.org//wiki//dentral_limit_theorem
We actually have central limit theorem s
●
There are actually a whole family of central limit
theorem s
– The all depend on how “fat” the tails of a distribution are
– This comes up in some very important problems...
How to compute the maximum likelihood
Gaussian for any (finite) data-set
Bayes classifier
But, how do you actually compute
this for regression?
Suppose your job was on the line...
how do you start?
Suppose your job was on the line...
how do you start?
Model Memorize
Parametric models.
Math! Some warm up...
Let's prove:
Back to MSE... irreducible errors
Bias-variance trade-off!
Variance Bias
Did I mention... :-)
●
There is a new class
being offered this
year.
●
DS517
●
It is meant to be a
preparation class for
DS502/MA543.
The worst sin... overfitting!
Let's see an example