What is Data Science?
{ Girl Develop It! Meetup
Renée M. P. Teate, March 2015
Let’s start with: “What is Data?”
http://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA https://encrypted-
_Big_Data.jpg tbn2.gstatic.com/images?q=tbn:ANd9GcS9dKu3_Tzi-sWW-
yAqee5y0EhuvoIZNSya_rAKnuBBd0JYxPX7pw
http://www.freefoto.com/images/1351/06/1351_06_2---Books--
http://fc01.deviantart.net/fs71/i/2012/326/3/4/cute_dog_by_tho Shakespeare-and-Company-Bookstore--The-Latin-Quarter--
masmeadows345-d5lsah9.jpg Paris_web.jpg
http://upload.wikimedia.org/wikipedia/commons/9/96/Bill_Nye
,_Barack_Obama_and_Neil_deGrasse_Tyson_selfie_2014.jpg
https://c2.staticflickr.com/4/3273/3017878633_65beb1c7d6.jpg
https://c1.staticflickr.com/1/2/1349370_07
03fce74c.jpg
http://upload.wikimedia.org/wikipedia/commons/e/e4/Gr
een_Bank_100m_diameter_Radio_Telescope.jpg
Around 100 hours of video are uploaded to YouTube every minute
it would take about 15 years to watch every video uploaded in one day
AT&T is thought to hold the world’s largest volume of data in one
unique database – its phone records database is 312 terabytes in size,
and contains almost 2 trillion rows.
Every minute we send 204,000,000 emails, generate 1,800,000 Facebook
likes, send 278,000 Tweets, and up-load 200,000 photos to Facebook
570 new websites spring into existence every minute of every day.
http://smartdatacollective.com/bernardmarr/277731/big-data-25-facts-everyone-needs-know
http://pixabay.com/static/uploads/photo/2014/03/13/01/12/datacen
ter-286386_640.jpg
https://c2.staticflickr.com/2/1296/533233247_b6baa30fdb_z.jpg?zz=1
https://c1.staticflickr.com/3/2300/2596366618_2d6cb01735.jpg
http://upload.wiki
media.org/wikipedi
a/commons/9/90/Ke
ncf0618FacebookNe
twork.jpg
http://upload.wikimedia.org/wikipedia/commons/b/bf/USDA_Hardine
ss_zone_map.jpg http://upload.wikimedia.org/wikipedia/commons/1/1c/CMS_Higgs-event.jpg
Databases You Use
Pretty much every website you interact with
Social Media Online Shopping
Banking Course Registration/Canvas
File Sharing Travel
Search Engines Etc. etc. etc…..
You broadcast/generate data everywhere you go
Cell phones Email
Purchases Posting status updates
Driving (GPS) Attending events
Streaming music Etc. etc. etc…..
https://www.google.com/maps/@38.8905569,-77.1721577,13z/data=!5m1!1e1
http://upload.wikimedia.org/wikipedia/commons/6/69/Netflix_logo.svg
How is data
https://c2.staticflickr.com/4/3324/3507973704_563846fe14_z.jpg?zz=1
collected about you
used to help you?
Who builds these systems?
Data Scientist
Computer Scientist Mathematician Business Person
• Data collection systems • Statistical Models • Domain Expertise
• Machine Learning • Evaluation Metrics • Knowing what
Algorithms • Predictive Analytics questions to ask
• Interface Design • Data Visualizations • Interpreting results for
• Design/Manage/Query business decisions
Databases • Presenting outcomes
• Data Aggregation
• Data Mining
Examples – not a complete definition, and not all
simultaneously necessary skills
Data Science Venn Diagram by Drew Conway
http://static.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10
f77ab/1364352052403/Data_Science_VD.png?format=750w
From “Doing Data Science” by Cathy
O’Neill & Rachel Schutt
http://semanticommunity.info/@api/deki/files/27057/Figure1-
http://www.becomingadatascientist.com/wp- 4.png?size=bestfit&width=484&height=541&revision=1
content/uploads/2014/06/DS_profile.png
No need to be a “unicorn”, but do need to know something
about all of these areas, and become expert in some
Some other names for “Data Scientist”
Statistician Pythonista
Data Mining Specialist Financial Analyst
Biostatistician Recommendation System
Social Science Researcher Engineer
Big Data Analyst Information Architect
Spatial/GIS Analyst Artificial Intelligence
Natural Language Researcher
Programmer
Neuroscientist
Computational Physicist
Data Visualization Designer
Data Science jobs pay an
average of $118,000 per year
It is estimated that by 2018, US could have a
shortage of 140,000+ people with advanced
analytical skills & need 1.5M managers/analysts
that can make decisions based on data analysis
“Extraction of Knowledge”
Also known as “knowledge discovery”
Goes beyond queries
Data Mining
Business Understanding
Data Understanding
Data Preparation
Modeling
Clustering
Classification
Regression
Evaluation
From “Data Science for
Business” by Provost & Fawcett Images from ODU ECE 607 Lecture Slides by Prof. Jiang Li
Video clip: Interview with Neha Kothari, LinkedIN Data Scientist
http://youtu.be/8dxKe5cGHdA?t=17s
Examples
Galaxy Classification using Convolutional
Neural Networks
http://benanne.github.io/2014/04/05/galaxy-zoo.html
Choosing Facebook Audience for Content
Promotion using Random Forests
http://citizennet.com/blog/2012/11/10/random-forests-
ensembles-and-performance-metrics/
Predicting Wine Quality with Principal
Component Analysis
http://fastml.com/predicting-wine-quality/
Readmission Risk Score to decide which
patients to give additional follow-up help at
Mt. Sinai hospital
http://www.technologyreview.com/news/518916/a-
hospital-takes-its-own-big-data-medicine/
http://xkcd.com/1425/
How to get started
Topics to learn about
Programming Research and Analysis
Any language is good to Science involving data
start with. Gain core collection and interpretation
understanding.
Working with “messy” real
Python or R data analysis life data
experience a plus
Business Analytics
Database design, SQL
Data Mining
Math
Others
Calculus
Business / Communication
Linear Algebra
Statistics Graphic Design
Advanced: Optimization /
Linear Programming
Read, read, read
Doing Data Science by Cathy O’Neil* & Rachel Schutt
Data Science for Business by Forster Provost & Tom Fawcett
Data Smart by John Foreman* (uses Excel)
I review other books as I read them:
http://www.becomingadatascientist.com/learning/
Blogs & News Feeds (FlowingData.com is a good one to start with)
Twitter – look for curated lists of people to follow
https://twitter.com/BecomingDataSci/lists/women-in-data-
science/members
*on Twitter and
willing to chat!
Free Online Courses
Python Fundamentals – Codecademy http://www.codecademy.com/tracks/python
Machine Learning – Coursera / Stanford https://www.coursera.org/course/ml
Data Analyst Nanodegree – Udacity https://www.udacity.com/course/nd002
(includes Hadoop mini-course)
Applied Data Mining and Statistical Learning – Penn State
https://onlinecourses.science.psu.edu/stat857/
Pretty comprehensive list here: http://www.kdnuggets.com/education/online.html
TED talks on Data http://www.ted.com/search?q=data
Susan Etlinger* http://www.ted.com/talks/susan_etlinger_what_do_we_do_with_all_this_big_data
“Need to spend more time on critical thinking skills…[because we have
the] potential to make bad decisions far more quickly, efficiently, and with
far greater impact than we did in the past.”
“…we need to be clear about ..the methodologies that we use, …because if I
don't know what …questions you asked, I don't know what questions you
didn't ask.”
Explore
Volunteer to Analyze Data (DataKind)
Play with public data sets
http://101.datascience.community/2014/10/17/data-sources-for-cool-data-
science-projects-part-1-guest-post/
https://www.opensciencedatacloud.org/publicdata/
http://catalog.data.gov/dataset
https://archive.ics.uci.edu/ml/datasets.html?format=&task=clu&att=&area=&nu
mAtt=&numIns=&type=&sort=nameUp&view=table
Data Science Competitions
(Kaggle also has “knowledge competitions” for learning)
Questions?
Renee Teate
[email protected], @becomingdatasci
http://www.becomingadatascientist.com