Data Academy - Data Science Basics
Data Academy - Data Science Basics
Volume
• Petabytes
• Records
• Transactions
• Tables, files
3v·: \
of Big )
Data
• Ba.tch • Structured
Real time • Unstructured
• Streaming Iii Semi-
structuired
Velocity Variety
PhD Thesis
MIT Media Arts & Sciences
Hilary Mason & Chris Wiggins (2010)
1. Obtain: pointing and clicking does not scale.
2. Scrub: the world is a messy place
3. Explore: You can see a lot by looking
4. Models: always bad, sometimes ugly
5. Interpret: “The purpose of computing is insight, not numbers.” (Hamming)
"Data science is clearly a blend of the hackers’ arts; statistics & machine learning;
expertise in mathematics & the domain of the data for the analysis to be
interpretable. It requires creative decisions & open-mindedness in a scientific
context."
A Taxonomy of Data Science
“Dataists”
Mike Loukides (2010)
"Data science enables the creation of data products."
“You might wake up one fine day and realize that your life... actually adds up to expertise
in some domain you’d never identified with at all.”
The Calculus of Grit
ribbonfarm
Drew Conway on Academia (2011)
"With respect to how academics have been impacted by data science, I think the
impact has mostly flowed in the other direction. One major component of data
science is the ability to extract insight from data using tools from math, statistics
and computer science. Most of this is informed by the work of academics, and not
the other way around."
"As so much more data gets pushed into the open, I believe basic data hacking skills
— scraping, cleaning, and visualization — will be prerequisites to any academic
research project."
Data science is a pipeline between academic disciplines
O'Reilly Radar
Jeff Hammerbacher (2009)
"... on any given day, a team member could author a multistage
processing pipeline in Python, design a hypothesis test, perform a
regression analysis over data samples with R, design and implement
an algorithm for some data- intensive product or service
in Hadoop, or communicate the results of our analyses
to other members of the organization."
"We need to tell people that Statisticians are the ones who make sense of the data deluge
occurring in science, engineering, and medicine; that Statistics provides methods for data
analysis in all fields, from art history to zoology; that it is exciting to be a Statistician in the
21st century because of the many challenges brought about by the data explosion in all of
these fields."
DJ Patil 0 Following
@DJ44
RElWEETS LIKES
60 78
4:37 AM - 5 Mar 2016
t.'1-
• •••
Kirk Borne (2016)
“Fake data scientists are often experts in one particular discipline and insist
that their discipline is the one and only true data science. That belief
misses the point that data science refers to the application of the full
arsenal of scientific tools and techniques (mathematical, computational,
visual, analytic, statistical, experimental, problem definition, model-
building and validation, etc.) to derive discoveries, insights, and value from
data collections.”
RETWEETS FAVORITES
891 406
https://twitter.com/josh_wills/status/198093512149958656
COMMERCE
DATA SERVICE
Machine
Learning
Substantive
Expertise
The Data Science Venn Diagram
Zero Intelligence Agents
Drew Conway (2010)
“Science” vs. “Scientist”
The state or fact of knowing; knowledge or cognizance of something
specified or implied.
vs.
Just take a few minutes to rank your skills and tell us how you view yourself. In exchange, we'll tell you more and describe how you
fit in! Advice provided is for entertainment value only!
Just in case you were wondering, we will **NEVER** publish or provide to any third party unaggregated responses or identifying
data.
Get Started
http://survey.datacommunitydc.org/
COMMERCE
DATA SERVICE
Businessperson
Engineer Creative
Researcher
(Me)
The Variety of Data Scientists
Data Businesspeople:
Businessperson, leader, entrepreneur
Data Creatives:
Artist, Jack of all Trades, Hacker
Data Developers:
Engineer, Programmer
Data Researchers:
Scientist, Researcher, Statistician
What tools do data scientists use?
What tools do data scientists use?
Suggestions?
Business Logic and Spreadsheet
Computation
COMMERCE
DATA SERVICE
Pric·e : $139.99
I
0 M ~ER v I c E
cDATA ERCE
Go .gle docs
!r I
._
'
Piric·e: $139.99
1
COM MERCE
DAT A S ERVICE
Go
Pric,e : $139.99
Mathematical and Scientific
Computation
COMMERCE
DATA SERVICE
COMMERCE
,,,. python··
DATA SERVICE
OsciPy
matplotlib
§.sas
THE POWER TOK W
Price: Don't even ask
§.sas
THE POWER TO KNOWe
Price: Don't even ask
§.sas
THE POWER TO KNOW
Price: Don' t even ask
HAPPY CODING.
~
WoRDPREss
django
COMMERCE
DATA SERVICE
Databases
COMMERCE
DATA SERVICE
ORACLE
--- --
--- ---
--- ---
---
----
- ·-
COMMERCE
DATA SERVICE
ORACLE MySQL®
PostgreSQL
--
-- - ---
--
-
---- - ---
-
----
- - -·-
COMMERCE
DATA SERVICE
ORACLE
--
--- - ---
-
Big Data and Distributed
Computation
COMMERCE
DATA SERVICE
COMMERCE
DATA SERVICE
~Cassandra
mongoDB
COMMERCE
DATA SERVICE
Infrastructure and Computing
Resources
COMMERCE
DATA SERVICE
COMMERCE
DATA SERVICE
• amaz.on~
webservtees
,.....___._
COMMERCE
DATA SERVICE
But even with these tools,
you still need brains!
Hypothesis Driven Development
Practicing Hypothesis-Driven Development is thinking about the development of
new ideas, products and services – even organizational change – as a series of
experiments to determine whether an expected outcome will be achieved. The
process is iterated upon until a desirable outcome is obtained or the idea is
determined to be not viable.
Barry O’Reilly
COMMERCE
DATA SERVICE
Introduction:
In this workshop we show you an example of a workflow in data science from initial data ingestion, cleaning, modeling, and
ultimately clustering. In this example we scrape the news feed of of NIST. For those not in the know, NIST is the National Institute of
Standards and Technology. It is comprised of multiple research centers which include:
You can use also this guide to scrape other data from a webpage: htto://docs.pvthon-guide.org/en/latest/scenarios/scrape/
https://github.com/StarCYing/open_data_day_dc
What is a data product?
Ideas?
A data product is a product that is based on
the combination of data and algorithms.
Hilary Mason
A data application acquires its value from
the data itself, and creates more data as
a result. It's not just an application with
data; it's a data product.
Mike Loukides
Data products are self-adapting, broadly
applicable economic engines that derive
their value from data and generate more
data by influencing human behavior or by
making inferences or predictions upon new
data.
Benjamin Bengfort
What are some examples?
COMMERCE
DATA SERVICE
.. c- -
Lkia.ctliit _..,__, · -
.i •. o ~ w ·
._,..,............, • -r- ~ c -
-
- .....
Q-*1f liJ'CM your polesaonal netwol'k
---------~
.
...,_..,....
-- ~ --
. ..-::..:.-_-:.::....-----
:"'
f. . . =.,,- I _.
!s1,os5)
~ =~WM.One~ PEOll'U:"fO.IMAVOiDW
[$1,068)
,.;.. ~ .. ._ ---
ll E" '~
.....
0-
......
~··
--...,...
~
[s1,06s)
o -
---· fs1,05sj
•• •
¥(Uil~WT\lfOM
1,455 ~.:.:.:.
33,231 ::::::.:.=....,,
---------------
Agony Price Depart Length
COMMERCE
DATA SERVICE
COMMERCE
DATA SERVICE
. ...I
Network Operations
Center/NMS I
I
I
-.-
Computation and
Analyses
Means
Source
Question Computation and
Size Analyses
Velocity
Data Munging
Data Ingestion
and Wrangling
Warehouse
Extract
Transform Computation and
Filter Analyses
Aggregation
Training
Hypothesis
Design
Method Computation and
Time Analyses
Supervised
Unsupervised
Regression Computation and
Classification Analyses
Clustering
Etc...
Crucial
Active Learning
Error Detection Computation and
Mashups Analyses
Value
Benjamin Bengfort
PhD Candidate at the University of
Maryland; Data Scientist at District
Data Labs.
Twitter: twitter.com/bbengfort
LinkedIn: linkedin.com/in/bbengfort
Github: github.com/bbengfort
Email: [email protected]
StackExchange
-
...,.
O"REIL.LY.