DATA
SCIENCE
Francis A
AGENDA
Introduction
• Data Science
• Big Data
• Data Analytics
Primary goals
Areas of growth
DATA SCIENCE 3
INTRODUCTION
Data science is an interdisciplinary
academic field that uses statistics,
scientific computing, scientific methods,
processes, algorithms and systems to
extract or extrapolate knowledge and
insights from noisy, structured, and
unstructured data.
BIG DATA 4
INTRODUCTION
The definition of big data is data that contains
greater variety, arriving in increasing volumes and
with more velocity. This is also known as the three
Vs.(Volume, Velocity, Variety). Put simply, big data
is larger, more complex data sets, especially from
new data sources. These data sets are so
voluminous that traditional data processing
software just can’t manage them. But these
massive volumes of data can be used to address
business problems you wouldn’t have been able to
tackle before.
BIG DATA 5
CAN U THINK
OF……
• Can vou think of running a
query on 20,980,000 GB file.
What if we get a new data set
like this, every day?
What if we need to execute
complex queries on this data set
everyday ?
Does anybody really deal with
this type of data set?
•Is it possible to store and
analyze this data?
• Yes google deals with more
than 20 PB data everyday
BIG DATA 6
YES… THAT’S
TRUE
IT’S POSSIBLE!!
• Google processes 20 PB a
day (2008)
• Way back Machine has 3
PB +100 TB/month (3/2009)
• Facebook has 2.5 PB of
user data + 15 TB/day
(4/2009)
• eBay has 6.5 PB of user
data + 50 TB/day (5/2009)
• CERN's Large Hydron
Collider (LHC)
generates 15 PB average
BIG DATA 7
BIG DATA IS NOT JUST ABOUT SIZE
• Volume
Data volumes are becoming unmanageable
• Variety
Data complexity is growing. more types of data captured than
previously
• Velocity
Some data is arriving so rapidly that it must either be
processed instantly, or lost. This is a whole subfield called
"stream processing"
BIG DATA 8
ANY DATA THAT IS DIFFICULT TO…
• Capture
• Store
• Search
• Share
• Transfer
• Analyze
• Create visualizations
BIG DATA 9
WHAT CAN BE DONE WITH BIG DATA
• Social media brand value analytics
• Product sentiment analysis
• Customer buying preference
predictions
• Video analytics
• Fraud detection
• Aggregation and Statistics
•Data warehouse and OLAP
• Indexing, Searching, and Querying
• Keyword based search
•Pattern matching (XML/RDF)
• Knowledge Discovery Data Mining
• Statistical Modeling
10
OK.....
ANALYSIS ON THIS BIGDATA CAN GIVE
US AWESOME INSIGHTS.
BUT, DATASETS ARE HUGE, COMPLEX
AND DIFFICULT TO PROCESS.
WHAT IS THE SOLUTION?
BIG DATA 11
HANDLING BIG DATA IS THERE A BETTER WAY???
• Till 1985, There is no way to connect multiple computers.
All systems were Centralized Systems.
• So multi-core system or super computers were the only
options for big data problems.
• After 1985,We have powerful microprocessors and High
Speed Computer Networks (LANs, WANs), which lead to
distributed systems.
BIG DATA 12
METHODS
PARALLEL DISTRIBUTED MAP
COMPUTING COMPUTING REDUCE
BIG DATA 13
PARALLEL DISTRIBUTED
Put multiple CPUs in a We want to cut the data
machine (100?) into small pieces & place
Write a code that will them on different
calculate 200 parallel machines.
counts and finally sums MAP REDUCE Divide the overall
up. problem into small tasks
But you need & run these small tasks
Processing data using
a super computer locally.
special map() and
reduce() functions. Finally collate the results
from local machines
14
I FOUND A SOLUTION,
DISTRIBUTED COMPUTING OR
MAP-REDUCE.
BUT LOOKS LIKE THIS DATA
STORAGE & PARALLEL
PROCESSING IS COMPLICATED.
WHAT IS THE SOLUTION?
15
DATA ANALYTICS
INTRODUCTION
Data analysis is a process of inspecting, cleansing,
transforming and modeling data with the goal of
discovering useful information, informing conclusions, and
supporting decision-making. Data analysis has multiple
facets and approaches, encompassing diverse techniques
under a variety of names, and is used in different
business, science, and social science domains.
DATA ANALYTICS 16
HADOOP
Hadoop is a bunch of tools, it has many components.
HDFS and MapReduce are two core components of
Hadoop.
HDFS: Hadoop Distributed File System
• Makes our job easy to store the data on commodity
hardware.
• Built to expect hardware failures.
• Intended for large files & batch inserts.
MapReduce
• For parallel processing
DATA ANALYTICS 17
SO WHAT IS HADOOP?
• Hadoop is not Bigdata.
• Hadoop is not a database.
• Hadoop is a platform/framework.
• Which allows the user to quickly write
and test distributed systems.
• Which is efficient in automatically
distributing the data and
work across machines.
DATA ANALYTICS 18
WHY HADOOP IS USEFUL?
•Scalable: It can reliably store and process petabytes.
• Economical: It distributes the data and processing
across clusters of commonly available computers.
• Efficient: By distributing the data, it can process it in
parallel on the nodes where the data is located.
• Reliable: It automatically maintains multiple copies of
data and automatically redeploys computing tasks
based on failures.
• And Hadoop is free
19
OK, I CAN USE HADOOP FRAMEWORK....
DON'T KNOW JAVA,
HOW DO I WRITE MAP-REDUCE
PROGRAMS?
DATA ANALYTICS 20
MAP REDUCE MADE EASY
Hive :
• Hive is for data analysts with strong SQL skills providing an SQL-
like interface and a relational data model.
• Hive uses a language called HiveQL; very similar to SQL.
• Hive translates queries into a series of MapReduce jobs.
Pig :
• Pig is a high-level platform for processing big data on Hadoop
clusters.
• Pig consists of a data flow language, called Pig Latin, supporting
writing queries on large datasets and an execution environment
running programs from a console.
• The Pig Latin programs consist of dataset transformation series
converted under the covers, to a MapReduce program series.
Mahout :
• Mahout is an open source machine-learning library facilitating
building scalable matching learning libraries
DATA ANALYTICS 21
22
LIFECYCLE OF BIGDATA
BUSINESS ANALYTICS
23
1. Discovery (Definition , Domain
, Resource)
2. Preparation (Sandbox , ETL &
ELT)
3. Model Planning (Research)
4. Model Building (Training data
& Test data)
5. Operationalize (Small scale)
6. Communicate Results
(Objectives)
PRIMARY GOAL
25
ROLE OF A DATA SCIENTIST
• A data scientist should be a person who looks at
data and makes out the trends in it.
• Does descriptive, discovery, predictive and
prescriptive analytics on the data lie.
• Finds out the hidden story in the data, makes
insights and takes suitable actions/ decisions.
• Works with application developers to find the
suitable data for analysis.
• Make the plan for doing analytics for specific results.
• Makes effective data mining architecture.
• Makes reports.
26
WHO IS A DATA SCIENTIST
27
WHO IS A DATA SCIENTIST
28
PERFORMANCE
29
AREAS OF GROWTH
30
“ ”
SUCCESS AHEAD
-AF
THANK YOU