Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views31 pages

Data Science

The document provides an overview of data science, big data, and data analytics, highlighting their definitions and significance in extracting insights from large and complex datasets. It discusses the challenges posed by big data, including volume, variety, and velocity, and introduces solutions like distributed computing and the Hadoop framework for efficient data processing. Additionally, it outlines the role of a data scientist and the lifecycle of big data analytics.

Uploaded by

francisfd0505
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views31 pages

Data Science

The document provides an overview of data science, big data, and data analytics, highlighting their definitions and significance in extracting insights from large and complex datasets. It discusses the challenges posed by big data, including volume, variety, and velocity, and introduces solutions like distributed computing and the Hadoop framework for efficient data processing. Additionally, it outlines the role of a data scientist and the lifecycle of big data analytics.

Uploaded by

francisfd0505
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

DATA

SCIENCE
Francis A
AGENDA
Introduction​
• Data Science
• Big Data
• Data Analytics

Primary goals
​Areas of growth

DATA SCIENCE 3

INTRODUCTION
Data science is an interdisciplinary
academic field that uses statistics,
scientific computing, scientific methods,
processes, algorithms and systems to
extract or extrapolate knowledge and
insights from noisy, structured, and
unstructured data.
BIG DATA 4

INTRODUCTION
The definition of big data is data that contains
greater variety, arriving in increasing volumes and
with more velocity. This is also known as the three
Vs.(Volume, Velocity, Variety). Put simply, big data
is larger, more complex data sets, especially from
new data sources. These data sets are so
voluminous that traditional data processing
software just can’t manage them. But these
massive volumes of data can be used to address
business problems you wouldn’t have been able to
tackle before.
BIG DATA 5

CAN U THINK
OF……
• Can vou think of running a
query on 20,980,000 GB file.
What if we get a new data set
like this, every day?
What if we need to execute
complex queries on this data set
everyday ?
Does anybody really deal with
this type of data set?
•Is it possible to store and
analyze this data?
• Yes google deals with more
than 20 PB data everyday
BIG DATA 6

YES… THAT’S
TRUE

IT’S POSSIBLE!!
• Google processes 20 PB a
day (2008)
• Way back Machine has 3
PB +100 TB/month (3/2009)
• Facebook has 2.5 PB of
user data + 15 TB/day
(4/2009)
• eBay has 6.5 PB of user
data + 50 TB/day (5/2009)
• CERN's Large Hydron
Collider (LHC)
generates 15 PB average
BIG DATA 7

BIG DATA IS NOT JUST ABOUT SIZE

• Volume
Data volumes are becoming unmanageable

• Variety
Data complexity is growing. more types of data captured than
previously

• Velocity
Some data is arriving so rapidly that it must either be
processed instantly, or lost. This is a whole subfield called
"stream processing"
BIG DATA 8

ANY DATA THAT IS DIFFICULT TO…

• Capture
• Store
• Search
• Share
• Transfer
• Analyze
• Create visualizations
BIG DATA 9

WHAT CAN BE DONE WITH BIG DATA


• Social media brand value analytics
• Product sentiment analysis
• Customer buying preference
predictions
• Video analytics
• Fraud detection
• Aggregation and Statistics
•Data warehouse and OLAP
• Indexing, Searching, and Querying
• Keyword based search
•Pattern matching (XML/RDF)
• Knowledge Discovery Data Mining
• Statistical Modeling
10

OK.....

ANALYSIS ON THIS BIGDATA CAN GIVE


US AWESOME INSIGHTS.

BUT, DATASETS ARE HUGE, COMPLEX


AND DIFFICULT TO PROCESS.

WHAT IS THE SOLUTION?


BIG DATA 11

HANDLING BIG DATA IS THERE A BETTER WAY???

• Till 1985, There is no way to connect multiple computers.


All systems were Centralized Systems.

• So multi-core system or super computers were the only


options for big data problems.

• After 1985,We have powerful microprocessors and High


Speed Computer Networks (LANs, WANs), which lead to
distributed systems.
BIG DATA 12

METHODS

PARALLEL DISTRIBUTED MAP


COMPUTING COMPUTING REDUCE
BIG DATA 13

PARALLEL DISTRIBUTED

Put multiple CPUs in a We want to cut the data


machine (100?) into small pieces & place
Write a code that will them on different
calculate 200 parallel machines.
counts and finally sums MAP REDUCE Divide the overall
up. problem into small tasks
But you need & run these small tasks
Processing data using
a super computer locally.
special map() and
reduce() functions. Finally collate the results
from local machines
14

I FOUND A SOLUTION,
DISTRIBUTED COMPUTING OR
MAP-REDUCE.
BUT LOOKS LIKE THIS DATA
STORAGE & PARALLEL
PROCESSING IS COMPLICATED.
WHAT IS THE SOLUTION?
15

DATA ANALYTICS

INTRODUCTION
Data analysis is a process of inspecting, cleansing,
transforming and modeling data with the goal of
discovering useful information, informing conclusions, and
supporting decision-making. Data analysis has multiple
facets and approaches, encompassing diverse techniques
under a variety of names, and is used in different
business, science, and social science domains.
DATA ANALYTICS 16

HADOOP
Hadoop is a bunch of tools, it has many components.
HDFS and MapReduce are two core components of
Hadoop.

HDFS: Hadoop Distributed File System


• Makes our job easy to store the data on commodity
hardware.
• Built to expect hardware failures.
• Intended for large files & batch inserts.
MapReduce
• For parallel processing
DATA ANALYTICS 17

SO WHAT IS HADOOP?

• Hadoop is not Bigdata.


• Hadoop is not a database.
• Hadoop is a platform/framework.
• Which allows the user to quickly write
and test distributed systems.
• Which is efficient in automatically
distributing the data and
work across machines.
DATA ANALYTICS 18

WHY HADOOP IS USEFUL?


•Scalable: It can reliably store and process petabytes.

• Economical: It distributes the data and processing


across clusters of commonly available computers.

• Efficient: By distributing the data, it can process it in


parallel on the nodes where the data is located.

• Reliable: It automatically maintains multiple copies of


data and automatically redeploys computing tasks
based on failures.

• And Hadoop is free


19

OK, I CAN USE HADOOP FRAMEWORK....


DON'T KNOW JAVA,
HOW DO I WRITE MAP-REDUCE
PROGRAMS?
DATA ANALYTICS 20

MAP REDUCE MADE EASY


Hive :
• Hive is for data analysts with strong SQL skills providing an SQL-
like interface and a relational data model.
• Hive uses a language called HiveQL; very similar to SQL.
• Hive translates queries into a series of MapReduce jobs.

Pig :
• Pig is a high-level platform for processing big data on Hadoop
clusters.
• Pig consists of a data flow language, called Pig Latin, supporting
writing queries on large datasets and an execution environment
running programs from a console.
• The Pig Latin programs consist of dataset transformation series
converted under the covers, to a MapReduce program series.

Mahout :
• Mahout is an open source machine-learning library facilitating
building scalable matching learning libraries
DATA ANALYTICS 21
22

LIFECYCLE OF BIGDATA
BUSINESS ANALYTICS
23

1. Discovery (Definition , Domain


, Resource)
2. Preparation (Sandbox , ETL &
ELT)
3. Model Planning (Research)
4. Model Building (Training data
& Test data)
5. Operationalize (Small scale)
6. Communicate Results
(Objectives)
PRIMARY GOAL
25

ROLE OF A DATA SCIENTIST


• A data scientist should be a person who looks at
data and makes out the trends in it.
• Does descriptive, discovery, predictive and
prescriptive analytics on the data lie.
• Finds out the hidden story in the data, makes
insights and takes suitable actions/ decisions.
• Works with application developers to find the
suitable data for analysis.
• Make the plan for doing analytics for specific results.
• Makes effective data mining architecture.
• Makes reports.
26

WHO IS A DATA SCIENTIST


27

WHO IS A DATA SCIENTIST


28

PERFORMANCE
29

AREAS OF GROWTH
30

“ ”
SUCCESS AHEAD
-AF
THANK YOU

You might also like