0% found this document useful (0 votes)

16 views31 pages

Data Science

The document provides an overview of data science, big data, and data analytics, highlighting their definitions and significance in extracting insights from large and complex datasets. It discusses the challenges posed by big data, including volume, variety, and velocity, and introduces solutions like distributed computing and the Hadoop framework for efficient data processing. Additionally, it outlines the role of a data scientist and the lifecycle of big data analytics.

Uploaded by

francisfd0505

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views31 pages

Data Science

Uploaded by

francisfd0505

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

DATA

SCIENCE
Francis A
AGENDA
Introduction
• Data Science
• Big Data
• Data Analytics

Primary goals
Areas of growth

DATA SCIENCE 3

INTRODUCTION
Data science is an interdisciplinary
academic field that uses statistics,
scientific computing, scientific methods,
processes, algorithms and systems to
extract or extrapolate knowledge and
insights from noisy, structured, and
unstructured data.
BIG DATA 4

INTRODUCTION
The definition of big data is data that contains
greater variety, arriving in increasing volumes and
with more velocity. This is also known as the three
Vs.(Volume, Velocity, Variety). Put simply, big data
is larger, more complex data sets, especially from
new data sources. These data sets are so
voluminous that traditional data processing
software just can’t manage them. But these
massive volumes of data can be used to address
business problems you wouldn’t have been able to
tackle before.
BIG DATA 5

CAN U THINK
OF……
• Can vou think of running a
query on 20,980,000 GB file.
What if we get a new data set
like this, every day?
What if we need to execute
complex queries on this data set
everyday ?
Does anybody really deal with
this type of data set?
•Is it possible to store and
analyze this data?
• Yes google deals with more
than 20 PB data everyday
BIG DATA 6

YES… THAT’S
TRUE

IT’S POSSIBLE!!
• Google processes 20 PB a
day (2008)
• Way back Machine has 3
PB +100 TB/month (3/2009)
• Facebook has 2.5 PB of
user data + 15 TB/day
(4/2009)
• eBay has 6.5 PB of user
data + 50 TB/day (5/2009)
• CERN's Large Hydron
Collider (LHC)
generates 15 PB average
BIG DATA 7

BIG DATA IS NOT JUST ABOUT SIZE

• Volume
Data volumes are becoming unmanageable

• Variety
Data complexity is growing. more types of data captured than
previously

• Velocity
Some data is arriving so rapidly that it must either be
processed instantly, or lost. This is a whole subfield called
"stream processing"
BIG DATA 8

ANY DATA THAT IS DIFFICULT TO…

• Capture
• Store
• Search
• Share
• Transfer
• Analyze
• Create visualizations
BIG DATA 9

WHAT CAN BE DONE WITH BIG DATA

• Social media brand value analytics
• Product sentiment analysis
• Customer buying preference
predictions
• Video analytics
• Fraud detection
• Aggregation and Statistics
•Data warehouse and OLAP
• Indexing, Searching, and Querying
• Keyword based search
•Pattern matching (XML/RDF)
• Knowledge Discovery Data Mining
• Statistical Modeling
10

OK.....

ANALYSIS ON THIS BIGDATA CAN GIVE

US AWESOME INSIGHTS.

BUT, DATASETS ARE HUGE, COMPLEX

AND DIFFICULT TO PROCESS.

WHAT IS THE SOLUTION?

BIG DATA 11

HANDLING BIG DATA IS THERE A BETTER WAY???

• Till 1985, There is no way to connect multiple computers.

All systems were Centralized Systems.

• So multi-core system or super computers were the only

options for big data problems.

• After 1985,We have powerful microprocessors and High

Speed Computer Networks (LANs, WANs), which lead to
distributed systems.
BIG DATA 12

METHODS

PARALLEL DISTRIBUTED MAP

COMPUTING COMPUTING REDUCE
BIG DATA 13

PARALLEL DISTRIBUTED

Put multiple CPUs in a We want to cut the data

machine (100?) into small pieces & place
Write a code that will them on different
calculate 200 parallel machines.
counts and finally sums MAP REDUCE Divide the overall
up. problem into small tasks
But you need & run these small tasks
Processing data using
a super computer locally.
special map() and
reduce() functions. Finally collate the results
from local machines
14

I FOUND A SOLUTION,
DISTRIBUTED COMPUTING OR
MAP-REDUCE.
BUT LOOKS LIKE THIS DATA
STORAGE & PARALLEL
PROCESSING IS COMPLICATED.
WHAT IS THE SOLUTION?
15

DATA ANALYTICS

INTRODUCTION
Data analysis is a process of inspecting, cleansing,
transforming and modeling data with the goal of
discovering useful information, informing conclusions, and
supporting decision-making. Data analysis has multiple
facets and approaches, encompassing diverse techniques
under a variety of names, and is used in different
business, science, and social science domains.
DATA ANALYTICS 16

HADOOP
Hadoop is a bunch of tools, it has many components.
HDFS and MapReduce are two core components of
Hadoop.

HDFS: Hadoop Distributed File System

• Makes our job easy to store the data on commodity
hardware.
• Built to expect hardware failures.
• Intended for large files & batch inserts.
MapReduce
• For parallel processing
DATA ANALYTICS 17

SO WHAT IS HADOOP?

• Hadoop is not Bigdata.

• Hadoop is not a database.
• Hadoop is a platform/framework.
• Which allows the user to quickly write
and test distributed systems.
• Which is efficient in automatically
distributing the data and
work across machines.
DATA ANALYTICS 18

WHY HADOOP IS USEFUL?

•Scalable: It can reliably store and process petabytes.

• Economical: It distributes the data and processing

across clusters of commonly available computers.

• Efficient: By distributing the data, it can process it in

parallel on the nodes where the data is located.

• Reliable: It automatically maintains multiple copies of

data and automatically redeploys computing tasks
based on failures.

• And Hadoop is free

OK, I CAN USE HADOOP FRAMEWORK....

DON'T KNOW JAVA,
HOW DO I WRITE MAP-REDUCE
PROGRAMS?
DATA ANALYTICS 20

MAP REDUCE MADE EASY

Hive :
• Hive is for data analysts with strong SQL skills providing an SQL-
like interface and a relational data model.
• Hive uses a language called HiveQL; very similar to SQL.
• Hive translates queries into a series of MapReduce jobs.

Pig :
• Pig is a high-level platform for processing big data on Hadoop
clusters.
• Pig consists of a data flow language, called Pig Latin, supporting
writing queries on large datasets and an execution environment
running programs from a console.
• The Pig Latin programs consist of dataset transformation series
converted under the covers, to a MapReduce program series.

Mahout :
• Mahout is an open source machine-learning library facilitating
building scalable matching learning libraries
DATA ANALYTICS 21
22

LIFECYCLE OF BIGDATA
BUSINESS ANALYTICS
23

1. Discovery (Definition , Domain

, Resource)
2. Preparation (Sandbox , ETL &
ELT)
3. Model Planning (Research)
4. Model Building (Training data
& Test data)
5. Operationalize (Small scale)
6. Communicate Results
(Objectives)
PRIMARY GOAL
25

ROLE OF A DATA SCIENTIST

• A data scientist should be a person who looks at
data and makes out the trends in it.
• Does descriptive, discovery, predictive and
prescriptive analytics on the data lie.
• Finds out the hidden story in the data, makes
insights and takes suitable actions/ decisions.
• Works with application developers to find the
suitable data for analysis.
• Make the plan for doing analytics for specific results.
• Makes effective data mining architecture.
• Makes reports.
26

WHO IS A DATA SCIENTIST

PERFORMANCE
29

AREAS OF GROWTH
30

“ ”
SUCCESS AHEAD
-AF
THANK YOU

Big Data & Hadoop Architecture Guide
50% (2)
Big Data & Hadoop Architecture Guide
168 pages
Expressing I Rab The Presentation of Arabic Gram Ebook PDF
0% (1)
Expressing I Rab The Presentation of Arabic Gram Ebook PDF
3 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Motorola Fire XT 311 Manual
No ratings yet
Motorola Fire XT 311 Manual
48 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Hadoop PPT
100% (1)
Hadoop PPT
25 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
119 pages
Bigdata PPT Slides (E)
No ratings yet
Bigdata PPT Slides (E)
10 pages
Session 8 - George Strawn - Big Data
No ratings yet
Session 8 - George Strawn - Big Data
34 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Big Data Analytics Using Apache Hadoop
No ratings yet
Big Data Analytics Using Apache Hadoop
33 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
88 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Seminar On: Big Data
No ratings yet
Seminar On: Big Data
23 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Big Data
No ratings yet
Big Data
31 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
134 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Oracle Cash Management Guide
No ratings yet
Oracle Cash Management Guide
13 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Information Systems Quiz Questions
No ratings yet
Information Systems Quiz Questions
14 pages
Preventive Maintenace Procedure
100% (1)
Preventive Maintenace Procedure
9 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Fiori Front Server 4.0 Implementation Guide
No ratings yet
Fiori Front Server 4.0 Implementation Guide
20 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Interactive Cyber Security Career Roadmap
100% (1)
Interactive Cyber Security Career Roadmap
22 pages
Data Mining With Bigdata
No ratings yet
Data Mining With Bigdata
30 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Mekelle University Ethiopian Institute of Technology-Mekelle Mechanical Engineering Department
No ratings yet
Mekelle University Ethiopian Institute of Technology-Mekelle Mechanical Engineering Department
3 pages
2.development of Blind Assistive Device in Shopping Malls
No ratings yet
2.development of Blind Assistive Device in Shopping Malls
4 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Agile Methodology 2016
No ratings yet
Agile Methodology 2016
17 pages
Freelance Market Research Analyst 1110
No ratings yet
Freelance Market Research Analyst 1110
4 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Oracle Cost Management Report
No ratings yet
Oracle Cost Management Report
36 pages
Pison VH10 User Manual
No ratings yet
Pison VH10 User Manual
166 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Lecture 1
No ratings yet
Lecture 1
22 pages
Big Data
No ratings yet
Big Data
25 pages
The Magic Cafe Forums - Psychological Subtleties
No ratings yet
The Magic Cafe Forums - Psychological Subtleties
5 pages
Lecture8 - Big Data (Hadoop)
No ratings yet
Lecture8 - Big Data (Hadoop)
29 pages
Creating Variables To MATLAB
No ratings yet
Creating Variables To MATLAB
9 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Obrien Dominic - Consigue Una Memoria Asombrosa
No ratings yet
Obrien Dominic - Consigue Una Memoria Asombrosa
191 pages
Introduction To Big Data Computing
No ratings yet
Introduction To Big Data Computing
25 pages
Hadoop & Big Data Overview
No ratings yet
Hadoop & Big Data Overview
23 pages
Use of Technology in Accounting
No ratings yet
Use of Technology in Accounting
2 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Rainbow Vistas at Rock Garden - Google Maps PDF
No ratings yet
Rainbow Vistas at Rock Garden - Google Maps PDF
3 pages
BigDataAnalytics 1.2
No ratings yet
BigDataAnalytics 1.2
25 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
Bda U1
No ratings yet
Bda U1
80 pages
Unit 1
No ratings yet
Unit 1
19 pages
CS 441 Handouts
No ratings yet
CS 441 Handouts
300 pages
Big Data Analytics
No ratings yet
Big Data Analytics
20 pages
M e Cse
No ratings yet
M e Cse
77 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
$RM5TSDQ
No ratings yet
$RM5TSDQ
70 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Inner Class
No ratings yet
Inner Class
5 pages
4B2 F.Feve Valeo
No ratings yet
4B2 F.Feve Valeo
15 pages
Maximum Supported Hopping Rate Measurements Using The Universal Software Radio Peripheral Software Defined Radio
No ratings yet
Maximum Supported Hopping Rate Measurements Using The Universal Software Radio Peripheral Software Defined Radio
7 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
June 2022 MS - Paper 1 Computer Edexcel Science GCSE
No ratings yet
June 2022 MS - Paper 1 Computer Edexcel Science GCSE
21 pages
How To Make Custom Shops in Elden Ring - Introduction To Talk Menus
No ratings yet
How To Make Custom Shops in Elden Ring - Introduction To Talk Menus
35 pages
Ccs334 BDA Important Questions
No ratings yet
Ccs334 BDA Important Questions
31 pages
Big Data Intro
No ratings yet
Big Data Intro
32 pages
Fall 2024 - CS506 - 1
No ratings yet
Fall 2024 - CS506 - 1
15 pages
OOPs Coding Problems
No ratings yet
OOPs Coding Problems
4 pages
0661 - Dot Commands
No ratings yet
0661 - Dot Commands
10 pages
BCA DBMS Exam June 2023
No ratings yet
BCA DBMS Exam June 2023
2 pages
Gideon Intel Drop 56 Justpasteit
No ratings yet
Gideon Intel Drop 56 Justpasteit
12 pages
Installation or Run AstroHora File
No ratings yet
Installation or Run AstroHora File
5 pages
5introduction Data Science
No ratings yet
5introduction Data Science
46 pages
CH 2
No ratings yet
CH 2
23 pages
Lecture 1 - What Is Big Data 1-29
No ratings yet
Lecture 1 - What Is Big Data 1-29
88 pages
Programme Elective Iv
No ratings yet
Programme Elective Iv
3 pages

Data Science

Uploaded by

Data Science

Uploaded by

DATA

BIG DATA IS NOT JUST ABOUT SIZE

ANY DATA THAT IS DIFFICULT TO…

WHAT CAN BE DONE WITH BIG DATA

ANALYSIS ON THIS BIGDATA CAN GIVE

BUT, DATASETS ARE HUGE, COMPLEX

WHAT IS THE SOLUTION?

HANDLING BIG DATA IS THERE A BETTER WAY???

• Till 1985, There is no way to connect multiple computers.

• So multi-core system or super computers were the only

• After 1985,We have powerful microprocessors and High

PARALLEL DISTRIBUTED MAP

Put multiple CPUs in a We want to cut the data

HDFS: Hadoop Distributed File System

• Hadoop is not Bigdata.

WHY HADOOP IS USEFUL?

• Economical: It distributes the data and processing

• Efficient: By distributing the data, it can process it in

• Reliable: It automatically maintains multiple copies of

• And Hadoop is free

OK, I CAN USE HADOOP FRAMEWORK....

MAP REDUCE MADE EASY

1. Discovery (Definition , Domain

ROLE OF A DATA SCIENTIST

WHO IS A DATA SCIENTIST

WHO IS A DATA SCIENTIST

You might also like