0% found this document useful (0 votes)

39 views23 pages

Chapter 2-Data Science

Uploaded by

Wondimu Bantihun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views23 pages

Chapter 2-Data Science

Uploaded by

Wondimu Bantihun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Chapter 2

Data Science
Contents:

 Overview of Data Science

 Data Types and Their Representation

 Data value Chain

 Basic Concepts of Big Data

Overview of Data Science

• Data science is a multi-disciplinary field that uses scientific methods,

processes, algorithms, and systems to extract knowledge and insights
from structured, semi-structured and unstructured data.

• Data science is much more than simply analyzing data.

• It offers a range of roles and requires a range of skills.

What are data and information?
• Data can be defined as a representation of facts, concepts, or instructions in a
formalized manner.
• It can be described as unprocessed facts and figures.
• It is represented with the help of:
• alphabets (A-Z, a-z),
• digits (0-9) or
• special characters (+, -, /, *, <,>, =, etc.)
• information is the processed data on which decisions and actions are based.
• It is interpreted data; created from organized, structured, and processed data
in a particular context.
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a particular
purpose.
• Data processing consists of the following basic steps - input,
processing, and output.

Data Processing Cycle

Data types and their representation

1. Data types from Computer programming perspective

• Integers(int)- is used to store whole numbers, mathematically known
as integers
• Booleans(bool)- is used to represent restricted to one of two values:
true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real numbers
• Alphanumeric strings(string)- used to store a combination of
characters and numbers
• 2. Data types from Data Analytics perspective

Structured Data Unstructured Data Semi-structured Data

• A pre-defined data • Have no pre-defined data model. • contains tags or other markers to
• Straightforward to analyze • May contain data such as dates, separate semantic elements
• Placed in tabular format numbers and facts • known as a self-describing
• Example: Excel files or • difficult to understand using structure.
SQL databases. traditional programs. • Example: JSON and XML
• Example: audio, video files
Metadata
• It is not a separate data structure, but most important element for Big
Data analysis and solutions.
• They are called data about data.
• In a set of photographs,
for example, metadata
could describe when and
where the photos were
taken.
Data value Chain

• The Data Value Chain is introduced to describe the information flow

within a big data system.
• describes the full data lifecycle from collection to analysis and usage.
• The Big Data Value Chain identifies the following key high-level
activities:
Basic concepts of big data

• Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.

• According to IBM, Big data is characterized by 3V and more:

• Volume (amount of data): dealing with large scales of data within

data processing (e.g. Global Supply Chains, Global Financial
Analysis, Large Hadron Collider).
• Velocity (speed of data): dealing with streams of high frequency of
incoming real-time data (e.g. Sensors, Pervasive Environments, Electronic
Trading, Internet of Things).

• Variety (range of data types/sources): dealing with data using differing

syntactic formats (e.g. Spreadsheets, XML, DBMS), schemas, and meanings
(e.g. Enterprise Data Integration).

• Veracity: can we trust the data? How accurate is it? etc.

Clustered Computing and Hadoop Ecosystem

Clustered Computing

• Cluster computing refers that many of the computers connected on a network and they
perform like a single entity.

• Because of the qualities of big data, individual computers are often inadequate for handling
the data at most stages.

• To better address the high storage and computational needs of big data, computer clusters are
a better fit.

• Big data clustering software combines the resources of many smaller machines, seeking to
provide a number of benefits:
suppose you have a big file having more than 500 mb data and you need to count the number of words. But
your computer has only 100 mb, how you can handle it ?
• Resource Pooling: Combining the available storage space to hold data
is a clear benefit, but CPU and memory pooling are also extremely
important.
• Processing large datasets requires large amounts of all three of these
resources.
• High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or software
failures from affecting access to data and processing.
• Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.
• Cluster membership and resource allocation can be handled by software
like Hadoop’s YARN (which stands for Yet Another Resource Negotiator).

• Hadoop is an open-source framework intended to make interaction with big

data easier.

• It is a framework that allows for the distributed processing of large datasets

across clusters of computers using simple programming models.
The four key characteristics of Hadoop are:

• Economical: Its systems are highly economical as ordinary computers can be used for data
processing.

• Reliable: It is reliable as it stores copies of the data on different machines and is resistant
to hardware failure.

• Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in
scaling up the framework.
• Flexible: It is flexible and you can store as much structured and unstructured data as you
need to and decide to use them later.
• Hadoop has an ecosystem that has evolved from its four core
components: data management, data access, data processing, and data
storage.

Hadoop ecosystem
Big Data Life Cycle with Hadoop
1. Ingesting data into the system

• The first stage of Big Data processing is Ingest.

• The data is ingested or transferred to Hadoop from various sources

such as relational databases, systems, or local files.

• Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers

event data.
2. Processing the data in storage

• The second stage is Processing. In this stage, the data is stored and
processed.

• The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, Hbase, Spark and MapReduce perform data
processing.
3. Computing and analyzing data

• The third stage is to Analyze. Here, the data is analyzed by processing

frameworks such as Pig, Hive, and Impala.

• Pig converts the data using a map and reduce and then analyzes it.

• Hive is also based on the map and reduce programming and is most
suitable for structured data.
4. Visualizing the results
• The fourth stage is Access, which is performed by tools such as Hue
and Cloudera Search.
• In this stage, the analyzed data can be accessed by users.
Thank you!!!

Europass CV
No ratings yet
Europass CV
2 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
CH 2
No ratings yet
CH 2
20 pages
Hadoop PPT
100% (1)
Hadoop PPT
25 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Big Data Unit 1 AKTU Notes
100% (1)
Big Data Unit 1 AKTU Notes
87 pages
Lec 1
No ratings yet
Lec 1
7 pages
Asset-V1 E-SHE+EX101+Q1+Type@Asset+Block@Chapter2 Session 4 PDF
No ratings yet
Asset-V1 E-SHE+EX101+Q1+Type@Asset+Block@Chapter2 Session 4 PDF
8 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
Ubd Graphing Slope-Intercept Form
No ratings yet
Ubd Graphing Slope-Intercept Form
4 pages
Pawan Transfer
No ratings yet
Pawan Transfer
2 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
30 pages
Weekly Assessment in Science
No ratings yet
Weekly Assessment in Science
1 page
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Big Data NOTES
No ratings yet
Big Data NOTES
14 pages
Elektor Electronics 2020-07 08 USA
100% (1)
Elektor Electronics 2020-07 08 USA
116 pages
CH 2
No ratings yet
CH 2
23 pages
2 Emerging
No ratings yet
2 Emerging
10 pages
The Feasibility Study of Ballitaw
No ratings yet
The Feasibility Study of Ballitaw
2 pages
PROBLEMS (Homework)
No ratings yet
PROBLEMS (Homework)
5 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
17 pages
Lecture8 - Big Data (Hadoop)
No ratings yet
Lecture8 - Big Data (Hadoop)
29 pages
LSD Thesis Statement
100% (3)
LSD Thesis Statement
5 pages
Unit Iv PDF
No ratings yet
Unit Iv PDF
26 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
AHP Template SCBUK
No ratings yet
AHP Template SCBUK
24 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
Data Science Essentials for Beginners
No ratings yet
Data Science Essentials for Beginners
20 pages
Big Data Analytics - Overview
No ratings yet
Big Data Analytics - Overview
66 pages
Bda U2
No ratings yet
Bda U2
68 pages
AP0070462152019
No ratings yet
AP0070462152019
1 page
Wa0000.
No ratings yet
Wa0000.
35 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
One Minute Manager Notes
No ratings yet
One Minute Manager Notes
8 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Chapter 1-Introduction To Emerging Technologies
No ratings yet
Chapter 1-Introduction To Emerging Technologies
24 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Chapter 6 - Ethics and Professionalism of ET
No ratings yet
Chapter 6 - Ethics and Professionalism of ET
31 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Big Data & Hadoop Architecture Guide
50% (2)
Big Data & Hadoop Architecture Guide
168 pages
Purposive Communication - Lesson 3
No ratings yet
Purposive Communication - Lesson 3
7 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Data Science
No ratings yet
Data Science
87 pages
Pumping Station Design Guidelines
100% (1)
Pumping Station Design Guidelines
8 pages
Hadoop & Big Data Overview
No ratings yet
Hadoop & Big Data Overview
23 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Biggdata
No ratings yet
Biggdata
24 pages
Gestation - Biology Presentation
No ratings yet
Gestation - Biology Presentation
8 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Big Data S All Units
No ratings yet
Big Data S All Units
122 pages
Big Data
No ratings yet
Big Data
17 pages
Soil Variability and Its Consequences in Geotechnical Engineering
No ratings yet
Soil Variability and Its Consequences in Geotechnical Engineering
302 pages
Clinical Microbiology MCQ Practice Test
100% (4)
Clinical Microbiology MCQ Practice Test
13 pages
Garduate Nurse Perceptions of The Work Experience
No ratings yet
Garduate Nurse Perceptions of The Work Experience
7 pages
Basic Technology Exam Questions For Jss2 Second Term
No ratings yet
Basic Technology Exam Questions For Jss2 Second Term
6 pages
Module 1
No ratings yet
Module 1
54 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Biology of Stem Cells: An Overview: Pedro C. Chagastelles and Nance B. Nardi
No ratings yet
Biology of Stem Cells: An Overview: Pedro C. Chagastelles and Nance B. Nardi
5 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
First Language Acquisition Theories
No ratings yet
First Language Acquisition Theories
28 pages
Richland Technologies 5th Anniversary Press Release
No ratings yet
Richland Technologies 5th Anniversary Press Release
2 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Black Seed (Nigella Sativa) - Clark's Nutrition
No ratings yet
Black Seed (Nigella Sativa) - Clark's Nutrition
5 pages
Design and Manufacturing of Pneumatic Burr Removing Machine: Kakde D V, Lokawar V L
No ratings yet
Design and Manufacturing of Pneumatic Burr Removing Machine: Kakde D V, Lokawar V L
3 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
English Language Proficiency
No ratings yet
English Language Proficiency
1 page
Link Game PPSSPP (Sfile
100% (1)
Link Game PPSSPP (Sfile
9 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Multi2sim Quickstart
No ratings yet
Multi2sim Quickstart
10 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
CRM Assignment: Key Concepts Quiz
100% (2)
CRM Assignment: Key Concepts Quiz
28 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Executive Leadership Profile
No ratings yet
Executive Leadership Profile
2 pages
Origin of Adat Iban
No ratings yet
Origin of Adat Iban
7 pages
Construction of A New Bridge Across The River Nile at Jinja UVCO/1407/JNJ/PY/ / Abutment A1 Structural Outline Sheet - 1
No ratings yet
Construction of A New Bridge Across The River Nile at Jinja UVCO/1407/JNJ/PY/ / Abutment A1 Structural Outline Sheet - 1
71 pages
The Incredible Hulk
No ratings yet
The Incredible Hulk
14 pages

Chapter 2-Data Science

Uploaded by

Chapter 2-Data Science

Uploaded by

Chapter 2

 Overview of Data Science

 Data Types and Their Representation

 Data value Chain

 Basic Concepts of Big Data

• Data science is a multi-disciplinary field that uses scientific methods,

• Data science is much more than simply analyzing data.

• It offers a range of roles and requires a range of skills.

Data Processing Cycle

1. Data types from Computer programming perspective

Structured Data Unstructured Data Semi-structured Data

• The Data Value Chain is introduced to describe the information flow

• According to IBM, Big data is characterized by 3V and more:

• Volume (amount of data): dealing with large scales of data within

• Variety (range of data types/sources): dealing with data using differing

• Veracity: can we trust the data? How accurate is it? etc.

• Hadoop is an open-source framework intended to make interaction with big

• It is a framework that allows for the distributed processing of large datasets

• The first stage of Big Data processing is Ingest.

• The data is ingested or transferred to Hadoop from various sources

• Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers

• The third stage is to Analyze. Here, the data is analyzed by processing

You might also like