0% found this document useful (0 votes)

26 views46 pages

Big Data - 1

The document provides an overview of Big Data, including its characteristics, technologies, and use cases, highlighting the rapid growth of data generated through social media and other platforms. It discusses the challenges of managing large datasets and introduces Hadoop as a solution for storing and processing Big Data. Key components of Hadoop, such as HDFS and YARN, are explained along with their roles in ensuring fault tolerance and efficient data management.

Uploaded by

adityaagarwal310

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views46 pages

Big Data - 1

Uploaded by

adityaagarwal310

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Big Data

Facts and
Figures
• Every minute on Facebook, 510,000 comments are posted, 293,000
statuses are updated, and 136,000 photos are uploaded.
• Instagram users upload 46,740 million posts every minute!
• Since 2013, the number of Facebook posts shared each minute has
increased 22%, from 2.5 million to 3 million posts per minute in 2016.
This number has increased more than 300 percent, from around
650,000 posts per minute in 2011!
• YouTube usage more than tripled from 2014-2016 with users
uploading 400 hours of new video each minute of every day! In 2017,
users were watching 4,146,600 videos every minute.
Facts and
Figures
• In total, 2.7 Zettabytes of data exists in our digital universe. So, how
much is this? “A terabyte is equal to 1,024 gigabytes. A petabyte is
equal to 1,024 terabytes. An exabyte is equal to 1,024 petabytes. A
zettabyte is equal to 1,024 exabytes.”
• 149, 513 emails are sent every minute.
By the year 2020
• 1.7 megabytes of new information will be created every second, per
person.
• Around 1/3 of all data will be processed through the cloud.
Introduction to Big Data
Big data is everywhere. There is no place where Big Data
does not exist!
This constant creation of data using social media,
business applications, telecom and various other
domains is leading to the formation of Big Data.
In other words, big data gets generated in multi terabyte
quantities. It changes fast and comes in varieties of
forms that are difficult to manage and process
using RDBMS or other traditional technologies.
Introduction to Big Data
Big Data solutions provide the tools, methodologies, and technologies that are
used to capture, store, search & analyze the data in seconds to find
relationships and insights for innovation and competitive gain that were
previously unavailable.
80% of the data getting generated today is unstructured and cannot be
handled by our traditional technologies. Earlier, an amount of data generated
was not that high.
Big Data Use Cases
• Netflix Uses Big Data to Improve Customer Experience
• Sentiment analysis
• Customer Churn analysis
• Predictive analysis
• Burberry, a prominent British fashion brand, is also using big data and AI to
boost performance, sales and customer satisfaction.
• The Discover Weekly feature of Spotify is an excellent example. Each week,
Spotify offers every user a personalized playlist with music recommendations
based on their listening and browsing history.
History of Big
Data
Roger Magoulas, in 2005, coined the term ‘Big Data’.

In the same year, the development of Hadoop started. It is an open-source framework that
could process both structured and unstructured data. This was built on top of Google’s
MapReduce and crafted by Yahoo!.

Today, the term Big Data pertains to the study and applications of data sets too complex for
traditional data processing software to handle. This concept faces challenges in capturing
data, data storage, data analysis, search, sharing, transfer, visualization, querying,
updating, information privacy, and data source.
Big Data Technologies
Top big data technologies are divided into 4 fields which are classified as
follows:
• Data Storage
• Data Mining
• Data Analytics
• Data Visualization
Big Data Technologies
Data Data Mining Data Data
Storage : : Analytics : Visualization :
• Hadoop • Presto • Kafka • Tableau
• Mongo DB • RapidMiner • Splunk • Plotly
• RainStor • Elastic • KNIME
• Hunk Search • Spark
• Hunk • R Language
• Blockchain
Emerging Big Data
Technologies :
Tensorflow
Big Data Beam
Technolo
gies Docker
Airflow
Kubernetes
Big Data could be found in three
forms:

Structured: Organized data format

Types Of with a fixed schema. Ex: RDBMS

Big Data
Semi-Structured: Partially
organized data which does not have
a fixed format. Ex: XML, JSON

Unstructured: Unorganized data

with an unknown schema. Ex: Audio,
video files etc.
Big Data Processing
Characteristics of Big Data
Volume – The name Big Data itself
is related to a size which is
enormous. Size of data plays a very
crucial role in determining value
Characteri out of data. Also, whether a
particular data can actually be
stics of considered as a Big Data or not, is
Big Data dependent upon the volume of
data. Hence, 'Volume' is one
characteristic which needs to be
considered while dealing with Big
Data.
Variety – The next aspect of Big Data is
its variety.
Variety refers to heterogeneous sources
and the nature of data, both structured
Characteri and unstructured. During earlier days,
spreadsheets and databases were the
stics of only sources of data considered by most
of the applications. Nowadays, data in
Big Data the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. are
also being considered in the analysis
applications. This variety of unstructured
data poses certain issues for storage,
mining and analyzing data.
Variety
Velocity – The term 'velocity' refers
to the speed of generation of data.
How fast the data is generated and
processed to meet the demands,
Characteri determines real potential in the data.
Big Data Velocity deals with the speed
stics of at which data flows in from sources
like business processes, application
Big Data logs, networks, and social media
sites, sensors, Mobile devices, etc.
The flow of data is massive and
continuous.
Velocity
Value - When we talk about value, we’re
referring to the worth of the data being
extracted. Having endless amounts of data
is one thing, but unless it can be turned
into value it is useless. While there is a
Characteri clear link between data and insights, this
does not always mean there is value in Big
stics of Data.
• If you’re going to invest in the
Big Data infrastructure required to collect and
interpret data on a system-wide scale, it’s
important to ensure that the insights that
are generated are based on accurate
data and lead to measurable
improvements at the end of the day.
Veracity - It is the extended definition
for big data, which refers to the data
quality and the data value. The data
quality of captured data can vary greatly,

Characteri affecting the accurate analysis.

• Veracity is the quality or
stics of trustworthiness of the data. Just how
accurate is all this data? For example,
Big Data think about all the Twitter posts with
hash tags, abbreviations, typos, etc.,
and the reliability and accuracy of all
that content. Cleaning loads and loads
of data is of no use if the quality or
trustworthiness is not accurate.
Veracity
Problems With Big Data

Storing Exponentially growing Solution could be Distributed File

huge datasets. System.

Processing data having Data could be structured, unstructured

complex Structure. or semi-structured

Size of hard disk has been increased

Processing data faster but transfer speed has not increased at
that rate.
Hadoop is the
Solution
Hadoop
• It is the technology to store massive datasets on a cluster
of cheap machines in a distributed manner. Not only this it
provides Big Data analytics through distributed computing
framework.
• Hadoop is a framework that allows you to first store Big
Data in a distributed environment, so that, you can
process it parallelly.
• Hadoop provides scalable, fault tolerance, reliable and
cost efficient data storage for Big data.
Hadoop
Scalable – New nodes can be added as needed
Robust – Fault Tolerance Mechanism
Cost Effective – Commodity Computers
Flexible - Any kind of data, from any number of
resources
Hadoop
It is an open-source software developed as a project by
Apache Software Foundation.
Doug Cutting created Hadoop. In the year 2008 Yahoo
gave Hadoop to Apache Software Foundation.
Since then two versions of Hadoop has come. Version 1.0 in
the year 2011 and version 2.0.6 in the year 2013.
Hadoop comes in various flavors like Cloudera, IBM
BigInsight, MapR and Hortonworks.
Hadoop
• Major components in Hadoop:
• HDFS (Storage) - Hadoop distributed File System
• YARN (Processing) - Yet Another Resource Negotiator.
• MapReduce – Data Processing layer of Hadoop.
There are few other components also like Hive, Apache
Pig, Apache Spark, Apache HBase and Hbase
components, Hcatalog, Avro, Thrift, Drill, Apache
mahout, Sqoop, Apache Flume, Ambari,
Zookeeper and Apache Oozie.
Usage of Hadoop components
Install and work with Hadoop installation with Hortonworks and the Ambari UI

Manage big data on a cluster with HDFS and MapReduce

Analyze data on Hadoop with Pig and Spark

Store and query your data with Sqoop, Hive, MySQL, Hbase, Cassandra, MongoDB,
Drill, Phoenix, and Presto
Cluster is managed with YARN, Mesos, Zookeeper, Oozie, Zeppelin and Hue

Handle streaming data with Kafka, Flume, Spark Streaming, Flink, and Storm
Hadoop Components

2 Components of
Hadoop Daemons
Name Node
- HDFS (Storage)
Data Node
- YARN / MR
Resource Manager
(Processing)
Node Manager
Hadoop Components
2 Components of Hadoop
Master - Slave
1. HDFS (Storage)
NameNode(Master)
DataNode(Slave)

2. YARN / MR (Processing)
Resource Manager(Master)
Node Manager (Slave)
FSImage stands for File System Image. It is a file that stores the state of the Hadoop
Distributed File System (HDFS) at a specific moment in time.

What does FSImage contain?

Metadata: Information about the stored files, such as their size, path, owner, group,
permissions, and block size

Directory structure: The structure of the HDFS directory tree

Data block location: Details about where data is located on the Data Blocks

Block storage: Information about which blocks are stored on which node

How is FSImage used?

The NameNode uses FSImage when it starts up

EditLogs, a transaction log, records changes to the HDFS file system since the last
FSImage was created

When the NameNode restarts, EditLogs are applied to FSImage to get the most recent
snapshot of the file system
Hadoop Architecture
Master is a high-end machine where as slaves are
inexpensive computers. The Big Data files get divided
into the number of blocks. Hadoop stores these blocks in
a distributed fashion on the cluster of slave nodes. On
the master, we have metadata stored.
Hadoop Architecture
NameNode : NameNode performs following functions –
• NameNode Daemon runs on the master machine.
• It is responsible for maintaining, monitoring and managing
DataNodes.
• It records the metadata of the files like the location of
blocks, file size, permission, hierarchy etc.
• Namenode captures all the changes to the metadata like
deletion, creation and renaming of the file in edit logs.
• It regularly receives heartbeat and block reports from the
DataNodes.
Hadoop Architecture
DataNode: The various functions of DataNode are as
follows –
• DataNode runs on the slave machine.
• It stores the actual business data.
• It serves the read-write request from the user.
• DataNode does the ground work of creating, replicating
and deleting the blocks on the command of NameNode.
• After every 3 seconds, by default, it sends heartbeat to
NameNode reporting the health of HDFS.
Hadoop Architecture
We replicate the data blocks present in Data Nodes, and
the default replication factor is 3. Since we are using
commodity hardware and we know the failure rate of
these hardware are pretty high, so if one of the
DataNodes fails, HDFS will still have the copy of those
lost data blocks. You can also configure replication factor
based on your requirements.
Hadoop Architecture
• Data will be stored and processed only in slave
machines
• File will be split into blocks (128 mb max size), which is
configurable
• Each block replicated to 3 times (configurable -
replication factor)
• Each replication for a particular block is written in a
separate slave machine, for fault tolerance
• When you read the data, any one of the replication will
be considered for a block.
File is divided
Blocks are stored
intodistributedly
smaller blocks
over
of Blocks are replicated for fault
128
cluster
mb tolerance

B B B
1 B2 3
Block 1 3
Large B B
Block 2 B
File 3 2 1

100 TB Block 3 B B
B B
1 2 3
Masters 2

Slaves

Hadoop
Clusters
1. Balancing Fault Tolerance and Network Overhead

Fault Tolerance: Storing data across two racks ensures that even if one rack fails, the data remains
accessible from the other rack. This satisfies the redundancy requirement for fault tolerance.

Network Overhead: Writing to three racks would require inter-rack data transfer for all three replicas,
increasing network traffic significantly. Sticking to two racks minimizes this overhead while still maintaining
adequate fault tolerance.

2. Optimal Replica Placement Strategy

In a typical 3-replica scenario:

One replica is stored on the same node where the client writes the data (local node).

A second replica is placed in a different rack to ensure fault tolerance in case an entire rack fails.

The third replica is stored on a different node in the same rack as the second replica to reduce inter-rack
traffic while still distributing the data for fault tolerance.
HDFS
HDFS is the primary storage system of Hadoop. HDFS is
a distributed file system that runs on commodity
hardware.
• The default big data storage layer for Apache Hadoop
is HDFS.
• Hadoop interact directly with HDFS by shell-like
commands.
• HDFS is optimized to process entire batches of dataset
faster rather than processing records one by one from
the dataset.
• Distributed data will be processed in parallel.
HDFS
Tasks of HDFS DataNode
• HDFS comprises of 3 important components-NameNode,
DataNode and Secondary NameNode.
• HDFS operates on a Master-Slave architecture model where
the NameNode acts as the master node for keeping a track of
the storage cluster and the DataNode acts as a slave node
summing up to the various systems within a Hadoop cluster.
• DataNode performs operations like block replica creation,
deletion, and replication according to the instruction of
NameNode.
• DataNode manages data storage of the system.

1 - PEDDINSERT - Insertion Machines
No ratings yet
1 - PEDDINSERT - Insertion Machines
26 pages
01.typical Details of W-Beam Crash Barrier
No ratings yet
01.typical Details of W-Beam Crash Barrier
1 page
Big Data: Characteristics and Impact
No ratings yet
Big Data: Characteristics and Impact
31 pages
Career Choice Influences Guide
100% (5)
Career Choice Influences Guide
2 pages
Hype Hair - February 2015 USA
100% (2)
Hype Hair - February 2015 USA
132 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Processign Using Hadoop
No ratings yet
Processign Using Hadoop
44 pages
Big Data Analytics - Overview
No ratings yet
Big Data Analytics - Overview
66 pages
Info System Big-Data-by-Dex
No ratings yet
Info System Big-Data-by-Dex
37 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
BigData Unit1
No ratings yet
BigData Unit1
74 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
Big Data Analysis Introduction
No ratings yet
Big Data Analysis Introduction
42 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Big Data
No ratings yet
Big Data
30 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Big Data: Submitted By-Rajashree Rashmita Reg - No-1825209016 Mca 4 Sem
0% (1)
Big Data: Submitted By-Rajashree Rashmita Reg - No-1825209016 Mca 4 Sem
27 pages
Data Science
No ratings yet
Data Science
87 pages
Big Data - Module 1
No ratings yet
Big Data - Module 1
22 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Bda M1
No ratings yet
Bda M1
111 pages
Unit 1
No ratings yet
Unit 1
89 pages
Class - Big Data UNIT-I
No ratings yet
Class - Big Data UNIT-I
40 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
Big Data Analytics: - by Ayushi Gupta
No ratings yet
Big Data Analytics: - by Ayushi Gupta
94 pages
Unit 5
No ratings yet
Unit 5
63 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
Report On Big Data
No ratings yet
Report On Big Data
23 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
Unit I-Ch 01-Big Data Introduction
No ratings yet
Unit I-Ch 01-Big Data Introduction
40 pages
Bigdatappt
No ratings yet
Bigdatappt
31 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Unit 1
No ratings yet
Unit 1
26 pages
Chapter N1 Introduction To Big Data
No ratings yet
Chapter N1 Introduction To Big Data
40 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Big Data Basics Unit 1
No ratings yet
Big Data Basics Unit 1
12 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
Module-2 - BIG DATA Introduction
No ratings yet
Module-2 - BIG DATA Introduction
22 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
BDA Unit-1
No ratings yet
BDA Unit-1
31 pages
Survey Paper On Big Data Analytics Using Hadoop Technologies
No ratings yet
Survey Paper On Big Data Analytics Using Hadoop Technologies
7 pages
Big Data: Introduction To Terms, Concepts and Tools
No ratings yet
Big Data: Introduction To Terms, Concepts and Tools
23 pages
Lecture8 - Big Data (Hadoop)
No ratings yet
Lecture8 - Big Data (Hadoop)
29 pages
Unit - I Introduction To Big Data
No ratings yet
Unit - I Introduction To Big Data
38 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
83 pages
BigData Processing Intro
No ratings yet
BigData Processing Intro
34 pages
Lect 2 Big Data Lesson01
No ratings yet
Lect 2 Big Data Lesson01
26 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Future Revolution On Big Data
No ratings yet
Future Revolution On Big Data
24 pages
Bigdata Intro
No ratings yet
Bigdata Intro
25 pages
Mod-1 Q1. Characteristics of Big Data (5'v) Volumes
No ratings yet
Mod-1 Q1. Characteristics of Big Data (5'v) Volumes
15 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Seminar Big Data Hadoop
No ratings yet
Seminar Big Data Hadoop
28 pages
Computer Networks TCP
No ratings yet
Computer Networks TCP
48 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Module 1
No ratings yet
Module 1
54 pages
Big Data Presentation Slide
100% (1)
Big Data Presentation Slide
30 pages
Aditya Goyal 500107195
No ratings yet
Aditya Goyal 500107195
5 pages
Ayush Shome Resume 3rd Year
No ratings yet
Ayush Shome Resume 3rd Year
2 pages
Week 8 (By Gagneet) 1236898
No ratings yet
Week 8 (By Gagneet) 1236898
17 pages
Aditya Goyal 500107195 CG
No ratings yet
Aditya Goyal 500107195 CG
18 pages
Ss TH
No ratings yet
Ss TH
1 page
Screenshot 2024-07-19 at 10.21.10 PM
No ratings yet
Screenshot 2024-07-19 at 10.21.10 PM
16 pages
Adityai Goyal 577500107195 LAB 25688
No ratings yet
Adityai Goyal 577500107195 LAB 25688
8 pages
Assignment 1
No ratings yet
Assignment 1
1 page
Aditya Goyal 500107195 PA
No ratings yet
Aditya Goyal 500107195 PA
3 pages
Aditya Goyal 500107195 Lab 7
No ratings yet
Aditya Goyal 500107195 Lab 7
14 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Modular Test 2
No ratings yet
Modular Test 2
7 pages
Verrier Elwin, Sarat Chandra Roy - The Agaria (1992, Oxford University Press, USA)
No ratings yet
Verrier Elwin, Sarat Chandra Roy - The Agaria (1992, Oxford University Press, USA)
383 pages
FMCG 2425 05141
No ratings yet
FMCG 2425 05141
3 pages
Tiana - Google Search
No ratings yet
Tiana - Google Search
1 page
FAR 1 All S25 Tests & Mock by Sir Nasir Abbas
No ratings yet
FAR 1 All S25 Tests & Mock by Sir Nasir Abbas
73 pages
A Deeper Look
No ratings yet
A Deeper Look
4 pages
Legal Appeal on Property Dispute
No ratings yet
Legal Appeal on Property Dispute
7 pages
Vani Ganapathy
No ratings yet
Vani Ganapathy
2 pages
Digital Marketing Ashutosh
No ratings yet
Digital Marketing Ashutosh
13 pages
Fertility Cycle Tracking Data
No ratings yet
Fertility Cycle Tracking Data
1 page
The Blue Ocean Strategy: W. Chan Kim & Renée Mauborgne
No ratings yet
The Blue Ocean Strategy: W. Chan Kim & Renée Mauborgne
47 pages
Dental Manpower
No ratings yet
Dental Manpower
24 pages
Resolution MDC Reprogram 2019
100% (1)
Resolution MDC Reprogram 2019
8 pages
Karnataka Vat Audit
100% (1)
Karnataka Vat Audit
199 pages
Sm-Ii Group Assignment: Adani Enterprises LTD
No ratings yet
Sm-Ii Group Assignment: Adani Enterprises LTD
9 pages
Eng 322
No ratings yet
Eng 322
10 pages
Rawabi Pearl Compound Brochure 2021
No ratings yet
Rawabi Pearl Compound Brochure 2021
9 pages
Week 5
No ratings yet
Week 5
8 pages
Soal Bahasa Inggris Bab Colors Warna-Warna Dan Kunci Jawaban
No ratings yet
Soal Bahasa Inggris Bab Colors Warna-Warna Dan Kunci Jawaban
8 pages
Week 8 Marriage and Family Life On The Eve of Missionary Contact
No ratings yet
Week 8 Marriage and Family Life On The Eve of Missionary Contact
8 pages
HGI Development Deck 2021
No ratings yet
HGI Development Deck 2021
57 pages
Econometric Multicollinearity Analysis
No ratings yet
Econometric Multicollinearity Analysis
4 pages
Asdaf Kabupaten Kutai Kartanegara, Provinsikalimantan Timur Program Studi Keuangan Publik
No ratings yet
Asdaf Kabupaten Kutai Kartanegara, Provinsikalimantan Timur Program Studi Keuangan Publik
13 pages
The Autoimmune Epidemic by Human Garage
No ratings yet
The Autoimmune Epidemic by Human Garage
12 pages
ASSIGNMENT
No ratings yet
ASSIGNMENT
8 pages
P5 Pre-Practical Exercise CHM 171 2904 (QP)
No ratings yet
P5 Pre-Practical Exercise CHM 171 2904 (QP)
4 pages

Big Data - 1

Uploaded by

Big Data - 1

Uploaded by

Big Data

Structured: Organized data format

Unstructured: Unorganized data

Characteri affecting the accurate analysis.

Storing Exponentially growing Solution could be Distributed File

Processing data having Data could be structured, unstructured

Size of hard disk has been increased

Manage big data on a cluster with HDFS and MapReduce

Analyze data on Hadoop with Pig and Spark

What does FSImage contain?

Directory structure: The structure of the HDFS directory tree

How is FSImage used?

The NameNode uses FSImage when it starts up

2. Optimal Replica Placement Strategy

In a typical 3-replica scenario:

You might also like