Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
26 views46 pages

Big Data - 1

The document provides an overview of Big Data, including its characteristics, technologies, and use cases, highlighting the rapid growth of data generated through social media and other platforms. It discusses the challenges of managing large datasets and introduces Hadoop as a solution for storing and processing Big Data. Key components of Hadoop, such as HDFS and YARN, are explained along with their roles in ensuring fault tolerance and efficient data management.

Uploaded by

adityaagarwal310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views46 pages

Big Data - 1

The document provides an overview of Big Data, including its characteristics, technologies, and use cases, highlighting the rapid growth of data generated through social media and other platforms. It discusses the challenges of managing large datasets and introduces Hadoop as a solution for storing and processing Big Data. Key components of Hadoop, such as HDFS and YARN, are explained along with their roles in ensuring fault tolerance and efficient data management.

Uploaded by

adityaagarwal310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Big Data

Facts and
Figures
• Every minute on Facebook, 510,000 comments are posted, 293,000
statuses are updated, and 136,000 photos are uploaded.
• Instagram users upload 46,740 million posts every minute!
• Since 2013, the number of Facebook posts shared each minute has
increased 22%, from 2.5 million to 3 million posts per minute in 2016.
This number has increased more than 300 percent, from around
650,000 posts per minute in 2011!
• YouTube usage more than tripled from 2014-2016 with users
uploading 400 hours of new video each minute of every day! In 2017,
users were watching 4,146,600 videos every minute.
Facts and
Figures
• In total, 2.7 Zettabytes of data exists in our digital universe. So, how
much is this? “A terabyte is equal to 1,024 gigabytes. A petabyte is
equal to 1,024 terabytes. An exabyte is equal to 1,024 petabytes. A
zettabyte is equal to 1,024 exabytes.”
• 149, 513 emails are sent every minute.
By the year 2020
• 1.7 megabytes of new information will be created every second, per
person.
• Around 1/3 of all data will be processed through the cloud.
Introduction to Big Data
Big data is everywhere. There is no place where Big Data
does not exist!
This constant creation of data using social media,
business applications, telecom and various other
domains is leading to the formation of Big Data.
In other words, big data gets generated in multi terabyte
quantities. It changes fast and comes in varieties of
forms that are difficult to manage and process
using RDBMS or other traditional technologies.
Introduction to Big Data
Big Data solutions provide the tools, methodologies, and technologies that are
used to capture, store, search & analyze the data in seconds to find
relationships and insights for innovation and competitive gain that were
previously unavailable.
80% of the data getting generated today is unstructured and cannot be
handled by our traditional technologies. Earlier, an amount of data generated
was not that high.
Big Data Use Cases
• Netflix Uses Big Data to Improve Customer Experience
• Sentiment analysis
• Customer Churn analysis
• Predictive analysis
• Burberry, a prominent British fashion brand, is also using big data and AI to
boost performance, sales and customer satisfaction.
• The Discover Weekly feature of Spotify is an excellent example. Each week,
Spotify offers every user a personalized playlist with music recommendations
based on their listening and browsing history.
History of Big
Data
Roger Magoulas, in 2005, coined the term ‘Big Data’.

In the same year, the development of Hadoop started. It is an open-source framework that
could process both structured and unstructured data. This was built on top of Google’s
MapReduce and crafted by Yahoo!.

Today, the term Big Data pertains to the study and applications of data sets too complex for
traditional data processing software to handle. This concept faces challenges in capturing
data, data storage, data analysis, search, sharing, transfer, visualization, querying,
updating, information privacy, and data source.
Big Data Technologies
Top big data technologies are divided into 4 fields which are classified as
follows:
• Data Storage
• Data Mining
• Data Analytics
• Data Visualization
Big Data Technologies
Data Data Mining Data Data
Storage : : Analytics : Visualization :
• Hadoop • Presto • Kafka • Tableau
• Mongo DB • RapidMiner • Splunk • Plotly
• RainStor • Elastic • KNIME
• Hunk Search • Spark
• Hunk • R Language
• Blockchain
Emerging Big Data
Technologies :
Tensorflow
Big Data Beam
Technolo
gies Docker
Airflow
Kubernetes
Big Data could be found in three
forms:

Structured: Organized data format


Types Of with a fixed schema. Ex: RDBMS

Big Data
Semi-Structured: Partially
organized data which does not have
a fixed format. Ex: XML, JSON

Unstructured: Unorganized data


with an unknown schema. Ex: Audio,
video files etc.
Big Data Processing
Characteristics of Big Data
Volume – The name Big Data itself
is related to a size which is
enormous. Size of data plays a very
crucial role in determining value
Characteri out of data. Also, whether a
particular data can actually be
stics of considered as a Big Data or not, is
Big Data dependent upon the volume of
data. Hence, 'Volume' is one
characteristic which needs to be
considered while dealing with Big
Data.
Variety – The next aspect of Big Data is
its variety.
Variety refers to heterogeneous sources
and the nature of data, both structured
Characteri and unstructured. During earlier days,
spreadsheets and databases were the
stics of only sources of data considered by most
of the applications. Nowadays, data in
Big Data the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. are
also being considered in the analysis
applications. This variety of unstructured
data poses certain issues for storage,
mining and analyzing data.
Variety
Velocity – The term 'velocity' refers
to the speed of generation of data.
How fast the data is generated and
processed to meet the demands,
Characteri determines real potential in the data.
Big Data Velocity deals with the speed
stics of at which data flows in from sources
like business processes, application
Big Data logs, networks, and social media
sites, sensors, Mobile devices, etc.
The flow of data is massive and
continuous.
Velocity
Value - When we talk about value, we’re
referring to the worth of the data being
extracted. Having endless amounts of data
is one thing, but unless it can be turned
into value it is useless. While there is a
Characteri clear link between data and insights, this
does not always mean there is value in Big
stics of Data.
• If you’re going to invest in the
Big Data infrastructure required to collect and
interpret data on a system-wide scale, it’s
important to ensure that the insights that
are generated are based on accurate
data and lead to measurable
improvements at the end of the day.
Veracity - It is the extended definition
for big data, which refers to the data
quality and the data value. The data
quality of captured data can vary greatly,

Characteri affecting the accurate analysis.


• Veracity is the quality or
stics of trustworthiness of the data. Just how
accurate is all this data? For example,
Big Data think about all the Twitter posts with
hash tags, abbreviations, typos, etc.,
and the reliability and accuracy of all
that content. Cleaning loads and loads
of data is of no use if the quality or
trustworthiness is not accurate.
Veracity
Problems With Big Data

Storing Exponentially growing Solution could be Distributed File


huge datasets. System.

Processing data having Data could be structured, unstructured


complex Structure. or semi-structured

Size of hard disk has been increased


Processing data faster but transfer speed has not increased at
that rate.
Hadoop is the
Solution
Hadoop
• It is the technology to store massive datasets on a cluster
of cheap machines in a distributed manner. Not only this it
provides Big Data analytics through distributed computing
framework.
• Hadoop is a framework that allows you to first store Big
Data in a distributed environment, so that, you can
process it parallelly.
• Hadoop provides scalable, fault tolerance, reliable and
cost efficient data storage for Big data.
Hadoop
Scalable – New nodes can be added as needed
Robust – Fault Tolerance Mechanism
Cost Effective – Commodity Computers
Flexible - Any kind of data, from any number of
resources
Hadoop
It is an open-source software developed as a project by
Apache Software Foundation.
Doug Cutting created Hadoop. In the year 2008 Yahoo
gave Hadoop to Apache Software Foundation.
Since then two versions of Hadoop has come. Version 1.0 in
the year 2011 and version 2.0.6 in the year 2013.
Hadoop comes in various flavors like Cloudera, IBM
BigInsight, MapR and Hortonworks.
Hadoop
• Major components in Hadoop:
• HDFS (Storage) - Hadoop distributed File System
• YARN (Processing) - Yet Another Resource Negotiator.
• MapReduce – Data Processing layer of Hadoop.
There are few other components also like Hive, Apache
Pig, Apache Spark, Apache HBase and Hbase
components, Hcatalog, Avro, Thrift, Drill, Apache
mahout, Sqoop, Apache Flume, Ambari,
Zookeeper and Apache Oozie.
Usage of Hadoop components
Install and work with Hadoop installation with Hortonworks and the Ambari UI

Manage big data on a cluster with HDFS and MapReduce

Analyze data on Hadoop with Pig and Spark

Store and query your data with Sqoop, Hive, MySQL, Hbase, Cassandra, MongoDB,
Drill, Phoenix, and Presto
Cluster is managed with YARN, Mesos, Zookeeper, Oozie, Zeppelin and Hue

Handle streaming data with Kafka, Flume, Spark Streaming, Flink, and Storm
Hadoop Components

2 Components of
Hadoop Daemons
Name Node
- HDFS (Storage)
Data Node
- YARN / MR
Resource Manager
(Processing)
Node Manager
Hadoop Components
2 Components of Hadoop
Master - Slave
1. HDFS (Storage)
NameNode(Master)
DataNode(Slave)

2. YARN / MR (Processing)
Resource Manager(Master)
Node Manager (Slave)
FSImage stands for File System Image. It is a file that stores the state of the Hadoop
Distributed File System (HDFS) at a specific moment in time.

What does FSImage contain?

Metadata: Information about the stored files, such as their size, path, owner, group,
permissions, and block size

Directory structure: The structure of the HDFS directory tree

Data block location: Details about where data is located on the Data Blocks

Block storage: Information about which blocks are stored on which node

How is FSImage used?

The NameNode uses FSImage when it starts up

EditLogs, a transaction log, records changes to the HDFS file system since the last
FSImage was created

When the NameNode restarts, EditLogs are applied to FSImage to get the most recent
snapshot of the file system
Hadoop Architecture
Master is a high-end machine where as slaves are
inexpensive computers. The Big Data files get divided
into the number of blocks. Hadoop stores these blocks in
a distributed fashion on the cluster of slave nodes. On
the master, we have metadata stored.
Hadoop Architecture
NameNode : NameNode performs following functions –
• NameNode Daemon runs on the master machine.
• It is responsible for maintaining, monitoring and managing
DataNodes.
• It records the metadata of the files like the location of
blocks, file size, permission, hierarchy etc.
• Namenode captures all the changes to the metadata like
deletion, creation and renaming of the file in edit logs.
• It regularly receives heartbeat and block reports from the
DataNodes.
Hadoop Architecture
DataNode: The various functions of DataNode are as
follows –
• DataNode runs on the slave machine.
• It stores the actual business data.
• It serves the read-write request from the user.
• DataNode does the ground work of creating, replicating
and deleting the blocks on the command of NameNode.
• After every 3 seconds, by default, it sends heartbeat to
NameNode reporting the health of HDFS.
Hadoop Architecture
We replicate the data blocks present in Data Nodes, and
the default replication factor is 3. Since we are using
commodity hardware and we know the failure rate of
these hardware are pretty high, so if one of the
DataNodes fails, HDFS will still have the copy of those
lost data blocks. You can also configure replication factor
based on your requirements.
Hadoop Architecture
• Data will be stored and processed only in slave
machines
• File will be split into blocks (128 mb max size), which is
configurable
• Each block replicated to 3 times (configurable -
replication factor)
• Each replication for a particular block is written in a
separate slave machine, for fault tolerance
• When you read the data, any one of the replication will
be considered for a block.
File is divided
Blocks are stored
intodistributedly
smaller blocks
over
of Blocks are replicated for fault
128
cluster
mb tolerance

B B B
1 B2 3
Block 1 3
Large B B
Block 2 B
File 3 2 1

100 TB Block 3 B B
B B
1 2 3
Masters 2

Slaves

Hadoop
Clusters
1. Balancing Fault Tolerance and Network Overhead

Fault Tolerance: Storing data across two racks ensures that even if one rack fails, the data remains
accessible from the other rack. This satisfies the redundancy requirement for fault tolerance.

Network Overhead: Writing to three racks would require inter-rack data transfer for all three replicas,
increasing network traffic significantly. Sticking to two racks minimizes this overhead while still maintaining
adequate fault tolerance.

2. Optimal Replica Placement Strategy

In a typical 3-replica scenario:

One replica is stored on the same node where the client writes the data (local node).

A second replica is placed in a different rack to ensure fault tolerance in case an entire rack fails.

The third replica is stored on a different node in the same rack as the second replica to reduce inter-rack
traffic while still distributing the data for fault tolerance.
HDFS
HDFS is the primary storage system of Hadoop. HDFS is
a distributed file system that runs on commodity
hardware.
• The default big data storage layer for Apache Hadoop
is HDFS.
• Hadoop interact directly with HDFS by shell-like
commands.
• HDFS is optimized to process entire batches of dataset
faster rather than processing records one by one from
the dataset.
• Distributed data will be processed in parallel.
HDFS
Tasks of HDFS DataNode
• HDFS comprises of 3 important components-NameNode,
DataNode and Secondary NameNode.
• HDFS operates on a Master-Slave architecture model where
the NameNode acts as the master node for keeping a track of
the storage cluster and the DataNode acts as a slave node
summing up to the various systems within a Hadoop cluster.
• DataNode performs operations like block replica creation,
deletion, and replication according to the instruction of
NameNode.
• DataNode manages data storage of the system.

You might also like