0% found this document useful (0 votes)

17 views10 pages

Unit-III Big Data

The document provides an overview of MapReduce and HBase as part of the Hadoop ecosystem, detailing the MapReduce framework's functionality, features, and phases, including mapping, shuffling, and reducing. It also discusses optimization techniques for MapReduce jobs and introduces the Big Data Stack, highlighting its various layers such as data source, ingestion, storage, and security. The document emphasizes the importance of these technologies in managing and analyzing large volumes of data effectively.

Uploaded by

navalchowke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views10 pages

Unit-III Big Data

Uploaded by

navalchowke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Unit-III

Chapter-I
Understanding MapReduce Fundamentals and HBase :-
1) MapReduce Framework:-
One of the three components of Hadoop is Map Reduce. The first component of Hadoop
that is, Hadoop Distributed File System (HDFS) is responsible for storing the file. The
second component that is, Map Reduce is responsible for processing the file.
MapReduce has mainly 2 tasks which are divided phase-wise.In first phase, Map is utilised
and in next phase Reduce is utilised.

Suppose there is a word file containing some text. Let us name this file as sample.txt. Note
that we use Hadoop to deal with huge files but for the sake of easy explanation over here,
we are taking a text file as an example. So, let’s assume that this sample.txt file contains
few lines as text. The content of the file is as follows :

Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
Hence, the above 8 lines are the content of the file. Let’s assume that while storing this file
in Hadoop, HDFS broke this file into four parts and named each part as first.txt, second.txt,
third.txt, and fourth.txt. So, you can easily see that the above file will be divided into four
equal parts and each part will contain 2 lines. First two lines will be in the file first.txt, next
two lines in second.txt, next two in third.txt and the last two lines will be stored in
fourth.txt. All these files will be stored in Data Nodes and the Name Node will contain the
metadata about them. All this is the task of HDFS. Now, suppose a user wants to process
this file. Here is what Map-Reduce comes into the picture. Suppose this user wants to run
a query on this sample.txt. So, instead of bringing sample.txt on the local computer, we
will send this query on the data. To keep a track of our request, we use Job Tracker (a
master service). Job Tracker traps our request and keeps a track of it. Now suppose that
the user wants to run his query on sample.txt and want the output in result.output file. Let
the name of the file containing the query is query.jar. So, the user will write a query like:
J$hadoop jar query.jar DriverCode sample.txt result.output

1. query.jar : query file that needs to be processed on the input file.

2. sample.txt: input file.
3. result.output: directory in which output of the processing will
be received.

2) Exploring the features of MapReduce :-

Scalability
MapReduce can scale to process vast amounts of data by distributing tasks across a large
number of nodes in a cluster. This allows it to handle massive datasets, making it suitable
for Big Data applications.

Fault Tolerance
MapReduce incorporates built-in fault tolerance to ensure the reliable processing of data. It
automatically detects and handles node failures, rerunning tasks on available nodes as
needed.

Data Locality
MapReduce takes advantage of data locality by processing data on the same node where it
is stored, minimizing data movement across the network and improving overall
performance.

Simplicity
The MapReduce programming model abstracts away many complexities associated with
distributed computing, allowing developers to focus on their data processing logic rather
than low-level details.

Cost-Effective Solution
Hadoop's scalable architecture and MapReduce programming framework make storing and
processing extensive data sets very economical.

Parallel Programming
Tasks are divided into programming models to allow for the simultaneous execution of
independent operations. As a result, programs run faster due to parallel processing, making
it easier for a process to handle each job. Thanks to parallel processing, these distributed
tasks can be performed by multiple processors. Therefore, all software runs faster.

3) Working of MapReduce:-
It is a framework in which we can write applications to run huge amount
of data in parallel and in large cluster of commodity hardware in a
reliable manner.
Different Phases of MapReduce:-
MapReduce model has three major and one optional phase.
• Mapping
• Shuffling and Sorting
• Reducing
• Combining
Map-Reduce is a processing framework used to process data over a large
number of machines. Hadoop uses Map-Reduce to process the data
distributed in a Hadoop cluster. Map-Reduce is not similar to the other
regular processing framework like Hibernate, JDK, .NET, etc. All these
previous frameworks are designed to use with a traditional system where
the data is stored at a single location like Network File System, Oracle
database, etc. But when we are processing big data the data is located
on multiple commodity machines with the help of HDFS.

Let’s take an example where you have a file of 10TB in size to process on
Hadoop. The 10TB of data is first distributed across multiple nodes on
Hadoop with HDFS. Now we have to process it for that we have a Map-
Reduce framework. So to process this data with Map-Reduce we have a
Driver code which is called Job.
4) Techniques to optimize MapReduce Job:-
The performance of mapreduce jobs and their reliability, in addition to the code written
for the main application, can be optimized by using some techniques. We can organize the
mapreduce optimization techniques in the following categories.
1) Hardware or network topology
2) Synchronization
3) File system
4) A) Hardware/Network Topology:-
MapReduce makes it possible for hardware to run the MapReduce tasks on inexpensive
clusters of commodity computers. These computers can be connected through standard
networks. The performance and fault tolerance required for Big Data operations are also
influenced by the physical location of servers. Usually, the data center arranges the
hardware in racks. The performance offered by hardware systems that are located in the
same rack where the data is stored will be higher than the performance of hardware
systems that are located in a different rack than the one containing the data. The reason
for the low performance of the hardware that is located away from the data is the
requirement to move the data and/or application code.

4)b)Synchronization :-
The completion of map processing enables the reduce function to combine the various
outputs for providing the final result. However, the performance will degrade if the results
of mapping are contained within the same nodes where the data processing began. In
order to improve the performance, we should copy the results from mapping nodes to the
reducing nodes, which will start their processing tasks immediately. In addition, the
hashing function sends all the values marked with the same key to the same reducer node.
Thus, the overall performance of the system enhances with an increased efficiency. You
can extract the best possible outcomes by writing the outputs of reduction operations
directly to the file system. This entire process is synchronized to provide efficiency to the
MapReduce programming model.

4)c)File System :-
A distributed tile system is used to support the implementation of the MapReduce
operation.
Distributed file systems are different from local file systems in a manner that local file
systems have less capability of storing and arranging data.
The Big Data world has immense information to be processed and requires data to be
distributed among a number of systems or nodes in a cluster so that the data can be
handled efficiently.
The distribution model followed in implementing the MapReduce Programming approach
is to use the master and slaves model. All the metadata and access rights, apart from the
mapping block, and file locations are stored with the master. On the other hand, the data
on which the application code will run is kept with the slaves. The master node receives all
the Requests which are forwarded to appropriate slaves for performing the required
actions.

CHAPTER-II
Understanding Big Data Technology Foundations :-
Exploring the Big Data Stack:-
Big Data Stack refers to the collection of technologies and tools used to handle and analyze
large volumes of data. These stacks typically include hardware, software, and
programming languages that work together to store, process, and analyze data.
Some standard components of a Big Data Stack include distributed file systems like
Hadoop, data processing frameworks like Apache Spark, and databases like Apache
Cassandra or MongoDB. These technologies allow organizations to efficiently manage and
derive insights from massive amounts of data, enabling them to make data-driven
decisions and gain a competitive advantage in today’s data-driven world.

Types of Layers:-
1)DATA SOURCE LAYER:-
The Data Source Layer is the first layer in the Big Data stack, responsible for collecting and
ingesting data from various sources. This layer is also known as the Data Ingestion Layer or
Data Acquisition Layer.
Key Characteristics:
1. Data Collection: Gathering data from various sources, such as social media, sensors,
databases, files, and applications.
2. Data Ingestion: Ingesting collected data into the Big Data system, often in real-time or
near-real-time.
3. Data Variety: Handling diverse data formats, structures, and velocities.

Common Data Sources:

1. Social Media: Twitter, Facebook, Instagram, etc.
2. Sensors: IoT devices, industrial sensors, environmental sensors, etc.
3. Databases: Relational databases, NoSQL databases, graph databases, etc.
4. Files: Text files, CSV files, JSON files, etc.
5. Applications: Web applications, mobile applications, enterprise applications, etc.
2)Ingestion Layer:-
The Ingestion Layer is a critical component of the Big Data stack, responsible for collecting,
processing, and transporting large volumes of data from various sources to a centralized
data storage system.
Key Characteristics:
1. Data Collection: Gathering data from diverse sources, such as logs, sensors, social
media, and databases.
2. Data Processing: Transforming, aggregating, and filtering data in real-time or near-real-
time.
3. Data Transportation: Moving processed data to a centralized data storage system, such
as Hadoop Distributed File System (HDFS), NoSQL databases, or cloud storage.

Ingestion Layer Functions:-

1. Data Filtering: Removing unnecessary data to reduce storage and processing costs.
2. Data Transformation: Converting data formats, aggregating data, and performing
calculations.
3. Data Validation: Checking data for errors, inconsistencies, and completeness.
4. Data Serialization: Converting data into a format suitable for storage and processing.

3) Storage Layer :-
The Storage Layer is a fundamental component of the Big Data stack, responsible for
storing and managing large volumes of structured, semi-structured, and unstructured
data.

Storage Layer Components :

1. Distributed File Systems: Hadoop Distributed File System (HDFS), Apache HBase, and
Ceph are examples of distributed file systems designed for Big Data storage.
2. NoSQL Databases: Apache Cassandra, Apache Couchbase, and MongoDB are popular
NoSQL databases for storing structured, semi-structured, and unstructured data.
3. Cloud Storage: Amazon S3, Google Cloud Storage, and Microsoft Azure Blob Storage are
examples of cloud storage services for Big Data.
4. Object Stores: Apache Cassandra, Riak, and Scality are examples of object stores
designed for storing and retrieving large amounts of unstructured data.

Storage Layer Functions

1. Data Storage: Stores large volumes of data in a scalable, flexible, and reliable manner.
2. Data Retrieval: Provides efficient data retrieval mechanisms, including batch, real-time,
and interactive query support.
3. Data Management: Offers data management capabilities, including data governance,
security, and lifecycle management.
4. Data Integration: Supports data integration with various data sources, including
relational databases, NoSQL databases, and cloud storage.

4) Physical Infrastructure Layer :-

The Physical Infrastructure Layer is the foundation of the Big Data stack, providing the
underlying hardware and infrastructure necessary for storing, processing, and analyzing
large volumes of data.
Key Characteristics:
1. Scalability: The Physical Infrastructure Layer is designed to scale horizontally, adding
more servers, storage, and networking resources as needed.
2. Flexibility: The layer supports a variety of hardware and software configurations,
enabling organizations to choose the best infrastructure for their specific Big Data
workloads.
3. Reliability: Redundancy and failover mechanisms are built into the Physical
Infrastructure Layer to ensure high availability and minimize downtime.
4. Security: Physical and logical security measures, such as access controls and encryption,
protect the infrastructure and data from unauthorized access.

5) Platform Management Layer:-

The Platform Management Layer is a critical component of the Big Data stack, responsible
for managing and orchestrating the underlying infrastructure and resources.
Key Functions:
1. Resource Management: Manages compute, storage, and networking resources, ensuring
efficient allocation and utilization.
2. Cluster Management: Oversees the creation, configuration, and management of
clusters, including node management and scaling.
3. Job Scheduling: Schedules and manages jobs, workflows, and applications, ensuring
efficient execution and resource utilization.
4. Monitoring and Logging: Provides real-time monitoring and logging capabilities,
enabling administrators to track performance, identify issues, and optimize operations.
5. Security and Access Control: Implements security measures, such as authentication,
authorization, and encryption, to protect data and resources.
Key Components:
1. Cluster Managers: Apache Mesos, Apache Hadoop YARN, and Kubernetes are popular
cluster managers.
2. Resource Managers: Apache Hadoop YARN, Apache Mesos, and Docker are examples of
resource managers.
3. Job Schedulers: Apache Oozie, Apache Airflow, and Apache Spark are popular job
schedulers.
4. Monitoring Tools: Apache Ambari, Prometheus, and Grafana are widely used monitoring
tools.

6) Security Layer :-
The Security Layer is a critical component of the Big Data stack, responsible for protecting
sensitive data and ensuring the confidentiality, integrity, and availability of Big Data assets.

Key Components:-
1. Authentication Systems: Kerberos, LDAP, and Active Directory are popular
authentication systems.
2. Authorization Frameworks: Apache Sentry, Apache Ranger, and Apache Knox provide
authorization and access control capabilities.
3. Encryption Tools: OpenSSL, SSL/TLS, and AES are widely used encryption tools.
4. Data Masking Tools: Apache Hive, Apache Impala, and data masking libraries provide
data masking capabilities.
5. Security Information and Event Management (SIEM) Systems: Splunk, ELK Stack, and
Apache Metron provide SIEM capabilities.

7) Monitoring Layer:-
The Monitoring Layer is a critical component of the Big Data stack, responsible for tracking
and analyzing the performance, health, and efficiency of Big Data systems, applications,
and infrastructure.
1. Improved System Performance: Identifies performance bottlenecks and issues, enabling
administrators to optimize system performance.
2. Enhanced System Reliability: Detects system health issues, enabling administrators to
take proactive measures to prevent system crashes and downtime.
3. Increased Efficiency: Analyzes system efficiency, enabling administrators to optimize
resource utilization and reduce costs.
4. Faster Issue Resolution: Enables administrators to quickly identify and resolve issues,
reducing mean time to detect (MTTD) and mean time to resolve (MTTR).
5. Data-Driven Decision Making: Provides insights and metrics, enabling administrators and
stakeholders to make data-driven decisions.

8) Visualization Layer:-
The Visualization Layer is the topmost layer of the Big Data stack, responsible for
presenting complex data insights and analytics in a clear, concise, and actionable manner.
Benefits:
1. Improved Decision Making: Enables stakeholders to make data-driven decisions by
presenting complex data insights in a clear and actionable manner.
2. Enhanced Collaboration: Facilitates collaboration among stakeholders by providing a
common platform for data visualization and exploration.
3. Increased Productivity: Automates reporting and analytics, freeing up resources for
more strategic and analytical work.
4. Better Storytelling: Presents data insights in a narrative format, making it easier for non-
technical stakeholders to understand and engage with the data.

Stylus Pro 7880 9880 Field Repair Guide PDF
62% (13)
Stylus Pro 7880 9880 Field Repair Guide PDF
350 pages
Zambian Grid Code
100% (1)
Zambian Grid Code
174 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
BDA Test Book
No ratings yet
BDA Test Book
9 pages
Unit 3 & 4 Big Data
No ratings yet
Unit 3 & 4 Big Data
18 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
Lesson 2 A Review of Hadoop
No ratings yet
Lesson 2 A Review of Hadoop
6 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Data Science
No ratings yet
Data Science
7 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Bda 2
No ratings yet
Bda 2
35 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
Hadoop MapReduce for Big Data
No ratings yet
Hadoop MapReduce for Big Data
5 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
MapReduce Architecture Explained
No ratings yet
MapReduce Architecture Explained
13 pages
MapReduce for Big Data Developers
No ratings yet
MapReduce for Big Data Developers
9 pages
Bda Unit 3
No ratings yet
Bda Unit 3
29 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Unit 5
No ratings yet
Unit 5
35 pages
BDAunit III
No ratings yet
BDAunit III
4 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
BDA - Unit - III-1
No ratings yet
BDA - Unit - III-1
57 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
MapReduce for Data Scientists
No ratings yet
MapReduce for Data Scientists
213 pages
BDM 2
No ratings yet
BDM 2
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
50 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
MapReduce for Big Data Analysis
No ratings yet
MapReduce for Big Data Analysis
59 pages
Hadoop OnePage
No ratings yet
Hadoop OnePage
2 pages
Hadoop Seminar Report IIT Guwahati
No ratings yet
Hadoop Seminar Report IIT Guwahati
28 pages
Hadoop & MapReduce Overview
No ratings yet
Hadoop & MapReduce Overview
18 pages
MapReduce and HDFS Architecture Guide
No ratings yet
MapReduce and HDFS Architecture Guide
9 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
3 Unit
No ratings yet
3 Unit
17 pages
Module - 2 - Introduction To Hadoop
No ratings yet
Module - 2 - Introduction To Hadoop
24 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Big Data and Hadoop
No ratings yet
Big Data and Hadoop
8 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
69 pages
Unit 5
No ratings yet
Unit 5
32 pages
MapReduce Based Algorithms For Efficient Big Data Processing
No ratings yet
MapReduce Based Algorithms For Efficient Big Data Processing
7 pages
Hadoop
No ratings yet
Hadoop
5 pages
BDH Unit 1
No ratings yet
BDH Unit 1
14 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Important Essay and Short Questions Big Data
No ratings yet
Important Essay and Short Questions Big Data
2 pages
Big Data Notes Units II III IV
No ratings yet
Big Data Notes Units II III IV
3 pages
Summaries of Semester 5
No ratings yet
Summaries of Semester 5
9 pages
V Sem - DS - No SQL Data Base
100% (1)
V Sem - DS - No SQL Data Base
1 page
Mobile Communication Syllabus
No ratings yet
Mobile Communication Syllabus
2 pages
Android App Uninstallation Guide
No ratings yet
Android App Uninstallation Guide
3 pages
Node.js Basics for Developers
No ratings yet
Node.js Basics for Developers
36 pages
User Manual
No ratings yet
User Manual
175 pages
Crime RAG
No ratings yet
Crime RAG
6 pages
Pronto Xi Help 750.2 - Item Creation Request Function
No ratings yet
Pronto Xi Help 750.2 - Item Creation Request Function
3 pages
REPORT New-1
No ratings yet
REPORT New-1
40 pages
Problems and Prospects of E-Marketing
75% (20)
Problems and Prospects of E-Marketing
13 pages
Matroska File Format Guide
No ratings yet
Matroska File Format Guide
51 pages
SQL Server Always On - Overview
No ratings yet
SQL Server Always On - Overview
4 pages
Tutorial 03 Latch FF State Machines 1
No ratings yet
Tutorial 03 Latch FF State Machines 1
81 pages
Brosur Elektronik 15 April 2023
No ratings yet
Brosur Elektronik 15 April 2023
2 pages
RUA Form 2022 - New
No ratings yet
RUA Form 2022 - New
5 pages
Micro GC 3000
No ratings yet
Micro GC 3000
11 pages
SC200: Advanced Sensor Controller
No ratings yet
SC200: Advanced Sensor Controller
4 pages
TN - 1130 Resolving A Popup Warning, "Your System May Be Running Low On Memory. Continue - " When Running HistData - InSource Solutions
No ratings yet
TN - 1130 Resolving A Popup Warning, "Your System May Be Running Low On Memory. Continue - " When Running HistData - InSource Solutions
6 pages
B.Tech IT Application Dev Lab Manual
No ratings yet
B.Tech IT Application Dev Lab Manual
45 pages
Realwear Quickstart Guide
No ratings yet
Realwear Quickstart Guide
4 pages
Chapter 2-Analytical Decision Making
No ratings yet
Chapter 2-Analytical Decision Making
39 pages
Network & System Admin Basics
100% (1)
Network & System Admin Basics
33 pages
FBM224 说明书
No ratings yet
FBM224 说明书
16 pages
1 - Chapter 3 Product Assurance
No ratings yet
1 - Chapter 3 Product Assurance
82 pages
User Manual: Journal Article Latex Authoring Template
No ratings yet
User Manual: Journal Article Latex Authoring Template
14 pages
Current Midterm Solved Papers: Muhammad Faisal Dar
No ratings yet
Current Midterm Solved Papers: Muhammad Faisal Dar
14 pages
Carrier Objective: Linkedin
No ratings yet
Carrier Objective: Linkedin
3 pages
Institutionalizing Modular Adaptable Ship Technologies
No ratings yet
Institutionalizing Modular Adaptable Ship Technologies
19 pages
Ge Mentor Visual Iq Specifications Spec Sheet Iv11
No ratings yet
Ge Mentor Visual Iq Specifications Spec Sheet Iv11
7 pages
Types of Components and Objects To Be Measured
No ratings yet
Types of Components and Objects To Be Measured
23 pages

Unit-III Big Data

Uploaded by

Unit-III Big Data

Uploaded by

Unit-III

1. query.jar : query file that needs to be processed on the input file.

2) Exploring the features of MapReduce :-

Common Data Sources:

Ingestion Layer Functions:-

Storage Layer Components :

Storage Layer Functions

4) Physical Infrastructure Layer :-

5) Platform Management Layer:-

You might also like