Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views10 pages

Unit-III Big Data

The document provides an overview of MapReduce and HBase as part of the Hadoop ecosystem, detailing the MapReduce framework's functionality, features, and phases, including mapping, shuffling, and reducing. It also discusses optimization techniques for MapReduce jobs and introduces the Big Data Stack, highlighting its various layers such as data source, ingestion, storage, and security. The document emphasizes the importance of these technologies in managing and analyzing large volumes of data effectively.

Uploaded by

navalchowke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views10 pages

Unit-III Big Data

The document provides an overview of MapReduce and HBase as part of the Hadoop ecosystem, detailing the MapReduce framework's functionality, features, and phases, including mapping, shuffling, and reducing. It also discusses optimization techniques for MapReduce jobs and introduces the Big Data Stack, highlighting its various layers such as data source, ingestion, storage, and security. The document emphasizes the importance of these technologies in managing and analyzing large volumes of data effectively.

Uploaded by

navalchowke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Unit-III

Chapter-I
Understanding MapReduce Fundamentals and HBase :-
1) MapReduce Framework:-
One of the three components of Hadoop is Map Reduce. The first component of Hadoop
that is, Hadoop Distributed File System (HDFS) is responsible for storing the file. The
second component that is, Map Reduce is responsible for processing the file.
MapReduce has mainly 2 tasks which are divided phase-wise.In first phase, Map is utilised
and in next phase Reduce is utilised.

Suppose there is a word file containing some text. Let us name this file as sample.txt. Note
that we use Hadoop to deal with huge files but for the sake of easy explanation over here,
we are taking a text file as an example. So, let’s assume that this sample.txt file contains
few lines as text. The content of the file is as follows :

Hello I am GeeksforGeeks
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
Hence, the above 8 lines are the content of the file. Let’s assume that while storing this file
in Hadoop, HDFS broke this file into four parts and named each part as first.txt, second.txt,
third.txt, and fourth.txt. So, you can easily see that the above file will be divided into four
equal parts and each part will contain 2 lines. First two lines will be in the file first.txt, next
two lines in second.txt, next two in third.txt and the last two lines will be stored in
fourth.txt. All these files will be stored in Data Nodes and the Name Node will contain the
metadata about them. All this is the task of HDFS. Now, suppose a user wants to process
this file. Here is what Map-Reduce comes into the picture. Suppose this user wants to run
a query on this sample.txt. So, instead of bringing sample.txt on the local computer, we
will send this query on the data. To keep a track of our request, we use Job Tracker (a
master service). Job Tracker traps our request and keeps a track of it. Now suppose that
the user wants to run his query on sample.txt and want the output in result.output file. Let
the name of the file containing the query is query.jar. So, the user will write a query like:
J$hadoop jar query.jar DriverCode sample.txt result.output

1. query.jar : query file that needs to be processed on the input file.


2. sample.txt: input file.
3. result.output: directory in which output of the processing will
be received.

2) Exploring the features of MapReduce :-


Scalability
MapReduce can scale to process vast amounts of data by distributing tasks across a large
number of nodes in a cluster. This allows it to handle massive datasets, making it suitable
for Big Data applications.

Fault Tolerance
MapReduce incorporates built-in fault tolerance to ensure the reliable processing of data. It
automatically detects and handles node failures, rerunning tasks on available nodes as
needed.

Data Locality
MapReduce takes advantage of data locality by processing data on the same node where it
is stored, minimizing data movement across the network and improving overall
performance.

Simplicity
The MapReduce programming model abstracts away many complexities associated with
distributed computing, allowing developers to focus on their data processing logic rather
than low-level details.

Cost-Effective Solution
Hadoop's scalable architecture and MapReduce programming framework make storing and
processing extensive data sets very economical.

Parallel Programming
Tasks are divided into programming models to allow for the simultaneous execution of
independent operations. As a result, programs run faster due to parallel processing, making
it easier for a process to handle each job. Thanks to parallel processing, these distributed
tasks can be performed by multiple processors. Therefore, all software runs faster.

3) Working of MapReduce:-
It is a framework in which we can write applications to run huge amount
of data in parallel and in large cluster of commodity hardware in a
reliable manner.
Different Phases of MapReduce:-
MapReduce model has three major and one optional phase.
• Mapping
• Shuffling and Sorting
• Reducing
• Combining
Map-Reduce is a processing framework used to process data over a large
number of machines. Hadoop uses Map-Reduce to process the data
distributed in a Hadoop cluster. Map-Reduce is not similar to the other
regular processing framework like Hibernate, JDK, .NET, etc. All these
previous frameworks are designed to use with a traditional system where
the data is stored at a single location like Network File System, Oracle
database, etc. But when we are processing big data the data is located
on multiple commodity machines with the help of HDFS.

Let’s take an example where you have a file of 10TB in size to process on
Hadoop. The 10TB of data is first distributed across multiple nodes on
Hadoop with HDFS. Now we have to process it for that we have a Map-
Reduce framework. So to process this data with Map-Reduce we have a
Driver code which is called Job.
4) Techniques to optimize MapReduce Job:-
The performance of mapreduce jobs and their reliability, in addition to the code written
for the main application, can be optimized by using some techniques. We can organize the
mapreduce optimization techniques in the following categories.
1) Hardware or network topology
2) Synchronization
3) File system
4) A) Hardware/Network Topology:-
MapReduce makes it possible for hardware to run the MapReduce tasks on inexpensive
clusters of commodity computers. These computers can be connected through standard
networks. The performance and fault tolerance required for Big Data operations are also
influenced by the physical location of servers. Usually, the data center arranges the
hardware in racks. The performance offered by hardware systems that are located in the
same rack where the data is stored will be higher than the performance of hardware
systems that are located in a different rack than the one containing the data. The reason
for the low performance of the hardware that is located away from the data is the
requirement to move the data and/or application code.

4)b)Synchronization :-
The completion of map processing enables the reduce function to combine the various
outputs for providing the final result. However, the performance will degrade if the results
of mapping are contained within the same nodes where the data processing began. In
order to improve the performance, we should copy the results from mapping nodes to the
reducing nodes, which will start their processing tasks immediately. In addition, the
hashing function sends all the values marked with the same key to the same reducer node.
Thus, the overall performance of the system enhances with an increased efficiency. You
can extract the best possible outcomes by writing the outputs of reduction operations
directly to the file system. This entire process is synchronized to provide efficiency to the
MapReduce programming model.

4)c)File System :-
A distributed tile system is used to support the implementation of the MapReduce
operation.
Distributed file systems are different from local file systems in a manner that local file
systems have less capability of storing and arranging data.
The Big Data world has immense information to be processed and requires data to be
distributed among a number of systems or nodes in a cluster so that the data can be
handled efficiently.
The distribution model followed in implementing the MapReduce Programming approach
is to use the master and slaves model. All the metadata and access rights, apart from the
mapping block, and file locations are stored with the master. On the other hand, the data
on which the application code will run is kept with the slaves. The master node receives all
the Requests which are forwarded to appropriate slaves for performing the required
actions.

CHAPTER-II
Understanding Big Data Technology Foundations :-
Exploring the Big Data Stack:-
Big Data Stack refers to the collection of technologies and tools used to handle and analyze
large volumes of data. These stacks typically include hardware, software, and
programming languages that work together to store, process, and analyze data.
Some standard components of a Big Data Stack include distributed file systems like
Hadoop, data processing frameworks like Apache Spark, and databases like Apache
Cassandra or MongoDB. These technologies allow organizations to efficiently manage and
derive insights from massive amounts of data, enabling them to make data-driven
decisions and gain a competitive advantage in today’s data-driven world.

Types of Layers:-
1)DATA SOURCE LAYER:-
The Data Source Layer is the first layer in the Big Data stack, responsible for collecting and
ingesting data from various sources. This layer is also known as the Data Ingestion Layer or
Data Acquisition Layer.
Key Characteristics:
1. Data Collection: Gathering data from various sources, such as social media, sensors,
databases, files, and applications.
2. Data Ingestion: Ingesting collected data into the Big Data system, often in real-time or
near-real-time.
3. Data Variety: Handling diverse data formats, structures, and velocities.

Common Data Sources:


1. Social Media: Twitter, Facebook, Instagram, etc.
2. Sensors: IoT devices, industrial sensors, environmental sensors, etc.
3. Databases: Relational databases, NoSQL databases, graph databases, etc.
4. Files: Text files, CSV files, JSON files, etc.
5. Applications: Web applications, mobile applications, enterprise applications, etc.
2)Ingestion Layer:-
The Ingestion Layer is a critical component of the Big Data stack, responsible for collecting,
processing, and transporting large volumes of data from various sources to a centralized
data storage system.
Key Characteristics:
1. Data Collection: Gathering data from diverse sources, such as logs, sensors, social
media, and databases.
2. Data Processing: Transforming, aggregating, and filtering data in real-time or near-real-
time.
3. Data Transportation: Moving processed data to a centralized data storage system, such
as Hadoop Distributed File System (HDFS), NoSQL databases, or cloud storage.

Ingestion Layer Functions:-


1. Data Filtering: Removing unnecessary data to reduce storage and processing costs.
2. Data Transformation: Converting data formats, aggregating data, and performing
calculations.
3. Data Validation: Checking data for errors, inconsistencies, and completeness.
4. Data Serialization: Converting data into a format suitable for storage and processing.

3) Storage Layer :-
The Storage Layer is a fundamental component of the Big Data stack, responsible for
storing and managing large volumes of structured, semi-structured, and unstructured
data.

Storage Layer Components :


1. Distributed File Systems: Hadoop Distributed File System (HDFS), Apache HBase, and
Ceph are examples of distributed file systems designed for Big Data storage.
2. NoSQL Databases: Apache Cassandra, Apache Couchbase, and MongoDB are popular
NoSQL databases for storing structured, semi-structured, and unstructured data.
3. Cloud Storage: Amazon S3, Google Cloud Storage, and Microsoft Azure Blob Storage are
examples of cloud storage services for Big Data.
4. Object Stores: Apache Cassandra, Riak, and Scality are examples of object stores
designed for storing and retrieving large amounts of unstructured data.

Storage Layer Functions


1. Data Storage: Stores large volumes of data in a scalable, flexible, and reliable manner.
2. Data Retrieval: Provides efficient data retrieval mechanisms, including batch, real-time,
and interactive query support.
3. Data Management: Offers data management capabilities, including data governance,
security, and lifecycle management.
4. Data Integration: Supports data integration with various data sources, including
relational databases, NoSQL databases, and cloud storage.

4) Physical Infrastructure Layer :-


The Physical Infrastructure Layer is the foundation of the Big Data stack, providing the
underlying hardware and infrastructure necessary for storing, processing, and analyzing
large volumes of data.
Key Characteristics:
1. Scalability: The Physical Infrastructure Layer is designed to scale horizontally, adding
more servers, storage, and networking resources as needed.
2. Flexibility: The layer supports a variety of hardware and software configurations,
enabling organizations to choose the best infrastructure for their specific Big Data
workloads.
3. Reliability: Redundancy and failover mechanisms are built into the Physical
Infrastructure Layer to ensure high availability and minimize downtime.
4. Security: Physical and logical security measures, such as access controls and encryption,
protect the infrastructure and data from unauthorized access.

5) Platform Management Layer:-


The Platform Management Layer is a critical component of the Big Data stack, responsible
for managing and orchestrating the underlying infrastructure and resources.
Key Functions:
1. Resource Management: Manages compute, storage, and networking resources, ensuring
efficient allocation and utilization.
2. Cluster Management: Oversees the creation, configuration, and management of
clusters, including node management and scaling.
3. Job Scheduling: Schedules and manages jobs, workflows, and applications, ensuring
efficient execution and resource utilization.
4. Monitoring and Logging: Provides real-time monitoring and logging capabilities,
enabling administrators to track performance, identify issues, and optimize operations.
5. Security and Access Control: Implements security measures, such as authentication,
authorization, and encryption, to protect data and resources.
Key Components:
1. Cluster Managers: Apache Mesos, Apache Hadoop YARN, and Kubernetes are popular
cluster managers.
2. Resource Managers: Apache Hadoop YARN, Apache Mesos, and Docker are examples of
resource managers.
3. Job Schedulers: Apache Oozie, Apache Airflow, and Apache Spark are popular job
schedulers.
4. Monitoring Tools: Apache Ambari, Prometheus, and Grafana are widely used monitoring
tools.

6) Security Layer :-
The Security Layer is a critical component of the Big Data stack, responsible for protecting
sensitive data and ensuring the confidentiality, integrity, and availability of Big Data assets.

Key Components:-
1. Authentication Systems: Kerberos, LDAP, and Active Directory are popular
authentication systems.
2. Authorization Frameworks: Apache Sentry, Apache Ranger, and Apache Knox provide
authorization and access control capabilities.
3. Encryption Tools: OpenSSL, SSL/TLS, and AES are widely used encryption tools.
4. Data Masking Tools: Apache Hive, Apache Impala, and data masking libraries provide
data masking capabilities.
5. Security Information and Event Management (SIEM) Systems: Splunk, ELK Stack, and
Apache Metron provide SIEM capabilities.

7) Monitoring Layer:-
The Monitoring Layer is a critical component of the Big Data stack, responsible for tracking
and analyzing the performance, health, and efficiency of Big Data systems, applications,
and infrastructure.
1. Improved System Performance: Identifies performance bottlenecks and issues, enabling
administrators to optimize system performance.
2. Enhanced System Reliability: Detects system health issues, enabling administrators to
take proactive measures to prevent system crashes and downtime.
3. Increased Efficiency: Analyzes system efficiency, enabling administrators to optimize
resource utilization and reduce costs.
4. Faster Issue Resolution: Enables administrators to quickly identify and resolve issues,
reducing mean time to detect (MTTD) and mean time to resolve (MTTR).
5. Data-Driven Decision Making: Provides insights and metrics, enabling administrators and
stakeholders to make data-driven decisions.

8) Visualization Layer:-
The Visualization Layer is the topmost layer of the Big Data stack, responsible for
presenting complex data insights and analytics in a clear, concise, and actionable manner.
Benefits:
1. Improved Decision Making: Enables stakeholders to make data-driven decisions by
presenting complex data insights in a clear and actionable manner.
2. Enhanced Collaboration: Facilitates collaboration among stakeholders by providing a
common platform for data visualization and exploration.
3. Increased Productivity: Automates reporting and analytics, freeing up resources for
more strategic and analytical work.
4. Better Storytelling: Presents data insights in a narrative format, making it easier for non-
technical stakeholders to understand and engage with the data.

You might also like