Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views9 pages

Unit 1 (Chapter 3) - Big Data Processing

The document discusses Big Data processing, focusing on parallel and distributed data processing methods, including technologies like Hadoop and MapReduce. It explains how large datasets can be efficiently processed by dividing them into smaller tasks that are handled simultaneously by multiple processors or machines. Key concepts such as batch processing, real-time processing, and the components of Hadoop are also outlined, emphasizing their importance in handling vast amounts of data.

Uploaded by

Prasad Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

Unit 1 (Chapter 3) - Big Data Processing

The document discusses Big Data processing, focusing on parallel and distributed data processing methods, including technologies like Hadoop and MapReduce. It explains how large datasets can be efficiently processed by dividing them into smaller tasks that are handled simultaneously by multiple processors or machines. Key concepts such as batch processing, real-time processing, and the components of Hadoop are also outlined, emphasizing their importance in handling vast amounts of data.

Uploaded by

Prasad Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Big Data Analytics – Unit 1 (Chapter – 3: Big Data processing) 1

Big Data Analytics


Unit-1 (Chapter - 3: Big Data Processing)
• Parallel Data Processing,
• Distributed Data Processing,
• Hadoop,
• MapReduce

Dear students, in today’s world we deal with a huge amount of data. This is not something new, but
the way we handle large amounts of data has changed over time.

Currently, we Break Large Data into Smaller Parts


Imagine you have a big book that needs to be read quickly. Instead of one person reading the whole
book, you divide it into chapters and give them to different people to read at the same time. This
makes the process much faster.

Similarly, in data processing:


❖ A data warehouse is like a library storing a huge collection of books (large data).
❖ A data mart is like a smaller bookshelf that contains books on a specific subject (a smaller
part of the large data).
❖ Dividing data into smaller parts helps in processing it faster.

Traditional databases store all the data in one central place and process it there (like a single computer
doing all the work). But in Big Data, information is stored in different locations, and multiple computers
process it at the same time. This is called distributed processing and makes handling huge amounts of
data much faster.

Dear students, here we have 2 types of Data Processing in Big Data, such as Batch Processing and Real-
time Processing
1. Batch Processing – Some data is collected and processed together at a later time.
2. Real-time Processing – Some data keeps coming continuously and needs to be processed
immediately.
For real-time data, computers use in-memory storage (storing data in IMDG [RAM] instead of a hard
disk) to process data quickly. This helps in making instant decisions, such as detecting fraud in online
transactions or predicting weather changes.

In order to follow Big Data Processing, we need to follow the principles such as Speed, Consistency and
Volume.

To further the discussion of Big Data processing, each of the following concepts will be examined in
turn:
• Parallel data processing,
• Distributed data processing,
• Hadoop,
• MapReduce.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 3: Big Data processing) 2

Parallel Data Processing:


Parallel Data Processing is a computing method where multiple processors work on different
parts of a large dataset simultaneously to complete a task faster and more efficiently.
For example: In general, imagine you have 1000 exam papers to check. If only one teacher checks all
the papers, it will take a long time. But if 10 teachers check 100 papers each at the same time, the work
will be finished much faster.
Similarly, this is exactly how Parallel Data Processing works. Instead of one processor doing all
the work, multiple processors process different parts of the data at the same time to speed up the task.
Parallel data processing means splitting a large task into smaller parts and working on them at
the same time using multiple processors. This speeds up the overall process and makes it more
efficient.

Above figure shows that a task can be divided into 3 sub-tasks that are executed in parallel on 3 different processors within
the same machine.

❖ Parallel processing works on the principle of divide and conquer algorithm


❖ Divide and Conquer Algorithm is a problem-solving technique used to solve problems by
dividing the main problem into subproblems, solving them individually and then merging them
to find solution to the original problem. Divide and Conquer is mainly useful when we divide a
problem into independent subproblems.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 3: Big Data processing) 3

Working of Parallel Processing:


1. Task Division:
A large problem is broken down into smaller tasks that can be executed simultaneously.
2. Assigning Tasks to Processors:
These smaller tasks are distributed among multiple processors. Each processor may have its
own memory or share a common memory.
3. Simultaneous Execution:
All processors execute their assigned tasks at the same time. Depending on the type of parallel
processing, they may follow the same or different instructions.
4. Data Exchange & Synchronization:
If needed, processors communicate and share intermediate results. Synchronization ensures
that tasks complete in the correct order.
5. Combining Results:
Once all tasks are finished, the outputs from different processors are combined to get the final
result.

Benefits of Parallel Processing:


1. Faster Execution:
Multiple processors work together, reducing the time needed to complete tasks.
2. Efficient Resource Utilization:
Uses multiple processors effectively, preventing them from sitting idle.
3. Handles Large Data Sets:
Useful for processing massive amounts of data, like big data analytics and AI training.
4. Improved Performance in Complex Tasks:
Ideal for simulations, scientific computations, and real-time applications

Types of parallel processing:


There are 4 types of parallel processing, such as
1. Single instruction single data (SISD)
2. Single instruction multiple data (SIMD)
3. Multiple instruction single data (MISD)
4. Multiple instruction multiple data (MIMD)

1. Single instruction single data (SISD):


❖ It is a Uniprocessor machine capable of executing single instruction to be operating on single
data stream,
❖ In this type, the instructions are processed sequentially. So, these computers are also called
sequential computers,

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 3: Big Data processing) 4

2. Single instruction multiple data (SIMD):


❖ It is a multi-processor system machine, which is capable of executing the same instruction on
all the CPU's.
❖ Here different set of data is inputted to the CPU to be operated by single instruction.

3. Multiple instruction single data (MISD):


It is a multiprocessor capable of executing different instructions on different CPUs but all of them
operating on the same dataset.

4. Multiple instruction multiple data (MIMD):


❖ It is a multi-processor system, capable of executing multiple instructions on multiple datasets.
❖ Each processor in MIMD model have separate instructions and separate data input.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 3: Big Data processing) 5

Distributed Data Processing:


Distributed Processing refers to a computing approach where a task is divided into smaller sub-tasks,
which are executed across multiple physically separate machines (nodes) connected via a network.
These machines work together to complete the task efficiently.
Distributed data processing is closely related to parallel data processing in that the same principle of
“divide-and-conquer” is applied. However, distributed data processing is always achieved through
physically separate machines that are networked together as a cluster. In below figure a task is
divided into 3 sub-tasks that are then executed on 3 different machines sharing one physical switch.

Key Characteristics of Distributed Processing:


❖ Scalability: Distributed systems can easily scale by adding more nodes to handle increased
workloads, accommodating growth without performance degradation.
❖ Multiple Machines (Nodes): Unlike parallel processing (which uses multiple cores within a
single machine), distributed processing involves multiple computers working together.
❖ Fault Tolerance: The system can continue operating even if one or more nodes fail, with
redundancy and data replication ensuring that a failure doesn't impact overall functionality.
❖ Concurrency: Multiple nodes can operate simultaneously, performing tasks in parallel,
improving system efficiency and speed.
❖ Resource Sharing: Nodes can share resources such as processing power, storage, and data,
allowing for efficient utilization of resources across the system.
❖ Transparency: Distributed systems aim to hide the complexity of multiple machines from the
end-user, providing a unified interface as if the system were running on a single computer.
❖ Heterogeneity: Distributed systems can consist of different types of machines, operating
systems, or networks, requiring seamless handling of this diversity.
❖ Openness: The system should be designed to be easily extended and improved, with software
developed and shared openly.
❖ Reliability: Failure of a single node doesn't halt the whole system, which can reassign tasks to
other nodes.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 3: Big Data processing) 6

Example of Distributed Processing: Imagine processing a large dataset in a big data environment
❖ A large dataset is divided into smaller chunks.
❖ Each chunk is processed by a separate machine.
❖ The results from all machines are combined to get the final output.

Types of Distributed Data Processing


Distributed data processing can be categorized based on how data is processed and the nature of
tasks performed. Here are the main types:
1. Batch Processing
2. Stream processing (Real-time processing)

1. Batch Processing
Definition: Batch processing is a type of distributed data processing where large volumes of data
are collected, stored & processed in groups (batches) at scheduled intervals, rather than in real-
time. This method is efficient for handling vast amounts of data that do not require immediate
responses.
Example Technologies: Hadoop (MapReduce), Apache Spark (Batch Mode)
Applications:
❖ Data Warehousing
❖ Periodic Reports (e.g., daily sales reports)
❖ Log Processing (e.g., analyzing server logs)

2. Stream Processing (Real-Time Processing)


Definition: Stream processing, also known as real-time processing, is a distributed data processing
technique where continuous data streams are processed as they arrive. Unlike batch processing,
which handles data in fixed intervals, stream processing processes data in real time, making it
ideal for applications that require immediate insights and actions.
Example Technologies: Apache Kafka, Apache Flink, Apache Spark Streaming, Google Dataflow
Applications:
❖ Fraud Detection in Banking
❖ Real-Time Analytics (e.g., website traffic monitoring)

Working of Distributed Data Processing


1. Data Partitioning
❖ Large datasets are divided into smaller chunks (partitions).
❖ These partitions are distributed across multiple nodes.
2. Parallel Processing
❖ Each node processes its assigned data independently.
❖ This parallelism improves performance and reduces processing time.
3. Distributed Storage
❖ Data is stored across multiple nodes using distributed file systems like HDFS.
❖ Ensures fault tolerance and scalability.
4. Task Scheduling & Execution
❖ A central coordinator assigns tasks to worker nodes.
❖ Worker nodes execute tasks and return results.
5. Aggregation & Final Output
❖ Partial results from worker nodes are combined to produce the final result.
❖ This is done using reducers in Hadoop or actions in Spark.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 3: Big Data processing) 7

Hadoop
Apache Hadoop is an open-source framework for distributed data storage and processing. It is
designed to handle large-scale datasets across a cluster of machines using a parallel computing
model. Hadoop enables efficient batch processing of big data, making it a fundamental technology in
data engineering and analytics.

Core Components of Hadoop:


1. Hadoop Distributed File System (HDFS)
❖ A fault-tolerant, distributed storage system that splits large files into smaller blocks
and distributes them across multiple nodes.
❖ Ensures data reliability through replication (typically three copies per block).
2. MapReduce
❖ A programming model for parallel data processing.
❖ Map Phase: Splits and processes data in parallel across multiple nodes.
❖ Reduce Phase: Aggregates results from the Map phase to produce the final output.
3. Yet Another Resource Negotiator (YARN)
❖ Manages cluster resources and job scheduling.
❖ Allocates CPU, memory, and processing power to tasks dynamically.
4. Hadoop Common
❖ A set of shared utilities and libraries that support other Hadoop components.

Working of Hadoop
1. Data Storage: Large files are divided into blocks (e.g., 128 MB or 256 MB) and stored across
HDFS nodes.
2. Job Submission: A user submits a job using MapReduce or another framework (e.g., Spark).
3. Processing:
❖ The Map task processes chunks of data in parallel.
❖ The Reduce task aggregates and processes the results.
4. Output Generation: The final results are stored back in HDFS or another storage system.

Key Features of Hadoop:


❖ Scalability – Can handle petabytes of data by adding more nodes.
❖ Fault Tolerance – Data replication ensures reliability even if nodes fail.
❖ Cost-Effective – Uses commodity hardware to store and process data.
❖ Parallel Processing – Distributes tasks across multiple nodes for efficiency.
❖ Support for Multiple Processing Frameworks – Works with Apache Spark, Hive, and Pig for
flexible data processing.

Applications of Hadoop:
❖ Big Data Analytics – Processing large datasets for insights (e.g., customer behavior
analysis).
❖ Data Warehousing – Storing and managing structured and unstructured data.
❖ Log Analysis – Processing server logs for system monitoring.
❖ ETL Pipelines – Extracting, transforming, and loading large datasets for business
intelligence.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 3: Big Data processing) 8

MapReduce:
MapReduce is a programming model used for processing and generating large datasets in a
distributed and parallel manner.
❖ MapReduce is the component of Hadoop,
❖ MapReduce is used to deal with the processing with the large dataset,
❖ MapReduce has 2 functions such as Map() function and Reduce() function,
❖ Map() Function is used to processes input data and generates intermediate key-value pairs,
❖ Reduce Function is used to processes the intermediate results to produce the final output,
❖ MapReduce Used in Big Data Processing, Log Analysis, Machine Learning, etc.,

How MapReduce Works?


Map Stage
➢ Input data is divided into smaller chunks or blocks.
➢ Several worker nodes work in parallel to process each chunk independently.
➢ A "Map" function is applied to each data chunk, generating intermediate key-value pairs.
➢ The Map function's goal is to extract relevant information from the input data and prepare it
for further processing.
Reduce Stage
➢ After the Map stage, the intermediate key-value pairs are grouped by key.
➢ The grouped key-value pairs are then shuffled and sorted based on their keys.
➢ The purpose of the shuffle and sort phase is to bring together all the intermediate values
associated with the same key & make them available to the corresponding Reduce function.
➢ Once the shuffling and sorting are complete, a "Reduce" function is applied to perform
aggregation, analysis, or other computations on the grouped data.
➢ The output is a set of final key-value pairs from the computation.

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.
Big Data Analytics – Unit 1 (Chapter – 3: Big Data processing) 9

MapReduce – Word Count Example

Prof. Prasad Patil,


Department of Computer Applications,
KLE Tech University, Belagavi.

You might also like