Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views55 pages

Module 1 Intro To Big Data - Hadoop

Uploaded by

sivefik636
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views55 pages

Module 1 Intro To Big Data - Hadoop

Uploaded by

sivefik636
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

1.

Introduction to Big Data &


Hadoop
What is Big Data?
• Big Data refers to large, complex and fast-growing collections of data
that are too big to be handled using traditional methods like
spreadsheets or relational databases (e.g., Excel or MySQL).
• It requires special tools and technologies to store, manage and
analyze.
• Imagine all the data generated every second by :
People using Facebook, Instagram and YouTube
Sensors in smart devices and machines
Online shopping websites like Amazon
Banking systems and mobile payments
All this information adds up to Big Data.
What is an Example of Big Data?
• The New York Stock Exchange is an example of Big Data
that generates about one terabyte of new trade data per day.
• Social Media
The statistic shows that 500+terabytes of new data get
ingested into the databases of social media site Facebook, every
day. This data is mainly generated in terms of photo and video
uploads, message exchanges, putting comments etc.
• A single Jet engine can generate 10+terabytes of data in 30
minutes of flight time. With many thousand flights per day,
generation of data reaches up to many Petabytes.
Characteristics of Big Data
1. Volume (Size of Data)
1. The amount of data is enormous — in terabytes, petabytes or more.
2. Example: Facebook generates over 4 petabytes of data per day.
2. Velocity (Speed of Data)
1. Data is created and streamed at very high speeds.
2. Example: 5000+ tweets per second on Twitter; stock market data updating in milliseconds.
3. Variety (Types of Data)
1. Data comes in different forms:
1. Structured: Tables, databases (e.g., customer names and emails)
2. Unstructured: Images, videos, audio, social media posts
3. Semi-structured: JSON, XML, logs
4. Veracity (Data Quality & Accuracy)
1. Big Data may be incomplete, inconsistent or inaccurate.
2. Example: Fake reviews or duplicate customer records.
3. Cleaning and verifying data is important.
5. Value (Usefulness of Data)
1. The true benefit lies in analyzing Big Data to extract meaningful insights.
2. Example: Companies analyze user behavior to improve products or increase sales.
Why is Big Data Important?
Big Data helps in:
• Improving decision-making in business and government
• Predictive analysis (e.g., when a machine might fail)
• Personalized recommendations (e.g., Netflix, Amazon)
• Fraud detection (in banking or insurance)
• Public health tracking (e.g., during pandemics)
Real-Life Examples of Big Data Use
1. Google Maps
• Analyzes GPS data from millions of users to suggest fastest routes.
2. Amazon
• Uses purchase history and browsing behavior to recommend products.
3. Netflix
• Tracks what you watch to suggest shows and movies you’ll like.
4. Healthcare
• Hospitals use Big Data to predict disease outbreaks and improve diagnosis.
Types of Big data
1.Structured
• Structured data is highly organized and easily searchable using
traditional tools like SQL.
• It is stored in tabular formats such as rows and columns in
relational databases.
• Characteristics:
Clearly defined fields (columns)
Stored in relational databases (RDBMS)
Easy to enter, store and analyze
Data stored in a relational database management system is one example of
a ‘structured’ data.
Examples Of Structured Data
An ‘Employee’ table in a database is an example of Structured Data

Employee_I Employee_Nam
Gender Department Salary_In_lacs
D e

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000


7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
Example : Bank Transaction Records
• What it contains: Account number, transaction ID, amount, date, time, balance
• Format: Stored in rows and columns in a relational database (RDBMS)
• Usage:
• Easily searchable (e.g., show all transactions over ₹10,000)
• Used in financial reporting and fraud detection
Example : Student Information System
• What it contains: Student ID, Name, Class, Attendance, Marks
• Format: Tabular data in school management software
• Usage:
• Automatic report generation
• Academic progress analysis
Example : Inventory Management
• What it contains: Product ID, Name, Quantity, Supplier, Cost
• Format: SQL databases like Oracle or MySQL
• Usage:
• Real-time stock updates
• Automated reordering systems
2. Unstructured
• Unstructured data means any data that doesn’t have a clear format or
structure.
• It’s usually very large in size and difficult to process. This makes it
hard to get useful information from it.
• A common example of unstructured data is when different types of
files—like text documents, images and videos—are stored together in
one place.
• Today, many organizations have a huge amount of such data, but they
struggle to use it properly because it’s still in its raw, unorganized
form.
• Examples Of Un-structured Data
The output returned by ‘Google Search’
Example : YouTube Videos
• What it contains: Audio, video, captions, comments
• Format: Binary multimedia files (MP4, MKV)
• Usage:
• Processed using video analysis and speech-to-text
• Used for recommendation systems, content moderation
Example : Customer Feedback (Text)
• What it contains: Free-form user reviews
• Format: Paragraphs of unstructured text
• Usage:
• Analysed using Natural Language Processing (NLP)
• Helps companies improve products/services
Example : Medical Images (X-rays, MRIs)
• What it contains: Visual scan data with no direct text format
• Format: DICOM image files
• Usage:
• Requires image processing and AI for diagnosis
• Used in hospitals for patient treatment records
3. Semi-structured
• Semi-structured data does not reside in traditional databases but has some
organizational properties like tags or markers, which make it easier to analyze
than unstructured data.
Characteristics:
• Doesn’t fit into a table, but has identifiable patterns
• Contains tags (like XML or JSON)
• Flexible structure
Examples:
• Data represented in an XML files
Personal data stored in an XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>

<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
Example : JSON from E-Commerce Website
• What it contains:
json{ "product": "Laptop", "brand": "Dell", "price": 55000, "features":
["i5 processor", "8GB RAM", "512GB SSD"]}
• Usage:
• Used in APIs to exchange data between client and server
• Flexible; new fields can be added without breaking the system
Example : Email Metadata
• What it contains: To, From, Subject, Time, Attachment info
• Format: Semi-structured headers + unstructured body
• Usage:
• Spam filtering
• Email tracking and archiving
Comparison
Data Type Real Example Format Use Case

Student records, banking


Structured Tables (SQL) Easy to query and analyze
logs

Semi-Structured JSON APIs, emails, XML Tagged format Flexible data exchange

Requires AI tools for


Unstructured Videos, reviews, images Free form (binary)
understanding

Needs preprocessing
Quasi-Structured Web logs, sensor data Irregular patterns
before analysis
Big Data vs Traditional Data
Parameters Traditional Data Big Data
Volume Range from GB (manageable size) Range from TB or PB (extremely large &
complex), Constantly Updated
Generated Rate Data generated Per hour, per day… More rapid(almost every second or
millisecond)
Structure Works with Structured data Semi-structured and unstructured
Data.
Data Source Managed in Centralized system Fully distributed system
Data Integration Easy Difficult
Data Store RDBMS HDFS, NoSQL
(Uses basic database tools) (Needs special big data tools)
Access Interactive Batch or near real time
Update Scenarios Repeated read and write Write Once, Repeated Read
Data Static Schema(Fixed) Dynamic Schema(Flexible)
Structure(Format)
Scaling Potential Non-linear (e.g. A traditional SQL database Somewhat close to Linear (e.g. If your online
may perform well with 10,000 records, but store gets more customers and data, you
slow down sharply with 1 million records.) can add more servers to manage the load
instead of replacing your current one with a
Case Study
1. Big Data Case Study – Walmart
• Walmart is the largest retailer in the world and the world’s largest company by revenue,
with more than 2 million employees and 20000 stores in 28 countries. It started making
use of big data analytics much before the word Big Data came into the picture.
• Walmart uses Data Mining to discover patterns that can be used to provide product
recommendations to the user, based on which products were brought together.
• Walmart by applying effective Data Mining has increased its conversion rate of
customers. It has been speeding along big data analysis to provide best-in-class
e-commerce technologies with a motive to deliver superior customer experience.
• The main objective of holding big data at Walmart is to optimize the shopping experience
of customers when they are in a Walmart store.
• Big data solutions at Walmart are developed with the intent of redesigning global
websites and building innovative applications to customize the shopping experience for
customers whilst increasing logistics efficiency.
• Hadoop and NoSQL technologies are used to provide internal customers with access to
real-time data collected from different sources and centralized for effective use.
Data Mining is the process of finding useful patterns, trends and relationships in
large sets of data.
Simple Explanation:
Imagine you have a big box full of customer shopping receipts. Data mining helps
you look through all those receipts to find hidden patterns like:
• Which products are often bought together (e.g., bread and butter),
• What time people shop the most,
• What kind of offers increase sales.
Why is Data Mining useful?
• It helps businesses understand customer behavior.
• It improves product recommendations.
• It helps in better decision-making.
Walmart Big Data Case Study – Using the 5 V’s
1. Volume (Huge amount of data)
Walmart is the world’s largest retailer, with over 2 million employees and 20,000+ stores in 28 countries.
This creates a massive amount of data daily from sales, customers, inventory and online activities.
2. Velocity (Speed of data processing)
Walmart uses Big Data tools to analyze data quickly and in real-time. For example, product
recommendations and stock updates are made instantly, helping improve customer service and operations.
3. Variety (Different types of data)
Walmart works with many types of data like online purchases, in-store sales, customer reviews, product
images, and social media activity. Technologies like Hadoop and NoSQL help them handle this mix of data.
4. Veracity (accuracy of data)
Walmart uses data mining to find reliable patterns in customer behavior, helping to provide accurate
product suggestions and avoid fake or incorrect data (like duplicate orders or wrong inventory counts).
5. Value (Usefulness of data)
Walmart uses big data to improve customer experience, make shopping easier and increase sales. It also
helps in designing better websites, apps and managing supply chains efficiently.
2. Big Data Case Study – Uber
• Uber is the first choice for people around the world when they think of
moving people and making deliveries. It uses the personal data of the
user to closely monitor which features of the service are mostly used,
to analyze usage patterns and to determine where the services should
be more focused.
• Uber focuses on the supply and demand of the services due to which
the prices of the services provided changes. Therefore, one of Uber’s
biggest uses of data is surge pricing. For instance, if you are running
late for an appointment and you book a cab in a crowded place then
you must be ready to pay twice the amount.
• For example, On New Year’s Eve, the price for driving for one mile can
go from 200 to 1000. In the short term, surge pricing affects the rate of
demand, while long term use could be the key to retaining or losing
customers. Machine learning algorithms are considered to determine
where the demand is strong.
Uber Big Data Case Study – Using the 5 V’s
1. Volume (Large amount of data)
Uber collects huge amounts of data every day from users around the world — including
ride bookings, locations, timings, prices and customer feedback.
2. Velocity (Speed of data)
Uber uses data in real time to quickly update ride prices (surge pricing), match riders with
drivers and respond to changing demand immediately — like during rush hours or
holidays.
3. Variety (Different types of data)
Uber works with different kinds of data: user locations (GPS), payment details, app usage,
traffic updates, driver ratings, and even calendar events (like New Year’s Eve).
4. Veracity (Accuracy of data)
Uber uses machine learning algorithms to analyze real-time and historical data. This helps
them understand where demand is real and avoid mistakes in pricing or driver placement.
5. Value (Usefulness of data)
Uber uses big data to offer dynamic pricing (surge pricing), improve customer service,
keep wait times low, and help drivers earn more — making the whole system smarter and
more efficient.
3. Big Data Case Study – Netflix
• It is the most loved American entertainment company specializing in online
on-demand streaming video for its customers.
• Netflix has been determined to be able to predict what exactly its customers
will enjoy watching with Big Data. As such, Big Data analytics is the fuel that
fires the ‘recommendation engine’ designed to serve this purpose. More
recently, Netflix started positioning itself as a content creator, not just a
distribution method.
• Unsurprisingly, this strategy has been firmly driven by data. Netflix’s
recommendation engines and new content decisions are fed by data points
such as what titles customers watch, how often playback stopped, ratings are
given, etc. The company’s data structure includes Hadoop, Hive and Pig with
much other traditional business intelligence.
• Netflix shows us that knowing exactly what customers want is easy to
understand if the companies just don’t go with the assumptions and make
decisions based on Big Data.
Netflix Big Data Case Study – Using the 5 V’s
1. Volume (Large amount of data)
Netflix collects massive data from millions of users worldwide — like what shows people
watch, how long they watch, when they pause, what they rate, and more.
2. Velocity (Speed of data processing)
Netflix processes data very quickly to give users real-time recommendations and updates. It
uses fast systems to instantly suggest shows based on your recent activity.
3. Variety (Different types of data)
Netflix handles many types of data: viewing history, search keywords, device type, screen
resolution, pause/play behavior, and user ratings. Tools like Hadoop, Hive, and Pig help
manage this mix of data.
4. Veracity (Accuracy and trust in data)
Netflix relies on real user data, not assumptions, to decide what content to recommend or
create. This reduces guesswork and helps in making accurate decisions.
5. Value (Usefulness of data)
Big Data helps Netflix recommend shows people will actually enjoy, which keeps users
watching longer. It also helps Netflix decide what new content to produce, increasing user
satisfaction and business success.
4. Big Data Case Study – eBay
• A big technical challenge for eBay as a data-intensive business to
exploit a system that can rapidly analyze and act on data as it arrives
(streaming data). There are many rapidly evolving methods to support
streaming data analysis.
• eBay is working with several tools including Apache Spark, Storm,
Kafka. It allows the company’s data analysts to search for information
tags that have been associated with the data (metadata) and make it
consumable to as many people as possible with the right level of
security and permissions (data governance).
• The company has been at the forefront of using big data solutions and
actively contributes its knowledge back to the open-source
community.
eBay Big Data Case Study – Using the 5 V’s
1. Volume (Large amount of data)
eBay handles huge amounts of data every second from users buying, selling and browsing
on its platform. This makes it a highly data-intensive business.
2. Velocity (Speed of data processing)
eBay deals with streaming data, meaning data that comes in real time. The company needs
to analyze and act on this data instantly, using fast tools like Apache Spark, Kafka and
Storm.
3. Variety (Different types of data)
eBay processes different types of data — user clicks, product listings, search queries, prices
and customer reviews — all coming in from various locations and devices.
4. Veracity (Accuracy and security of data)
eBay uses metadata (information about data) and applies data governance (rules for access
and security) to make sure data is shared safely and correctly across teams.
5. Value (Usefulness of data)
Big data allows eBay’s analysts to gain insights, improve customer experience and make
better business decisions. eBay also shares its knowledge with the open-source community,
helping others benefit too.
What is Hadoop?
• Hadoop is an open-source framework that allows us to store &
process large data in a parallel & distributed manner.
• Doug cutting and Mike Cafarella
• Two main components : HDFS & MapReduce.
• Hadoop Distributed File System (HDFS) is the primary data storage
system used by Hadoop applications.
• MapReduce is the processing unit of Hadoop.
• Basically, it accomplishes the following two tasks:
1. Massive data storage.
2. Faster processing.
The main goals of Hadoop are :
1. Scalable : Hadoop is designed to scale horizontally, which means you can increase processing power
simply by adding more machines (nodes) to the cluster. Example : Suppose an e-commerce website
starts small with 3 servers to process user transaction data. As their user base grows, they can expand
their Hadoop cluster to 1000 servers without changing the code. This ensures it can handle large
amounts of data generated during major sales events like Diwali or Black Friday.
2. Fault Tolerance : Hadoop automatically recovers from hardware or node failures without losing data or
affecting processing. Example : If one DataNode storing part of a file crashes, Hadoop does not stop.
The data is already replicated (typically 3 copies) across different nodes. So, another node with a replica
of the same data takes over automatically. This ensures zero data loss.
3. Economical : Hadoop is cost-effective because it uses commodity hardware — affordable,
general-purpose machines instead of expensive, specialized servers. Example : Instead of buying
high-end servers worth lakhs of rupees, a company can build a Hadoop cluster with dozens of regular
PCs. Facebook, for instance, uses Hadoop on thousands of such machines to analyze user data and
trends cost-effectively.
4. Handle Hardware Failures : Hadoop is built to detect and manage failures at the software level,
meaning even if a disk, server, or switch fails, the system keeps running. Example : If a server crashes
while processing a job, Hadoop's JobTracker/ResourceManager detects the failure and reassigns that
task to another working server. This is why large companies like Yahoo! or LinkedIn use Hadoop — their
systems keep running even when hardware components fail unexpectedly.
Core components of
Hadoop Architecture
• Core Hadoop Components
Hadoop consists of the following components:
1. Hadoop Common: This package provides file system and OS level
abstractions. It contains libraries and utilities required by other Hadoop
modules.
2. Hadoop Distributed File System (HDFS): HDFS is a distributed file system
that provides a limited interface for managing the file system.
3. Hadoop MapReduce: MapReduce is the key algorithm that the Hadoop
MapReduce engine uses to distribute work around a cluster.
4. Hadoop Yet Another Resource Negotiator (YARN) (MapReduce 2.0): It is a
resource management platform responsible for managing compute
resources in clusters and using them for scheduling of users’ applications
Hadoop Architecture
1. HDFS Layer (Storage Layer) 2. MapReduce Layer (Processing Layer)
This layer is responsible for storing large amounts of This layer is responsible for processing the data stored in
data across multiple machines. HDFS.(intermediate results ,key-value pair)
Components: Components:
NameNode (Master Node): JobTracker (Master Node):
• Manages the file system namespace. • Coordinates the execution of MapReduce jobs.
• Keeps track of metadata (like filenames, file • Assigns tasks to TaskTrackers on available nodes.
locations, permissions). • Monitors job progress and handles failures.
• Does not store actual data. TaskTracker (Slave Node):
DataNode (Slave Node): • Executes the actual tasks (Map/Reduce) assigned
• Stores the actual data blocks. by JobTracker.
• Regularly sends a signal and block report to • Reports progress/status to the JobTracker.
the NameNode. Interaction:
Interaction: •JobTracker sends jobs to TaskTrackers.
•NameNode instructs DataNodes where to •TaskTrackers perform computations and send
store/replicate data. results/status back.
•DataNodes communicate back with the NameNode.
Hadoop Distributed file system architecture
Hadoop Distributed file system
Hadoop Distributed file system
Rather than keeping data in single area, it keeps in
multiple distributed files
1280MB= default partition is 128 MB
Save those partitions in different places called data nodes.
The information regarding which data is stored in which
data node is stored in master node. This master file is called
metadata.
Each data node has 2-3 replicas those are secondary data
nodes, so that no risk of failure.
Hadoop Distributed file system
1.Namenode :
• Also known as Master node
• Contains Metadata File.
• For E.g. name of file, how many replications, permission on files, Which datanode it uses.
2.Datanode :
• Are Slaves of HDFS
• Default replication factor is 3.
• For replication of data a datanode may communicate with each other.
3.Secondary namenode : Continuously update the primary namenode.

Rack Awareness
Read and Write Operation
Hadoop Distributed file system
Functions of NameNode:
• It records the metadata of all the files stored in the cluster.
E.g. The location of blocks stored, the size of the file, permissions,
hierarchy, etc.
• Two files associated with metadata are
FsImage – It contains the complete state of the file system
namespace since the start of the NameNode.
EditLogs – It contains all the recent modifications made to the file
system with respect to the most recent FsImage.
• It regularly receives a heartbeat and block report from all the
DataNodes in the cluster to ensure that the DataNode is alive.
Map Reduce
• Map reduce performs the processing of large amount of dataset in distributed
and parallel manner(less time, more efficiency).
• Two main Task:
1. Map(): Divides the data in key value pairs
2. Reduce(): Combine intermediate result
• Two main Daemons(Small process):
1. Master (Only one) Job Tracker: Scheduling Process, Provide the resources to
the task tracker, monitoring.
2. Slave (multiple) Task Tracker: Execute the task by using resources & provide
status of task to the master.
- Keep the data on local disk. (less propagation delay)
- Not on remote because if we keep remotely, it will take more time i.e. more delay.
- Processing unit works on data.
Map Reduce Example
Hadoop Ecosystem
Apart from HDFS and MapReduce, the other
components of Hadoop ecosystem are shown in Fig.
The main ecosystems components of Hadoop
architecture are as follows:
1. Apache HBase: Columnar (Non-relational)
database.
2. Apache Hive: Data access and query.
3. Apache Hcatalog : Metadata services.
4. Apache Pig: Scripting platform.
5. Apache Mahout: Machine learning libraries for
Data Mining.
6. Apache Oozie: Workflow and scheduling
services.
7. Apache ZooKeeper : Cluster coordination.
8. Apache Sqoop: Data integration services.
Additional Functionalities:
- Proper monitoring, Proper management, Security Feature, Scalability

Hadoop Ecosystem feature and so on for this different components are used.
MapReduce
- When Hadoop first version came 1.0 (Data processing + Resource
management), because of this it became slow due to increasing in
data.
- So Resource management part was removed from MapReduce and
different component Hadoop Yarn was introduced.
- E.g. CPU
Hadoop Yarn
- How resources should be managed
- How job should be scheduled
- E.g. OS (Windows, Linux)
Hadoop HDFS
- E.g. File system (NTFS, NFS)
Flume & Sqoop
- Data collection(Data Ingestion)
- In Pc ,different source of data collection(download, PD, External Hard
drive )
- Flume – unstructured data or semi-structured data (photos, video, live
streaming) continuous data, long files.., Real time data
- Sqoop – structured data (SQL, Oracle, MySQL Applications)
HBase
- In RDBMS data is stored in form of rows, HBase stores data in form of
columns (store data without creating a structure to store the data)
- NoSQL concept is used
- Google Big Table Project has been further modelled and specially
designed to store large data.
HIVE
- After Processing the data whatever output comes, so how to analyse the output

Hadoop Ecosystem (Data analysis)


- HIVE is used for SQL query
- Used for structured data
PIG
- Component used for scripting
- Reduce the line of code(e.g. Code written in Mapper, Reducer, Driver class are
lengthy, PIG will Reduce the lines of code from 200-300 lines to 10-20 lines)
Mahout
- Different libraries of Machine Learning which includes all the Algorithms of Machine
Learning.
- If want to designed an application based on Hadoop where ML Algo are used, so
Apache Mahout is used there(Provide System Scalability by including ML Algo)
Cloudera
- 4 Phases
- [Input data(Data Ingestion)====================== Flume, Sqoop
Data manage(HDFS)========== ================Hbase(way of storing the data)
Data Processing(MapReduce)===================Generate Output
Proper way to analyse the data, visualize the data==PIG, Mahout
- So how user can search or Explore the data in Big System, by using Cloudera
- Provide a Platform where data can be managed, monitored.
Oozie
- Purpose to schedule jobs(Workflow)
- For e.g. if want to assign alarm to entire System (which job should complete on what
time, automation) so this can be done using Oozie
Zookeeper
- Proper Monitoring, Proper Managing (all components)
- Have combined different Ecosystem together(Clusters)
- Zookeeper manage Big Clusters together.
Hadoop Limitations
1. Small File Problem:
• Hadoop's HDFS (Hadoop Distributed File System) is optimized for large files. Storing a large number of small files can
lead to performance issues because the Namenode (which stores metadata) can become overloaded.
• This is because each file, regardless of size, requires metadata storage in the Namenode.
• The default HDFS block size is typically 128MB or 256MB, making files significantly smaller than this block size
problematic.
2. Slow Processing Speed:
• Hadoop's core processing engine, MapReduce, primarily handles batch processing, which can result in slower
processing times compared to other frameworks designed for real-time or interactive data processing.
• This is because data is read from and written to disk during processing, which is an expensive operation, especially
for large datasets.
3. Limited Real-Time and Iterative Processing:
• Hadoop is not well-suited for real-time or interactive data analysis, as it is designed for batch processing of data at
rest.
• Iterative processing (where the same data is processed multiple times in a loop) is also inefficient in Hadoop.
Hadoop Limitations
4. Complexity:
• Hadoop can be complex to set up, configure, and manage, particularly in large clusters.
• The learning curve for Hadoop can be steep, especially for those unfamiliar with MapReduce programming.
5. Security Concerns:
• Hadoop's default security features are basic, and it lacks encryption at the storage and network levels.
• Implementing robust security measures in Hadoop often requires integrating additional tools, which can add to
the complexity.

You might also like