801 – BIG DATA
Unit -1
Introduction to Big Data
What is Big Data?
Big Data refers to massive volumes of data that are generated, stored, and
analyzed for insights to improve decision-making. This data can be
structured, semi-structured, or unstructured and is too complex to be
processed using traditional data management tools. Big Data is collected
from multiple sources, including social media, sensors, IoT devices, financial
transactions, healthcare records, and digital applications.
The growth of digital technology has significantly contributed to the
explosion of data. Companies, governments, and research institutions use
Big Data analytics to enhance productivity, improve customer experiences,
optimize business operations, and drive innovation.
Importance of Big Data
The significance of Big Data lies in its ability to uncover hidden patterns,
correlations, and insights that were previously inaccessible due to
computational limitations. Organizations use Big Data to:
Enhance Business Decisions – Helps businesses understand market
trends, customer behavior, and operational performance.
Improve Healthcare – Used for disease prediction, personalized
medicine, and efficient patient care.
Boost Customer Experience – Enables businesses to deliver
personalized recommendations and services.
Detect Fraud & Security Threats – Helps financial institutions and
cybersecurity experts identify fraudulent activities.
Optimize Supply Chain & Logistics – Enables better inventory
management, demand forecasting, and transportation efficiency.
Future of Big Data
The future of Big Data is driven by advancements in AI, cloud computing,
edge computing, and blockchain technology. Emerging trends include:
AI & Machine Learning – Automating data-driven decision-making.
Edge Computing – Processing data closer to the source to reduce
latency.
Blockchain & Big Data – Enhancing security and transparency in
data transactions.
Characteristics of Big Data (5Vs of Big Data)
Big Data is defined by five primary characteristics, known as the 5Vs:
1. Volume (Data Size and Scale)
The most distinguishing feature of Big Data is its enormous size.
Data is collected from various sources such as social media, IoT
sensors, transaction logs, and online activity.
Traditional database systems struggle to handle petabytes (PB) and
exabytes (EB) of data efficiently.
Example:
Facebook generates 4+ petabytes of data daily from user activities,
posts, comments, and likes.
A single Boeing 787 aircraft generates 500GB of data per flight from
sensors monitoring engine performance.
2. Velocity (Speed of Data Generation and Processing)
Big Data is generated at an incredibly high speed and needs to be
processed in real-time.
Businesses must analyze data streams instantly to make timely
decisions (e.g., fraud detection, stock market analysis).
Example:
Stock trading platforms process millions of transactions per
second, requiring real-time analytics to detect anomalies.
Social media platforms such as Twitter generate over 500 million
tweets per day, which need to be processed for trend analysis and
sentiment detection.
3. Variety (Different Types of Data)
Data comes in different formats:
o Structured data (organized, relational databases).
o Semi-structured data (XML, JSON, log files).
o Unstructured data (images, videos, social media posts, sensor
data).
Handling and integrating these diverse formats is a challenge.
Example:
A single e-commerce transaction includes:
o Structured Data: Customer ID, transaction amount, payment
details.
o Semi-structured Data: JSON/XML containing order details.
o Unstructured Data: Customer reviews, images, and voice
support calls.
4. Veracity (Data Quality and Reliability)
The accuracy and trustworthiness of data are critical, as poor-quality
data can lead to incorrect decisions.
Issues such as missing values, inconsistencies, and noise must be
handled through data cleansing and preprocessing.
Example:
In fraud detection, financial institutions filter out false positives
(legitimate transactions flagged as fraudulent) by analyzing spending
patterns and customer behavior.
Fake news on social media platforms requires verification to prevent
misinformation.
5. Value (Extracting Meaningful Insights from Data)
The main purpose of Big Data is to extract useful business insights that
improve decision-making.
Organizations invest in analytics, AI, and machine learning to derive
value from data.
Example:
Netflix analyzes users' watch history to provide personalized
recommendations, improving user engagement and retention.
Retail stores analyze shopping patterns to offer targeted promotions
and optimize inventory.
Types of Big Data
Big Data can be classified into three main categories:
1. Structured Data
Data that follows a predefined schema and is stored in relational
databases.
Can be easily searched and analyzed using SQL queries.
Examples:
Customer databases (ID, Name, Age, Address).
Bank transaction records.
Inventory management systems.
2. Unstructured Data
Data that does not have a specific format, making it difficult to store
and analyze using traditional tools.
Requires advanced AI/ML models for processing.
Examples:
Social media posts (tweets, Facebook comments).
Audio and video recordings.
Satellite images and CCTV footage.
3. Semi-structured Data
Data that does not follow a strict schema but contains tags, metadata,
or markers to define structure.
Often stored in NoSQL databases.
Examples:
XML and JSON files.
Log files from web servers.
Emails (structured header + unstructured body).
Traditional Data vs. Big Data: A Detailed
Comparison
As businesses and industries evolve, data management has shifted from
traditional data processing methods to Big Data technologies. Below is a
detailed comparison between Traditional Data and Big Data based on
various factors.
1. Definition
Traditional Data
Refers to structured and well-organized data stored in relational
databases (RDBMS).
Managed using SQL-based systems like MySQL, Oracle, PostgreSQL.
Suitable for small to medium-sized datasets that fit into traditional
databases.
Big Data
Refers to extremely large, complex, and diverse datasets that
traditional databases cannot efficiently handle.
Includes structured, semi-structured, and unstructured data.
Requires distributed computing frameworks like Hadoop, Apache
Spark, NoSQL databases for processing.
2. Data Volume (Size of Data)
Traditional Data
Handles limited amounts of data (usually gigabytes to terabytes).
Designed for small datasets with structured relationships.
Big Data
Deals with massive datasets (ranging from terabytes to petabytes
and beyond).
Grows exponentially with data generated from IoT, social media, and
real-time applications.
✅ Example:
A traditional banking system stores customer account details in a
relational database.
Big Data applications analyze customer transactions, fraud
detection, and market trends in real time.
3. Data Variety (Types of Data)
Traditional Data
Mostly structured data with predefined formats (e.g., tables in a
database).
Can be stored and retrieved using SQL queries.
✅ Example:
Employee records stored in an SQL database (name, age, salary,
department).
Big Data
Can be structured, semi-structured, or unstructured.
Includes data from social media, sensors, emails, videos, logs, IoT
devices.
Requires specialized tools like MongoDB (NoSQL), HDFS (Hadoop),
and Apache Spark.
✅ Example:
Customer reviews (text), social media posts (images/videos), and GPS
location data.
4. Data Velocity (Speed of Processing)
Traditional Data
Processes data in batch mode (stored and then analyzed later).
Does not support real-time analytics.
Suitable for businesses with slow-changing datasets.
Big Data
Requires real-time or near real-time processing due to fast data
generation.
Uses Apache Kafka, Apache Spark, and Flink for streaming data
processing.
Essential for applications like fraud detection, stock trading, and IoT
monitoring.
✅ Example:
Traditional Approach: A retail store analyzes last month’s sales for
future planning.
Big Data Approach: Analyzing real-time customer purchases to offer
instant discounts.
5. Data Storage & Management
Traditional Data
Stored in centralized databases (RDBMS) like MySQL, PostgreSQL,
and Oracle.
Uses structured schemas to define data formats.
Cannot efficiently handle unstructured or semi-structured data.
Big Data
Stored in distributed file systems (HDFS, Google BigTable, Amazon
S3).
Uses NoSQL databases (MongoDB, Cassandra, HBase) to handle
various data types.
Enables horizontal scaling for handling massive data volumes.
✅ Example:
Traditional: A company's financial records stored in an Oracle
database.
Big Data: Google indexing billions of web pages for search results.
6. Data Processing Techniques
Traditional Data
Uses SQL-based queries to retrieve structured data.
Relies on single-server architecture for computations.
Not optimized for distributed parallel processing.
Big Data
Uses parallel distributed processing across multiple nodes.
Processes data using MapReduce (Hadoop), Spark, and Flink.
Supports advanced machine learning and AI-driven analytics.
✅ Example:
Traditional: A business runs monthly reports using SQL queries.
Big Data: Facebook processes millions of user activities in real-time.
7. Scalability
Traditional Data
Vertical scaling (Scaling Up) – Requires upgrading a single
machine’s hardware (CPU, RAM).
Limited by hardware capacity and performance constraints.
Big Data
Horizontal scaling (Scaling Out) – Increases capacity by adding
multiple machines.
Distributed computing ensures high availability and fault tolerance.
✅ Example:
Traditional Approach: Increasing database performance by adding
more RAM to a server.
Big Data Approach: Using a Hadoop cluster with 100+ machines to
process data simultaneously.
8. Cost & Infrastructure
Traditional Data
Requires high-end relational database licenses and expensive
infrastructure.
Involves high maintenance costs for centralized database
management.
Big Data
Uses open-source technologies (Hadoop, Spark, NoSQL) to
reduce costs.
Leverages cloud-based solutions like AWS, Google Cloud, and Azure for
cost-effective scaling.
✅ Example:
Traditional: A bank purchases Oracle licenses for customer data
management.
Big Data: Netflix uses AWS cloud storage for its massive video
streaming data.
9. Security & Data Privacy
Traditional Data
Uses authentication & authorization (e.g., role-based access
control).
Easier to manage due to structured data and centralized databases.
Big Data
Requires advanced security measures due to distributed
architecture.
Involves data encryption, anonymization, and compliance with GDPR &
CCPA.
Security challenges in handling real-time, large-scale transactions.
✅ Example:
Traditional: Secure access to a company’s HR database using SQL
authentication.
Big Data: Securing global IoT device communications from cyber
threats.
10. Use Cases & Applications
Traditional Data Use Cases
1. Banking & Finance: Storing customer transactions in relational
databases.
2. HR Systems: Employee payroll and attendance records.
3. Inventory Management: Tracking stock levels using SQL databases.
Big Data Use Cases
1. Healthcare: AI-driven disease prediction and patient data analysis.
2. E-commerce: Personalized product recommendations (Amazon,
Flipkart).
3. Social Media: Sentiment analysis of Twitter, Facebook, Instagram.
4. Smart Cities: Traffic pattern analysis and IoT-based energy
management.
✅ Example:
Traditional: A telecom company maintains customer billing records in
a relational database.
Big Data: Analyzing millions of customer calls to detect network
failures.
Evolution of Big Data: A Detailed Overview
Big Data has evolved over the decades as technology, computing power, and
data generation capabilities have advanced. The journey from traditional
databases to the era of AI-driven analytics showcases how businesses and
industries have adapted to massive, fast-growing, and diverse
datasets.
1. Pre-Big Data Era (Before 2000s) – The Age of Traditional Data
Key Characteristics:
Data Volume: Small to moderate (Megabytes to Gigabytes).
Data Type: Mostly structured data (tables in relational databases).
Storage & Processing: Traditional Relational Database
Management Systems (RDBMS) such as Oracle, MySQL,
PostgreSQL, and SQL Server.
Challenges:
o Could not handle unstructured data (images, videos, emails,
logs).
o Data storage and processing had hardware limitations.
o No real-time analytics; only batch processing was possible.
Example:
Banks stored customer transaction details in SQL databases.
Businesses used Excel spreadsheets for data management.
2. Early Big Data (2000s) – The Birth of Distributed Systems
Key Characteristics:
Data Volume: Increased significantly (Gigabytes to Terabytes).
Data Type: Semi-structured and unstructured data emerged
(emails, XML, JSON, log files).
Storage & Processing:
o Google developed MapReduce (2004) – a distributed
computing framework for large-scale data processing.
o Hadoop (2006) was introduced as an open-source
implementation of MapReduce.
o NoSQL databases (MongoDB, Cassandra, HBase) emerged
to handle unstructured data.
Major Developments:
Search Engines: Google revolutionized search with its PageRank
algorithm and distributed computing techniques.
Social Media Rise: Platforms like Facebook, Twitter, and LinkedIn
started generating massive amounts of unstructured user data.
Cloud Storage: Companies like Amazon (AWS S3) and Google
Cloud Storage introduced scalable storage solutions.
Challenges:
Limited real-time processing – Hadoop was batch-based.
High latency – Not suitable for instant decision-making.
Example:
Yahoo! used Hadoop to process search engine data.
Walmart analyzed sales trends using distributed computing.
3. Big Data Boom (2010s) – Real-Time Analytics & AI Integration
Key Characteristics:
Data Volume: Exploded to Terabytes, Petabytes, and even
Exabytes.
Data Type:
o Structured (SQL databases).
o Semi-structured (JSON, XML, web logs).
o Unstructured (videos, images, sensor data, social media).
Storage & Processing:
o Hadoop improved with Apache Spark (2014), which enabled
real-time data processing.
o Streaming frameworks like Apache Kafka and Flink were
introduced.
o AI and Machine Learning started leveraging Big Data for
predictive analytics.
Major Developments:
Real-Time Applications:
o Fraud detection in banking.
o Personalized recommendations (Netflix, Amazon).
o Self-driving cars using sensor data.
Cloud & Edge Computing:
o Companies adopted Google Cloud, AWS, Microsoft Azure for
scalable data processing.
o Edge Computing emerged to process data closer to the source
(IoT devices).
Deep Learning & AI:
o Neural networks (like TensorFlow, PyTorch) leveraged Big Data for
speech recognition, image processing, and automation.
Challenges:
Data privacy concerns (GDPR, CCPA).
Cybersecurity threats due to large-scale data breaches.
Complexity in managing multi-cloud environments.
Example:
Netflix processes user viewing data in real time to recommend
shows.
Tesla’s autonomous cars analyze millions of sensor inputs for
navigation.
4. Modern Era (2020s – Present) – AI-Driven Big Data & Quantum
Computing
Key Characteristics:
Data Volume: Massive expansion to Zettabytes.
Data Type: Multi-source, heterogeneous, streaming data from IoT,
blockchain, AI, and 5G networks.
Storage & Processing:
o Serverless computing allows automatic scaling (AWS Lambda,
Google Cloud Functions).
o Quantum computing experiments with ultra-fast Big Data
analytics.
o Federated learning allows AI to train on decentralized data
without privacy risks.
Major Developments:
Big Data + AI + Blockchain:
o AI predicts diseases from medical Big Data.
o Blockchain ensures data security and traceability.
5G & IoT Revolution:
o 5G enables real-time streaming and analytics at a massive scale.
o Smart cities use IoT-generated data for traffic, waste, and energy
management.
Augmented Analytics:
o AI automatically cleans, processes, and interprets data.
o NLP (Natural Language Processing) allows businesses to ask
data-related questions in human language.
Challenges:
Ethical AI – Bias in AI models using Big Data.
Sustainability – Huge energy consumption for AI and data centers.
Data Sovereignty – Legal battles over where data is stored and
processed.
Example:
Google’s DeepMind uses AI and Big Data for protein structure
predictions.
Facebook detects fake news using real-time Big Data analytics.
Future of Big Data (Beyond 2030)
Expected Developments:
✅ AI-powered autonomous systems:
AI-driven data pipelines will fully automate data collection, processing,
and insights generation.
✅ Quantum Computing for Big Data:
Quantum algorithms will analyze exabytes of data in seconds.
✅ DNA Data Storage:
Storing vast amounts of data in DNA molecules for near-infinite
storage.
✅ AI-Augmented Decision-Making:
Governments and businesses will rely on AI-driven insights for
policymaking, defense, and economy.
Challenges with Big Data: A Detailed Analysis
Big Data brings immense opportunities, but it also presents several
challenges related to data storage, processing, security, privacy, and
real-time analytics. As organizations deal with high-volume, high-
velocity, and high-variety data, they face significant hurdles in
managing, analyzing, and deriving insights efficiently.
Below is an in-depth exploration of the major challenges associated with
Big Data.
1. Data Storage & Scalability Issues
Problem:
The massive increase in data volume (from Terabytes to
Petabytes and beyond) creates a storage bottleneck.
Traditional relational databases (SQL-based systems) struggle to
handle this scale efficiently.
Storage costs increase as data volume grows, requiring distributed
and cloud storage solutions.
Solutions:
✅ Distributed File Systems – Apache Hadoop’s HDFS (Hadoop
Distributed File System) stores large-scale data across multiple machines.
✅ Cloud Storage – AWS S3, Google Cloud Storage, and Azure Blob Storage
provide scalable and cost-effective solutions.
✅ Compression Techniques – Reduce storage costs by using efficient
compression algorithms like Snappy, LZ4, or Gzip.
Example:
Facebook stores petabytes of user data using a combination of
Hadoop, Hive, and cloud-based infrastructure.
2. Data Processing & Real-Time Analytics
Problem:
Traditional data processing tools (SQL, RDBMS) are too slow to
analyze massive datasets.
Real-time analytics is critical for fraud detection, stock trading,
self-driving cars, and other applications.
High latency in batch processing (Hadoop’s MapReduce) makes
real-time decision-making difficult.
Solutions:
✅ In-Memory Processing – Apache Spark and Google’s BigQuery enable
faster analytics by processing data in memory instead of disk.
✅ Streaming Frameworks – Apache Kafka, Flink, and Storm provide
real-time data stream processing.
✅ Edge Computing – Process data closer to the source (IoT devices,
sensors) instead of sending everything to a central data center.
Example:
Uber processes ride requests in real time using Apache Kafka and
Spark Streaming.
3. Data Integration & Heterogeneity
Problem:
Big Data comes from multiple sources – IoT sensors, social media,
emails, logs, images, videos, etc.
Data is structured (SQL databases), semi-structured (JSON,
XML), and unstructured (videos, images, social media posts).
Integrating diverse datasets into a unified platform is complex.
Solutions:
✅ ETL (Extract, Transform, Load) Pipelines – Tools like Apache NiFi,
Talend, and Apache Beam automate data integration.
✅ Data Lakes – Platforms like AWS Lake Formation store raw,
unstructured data efficiently.
✅ Schema-on-Read Approach – Allows flexible querying of diverse data
formats without predefining strict schemas.
Example:
Netflix integrates data from different sources (user interactions,
streaming behavior, network logs) into a unified analytics system.
4. Data Quality & Cleaning
Problem:
Incomplete, inconsistent, duplicate, and inaccurate data leads to
poor decision-making.
Different data sources may have conflicting formats (e.g., date
formats in YYYY-MM-DD vs. MM-DD-YYYY).
Missing values and noise make data less reliable for analytics and AI
models.
Solutions:
✅ Automated Data Cleaning Tools – OpenRefine, Trifacta, and Apache
Griffin help clean and standardize data.
✅ AI & ML for Data Cleaning – AI-powered tools detect anomalies,
inconsistencies, and missing values automatically.
✅ Master Data Management (MDM) – Ensures a single, consistent version
of data across an organization.
Example:
Healthcare data cleaning ensures that patient records are complete
and consistent for accurate diagnostics.
5. Security & Privacy Concerns
Problem:
Big Data contains sensitive personal, financial, and business
information.
Cyberattacks, data breaches, and unauthorized access pose
major threats.
Regulatory compliance (GDPR, CCPA) requires organizations to
protect user data.
Solutions:
✅ Encryption & Access Control – Data is encrypted using AES-256, and
role-based access controls (RBAC) restrict access.
✅ Blockchain for Data Security – Ensures tamper-proof data storage and
audit trails.
✅ Anomaly Detection with AI – AI-driven cybersecurity tools detect
unusual patterns and prevent breaches.
Example:
Equifax Data Breach (2017): A cyberattack exposed 147 million
users’ personal data, highlighting the need for stronger Big Data
security.
6. High Infrastructure & Maintenance Costs
Problem:
Storing and processing petabytes of data requires expensive
servers, cloud services, and high-performance computing
clusters.
Maintenance costs rise due to data replication, backup, and
redundancy.
Solutions:
✅ Hybrid Cloud Solutions – Use on-premises + cloud storage for cost
optimization.
✅ Serverless Computing – AWS Lambda, Google Cloud Functions
dynamically allocate resources as needed.
✅ Data Archiving & Compression – Move infrequently accessed data to
cold storage (cheaper but slower).
Example:
Google optimizes Big Data costs by using AI-powered workload
scheduling and serverless infrastructure.
7. Ethical Issues & AI Bias in Big Data
Problem:
AI models trained on biased data make unfair decisions (e.g.,
biased hiring, racial profiling).
Privacy invasion – Companies use Big Data to track users without
consent.
Manipulation of public opinion using AI-driven fake news and
deepfakes.
Solutions:
✅ Fair AI Algorithms – Ensure diverse training data to remove bias.
✅ Ethical AI Regulations – Enforce AI transparency and explainability.
✅ User Control Over Data – Implement opt-in policies for data collection.
Example:
Facebook’s AI algorithm was accused of racial bias in advertising
placement, leading to stricter AI fairness policies.
8. Data Governance & Compliance
Problem:
Governments enforce strict data protection laws (GDPR in Europe,
CCPA in California).
Companies must track, store, and process user data legally.
Failure to comply leads to heavy fines (e.g., GDPR fines up to €20
million).
Solutions:
✅ Data Masking & Tokenization – Hide sensitive user data to protect
privacy.
✅ Compliance Audits – Conduct regular checks to ensure legal compliance.
✅ Metadata Management – Properly label and classify sensitive data.
Example:
Amazon was fined €746 million for GDPR violations in 2021 due
to improper handling of user data.
Technologies Available for Big Data
Big Data technologies help in storing, processing, analyzing, and
visualizing massive amounts of structured and unstructured data.
These technologies are categorized into data storage, processing
frameworks, databases, real-time analytics, machine learning, and
visualization tools.
Below is a detailed classification of Big Data technologies and their
applications.
1. Data Storage Technologies
Big Data storage technologies are essential for storing massive volumes
of structured and unstructured data efficiently.
a) Distributed File Systems
Used to store data across multiple nodes to ensure scalability and fault
tolerance.
✅ Hadoop Distributed File System (HDFS) – The backbone of Apache
Hadoop, used for storing large-scale datasets across clusters.
✅ Google File System (GFS) – Google’s proprietary distributed file system,
predecessor of HDFS.
✅ Amazon S3 (Simple Storage Service) – A cloud-based object storage
solution for handling large data workloads.
b) Cloud Storage
Used for storing and managing data on remote cloud servers.
✅ AWS S3, Google Cloud Storage, Azure Blob Storage – Cloud storage
services that offer high availability and scalability.
✅ Snowflake – A cloud-based data warehouse optimized for analytics.
✅ MinIO – An open-source alternative to AWS S3 for private cloud storage.
c) Data Warehousing
Optimized for storing structured and semi-structured data for analytical
processing.
✅ Amazon Redshift – A cloud-based data warehouse optimized for complex
queries.
✅ Google BigQuery – A serverless data warehouse for large-scale analytics.
✅ Apache Hive – A data warehouse built on top of Hadoop, enabling SQL-like
queries on Big Data.
2. Data Processing Technologies
These frameworks process large datasets efficiently using batch and real-
time processing methods.
a) Batch Processing Frameworks
Process large datasets in batches at scheduled intervals.
✅ Apache Hadoop (MapReduce) – A framework that processes massive
datasets using the MapReduce programming model.
✅ Apache Spark – Faster than Hadoop, uses in-memory processing for high-
speed batch analytics.
✅ Apache Flink – Handles both batch and real-time stream processing.
b) Real-Time & Stream Processing Frameworks
Process continuous streams of data from IoT devices, sensors, and social
media.
✅ Apache Kafka – A distributed messaging system used for streaming data
pipelines.
✅ Apache Storm – Processes real-time streaming data with low latency.
✅ Apache Flink – Supports event-driven real-time stream processing.
✅ Google Dataflow – A serverless data processing service for batch and
real-time streams.
3. Big Data Databases
Big Data requires specialized databases that can handle massive amounts
of structured, semi-structured, and unstructured data.
a) NoSQL Databases
Designed for scalability and high availability, ideal for handling
unstructured data.
✅ MongoDB – A document-oriented NoSQL database for flexible schema
storage.
✅ Cassandra – A highly scalable, distributed database used by Facebook,
Netflix, and Apple.
✅ HBase – A NoSQL database that runs on Hadoop, optimized for large
tables.
b) NewSQL Databases
Combine the benefits of traditional SQL databases with the scalability
of NoSQL.
✅ Google Spanner – A globally distributed database that provides strong
consistency.
✅ CockroachDB – A fault-tolerant, horizontally scalable SQL database.
✅ MemSQL – A real-time, high-performance distributed database.
c) Graph Databases
Designed for storing complex relationships in social networks, fraud
detection, and recommendation systems.
✅ Neo4j – A popular graph database used for social media and fraud
analytics.
✅ Amazon Neptune – A fully managed graph database optimized for deep
link analytics.
✅ ArangoDB – A multi-model NoSQL database that supports graph,
document, and key-value data.
4. Machine Learning & AI for Big Data
AI and Machine Learning technologies analyze Big Data to extract patterns,
trends, and predictive insights.
✅ TensorFlow & PyTorch – Deep learning frameworks for training large-
scale AI models.
✅ Apache Mahout – A scalable machine learning library built for Hadoop.
✅ MLlib (Apache Spark) – A distributed machine learning library for high-
performance AI workloads.
✅ Google AI Platform – A cloud-based AI service for training and deploying
ML models.
✅ H2O.ai – An open-source AI platform for predictive analytics and deep
learning.
5. Data Visualization & Business Intelligence (BI) Tools
Visualization tools help in interpreting Big Data insights in a more
intuitive way.
✅ Tableau – A leading BI tool for interactive data visualization.
✅ Power BI – A Microsoft analytics platform for real-time dashboards.
✅ Apache Superset – An open-source visualization tool for large datasets.
✅ Google Data Studio – A free tool for connecting and visualizing data from
multiple sources.
✅ D3.js – A JavaScript library for creating custom data visualizations.
6. Data Security & Privacy Technologies
Big Data security ensures data confidentiality, integrity, and protection
against cyber threats.
✅ Apache Ranger – Security and policy framework for Hadoop and Big Data
environments.
✅ Apache Knox – Provides authentication and access control for Big Data
systems.
✅ GDPR & CCPA Compliance Tools – Tools like BigID and Privacera help
companies comply with privacy laws.
✅ Encryption (AES-256, SSL, TLS) – Ensures data is encrypted during
transmission and storage.
✅ Blockchain for Data Security – Used in fraud detection and tamper-proof
audit trails.
7. Orchestration & Workflow Management
Big Data workflows require tools for scheduling, automating, and
monitoring data pipelines.
✅ Apache Airflow – A powerful workflow automation tool for orchestrating
ETL jobs.
✅ Apache Oozie – A workflow scheduler designed for Hadoop jobs.
✅ Prefect – A Python-based workflow automation tool for data pipelines.
✅ Luigi – A Python library for building complex batch workflows.
Big Data Infrastructure: A Detailed Definition
Big Data infrastructure refers to the collection of hardware, software,
networking, and cloud solutions that are specifically designed to store,
process, manage, and analyze vast amounts of data. It is the
backbone of any system or platform dealing with Big Data and aims to
ensure the scalability, reliability, security, and efficiency needed to
handle massive volumes of data that traditional infrastructures cannot.
Big Data infrastructure is built to manage the three Vs of Big Data:
Volume: The amount of data generated.
Velocity: The speed at which data is generated, processed, and
analyzed.
Variety: The different types of data—structured, semi-structured, and
unstructured—that must be handled.
Key Components of Big Data Infrastructure
1. Data Storage Layer
This layer is responsible for the storage and management of large
datasets. Since Big Data often consists of structured, semi-structured,
and unstructured data, storage systems must be distributed and
scalable to handle this large volume of data.
o Distributed File Systems (DFS): These are used to store large
datasets across multiple servers. Examples include HDFS
(Hadoop Distributed File System) and Google File System
(GFS).
o Cloud Storage Solutions: To handle elastic scalability, cloud
storage such as AWS S3, Google Cloud Storage, and Azure
Blob Storage are commonly used.
o Data Warehouses and Databases: Specialized systems like
Amazon Redshift, Google BigQuery, and NoSQL databases
(like MongoDB, Cassandra) allow for fast data retrieval and
analytical processing.
2. Data Processing Layer
This layer handles the processing of data, whether it's batch
processing (data processed in chunks) or stream processing (real-time
data processing).
o Batch Processing Frameworks: For large-scale data
processing, frameworks like Apache Hadoop (MapReduce)
and Apache Spark are commonly used.
o Stream Processing Frameworks: Real-time data processing
can be managed by systems like Apache Kafka and Apache
Flink, which allow for the processing of data as it arrives.
3. Data Management Layer
This layer is responsible for the management of databases and
other forms of data storage.
o SQL Databases (NewSQL): For scenarios requiring consistency
and transactions, databases such as Google Spanner and
CockroachDB provide relational database features with
scalability.
o NoSQL Databases: For handling large amounts of unstructured
data, NoSQL systems like MongoDB, Cassandra, and HBase
are used.
o Graph Databases: These are used for managing relationship-
based data and include systems like Neo4j and Amazon
Neptune.
4. Data Integration & Ingestion Layer
This layer focuses on getting data from various sources and
bringing it into the system.
o Data Ingestion Tools: Tools like Apache NiFi, Apache Flume,
and Kafka Connect are used to collect, cleanse, and route data
from disparate sources into the Big Data system.
o ETL (Extract, Transform, Load): These tools are used to
extract data from various sources, transform it into a usable
format, and load it into storage systems or databases. Popular
ETL tools include Apache Airflow and Talend.
5. Networking Infrastructure
Big Data infrastructures rely heavily on high-bandwidth, low-
latency networking to transfer data quickly and efficiently across
various components of the system.
o High-Speed Networking: Technologies like InfiniBand,
100GbE (Gigabit Ethernet), and 10GbE are used to handle the
large data transfer needs.
o Cloud Networking: For cloud-based infrastructure, solutions like
AWS Direct Connect and Google Cloud Interconnect provide
fast and secure network connections.
6. Security & Governance Layer
Ensuring the security, privacy, and compliance of data is critical.
This layer focuses on protecting the data from unauthorized access
and ensuring that it is stored and processed according to relevant laws
and regulations.
o Access Control: Apache Ranger and Apache Knox provide
security by enforcing access policies in Hadoop ecosystems.
o Data Encryption: Sensitive data is encrypted both in transit
(using SSL/TLS) and at rest (using AES-256 encryption).
o Compliance: Big Data systems must ensure compliance with
industry standards like GDPR, HIPAA, and other data protection
laws.
o Data Masking & Anonymization: For privacy, personal and
sensitive data can be masked or anonymized.
7. Data Analytics & Machine Learning Layer
Once data is stored and processed, organizations use analytics and
machine learning (ML) tools to derive insights from the data.
o Analytics Tools: Tools like Apache Hive, Apache Presto, and
Apache Drill allow for querying and analyzing structured and
unstructured data.
o Machine Learning Tools: Platforms like Apache Mahout,
MLlib (Apache Spark), and AI frameworks like TensorFlow and
PyTorch are used for building and training machine learning
models on Big Data.
o Real-Time Analytics: Streaming platforms like Apache Flink
and Apache Storm are used for analyzing real-time data to gain
immediate insights.
8. Data Visualization Layer
Visualizing Big Data insights is crucial for making data-driven
decisions. This layer includes Business Intelligence (BI) tools and
visualization platforms.
o BI Tools: Tools like Tableau, Power BI, and QlikView are used
for creating interactive dashboards and reports.
o Custom Visualization: D3.js and Apache Superset can be
used for building custom visualizations for Big Data applications.
Big Data Infrastructure Architecture
The architecture of Big Data infrastructure consists of multiple layers, each
handling different aspects of the Big Data lifecycle:
1. Data Collection: Sources include IoT devices, social media, sensors,
databases, and logs.
2. Data Ingestion: Collecting data using tools like Kafka, NiFi, or
Flume.
3. Storage: Using distributed file systems like HDFS or cloud storage
platforms.
4. Processing: Distributed computing using Hadoop, Spark, or Flink.
5. Analysis: Querying with Hive, Presto, and using machine learning
models.
6. Visualization: Displaying insights using Tableau, Power BI, or
custom dashboards.
Infrastructure Considerations
When designing Big Data infrastructure, organizations must consider:
Scalability: The ability to handle growing volumes of data by scaling
horizontally (adding more machines) or vertically (upgrading
hardware).
Fault Tolerance: Ensuring the system can recover quickly from
hardware or software failures. This is achieved through replication
and redundancy.
Performance: The infrastructure must be optimized for both data
processing speed and real-time analytics.
Cost Management: Managing the cost of storing and processing vast
amounts of data, especially in cloud environments.
Use of Data Analytics in Big Data
Data analytics plays a critical role in Big Data as it helps organizations make
data-driven decisions by extracting valuable insights from massive
datasets. Big Data analytics involves examining large datasets (often with
complex and varied data types) to uncover hidden patterns, correlations, and
trends. Here are some key uses:
1. Predictive Analytics
Predicting trends and future outcomes based on historical data, such
as predicting customer behavior, stock market trends, or potential
risks.
Applications: Retailers use predictive analytics to forecast demand,
manufacturers predict equipment failures, and healthcare systems
predict disease outbreaks.
2. Descriptive Analytics
Summarizing historical data to identify patterns and trends.
Descriptive analytics answers "What happened?"
Applications: Businesses use descriptive analytics to understand past
sales trends, consumer behavior, or operational performance.
3. Diagnostic Analytics
Finding reasons behind specific outcomes or events by analyzing
data. Diagnostic analytics answers "Why did it happen?"
Applications: Identifying the root cause of a problem (e.g., why sales
dropped last quarter or why a machine malfunctioned).
4. Real-time Analytics
Processing data in real-time to make instant decisions. Real-time
analytics is essential for scenarios that require immediate insights, like
monitoring transactions or detecting fraud.
Applications: Stock market trading, IoT devices, and real-time
marketing strategies.
5. Prescriptive Analytics
Providing recommendations for actions based on data analysis.
Prescriptive analytics answers "What should we do?"
Applications: Optimizing supply chain management, personalized
marketing recommendations, or identifying the best course of action in
healthcare (e.g., treatment plans).
6. Machine Learning and AI
Training algorithms on Big Data to recognize patterns, make
predictions, and improve over time without human intervention.
Applications: Personalized product recommendations (e.g., Netflix,
Amazon), fraud detection, and autonomous vehicles.
7. Data Visualization
Displaying data in an easy-to-understand visual format such as
charts, graphs, or dashboards. This makes it easier to identify trends
and insights.
Applications: Business Intelligence tools (e.g., Power BI, Tableau)
allow executives to track key performance indicators (KPIs) and
metrics.
Desired Properties of a Big Data System
Big Data systems must be equipped with several essential properties to
manage vast amounts of data efficiently. Below are the most important
characteristics:
1. Scalability
A Big Data system should be able to handle growing amounts of
data and adapt as the dataset increases in size, speed, and
complexity.
Horizontal scaling (adding more nodes to a cluster) and vertical
scaling (upgrading the existing hardware) are common approaches.
2. Reliability
Big Data systems need to provide consistent data availability and
fault tolerance to ensure that they work even during hardware or
software failures.
Techniques like data replication (duplicating data across multiple
machines) and data recovery mechanisms are used to ensure
reliability.
3. Flexibility (Variety)
A Big Data system must be able to handle structured, semi-
structured, and unstructured data (e.g., text, images, video, and
audio).
It should support data formats like JSON, XML, CSV, and Parquet,
and be able to process data from different sources (social media,
sensor data, web logs, etc.).
4. Speed (Low Latency)
Big Data systems should be capable of processing data at high speed,
especially when real-time or near-real-time analytics are required.
For example, stream processing frameworks like Apache Kafka or
Apache Flink help achieve low-latency processing for real-time data.
5. High Throughput
The system should be able to process large volumes of data in a short
period. This is achieved by leveraging distributed computing and
parallel processing techniques.
Batch processing (for large datasets) and stream processing (for
real-time data) are optimized for high throughput.
6. Data Consistency
Consistency is crucial for ensuring that all data copies across
distributed systems are in sync. Systems should handle distributed
data consistency through protocols like CAP theorem (Consistency,
Availability, Partition tolerance).
For high consistency, systems like Google Spanner or Apache
HBase are commonly used.
7. Fault Tolerance
The ability of a system to recover from failures without losing data. A
Big Data system must ensure automatic failover, data replication,
and the ability to resynchronize after a failure.
Technologies like Hadoop Distributed File System (HDFS) and
Apache Spark ensure fault tolerance by replicating data blocks across
nodes.
8. Security and Privacy
As Big Data often involves sensitive information, ensuring data
security is essential. Security measures should include encryption,
user authentication, and role-based access control (RBAC) to
protect data from unauthorized access.
Privacy is also important, with compliance to GDPR, HIPAA, or other
data protection laws.
9. Manageability
A Big Data system should be easy to manage, with monitoring tools,
dashboards, and automatic updates.
It should allow data integration, metadata management, and data
lineage tracking to make it easier for data engineers and analysts to
work with.
10. Cost-Effectiveness
Big Data systems should be designed to manage large datasets while
keeping costs manageable. Often, organizations leverage cloud
infrastructure for elastic scalability and on-demand resource
allocation to reduce costs.
Open-source technologies like Hadoop and Apache Spark are used to
minimize software licensing costs.
11. Interoperability
A Big Data system should be able to integrate with other systems,
platforms, and applications. It should allow easy interaction with third-
party tools, APIs, and data exchange formats.
This ensures seamless data transfer between the Big Data system and
other enterprise systems like ERP, CRM, and BI tools.
Unit -2
Introduction to Hadoop
Hadoop is an open-source framework designed to store and process large
amounts of data in a distributed computing environment. Developed by
Apache Software Foundation, Hadoop allows users to process and
analyze massive datasets that cannot be handled by traditional data-
processing systems due to their size or complexity. It is built to scale from a
single server to thousands of machines, providing flexibility and fault
tolerance.
Hadoop is based on a distributed computing model, which breaks down
data into smaller chunks and distributes them across a cluster of
commodity hardware, allowing parallel processing. It is widely used for big
data analytics, data warehousing, and data storage tasks, and is
particularly effective when handling unstructured data (e.g., text, images,
videos, etc.).
Core Hadoop Components
Hadoop has several core components, each responsible for a specific
function in the processing of big data. The primary components are:
1. Hadoop Distributed File System (HDFS)
HDFS is the storage layer of Hadoop, designed to store large volumes
of data in a distributed manner across multiple machines in a cluster. It
provides high fault tolerance, scalability, and efficiency in storing
vast amounts of data.
Key Features:
o Block-level storage: Files are split into fixed-size blocks
(usually 128 MB or 256 MB) and distributed across different
nodes in the cluster.
o Replication: Each data block is replicated across multiple
machines (default is 3 copies) to ensure reliability and data
recovery in case of failure.
o Fault tolerance: If a node fails, data can be retrieved from other
replicas stored on different nodes.
o High throughput: It is optimized for throughput rather than low
latency, making it ideal for batch processing of large datasets.
2. MapReduce
MapReduce is a programming model and processing framework used
for processing large data sets in parallel across a distributed cluster. It
breaks down tasks into smaller sub-tasks and processes them
concurrently across multiple nodes.
Key Functions:
o Map phase: The input data is divided into smaller chunks, which
are processed by individual mapper tasks. These mappers output
key-value pairs.
o Reduce phase: The output of the mappers is shuffled and
sorted based on keys, and the reducers process the data to
generate the final output.
o It allows for scalable and parallel processing of data across large
clusters, providing a distributed computation model.
3. YARN (Yet Another Resource Negotiator)
YARN is the resource management layer in Hadoop. It is
responsible for managing and allocating resources in the cluster to
ensure that the applications have enough resources to run.
Key Features:
o Resource management: It decides how to distribute
computational resources (CPU, memory) among the running
applications.
o Job scheduling: YARN manages job execution and schedules the
tasks across the cluster.
o Fault tolerance: YARN ensures that if a task fails, it will restart
and continue from where it left off.
o Multi-tenancy: It supports multiple applications running
simultaneously in the same cluster by providing resource
isolation.
4. Hadoop Common (Hadoop Core Libraries)
Hadoop Common consists of the set of shared libraries and
utilities required by other Hadoop components. It provides the
necessary tools for Hadoop to run across a distributed system.
Key Features:
o Java libraries: These libraries provide essential functionality for
working with Hadoop components (like HDFS, MapReduce, etc.).
o Configuration files: The configuration files used by all the
Hadoop components for setting up properties like file paths,
directories, memory allocations, etc.
o Distributed computing support: It provides support for
different cluster nodes to communicate, interact, and execute
tasks effectively.
Hadoop Ecosystem
The Hadoop Ecosystem consists of various tools and frameworks that
enhance the capabilities of Hadoop, making it more versatile and scalable for
different big data processing tasks. These tools and frameworks address the
need for distributed storage, data processing, data management, real-time
analytics, and more.
Core Hadoop Components:
HDFS (Hadoop Distributed File System): Distributed storage layer.
MapReduce: Distributed data processing model.
YARN (Yet Another Resource Negotiator): Resource management
layer.
Hadoop Common: Common utilities and libraries used by all Hadoop
modules.
Additional Components in the Hadoop Ecosystem:
1. Apache Hive:
o A data warehousing and SQL-like query engine built on top of
Hadoop.
o It provides an SQL-based interface (HiveQL) to interact with
the Hadoop distributed storage (HDFS) for querying and
managing data.
o Ideal for data analysts familiar with SQL but need to process
massive amounts of data.
o Use cases: Querying large datasets, summarizing data, and
creating reports.
2. Apache HBase:
o A NoSQL database built on top of HDFS designed for real-time
random read/write access to large datasets.
o Suitable for applications requiring fast access to large amounts of
structured data.
3. Apache Pig:
o A high-level data flow scripting language used to process
large datasets in Hadoop.
o It is designed to simplify the complexities of writing raw
MapReduce code by using the Pig Latin language, which is a
simple scripting language.
4. Apache Spark:
o A real-time, in-memory data processing engine that provides
a faster and more flexible alternative to MapReduce.
o It supports batch processing, streaming analytics, machine
learning, and graph processing.
o Spark can be used for interactive querying and real-time
processing.
5. Apache Kafka:
o A distributed messaging system used for real-time data
streaming.
o Kafka allows the collection, storage, and real-time processing of
data streams, and is widely used for integrating different
components of the Hadoop ecosystem.
6. Apache Zookeeper:
o A coordination service that ensures synchronization and
management of distributed systems.
o Zookeeper is used by several Hadoop-related components (e.g.,
HBase, Kafka) to maintain distributed locks, configuration
management, and leader election.
7. Apache Flume:
o A service for collecting, aggregating, and moving large
amounts of log data to HDFS or other destinations in a Hadoop
cluster.
o Commonly used for streaming log data from multiple sources
into HDFS.
8. Apache Sqoop:
o A tool used for importing/exporting data between Hadoop and
relational databases like MySQL, Oracle, etc.
o It is used for bulk transfer of structured data to and from HDFS
and RDBMS.
9. Apache Oozie:
o A workflow scheduler system to manage Hadoop jobs.
o It allows the scheduling and coordination of complex data
processing workflows and jobs in Hadoop (like MapReduce jobs,
Hive jobs, etc.).
10. Apache Mahout:
A machine learning library for creating scalable machine learning
algorithms.
It leverages Hadoop's MapReduce framework to scale machine learning
models across large datasets.
Hive Overview
Apache Hive is a data warehousing system built on top of Hadoop that
enables users to perform SQL-like queries on large datasets stored in
HDFS. Hive abstracts the complexity of writing MapReduce code by allowing
users to query data using HiveQL, which is similar to SQL.
Key Features of Hive:
1. SQL-Like Query Language: HiveQL allows users to run SQL-style
queries on data stored in HDFS.
2. Scalable Data Warehousing: Hive is designed to scale and handle
large datasets, often used for data summarization, queries, and
reporting.
3. Extensibility: Hive supports user-defined functions (UDFs) to extend
its capabilities.
4. Integration with other Hadoop Ecosystem Tools: Hive can
integrate with tools like HBase, Apache Pig, and Apache Spark for a
more complete big data solution.
5. Partitioning and Bucketing: It supports partitioning and bucketing of
data, improving query performance and organizing large datasets.
Hive Physical Architecture
The physical architecture of Hive defines how the system works
internally, focusing on how it processes, stores, and retrieves data from
HDFS. Hive interacts with HDFS for storage and leverages MapReduce for
processing queries.
Key Components of Hive Physical Architecture:
1. Hive Client:
o The interface through which users submit HiveQL queries.
o Users can interact with Hive using the command-line interface
(CLI), Web UI, or through JDBC/ODBC connections.
2. Hive Metastore:
o The central repository that stores metadata for Hive tables
(schema information, partition details, etc.).
o It maintains the structure of the data, but not the actual data,
which is stored in HDFS.
o The metastore is essential for schema management and helps in
providing structured access to data stored in HDFS.
3. Hive Execution Engine:
o The component responsible for executing HiveQL queries.
o Hive uses MapReduce as the default execution engine for query
processing, but it also supports other engines such as Apache
Tez or Apache Spark for faster processing.
o The execution engine converts HiveQL queries into a sequence of
MapReduce jobs and executes them on Hadoop.
4. Hive Driver:
o The interface between the Hive client and the execution
engine.
o It handles the parsing of the HiveQL queries and interacts with
the execution engine.
o It also manages session states, such as user configurations and
session variables.
5. Hive Query Compiler:
o The compiler processes the HiveQL queries and generates an
abstract syntax tree (AST) from the HiveQL query.
o It ensures that the SQL-like syntax in the query is converted into
MapReduce jobs that can be executed on the Hadoop cluster.
6. Hive Optimizer:
o The optimizer works on the generated query plan and applies
optimization techniques, such as predicate pushdown,
column pruning, and join reordering, to improve query
performance.
7. HDFS (Hadoop Distributed File System):
o Hive queries are executed on data stored in HDFS. The actual
data files are managed and stored on HDFS, which provides
scalability and fault tolerance.
8. Execution Framework:
o As mentioned, the execution framework can be MapReduce,
Apache Spark, or Apache Tez (a faster, optimized engine). This
framework executes the distributed jobs based on the parsed
and compiled Hive queries.
Hive Architecture Flow:
1. User Interaction:
o A user submits a HiveQL query using the Hive CLI, Web UI, or
programmatically through JDBC/ODBC connections.
2. Driver:
o The driver receives the query and forwards it to the query
compiler.
3. Compiler:
o The compiler parses the query, checks syntax, and generates an
abstract syntax tree (AST) to determine the execution plan.
4. Optimizer:
o The optimizer enhances the execution plan by applying various
performance optimization techniques.
5. Execution:
o The execution engine translates the optimized plan into
MapReduce jobs (or other execution engines like Spark or Tez),
which are then executed on the Hadoop cluster.
6. Result:
o After the query is executed, the result is returned to the user,
either via the command-line interface or the chosen client.
Hadoop Limitations
Despite its immense capabilities in handling large-scale data, Hadoop also
has certain limitations that make it unsuitable for all use cases. Below are
some of the key limitations of Hadoop:
1. Complexity:
o Hadoop has a steep learning curve, especially for users
unfamiliar with distributed systems or MapReduce programming.
o Setting up and managing Hadoop clusters require skilled
personnel and is complex for newcomers.
o The integration of multiple tools in the Hadoop ecosystem also
requires expertise.
2. Real-Time Processing:
o Hadoop is primarily designed for batch processing, meaning it
is not well-suited for real-time data processing.
o While tools like Apache Spark and Apache Storm can provide
real-time capabilities, Hadoop itself is not optimized for low-
latency processing.
3. I/O Intensive:
o Hadoop is heavily reliant on disk I/O, which can lead to
bottlenecks in data processing.
o Since Hadoop’s processing model is based on MapReduce, the
intermediate data between tasks is written to the disk, resulting
in high disk I/O operations and relatively slower processing
speeds compared to in-memory processing.
4. Lack of ACID Transactions:
o Hadoop does not natively support ACID (Atomicity,
Consistency, Isolation, Durability) transactions, which are
critical for certain types of applications that require data
integrity.
o While there are workarounds, such as using HBase or integrating
with Apache Phoenix to provide transactional capabilities, they
do not provide full ACID compliance in the way traditional
databases do.
5. Limited Support for Advanced Analytics:
o Hadoop itself is primarily a distributed storage and
processing system; it does not inherently offer advanced
analytics capabilities (like machine learning, AI models, or
complex queries).
o This limitation can be addressed with additional tools, such as
Apache Mahout for machine learning or Apache Spark for in-
memory processing, but these tools add complexity to the
ecosystem.
6. Security Issues:
o Although Hadoop provides basic security features, such as
Kerberos authentication, authorization, and data
encryption, it often requires additional third-party security tools
to ensure a robust security model.
o Data privacy and access control can be challenging to
implement properly within a Hadoop ecosystem without
additional configurations.
7. Data Quality and Consistency:
o Since Hadoop is designed for unstructured and semi-structured
data, ensuring data consistency and quality can be challenging.
o Managing data formats, schemas, and data quality issues
becomes more complex, especially in large datasets.
8. Cost of Implementation:
o While Hadoop is often touted as being cheaper than traditional
databases for storing large amounts of data, the cost of
infrastructure, maintenance, and management of Hadoop
clusters can add up, especially as the scale of the deployment
increases.
o Cloud services like Amazon EMR help reduce infrastructure
costs but can still be expensive in the long run.
RDBMS (Relational Database Management System) vs Hadoop
The comparison between RDBMS and Hadoop is a key consideration for
organizations looking to process large-scale data. Both systems have their
strengths and weaknesses depending on the type of workload. Below is a
detailed comparison between RDBMS and Hadoop:
Aspect RDBMS Hadoop
Unstructured and semi-
Data Structured data (tables with
structured data (e.g., text,
Structure predefined schema).
logs, JSON, XML).
Highly scalable (horizontal
Limited scalability (vertical
Scalability scaling with commodity
scaling).
hardware).
Provides ACID (Atomicity,
Data Consistency, Isolation, Does not natively support ACID
Integrity Durability) properties for data properties.
integrity.
Primarily batch processing
Primarily transactional and with MapReduce (though real-
Data
real-time processing (using time processing can be added
Processing
SQL). using additional tools like
Apache Spark).
Performs well for large
Performs well for small to datasets, but has higher
Performan
medium-sized datasets with latency for I/O-bound
ce
indexed queries. operations due to disk-based
processing.
Uses HiveQL (SQL-like query
Query
Uses SQL for querying. language) or custom
Language
MapReduce scripts.
Uses HDFS (Hadoop
Data Uses tables for storing data, Distributed File System) for
Storage often on disk. distributed storage of large
datasets.
Aspect RDBMS Hadoop
Relatively easy to set up and Complex to set up and
Complexity manage for small-scale manage, especially for large-
systems. scale clusters.
Can be cheaper for storing vast
Typically more expensive due
amounts of data on commodity
to licensing fees for enterprise-
Cost hardware, but infrastructure
grade solutions (e.g., Oracle,
and management costs can
SQL Server).
add up.
Supports high concurrency Not optimized for high-
Concurrenc
and many simultaneous concurrency transactional
y
transactions. workloads.
Suitable for transactional Best suited for big data
systems, such as banking or applications like data
Use Cases inventory systems where warehousing, data lakes,
data consistency and real-time and real-time analytics on
access are critical. massive datasets.
Highly flexible in terms of data
Limited flexibility with data
types and can handle various
Flexibility types and schema. Changes to
formats like JSON, XML, or plain
schema are difficult.
text.
Eventual consistency model
Data
Strong consistency ensured (in HDFS and MapReduce),
Consistenc
by relational constraints. which may lead to temporary
y
inconsistency.
Data backup is required for Built-in fault tolerance using
Fault fault tolerance; replication in HDFS where data is
Tolerance some RDBMS (e.g., MySQL, replicated across nodes in the
PostgreSQL). cluster.
When to Use RDBMS:
When you need strong data integrity and support for complex
transactions (banking, e-commerce platforms, etc.).
When the data is structured and you need to work with real-time
queries.
When the dataset is small to medium-sized (fits within the limits of a
traditional server or database system).
When to Use Hadoop:
When dealing with big data that cannot be processed or stored
effectively using traditional relational databases.
When data is unstructured or semi-structured (e.g., logs, social media
posts, sensor data).
When you need to process large datasets using distributed storage
and parallel processing (e.g., data warehousing, data lakes, and big
data analytics).
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is the primary storage
system of Hadoop. It is designed to store large volumes of data across a
distributed cluster of machines, ensuring reliability and scalability. HDFS is
optimized for high throughput and fault tolerance rather than low-
latency access to small files.
Key Characteristics of HDFS:
1. Distributed Storage:
o HDFS divides large files into fixed-size blocks (typically 128 MB
or 256 MB), which are distributed across the nodes in a Hadoop
cluster.
o Each block is replicated multiple times (default is 3 replicas)
across different nodes to ensure fault tolerance.
2. Fault Tolerance:
o In the event of node failure, HDFS can recover by accessing
replicas of the data stored on other nodes. This replication
ensures data availability even when hardware fails.
o The NameNode in HDFS tracks the locations of each block in the
cluster, and the DataNodes store the actual blocks of data.
3. Block-based Architecture:
o Files are split into blocks, and each block is stored across
different machines in the cluster. This allows parallel processing
and ensures efficient data access.
o Blocks are typically large, which minimizes the overhead caused
by seeking between blocks and improves read/write throughput.
4. Master-Slave Architecture:
o NameNode (Master): The NameNode manages the filesystem
namespace, keeps track of the metadata of all files (e.g., file
names, block locations), and coordinates the storage of data.
o DataNode (Slave): The DataNodes store the actual data blocks.
These nodes handle read and write requests from clients and
send periodic heartbeat signals to the NameNode to indicate
they are alive.
5. High Throughput:
o HDFS is designed for high throughput access to data, making it
well-suited for batch processing. It is optimized for reading and
writing large files, and not for frequent random reads/writes to
small files.
6. Write-once, Read-many Model:
o HDFS follows a write-once, read-many model, which means
that once a file is written, it cannot be modified. This simplifies
data consistency models and is ideal for applications that append
data or need to process data sequentially.
o If modifications are needed, the data must be rewritten as new
files, and the old versions can be discarded.
7. Scalability:
o HDFS can scale by adding more nodes to the cluster. As the data
volume grows, more DataNodes can be added to store the data,
ensuring that the system remains efficient.
8. Data Locality:
o HDFS tries to move the computation to the data (instead of
moving data to the computation), minimizing network congestion
and optimizing performance.
o This is accomplished by running the MapReduce jobs on the
nodes where the data resides, reducing the amount of data
transfer across the network.
Processing Data with Hadoop (MapReduce)
MapReduce is the computational model that Hadoop uses to process large
datasets in a distributed manner across a Hadoop cluster. It breaks down the
processing of data into two main stages: Map and Reduce.
1. Map Phase:
In the Map phase, the input data (typically stored in HDFS) is processed in
parallel across the nodes of the cluster.
The data is split into chunks (blocks) and distributed across
DataNodes.
Each Map task processes a block of data and outputs a set of key-
value pairs.
o Input: Each record in the input is processed by the Mapper
function.
o Output: The output of the Map function is a set of intermediate
key-value pairs.
Example: In a word count program, the input data might be a text file, and
the Mapper reads the text, breaking it down into words (key-value pairs like:
"word": 1).
2. Shuffle and Sort:
After the Map phase, Hadoop automatically performs a Shuffle and Sort
step, which groups the intermediate key-value pairs by key and sorts them.
Shuffle: Groups all the values associated with the same key together
across all nodes in the cluster.
Sort: Sorts the intermediate key-value pairs so that the Reducer can
process them efficiently.
3. Reduce Phase:
In the Reduce phase, the Reducer processes each group of key-value
pairs that have been shuffled and sorted.
The Reducer receives the key and a list of values associated with that
key.
It then processes the values and returns a final key-value pair.
In the example of word count, the Reducer would aggregate the word
counts for each word, summing up the counts.
Example: The output could be something like: ("word", 5) indicating that the
word appeared 5 times in the text.
4. Output:
The final output from the Reduce phase is written back to HDFS as a set of
files.
MapReduce in Hadoop: Key Components
InputFormat: Defines how the input data is split and read. It
determines how the data is divided into manageable chunks for the
Map phase.
Mapper: The function that processes input data, applies a
transformation, and emits key-value pairs.
Partioner: Determines how the intermediate key-value pairs are
distributed across Reducers.
Combiner: An optional optimization that can perform a local reduce
operation on the output of a Mapper before sending it to the Reducer,
reducing the amount of data transferred between Map and Reduce.
Reducer: The function that aggregates the key-value pairs produced
by the Mappers and generates the final output.
Hadoop Data Processing Workflow:
1. Data Splitting: The input data is split into smaller chunks (blocks) by
HDFS.
2. Map Task Execution: The Map tasks are distributed across nodes in
the Hadoop cluster, each task processing its chunk of the data and
emitting key-value pairs.
3. Shuffle and Sort: The intermediate data produced by Mappers is
shuffled and sorted by the system to group the values for each key.
4. Reduce Task Execution: Reducers process the grouped key-value
pairs, applying the required computations (e.g., summing the counts).
5. Writing Output: The results of the Reduce phase are written back to
HDFS, where they can be accessed for further analysis.
Benefits of Hadoop for Data Processing:
1. Scalability: Hadoop can process massive datasets by distributing the
data and computation across many nodes in a cluster.
2. Fault Tolerance: HDFS ensures that data is replicated across nodes,
so even if a node fails, data can still be accessed from another replica.
3. Cost-Effectiveness: Hadoop can run on commodity hardware, making
it much more cost-effective compared to traditional data storage and
processing systems.
4. Flexibility: It can process a wide variety of data types (structured,
semi-structured, unstructured) and formats (e.g., JSON, XML, text,
images).
Limitations of MapReduce:
1. Latency: MapReduce can be slower due to its reliance on writing
intermediate data to disk between the Map and Reduce phases.
2. Difficulty in Real-Time Processing: Hadoop is optimized for batch
processing and is not suitable for low-latency, real-time data
processing without additional frameworks like Apache Spark.
3. Complexity: Developing efficient MapReduce jobs can be complex,
particularly for developers unfamiliar with the programming model.
Managing Resources and Applications with Hadoop YARN
YARN (Yet Another Resource Negotiator) is the resource management
layer of Hadoop. It is a critical component that was introduced in Hadoop
2.x to improve the resource management and job scheduling in a Hadoop
cluster. YARN decouples the resource management and job scheduling
functionalities, allowing the system to handle multiple applications
concurrently in a more efficient manner.
Key Components of YARN:
1. ResourceManager (RM):
o The ResourceManager is the master daemon responsible for
managing resources across the cluster.
o It has two main components:
Scheduler: The Scheduler is responsible for allocating
resources to applications based on user-defined policies
(e.g., fairness, capacity).
ApplicationManager: The ApplicationManager handles
the lifecycle of applications, ensuring they are started,
executed, and monitored correctly.
2. NodeManager (NM):
o NodeManager is responsible for managing resources on a single
node within the Hadoop cluster.
o It monitors resource usage (memory, CPU) on each node and
reports back to the ResourceManager. It also launches and
manages containers (the execution environment for tasks).
3. ApplicationMaster (AM):
o The ApplicationMaster is responsible for managing the
lifecycle of a specific application. It negotiates resources from the
ResourceManager, coordinates with NodeManagers to execute
tasks, and monitors the application's progress.
o There is one ApplicationMaster per application running in the
cluster.
4. Containers:
o Containers are the execution environments allocated by
NodeManagers on nodes. A container can hold one or more tasks
and has a defined amount of resources (CPU, memory, etc.)
based on the application's requirements.
o The ApplicationMaster requests containers from the
ResourceManager and the NodeManager launches these
containers to run tasks.
How YARN Works:
1. Job Submission:
o When a user submits a job to the cluster, the
ResourceManager first allocates resources for the job by
determining which nodes have available resources.
o The ApplicationMaster for the job is then launched in one of
the containers. The ApplicationMaster is responsible for
managing the job's execution.
2. Resource Allocation:
o The ResourceManager uses the Scheduler to allocate
resources based on policies like capacity, fairness, and priority.
o NodeManagers communicate with the ResourceManager,
reporting available resources on their respective nodes, enabling
the ResourceManager to make informed decisions.
3. Task Execution:
o Once resources are allocated, the ApplicationMaster requests
containers from the NodeManagers. The containers hold the
tasks that are executed.
o NodeManagers monitor the execution of tasks within
containers, ensuring that resource usage is within specified limits
and reporting status back to the ResourceManager.
4. Job Completion:
o The ApplicationMaster monitors the progress of the tasks and,
once all tasks are completed, it signals the job's completion. The
ResourceManager cleans up the resources and updates the job
status.
Advantages of YARN:
Multi-Tenancy: YARN enables multiple applications (MapReduce,
Spark, Tez, etc.) to run concurrently in the same Hadoop cluster,
allowing for improved resource utilization.
Resource Isolation: YARN allows for resource isolation, ensuring that
different applications do not interfere with each other and each gets
the necessary resources.
Scalability: YARN can handle a much larger scale of cluster and
applications than the original Hadoop 1.x version, enabling it to
efficiently manage resources in big clusters.
Flexibility: YARN supports different types of workloads, including
batch processing, real-time processing, and interactive queries.
MapReduce Programming in Hadoop
MapReduce is a programming model used for processing large data sets
with a distributed algorithm on a Hadoop cluster. It is the core computational
model in the Hadoop ecosystem. MapReduce allows developers to write
distributed applications that process data in parallel on a large number of
nodes.
MapReduce Workflow:
MapReduce programs consist of two primary phases:
1. Map Phase:
o The Map function takes an input key-value pair and produces a
set of intermediate key-value pairs.
o This phase involves splitting the input data into chunks (called
splits), which are then processed in parallel by different Mapper
tasks.
o Each Mapper reads a split, processes the data, and emits
intermediate key-value pairs (e.g., "word" -> 1 for a word count
program).
2. Reduce Phase:
o After the Map phase, the intermediate data is shuffled and sorted
based on the key.
o The Reduce function takes each key and a list of associated
values, processes them, and outputs a final result (e.g., sum of
counts for each word in a word count program).
MapReduce Programming Model:
1. Mapper Function:
o Input: A chunk of data (record) from the input file, represented
as a key-value pair.
o Output: Intermediate key-value pairs.
o Example: In a word count program, the Mapper reads lines of
text, splits them into words, and outputs key-value pairs like
("word": 1).
2. Reducer Function:
o Input: Key-value pairs generated by the Map phase (grouped by
key).
o Output: The final result after reducing the intermediate data. For
a word count example, it would sum the counts of each word.
o Example: In the word count program, the Reducer sums the
counts for each word and outputs a final result, like ("word", 5).
MapReduce Example:
Consider a word count program where the task is to count how often each
word appears in a large text file.
1. Map Phase:
o Input: A text file, with lines like "apple orange banana apple".
o Mapper emits: ("apple", 1), ("orange", 1), ("banana", 1), ("apple",
1).
2. Shuffle and Sort:
o The system groups all the pairs by their keys:
Key: "apple", Values: [1, 1]
Key: "orange", Values: [1]
Key: "banana", Values: [1]
3. Reduce Phase:
o The Reducer processes each key and aggregates the values:
Key: "apple", Values: [1, 1], Output: "apple", 2
Key: "orange", Values: [1], Output: "orange", 1
Key: "banana", Values: [1], Output: "banana", 1
4. Final Output:
o The final output is written to the HDFS:
"apple", 2
"orange", 1
"banana", 1
Writing a MapReduce Program:
To write a MapReduce program, you typically implement three key methods:
1. Mapper:
o Processes input data and outputs intermediate key-value pairs.
2. Reducer:
o Aggregates the intermediate key-value pairs and outputs the
final result.
3. Driver:
o Configures the job, sets up input/output paths, and runs the
MapReduce job.
Unit 3
🐝 1. Introduction to Hive
📌 What is Apache Hive?
Apache Hive is a data warehouse infrastructure built on top of Hadoop
for providing data summarization, query, and analysis. Hive allows users
to read, write, and manage large datasets residing in distributed storage
using SQL-like syntax called HiveQL (HQL).
🎯 Why Hive?
Writing MapReduce manually is complex — Hive simplifies this.
Ideal for data analysts and non-programmers to query large-scale
datasets.
Converts HQL into MapReduce, Tez, or Spark jobs behind the scenes.
Supports structured data stored in formats like Text, ORC, Parquet,
etc.
2. Hive Architecture
Apache Hive's architecture consists of the following components:
🔷 1. User Interface (UI)
Provides various ways for users to interact:
CLI: Command Line Interface.
Web UI: Browsers like Hue.
ODBC/JDBC Drivers: For connecting external tools (e.g., Tableau, Java
apps).
🔷 2. Driver
It manages the lifecycle of a HiveQL query. It acts like the controller:
Parser: Validates syntax of the query.
Planner: Creates an execution plan.
Optimizer: Optimizes the plan for better performance.
Executor: Executes the plan.
🔷 3. Compiler
Converts the HiveQL query into DAG (Directed Acyclic Graph) of
tasks.
Uses MapReduce, Tez, or Spark as execution engines.
🔷 4. Metastore
Stores metadata: table names, column types, partitions, etc.
Typically uses RDBMS like MySQL/PostgreSQL.
Essential for query planning and optimization.
🔷 5. Execution Engine
Works with Hadoop/YARN to execute jobs.
Translates the logical plan into physical plan.
🔷 6. HDFS (Hadoop Distributed File System)
The storage layer where data is actually stored.
Hive only reads/writes; does not manage storage directly.
🧮 3. Hive Data Types
Hive data types are categorized into Primitive and Complex types.
➤ Primitive Data Types:
Type Description
TINYINT 1-byte signed integer
SMALLINT 2-byte signed integer
INT 4-byte signed integer
BIGINT 8-byte signed integer
BOOLEAN True/False
FLOAT 4-byte floating point
DOUBLE 8-byte floating point
Arbitrary precision
DECIMAL
numbers
STRING Variable-length string
String with specified
VARCHAR
length
CHAR Fixed-length string
DATE Date without time
TIMESTAM Date and time
Type Description
➤ Complex Data Types:
Type Description
ARRAY<T
Ordered collection of elements
>
MAP<K,V
Key-value pairs
>
Collection of elements grouped under one
STRUCT
record
UNIONTYP
Supports multiple types in a single field
E
Example:
CREATE TABLE student_info (
name STRING,
marks ARRAY<INT>,
address STRUCT<city:STRING, state:STRING>,
metadata MAP<STRING, STRING>
);
💻 4. Hive Query Language (HQL)
🔸 HiveQL is SQL-like, with some differences:
Case-insensitive
Schema-on-read (parses data at query time)
🛠 DDL (Data Definition Language)
Used to define and manage schema.
CREATE TABLE employees (
id INT,
name STRING,
salary FLOAT
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
DROP TABLE employees;
DESCRIBE employees;
🧱 DML (Data Manipulation Language)
Used to load or insert data into tables.
LOAD DATA LOCAL INPATH '/user/data/employees.csv' INTO TABLE
employees;
INSERT INTO TABLE employees VALUES (1, 'Alice', 7000.0);
🔍 SELECT Queries
SELECT name, salary FROM employees WHERE salary > 5000;
SELECT department, COUNT(*) FROM employees GROUP BY department;
SELECT * FROM employees ORDER BY salary DESC LIMIT 5;
🗂 Partitioning & Bucketing
Partitioning:
Divides table data based on a column's value.
CREATE TABLE logs (
id INT,
log_message STRING
PARTITIONED BY (log_date STRING);
Bucketing:
Splits data into fixed number of files (buckets).
CREATE TABLE user_data (
user_id INT,
name STRING
CLUSTERED BY (user_id) INTO 4 BUCKETS;
🧰 Joins in Hive:
SELECT a.name, b.salary
FROM dept a
JOIN emp b
ON a.id = b.dept_id;
✅ Summary
Topic Key Points
Hive SQL-like querying for Hadoop; built for scalability
Architectur
UI → Driver → Compiler → Execution Engine ↔ Metastore & HDFS
e
Data
Primitive (INT, STRING...) and Complex (ARRAY, MAP, STRUCT)
Types
HiveQL Similar to SQL: supports DDL, DML, SELECT, JOIN, PARTITIONING,
BUCKETING
Topic Key Points
🐷 1. Introduction to Pig
📌 What is Apache Pig?
Apache Pig is a high-level platform for creating MapReduce programs used
with Hadoop. It uses a scripting language called Pig Latin, which
abstracts the complexity of writing raw MapReduce code.
🎯 Key Goals of Pig:
Simplify the development of big data transformation tasks.
Allow developers to process large datasets without writing low-level
Java MapReduce code.
✅ Features of Pig:
Pig Latin: A high-level, procedural scripting language.
Automatically converts Pig Latin scripts into MapReduce jobs.
Works with structured, semi-structured, and unstructured data.
Can process data stored in HDFS, HBase, or Hive.
Supports UDFs (User Defined Functions) in Java, Python, etc.
🧬 2. Anatomy of Pig
Here’s what a typical Pig environment looks like and how it functions:
🔷 Components of Pig:
Component Description
Pig Latin Code written by the user to process
Scripts data
Converts scripts into logical and
Pig Compiler
physical plans
Component Description
Execution Runs the actual tasks
Engine (MapReduce/Spark/Tez)
HDFS Stores the input and output data
🔁 Execution Flow (Anatomy)
1. Write Pig Script (in Pig Latin)
pig
data = LOAD '/user/data/employees.csv' USING PigStorage(',') AS (id:int,
name:chararray, salary:float);
highEarners = FILTER data BY salary > 5000;
DUMP highEarners;
2. Parse & Semantic Check
o Syntax checking and type resolution.
3. Logical Plan Generation
o Operator-based logical structure is built (LOAD → FILTER →
DUMP).
4. Optimization
o Apply rule-based optimizations (e.g., push filters early).
5. Physical Plan Generation
o Convert into a plan of physical operations.
6. MapReduce Jobs
o Converted into one or more MR jobs and executed on Hadoop.
📘 Modes of Execution:
Mode Description
Local Mode Runs on local file system without
Mode Description
Hadoop.
MapReduce Runs on HDFS using Hadoop cluster
Mode (default).
🧩 3. Pig on Hadoop
Apache Pig is deeply integrated with Hadoop, and it executes scripts as
MapReduce jobs.
🧱 Integration Points:
HDFS: Pig reads input from and writes output to HDFS.
YARN: Pig scripts are executed using MapReduce jobs scheduled by
Hadoop YARN.
MapReduce Engine: Pig converts each step of its script into a
MapReduce job.
Example Pig Latin Script on Hadoop:
pig
-- Load data from HDFS
logs = LOAD '/data/server_logs.txt' USING PigStorage('\t') AS (ip:chararray,
url:chararray);
-- Filter entries
filtered = FILTER logs BY url MATCHES '.*.jpg';
-- Count accesses
grouped = GROUP filtered BY ip;
counts = FOREACH grouped GENERATE group AS ip, COUNT(filtered) AS
access_count;
-- Store output in HDFS
STORE counts INTO '/output/image_access_counts' USING PigStorage(',');
🔄 Pig vs Hive vs MapReduce
Feature Pig Hive MapReduce
Pig Latin HiveQL
Language Java (Low-level)
(Procedural) (Declarative)
Data Custom
Use Case Data Analysis
Transformation processing
Ease of Use Medium Easy Complex
Execution MapReduce/Tez/
MapReduce Native
Engine Spark
✅ Summary
Topic Key Points
High-level platform for writing MapReduce programs using Pig
Apache Pig
Latin
Anatomy of Scripting → Parsing → Logical Plan → Optimization → Physical
Pig Plan → MR Jobs
Pig on Pig runs on top of Hadoop using HDFS for storage and
Hadoop MapReduce for execution
✅ Use Case for Pig (In Detail)
🔍 What kind of problems does Pig solve?
Pig is best for analyzing and transforming large datasets. It's especially
useful for:
ETL jobs (Extract, Transform, Load)
Data preparation for Machine Learning
Log file processing
Ad-hoc data analysis
🔧 Real-world Use Cases:
📊 1. Log Analysis (e.g., Server Logs)
Problem: A company has terabytes of log data from web servers and wants
to find how many requests came from each country.
Solution with Pig:
Load logs into Pig from HDFS.
Extract the IP field and map it to locations.
Group by country and count.
🛒 2. Retail Analytics
Problem: An e-commerce platform wants to analyze the average spend per
customer.
Pig Tasks:
Load transaction data.
Group by customer ID.
Calculate total and average spend.
📈 3. Preprocessing for ML
Problem: Before feeding data into a Machine Learning algorithm, it needs to
be cleaned, normalized, and filtered.
Pig Tasks:
Remove nulls/duplicates.
Normalize values.
Generate features from raw data.
🧪 4. Data Sampling
For data scientists who need only a sample of data for testing or
visualization.
🔄 ETL Processing in Pig (In Detail)
📌 What is ETL?
Extract: Get data from sources like HDFS, Hive, relational databases.
Transform: Apply filters, joins, calculations, or clean data.
Load: Store it back to HDFS, Hive, or another system.
🔃 Pig in ETL
Pig makes ETL processes simpler through Pig Latin – a script-based
language that supports:
Filtering
Sorting
Joining
Grouping
Aggregation
🔧 Example ETL Pipeline:
Input (CSV file in HDFS):
CopyEdit
1,John,Sales,5600
2,Alice,HR,4300
3,Bob,Sales,7000
🔤 Pig Script:
pig
CopyEdit
-- Extract
data = LOAD '/user/hr/employees.csv' USING PigStorage(',')
AS (id:int, name:chararray, dept:chararray, salary:float);
-- Transform
filtered = FILTER data BY salary > 5000;
grouped = GROUP filtered BY dept;
avg_sal = FOREACH grouped GENERATE group AS department,
AVG(filtered.salary) AS average_salary;
-- Load
STORE avg_sal INTO '/user/output/high_earners' USING PigStorage(',');
✅ Output (in HDFS):
CopyEdit
Sales,6300.0
🔢 Data Types in Pig (In Detail)
Pig supports both primitive types and complex types.
🔹 Primitive (Scalar) Data Types:
Type Example Description
int 100 32-bit signed integer
100000000
long 64-bit integer
00
float 12.5f 32-bit floating point
double 13.456 64-bit floating point
chararra
"John" String of characters
y
bytearr Raw byte data
binary data
ay (uninterpreted)
boolean true Boolean value
🔸 Complex Data Types:
1. Tuple
A tuple is a collection of fields (like a row).
pig
CopyEdit
(id, name, salary)
(1, 'Alice', 5000)
2. Bag
A bag is a collection of tuples (like a table). Bags can have duplicates.
pig
CopyEdit
{(1, 'Alice'), (2, 'Bob'), (1, 'Alice')}
3. Map
A map is a key-value pair.
pig
CopyEdit
[name#'John', age#30, dept#'HR']
🧪 Data Type Example in Schema:
pig
CopyEdit
student = LOAD 'students.txt' USING PigStorage(',')
AS (roll:int, name:chararray, scores:bag{t:tuple(subject:chararray,
marks:int)});
This defines:
A roll number (int)
A name (string)
A bag of subject-mark pairs
🏃 Running Pig (In Detail)
Modes of Execution:
Mode When to Use
Local Mode For testing on a local machine
MapReduce For real execution on Hadoop
Mode cluster
🔹 Starting Pig in Local Mode:
bash
CopyEdit
pig -x local
🔹 Starting Pig in Hadoop Mode:
bash
CopyEdit
pig
🧾 Running Pig Scripts
1. Interactive Mode (Grunt Shell)
Open shell:
bash
CopyEdit
pig
Commands:
pig
CopyEdit
grunt> data = LOAD 'file.txt' AS (name:chararray);
grunt> DUMP data;
2. Batch Mode (Script File)
Write script:
-- File: etl_script.pig
data = LOAD '/data/input.csv' USING PigStorage(',') AS (id:int,
name:chararray);
filtered = FILTER data BY id > 5;
DUMP filtered;
Run script:
bash
CopyEdit
pig etl_script.pig
3. Embedded Mode (Java)
You can also run Pig from Java using PigServer.
🔚 Summary Table
Topic Key Takeaways
Use Case Data transformation, analysis, and preparation
Extract from HDFS, Transform via Pig, Load
ETL
results
Data Primitive (int, float, chararray), Complex (tuple,
Types bag, map)
Running
CLI (Grunt), Script, Local/Hadoop modes
Pig
✅ 1. Execution Model of Pig
Apache Pig follows a step-by-step dataflow model using a scripting
language called Pig Latin, which is translated into MapReduce jobs under
the hood.
🔸 Pig Execution Flow:
1. Pig Latin Script
o You write data transformation logic in Pig Latin.
2. Parser
o The script is parsed to check syntax and semantics.
o A logical plan is generated.
3. Optimizer
o The logical plan is optimized (e.g., combining filters, removing
redundant steps).
o Converts into a physical plan.
4. Compiler
o Converts the physical plan into a sequence of MapReduce jobs.
5. Execution
o Jobs are submitted to the Hadoop cluster (or run locally if in local
mode).
o Results are collected and returned.
Execution Modes:
Mode Description
Runs on a single JVM; good for
Local
testing.
MapRedu
Default mode; runs on Hadoop.
ce
🔄 2. Operators in Pig
Pig provides relational operators similar to SQL but more flexible.
🔹 Core Pig Operators:
Operat
Description Example
or
Loads data from
LOAD LOAD 'file.csv'
HDFS/local
STORE Saves output STORE result INTO 'output/'
DUMP Prints data to console DUMP A
FILTER Filters rows FILTER A BY salary > 5000
FOREAC FOREACH A GENERATE name,
Iterates each row
H salary
GROUP Groups rows by key GROUP A BY dept
JOIN Joins two datasets JOIN A BY id, B BY id
ORDER Orders data ORDER A BY salary DESC
DISTINC
Removes duplicates DISTINCT A
T
LIMIT Limits records LIMIT A 10
UNION Combines datasets UNION A, B
SPLIT A INTO X IF cond, Y IF
SPLIT Splits dataset
cond2
3. Functions in Pig
Pig offers a wide range of built-in functions and allows custom UDFs
(User Defined Functions).
🔹 Built-in Functions:
📊 Aggregate Functions:
Functio
Description Example
n
COUNT( FOREACH G GENERATE
Count rows
) COUNT(A)
SUM() Total value SUM(A.salary)
AVG() Average AVG(A.salary)
Minimum
MIN() MIN(A.salary)
value
Maximum
MAX() MAX(A.salary)
value
🔤 String Functions:
Functio
Description
n
CONCAT(
Combines strings
)
STRSPLIT
Splits a string
()
Converts to
UPPER()
uppercase
Converts to
LOWER()
lowercase
🔢 Math Functions:
Functio
Description
n
Absolute
ABS()
value
ROUND(
Round value
)
SQRT() Square root
🔸 UDFs (User Defined Functions)
You can write custom functions in Java, Python, or other languages.
java
CopyEdit
public class UpperCase extends EvalFunc<String> {
public String exec(Tuple input) {
return input.get(0).toString().toUpperCase();
Register in Pig:
pig
CopyEdit
REGISTER 'myfuncs.jar';
DEFINE ToUpper com.example.UpperCase();
🔢 4. Data Types in Pig (Deep Dive)
🟢 Primitive Data Types:
Descriptio
Type Example
n
32-bit
int 25
integer
64-bit 100000000
long
integer 00
float 32-bit float 12.5f
double 64-bit float 3.14159
chararra
String 'hello'
y
bytearr Binary data -
Descriptio
Type Example
n
ay
boolean True/False true
🔵 Complex Data Types:
Typ
Description Example
e
Tupl Ordered collection of
(1, 'John')
e fields
Bag Collection of tuples {(1,'A'),(2,'B')}
[name#'John',
Map Key-value pairs
age#25]
🔸 Example of Nested Data:
pig
CopyEdit
students = LOAD 'marks.txt' USING PigStorage(',')
AS (id:int, name:chararray, scores:bag{t:tuple(subject:chararray, mark:int)});
Here:
scores is a bag of tuples containing subject and mark.
📘 Summary Table:
Topic Details
Execution Converts Pig Latin → Logical Plan → MapReduce
Model Jobs
LOAD, FILTER, JOIN, GROUP, DUMP, FOREACH,
Operators
ORDER, etc.
Topic Details
Functions Built-in (COUNT, AVG, CONCAT), and Custom UDFs
Primitive (int, chararray), Complex (tuple, bag,
Data Types
map)
UNIT 4
🔰 Introduction to NoSQL
✅ What is NoSQL?
NoSQL stands for "Not Only SQL". It refers to a category of non-relational
databases designed to handle:
Large volumes of unstructured or semi-structured data,
High scalability and performance,
Flexible schemas (no fixed table structure),
Real-time or near-real-time data processing.
💡 Key Features:
Feature Description
No predefined schema; flexible data
Schema-less
models
Easily scales out using distributed
Scalability
clusters
High
Built to handle failures gracefully
Availability
Fast
Optimized for read/write throughput
Performance
🧠 Why Use NoSQL?
Traditional RDBMS may struggle with:
Huge datasets (big data),
High-speed streaming data (e.g., logs, IoT),
Complex hierarchical data (like JSON/XML),
Cloud-based scalability.
Hence, NoSQL fits use cases where relational schemas limit
performance or flexibility.
💼 NoSQL Business Drivers
🔍 Why Businesses Adopt NoSQL:
Driver Explanation
Data from social media, IoT, logs, and sensors is
Big Data Growth
exploding – NoSQL handles large-scale data better.
Agility and Developers need faster development cycles – NoSQL
Speed allows dynamic data models.
Cloud Native NoSQL is often designed to run on distributed cloud
Architecture environments.
Real-Time Applications like fraud detection or recommendation
Analytics systems need real-time responses.
Global NoSQL systems (like Cassandra) support geo-distributed
Scalability databases.
🧾 Business Examples:
Facebook, Twitter – Store and retrieve user-generated content.
Netflix – Uses Cassandra for global scalability.
Amazon – Uses DynamoDB to handle millions of transactions per
second.
NoSQL Data Architectural Patterns
NoSQL offers multiple data models and architectural patterns for different
needs:
🔸 1. Key-Value Store
🧱 Structure: Key → Value (like a hashmap)
⚡ Use Case: Session storage, caching
🧰 Examples: Redis, Riak, DynamoDB
json
CopyEdit
"user123": {"name": "John", "age": 25}
🔹 2. Document Store
📦 Structure: Documents in JSON, BSON, or XML
📖 Use Case: Content management systems, catalogs
🧰 Examples: MongoDB, CouchDB, Firebase
json
CopyEdit
"id": "123",
"name": "Alice",
"address": {
"city": "NY",
"zip": "10001"
🔺 3. Column Family Store
📊 Structure: Columns grouped in families (like wide tables)
💼 Use Case: Analytics, event logs, data warehousing
🧰 Examples: Apache Cassandra, HBase
makefile
CopyEdit
RowKey: 101
Name: Alice
Subject: Math
Marks: 95
🔘 4. Graph Store
🌐 Structure: Nodes (entities) and Edges (relationships)
🔄 Use Case: Social networks, fraud detection, recommendation engines
🧰 Examples: Neo4j, ArangoDB
scss
CopyEdit
(Alice) --[FRIEND]--> (Bob)
🧩 Summary of Patterns:
Best Use
Type Examples
Case
Key- Caching,
Redis, DynamoDB
Value Session
Docume
CMS, Products MongoDB, CouchDB
nt
Columna
Logs, Analytics Cassandra, HBase
r
Graph Relationships Neo4j, Amazon
Best Use
Type Examples
Case
Neptune
📌 Architectural Patterns in NoSQL Systems:
Pattern Description
Sharding Horizontal partitioning – splits data across nodes
Replication Copies data to multiple servers for fault-tolerance
You can only guarantee 2 of 3: Consistency, Availability,
CAP Theorem
Partition Tolerance
Eventually Common in distributed NoSQL – data becomes consistent
Consistent over time
🔄 Variations of NoSQL Architectural Patterns
NoSQL databases support various architectural variations to optimize
performance, scalability, and fault tolerance when managing Big Data.
✅ 1. Shared Nothing Architecture
Every node is independent.
No single point of failure.
Best for horizontal scaling.
Used by: Cassandra, MongoDB, DynamoDB
✅ 2. Sharding (Horizontal Partitioning)
Data is split across multiple shards (nodes) using a shard key.
Enables parallel processing of large datasets.
Example:
o Shard 1: users with ID 1–1000
o Shard 2: users with ID 1001–2000
✅ 3. Replication
Copies of data are maintained across multiple servers.
Provides high availability and fault tolerance.
Replication factor determines how many copies exist.
✅ 4. MapReduce Pattern
Batch processing of large datasets.
Data is divided into chunks and processed in parallel.
Common in document stores and columnar databases.
✅ 5. Eventual Consistency
In distributed systems, updates propagate gradually.
Prioritizes availability and partition tolerance over immediate
consistency.
Used by systems like Cassandra, DynamoDB.
🧠 Use of NoSQL to Manage Big Data
NoSQL databases are optimized for handling:
Volume – Can handle petabytes of data.
Velocity – Supports fast insert/read operations.
Variety – Handles structured, semi-structured, and unstructured data.
📦 Example Use Cases:
NoSQL
Use Case Database
Pattern
Real-time Column- Apache
analytics oriented Cassandra
Product catalogs Document store MongoDB
Social
Graph store Neo4j
networking
IoT & Sensors Key-value Redis
Graph + ArangoDB,
Fraud detection
Document MongoDB
🍃 Introduction to MongoDB
📌 What is MongoDB?
MongoDB is a document-oriented NoSQL database.
Stores data in JSON-like documents (BSON format).
Highly flexible, scalable, and widely used in web and big data apps.
🧱 MongoDB Architecture
Compone
Description
nt
Documen
Basic unit of data (like a row)
t
Collectio
Group of documents (like a table)
n
Database Container for collections
Replica
Group of MongoDB servers for redundancy
Set
Splitting data across multiple machines for
Sharding
scaling
MongoDB Document Example:
json
CopyEdit
"_id": "123",
"name": "John Doe",
"email": "[email protected]",
"orders": [
{"id": 1, "item": "Laptop", "price": 750},
{"id": 2, "item": "Mouse", "price": 25}
Nested structures are allowed
No fixed schema – fields can vary between documents
Key Features:
Feature Details
Schema-less No need to define structure in advance
MongoDB Query Language (MQL) –
Query Language
JSON-based
Indexing Fast search on any field
Aggregation
Like SQL GROUP BY, but more powerful
Framework
Supports replication and automatic
High Availability
failover
Feature Details
Horizontal
Built-in sharding support
Scalability
🚀 Use Cases of MongoDB:
Real-time analytics
Product catalogs
CMS (Content Management Systems)
IoT platforms
Social apps
UNIT 5
Mining Social Network Graphs
📌 What is Social Network Mining?
Social network mining is the process of extracting patterns,
relationships, and useful information from social network data. It
involves using graph theory, data mining, and machine learning to
analyze social connections.
🔰 Introduction to Social Network Mining
Social networks represent people or entities as nodes and their
relationships as edges in a graph.
Examples:
Facebook: Users are nodes, friendships are edges.
Twitter: Users are nodes, "follows" are directed edges.
LinkedIn: Professional connections.
🔍 Goals of Social Network Mining:
Goal Description
Community Detection Finding groups of closely connected nodes
Influencer Finding nodes with high influence
Identification (centrality)
Recommendation
Suggesting friends, content, or products
Systems
Spotting unusual patterns like spam bots
Anomaly Detection
or fraud
Information
Studying how content or ideas spread
Propagation
📱 Applications of Social Network Mining
1. Marketing & Advertisement
Identifying influencers to promote products.
Viral marketing strategies.
2. Recommendation Engines
Friend recommendations, product suggestions (like Amazon or Netflix).
3. Fraud & Spam Detection
Detecting fake accounts or abnormal patterns in communication.
4. Epidemic Modeling
Studying how diseases or information spread across people.
5. Security & Surveillance
Monitoring suspicious social interactions in criminal networks.
6. Political & Sentiment Analysis
Understanding public opinion or political campaigns on social
platforms.
📊 Social Networks as a Graph
Social networks can be modeled using graph data structures:
🧱 Graph Components:
Component Explanation
Nodes
Represent people, accounts, or entities
(Vertices)
Edges (Links) Represent relationships (friendship, follows, etc.)
Directed
A → B means A follows B (e.g., Twitter)
Edge
Undirected
A – B means mutual relationship (e.g., Facebook friend)
Edge
Weighted Represents strength of connection (number of messages,
Edge likes, etc.)
🧠 Example Graph Types:
1. Undirected Graph – Mutual relationships
2. Directed Graph (Digraph) – One-way relationships
📏 Common Graph Metrics:
Metric Meaning
Degree Centrality Number of connections a node has
Betweenness Influence of a node over information
Centrality flow
Closeness
How quickly a node can access others
Centrality
Density How interconnected the network is
Metric Meaning
Clustering How likely nodes are to form triangles
Coefficient (groups)
🧮 Tools & Libraries:
NetworkX (Python) – Graph manipulation and analysis
Gephi – Visualization of social graphs
Neo4j – Graph database used for social network modeling
GraphX (Apache Spark) – Scalable graph processing
Types of Social Networks
Social networks can be categorized based on the nature of relationships,
purpose, and structure. Here are the main types:
1. Personal or Ego-Centric Networks
Focus on a single individual (ego) and their direct connections.
Example: Facebook profile with friends.
2. Collaboration Networks
Formed by individuals working together.
Example: Co-authorship networks in academic research (authors as
nodes, shared papers as edges).
3. Communication Networks
Nodes are individuals; edges represent communication (calls, emails).
Example: Email exchange networks within an organization.
4. Information Networks
Nodes are pieces of content, and edges represent citation or reference.
Example: Citation network in research papers.
5. Online Social Networks (OSNs)
Platforms like Facebook, Instagram, Twitter where nodes are users and
edges represent various interactions (likes, comments, follows).
🧩 Clustering of Social Graphs
Clustering in social graphs is the process of grouping users (nodes) who
are more connected to each other than to the rest of the graph.
🔍 Why Cluster?
Identify communities, interest groups, or social circles
Useful for:
o Targeted marketing
o Recommendation systems
o Influencer identification
📈 Common Clustering Techniques:
Method Description
K-means (after graph
Cluster nodes based on feature vectors
embedding)
Hierarchical Clustering Builds a tree of clusters
Spectral Clustering Uses eigenvalues of graph Laplacian matrix
Spreads labels through the network to form
Label Propagation
clusters
Girvan–Newman Removes edges with high betweenness to find
Algorithm communities
🧠 Direct Discovery of Communities in a Social Graph
Community detection is the process of identifying dense subgraphs or
clusters of nodes that are more connected within than outside.
📌 Popular Algorithms:
1. Modularity-Based Detection (Louvain Algorithm)
o Measures how well a graph is partitioned into communities.
o High modularity = strong community structure.
2. Clique Percolation
o Communities are overlapping groups formed by k-cliques.
3. Edge Betweenness (Girvan–Newman)
o Removes “bridge” edges (high betweenness) to split the
network.
4. Label Propagation Algorithm (LPA)
o Labels are propagated iteratively and nodes adopt the most
frequent label among neighbors.
Visual Example:
scss
CopyEdit
[Community A] — [Bridge Nodes] — [Community B]
Each community is tightly connected within, but has few connections
outside.
🎯 Introduction to Recommender Systems
Recommender systems are software tools and techniques that provide
suggestions for items (movies, products, people, etc.) that are likely to be
of interest to a user.
📦 Types of Recommender Systems:
Type Description Example
Content-Based Recommends items similar to "You watched Inception
Filtering those the user liked in the past → Suggest Interstellar"
Collaborative Recommends items liked by "People who liked this
Filtering similar users also liked..."
Type Description Example
Combines content-based +
Hybrid Systems Netflix, Amazon
collaborative filtering
Social Uses data from social networks Spotify: “Your friend liked
Recommenders (friends’ likes) this playlist”
🔧 Algorithms Used:
Cosine Similarity
Matrix Factorization (SVD, ALS)
Deep Learning (Neural Collaborative Filtering)
Graph-Based Recommendations (Personalized PageRank)
🧠 Real-Life Examples:
Platfor
Recommendation
m
Amazo
Products based on purchase/view history
n
Netflix Movies based on viewing and ratings
LinkedI
People you may know
n
YouTub Videos based on watch history and
e subscriptions