BIG DATA ASSIGNMENT NOTES
ASSIGNMENT 3
1. How Does MapReduce Work in Hadoop?
MapReduce is a programming model used in Hadoop for processing large
data sets in a distributed manner.
How It Works:
🟩 Step 1: Input Splitting
Large files are split into chunks (blocks).
Each chunk is assigned to a Map task.
🟩 Step 2: Mapping
Each Mapper processes a data block and produces key-value pairs.
Example: Processing logs → (IP address, 1)
🟩 Step 3: Shuffling and Sorting
Hadoop groups all values by key across all Mappers.
Intermediate data is sorted and sent to Reducers.
🟩 Step 4: Reducing
Reducers process each group of key-value pairs to produce final
output.
Example: Summing counts per IP → (IP address, total visits)
🟩 Step 5: Output
Final output is written to HDFS.
Example Use Case: Word count, log analysis, clickstream processing.
2. Difference Between HDFS and Traditional File Systems
HDFS (Hadoop Distributed Traditional File System
Feature
File System) (e.g., NTFS, ext4)
Distributed across multiple
Architecture Centralized or single-machine
nodes
Fault Manual or external backup
Built-in data replication
Tolerance required
Scales horizontally (add more
Scalability Limited to hardware
nodes)
Data Size Optimized for large files (GBs
Not ideal for massive files
Handling to TBs)
Supports frequent read-write
Write Support Write-once, read-many
operations
Data Locality Computation moves to data Data moves to computation
Large (default 128 MB or 256 Smaller (4 KB – 64 KB
Block Size
MB) typically)
3. How Does Spark Compare to Hadoop for Big Data Processing?
Feature Apache Spark Hadoop (MapReduce)
Processing In-memory Disk-based
Up to 100x faster for some Slower due to frequent disk
Speed
workloads I/O
High-level APIs (Python, Requires Java-based
Ease of Use
Scala, Java, R) MapReduce code
Real-time
Yes (Spark Streaming) No (batch only)
Support
Machine Limited support (needs
Built-in MLlib
Learning external tools)
Through task re-execution
Fault Tolerance DAG lineage and RDDs
and replication
Data Processing Batch, Streaming, Only batch processing
Feature Apache Spark Hadoop (MapReduce)
Modes Interactive, Graph
Summary:
Use Hadoop MapReduce for batch jobs on extremely large datasets.
Use Apache Spark for faster, in-memory, interactive or real-time
data processing.
UNIT 4
1. What is NoSQL, and How is it Used in Big Data Storage?
✅ Definition:
NoSQL (Not Only SQL) databases are non-relational databases designed
to handle large volumes of unstructured, semi-structured, or
structured data with high performance and scalability.
✅ Types of NoSQL Databases:
Type Description Examples
Document- Stores data as JSON-like
MongoDB, CouchDB
based documents
Redis, Amazon
Key-Value Stores pairs for fast lookups
DynamoDB
Stores data in columns instead Apache Cassandra,
Column-based
of rows HBase
Optimized for Neo4j, Amazon
Graph-based
relationships/networks Neptune
✅ Use in Big Data:
Handles high volume, velocity, and variety of data.
Scales horizontally across distributed clusters.
Useful in real-time analytics, IoT, recommendation systems, and social
media platforms.
2. How Do You Handle Data Quality Issues in Big Data Sets?
Big data often contains noise, duplication, or missing values. Here's how you
can manage quality issues:
✅ Steps to Handle Data Quality:
Issue Type Handling Techniques
Imputation (mean/median), data interpolation,
Missing Data
deletion
Duplicate
Use hashing or unique IDs to remove duplicates
Records
Inconsistent Standardize units (e.g., date formats, case
Formats normalization)
Use statistical or ML techniques to detect and
Outliers/Noise
handle
Incorrect Data Cross-validation with reference datasets or rules
✅ Tools Commonly Used:
Apache Spark, Talend, Trifacta, OpenRefine, Pandas (in Python)
3. Techniques for Data Preprocessing in Big Data
Data preprocessing prepares raw data for analytics or machine learning
models.
✅ Common Techniques:
Technique Purpose
Fix/remove incorrect, incomplete, or inconsistent
Data Cleaning
data
Data
Normalize, scale, encode data for algorithms
Transformation
Combine data from multiple sources (ETL
Data Integration
processes)
Data Reduction Dimensionality reduction (e.g., PCA), sampling,
Technique Purpose
aggregation
Convert continuous data into categories or
Data Discretization
intervals
Tokenization &
For text data — splitting sentences into words
Parsing
Streaming Real-time data transformation using tools like
Preprocessing Kafka, Spark
✅ Big Data Tools for Preprocessing:
Apache Spark (with PySpark or Scala)
Apache NiFi
Hadoop MapReduce
ETL pipelines (Airflow, Talend)
UNIT 5
✅ 1. How Do You Implement Data Governance in a Big Data
Environment?
Data governance ensures that data is accurate, secure, consistent, and
used responsibly.
📌 Key Components of Data Governance in Big Data:
Component Description
Data Catalog Centralized metadata store (e.g., Apache Atlas, Alation)
Tracks data flow from source to destination (e.g.,
Data Lineage
OpenLineage, Talend)
Role-Based Access Control (RBAC), policies for who can
Access Control
access what
Data Quality
Define valid values, types, ranges, null handling
Rules
Data Assign responsible roles for maintaining data integrity
Component Description
Stewardship
Policy
Compliance with GDPR, HIPAA, etc.
Management
🔧 Tools for Data Governance:
Apache Atlas (metadata management)
Apache Ranger (fine-grained access control)
Collibra, Informatica, AWS Glue Data Catalog
✅ 2. Common Big Data Security Threats & Mitigation Strategies
⚠️Common Security Threats:
Threat Description Mitigation Strategies
Unauthorized access to Encryption (at-rest/in-transit),
Data Breaches
sensitive data Access controls
Unauthorized Lack of strict access Use Kerberos, LDAP, or OAuth
Access policies authentication
Data Leakage in Leakage during Secure APIs, TLS/SSL, audit
Pipelines processing or transfers trails
Malicious Code Attacks via open-source Code scanning, sandboxing
Injection or shared scripts jobs
Lack of Audit No monitoring of data Use logging systems like
Trails usage Apache Ranger, audit tools
🔐 Key Security Techniques:
Kerberos: Secure authentication in Hadoop/Spark
Apache Ranger: Role-based policies and audit logs
Tokenization & Encryption: Protects PII data
Network Layer Security: VPN, firewalls, VPCs
✅ 3. How Do You Scale Big Data Processing for Real-Time Analytics?
Real-time analytics requires fast ingestion, low-latency processing, and
scalable architecture.
⚙️Architecture for Real-Time Analytics:
text
CopyEdit
[Data Sources]
[Ingestion Layer] — Kafka / Flume / Kinesis
[Processing Layer] — Apache Spark Streaming / Flink / Storm
[Storage Layer] — Cassandra / HBase / Elasticsearch
[Visualization Layer] — Grafana / Kibana / Tableau
🧠 Key Techniques:
Technique Purpose
Real-time data computation (Spark
Stream Processing
Streaming, Flink)
Micro-Batching Efficient processing in small time windows
Autoscaling Dynamic resource allocation in cloud (K8s,
Infrastructure EMR)
Event-Driven
Process events instantly via Kafka or Pulsar
Architecture
In-Memory
Fast processing using RAM (Spark, Ignite)
Computing
🛠 Example Tools:
Kafka + Spark Structured Streaming for low-latency pipelines
AWS Kinesis + Lambda for serverless real-time processing
Apache Flink for advanced stream processing with stateful operators
🚀 What is Apache Spark?
Apache Spark is an open-source, distributed computing framework
designed for fast processing of large-scale data. It supports batch
processing, streaming, machine learning, and SQL-based analytics —
all in one platform.
Feature Description
In-Memory Keeps intermediate data in memory for faster processing
Computing than Hadoop MapReduce
Supports SQL, MLlib (Machine Learning), GraphX (Graph
Unified Engine
Processing), and Spark Streaming
Language
APIs available in Python, Scala, Java, and R
Support
Distributed
Splits tasks across a cluster for parallel execution
Computing
Fault Tolerant Automatically handles failures using RDD lineage
🔄 Spark vs Hadoop (MapReduce)
Feature Apache Spark Hadoop MapReduce
Speed Faster (in-memory) Slower (disk-based)
Rich APIs in
Ease of Use Low-level Java APIs
Python/Scala/Java
Data
Batch + Streaming Batch only
Processing
Machine Needs integration with external
Built-in MLlib
Learning tools
🔥 Core Components of Apache Spark
1. Spark Core – The execution engine (RDDs, memory mgmt, fault
tolerance)
2. Spark SQL – Query structured data using SQL or DataFrames
3. Spark Streaming – Real-time data processing from sources like Kafka
4. MLlib – Machine learning library (classification, clustering, etc.)
5. GraphX – Graph processing (e.g., PageRank, graph traversal)
💡 Example: PySpark Code for Word Count
python
CopyEdit
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WordCount").getOrCreate()
# Load text file
rdd = spark.sparkContext.textFile("sample.txt")
# Word count logic
counts = (
rdd.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
counts.collect()
✅ Common Use Cases
Real-time analytics (e.g., fraud detection, log monitoring)
ETL (Extract, Transform, Load) pipelines
Recommendation engines
Social media data analysis
IoT stream processing
Would you like:
A simple Spark ML example (like linear regression)?
A real-time example using Spark Streaming + Kafka?
🚀 Apache Spark Architecture: Overview
Apache Spark follows a master-slave architecture with the following key
components:
🧱 Core Components of Apache Spark:
Component Role
Driver Controls the application, manages SparkContext, and
Program coordinates tasks
Cluster Allocates resources across Spark applications (e.g., YARN,
Manager Mesos, Kubernetes)
Executors Run tasks and return results to the driver
Tasks Individual units of work sent to executors
🔧 Detailed Components of Apache Spark
1. Spark Core (Foundation of everything)
Manages memory, fault-tolerance, job scheduling.
Provides the RDD (Resilient Distributed Dataset) abstraction for
distributed data.
2. Spark SQL
Allows querying structured and semi-structured data using SQL,
DataFrames, and Datasets.
Can read from Hive, Parquet, JSON, JDBC, etc.
3. Spark Streaming
Enables real-time data processing.
Processes live data streams using micro-batching.
4. MLlib (Machine Learning Library)
Built-in library for scalable machine learning tasks:
o Classification, Regression, Clustering, Recommendation
5. GraphX
API for graph processing (e.g., social networks, recommendation
graphs).
Includes graph algorithms like PageRank and connected components.
⚙️How Apache Spark Works (Step-by-Step Execution)
Let’s understand with an example:
💼 Suppose: You want to count words in a large text file using Spark.
🔁 Spark Job Workflow:
1. Driver Program Starts
o It creates a SparkContext (entry point to Spark cluster).
2. Cluster Manager Allocates Resources
o The driver asks for executors on cluster nodes.
3. RDD/DataFrame Created
o Data is loaded into an RDD (e.g., from a text file).
4. Transformations Applied
o Operations like .map(), .filter(), .flatMap() define a DAG (Directed
Acyclic Graph).
5. Actions Trigger Execution
o An action like .collect() or .saveAsTextFile() starts actual
processing.
6. Task Scheduling
o Spark breaks the DAG into stages and tasks.
7. Tasks Sent to Executors
o Executors perform computations in parallel.
8. Results Returned
o Executors return the results to the driver, or write to storage.
🖼 Spark Architecture Diagram (Text-Based)
plaintext
CopyEdit
+----------------------+
| Driver Program | ← Controls the job
+----------------------+
+----------------------+ +----------------------+
| Cluster Manager | ←→→→ | Executors (n) |
+----------------------+ +----------------------+
| |
v v
Distribute Tasks Process Data, Store Cache
🧠 Summary
Component Responsibility
Main controller, builds job, sends tasks to
Driver
workers
Executor Workers that run tasks and store data
Cluster
Manages resources and task scheduling
Manager
RDD/
Data abstraction used for processing
DataFrame