0% found this document useful (0 votes)

17 views13 pages

Big Data Assignment Notes

The document provides an overview of key concepts related to big data processing, including MapReduce in Hadoop, differences between HDFS and traditional file systems, and comparisons between Apache Spark and Hadoop. It also discusses NoSQL databases, data quality management, data governance, security threats, and real-time analytics architecture. Additionally, it outlines Apache Spark's architecture, core components, and common use cases for big data applications.

Uploaded by

paridhikadwey78

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views13 pages

Big Data Assignment Notes

Uploaded by

paridhikadwey78

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

BIG DATA ASSIGNMENT NOTES

ASSIGNMENT 3
1. How Does MapReduce Work in Hadoop?

MapReduce is a programming model used in Hadoop for processing large

data sets in a distributed manner.

How It Works:

🟩 Step 1: Input Splitting

 Large files are split into chunks (blocks).

 Each chunk is assigned to a Map task.

🟩 Step 2: Mapping

 Each Mapper processes a data block and produces key-value pairs.

 Example: Processing logs → (IP address, 1)

🟩 Step 3: Shuffling and Sorting

 Hadoop groups all values by key across all Mappers.

 Intermediate data is sorted and sent to Reducers.

🟩 Step 4: Reducing

 Reducers process each group of key-value pairs to produce final

output.

 Example: Summing counts per IP → (IP address, total visits)

🟩 Step 5: Output

 Final output is written to HDFS.

Example Use Case: Word count, log analysis, clickstream processing.

2. Difference Between HDFS and Traditional File Systems

HDFS (Hadoop Distributed Traditional File System
Feature
File System) (e.g., NTFS, ext4)

Distributed across multiple

Architecture Centralized or single-machine
nodes

Fault Manual or external backup

Built-in data replication
Tolerance required

Scales horizontally (add more

Scalability Limited to hardware
nodes)

Data Size Optimized for large files (GBs

Not ideal for massive files
Handling to TBs)

Supports frequent read-write

Write Support Write-once, read-many
operations

Data Locality Computation moves to data Data moves to computation

Large (default 128 MB or 256 Smaller (4 KB – 64 KB

Block Size
MB) typically)

3. How Does Spark Compare to Hadoop for Big Data Processing?

Feature Apache Spark Hadoop (MapReduce)

Processing In-memory Disk-based

Up to 100x faster for some Slower due to frequent disk

Speed
workloads I/O

High-level APIs (Python, Requires Java-based

Ease of Use
Scala, Java, R) MapReduce code

Real-time
Yes (Spark Streaming) No (batch only)
Support

Machine Limited support (needs

Built-in MLlib
Learning external tools)

Through task re-execution

Fault Tolerance DAG lineage and RDDs
and replication

Data Processing Batch, Streaming, Only batch processing

Feature Apache Spark Hadoop (MapReduce)

Modes Interactive, Graph

Summary:

 Use Hadoop MapReduce for batch jobs on extremely large datasets.

 Use Apache Spark for faster, in-memory, interactive or real-time

data processing.

UNIT 4
1. What is NoSQL, and How is it Used in Big Data Storage?

✅ Definition:

NoSQL (Not Only SQL) databases are non-relational databases designed

to handle large volumes of unstructured, semi-structured, or
structured data with high performance and scalability.

✅ Types of NoSQL Databases:

Type Description Examples

Document- Stores data as JSON-like

MongoDB, CouchDB
based documents

Redis, Amazon
Key-Value Stores pairs for fast lookups
DynamoDB

Stores data in columns instead Apache Cassandra,

Column-based
of rows HBase

Optimized for Neo4j, Amazon

Graph-based
relationships/networks Neptune

✅ Use in Big Data:

 Handles high volume, velocity, and variety of data.

 Scales horizontally across distributed clusters.

 Useful in real-time analytics, IoT, recommendation systems, and social

media platforms.
2. How Do You Handle Data Quality Issues in Big Data Sets?

Big data often contains noise, duplication, or missing values. Here's how you
can manage quality issues:

✅ Steps to Handle Data Quality:

Issue Type Handling Techniques

Imputation (mean/median), data interpolation,

Missing Data
deletion

Duplicate
Use hashing or unique IDs to remove duplicates
Records

Inconsistent Standardize units (e.g., date formats, case

Formats normalization)

Use statistical or ML techniques to detect and

Outliers/Noise
handle

Incorrect Data Cross-validation with reference datasets or rules

✅ Tools Commonly Used:

 Apache Spark, Talend, Trifacta, OpenRefine, Pandas (in Python)

3. Techniques for Data Preprocessing in Big Data

Data preprocessing prepares raw data for analytics or machine learning

models.

✅ Common Techniques:

Technique Purpose

Fix/remove incorrect, incomplete, or inconsistent

Data Cleaning
data

Data
Normalize, scale, encode data for algorithms
Transformation

Combine data from multiple sources (ETL

Data Integration
processes)

Data Reduction Dimensionality reduction (e.g., PCA), sampling,

Technique Purpose

aggregation

Convert continuous data into categories or

Data Discretization
intervals

Tokenization &
For text data — splitting sentences into words
Parsing

Streaming Real-time data transformation using tools like

Preprocessing Kafka, Spark

✅ Big Data Tools for Preprocessing:

 Apache Spark (with PySpark or Scala)

 Apache NiFi

 Hadoop MapReduce

 ETL pipelines (Airflow, Talend)

UNIT 5
✅ 1. How Do You Implement Data Governance in a Big Data
Environment?

Data governance ensures that data is accurate, secure, consistent, and

used responsibly.

📌 Key Components of Data Governance in Big Data:

Component Description

Data Catalog Centralized metadata store (e.g., Apache Atlas, Alation)

Tracks data flow from source to destination (e.g.,

Data Lineage
OpenLineage, Talend)

Role-Based Access Control (RBAC), policies for who can

Access Control
access what

Data Quality
Define valid values, types, ranges, null handling
Rules

Data Assign responsible roles for maintaining data integrity

Component Description

Stewardship

Policy
Compliance with GDPR, HIPAA, etc.
Management

🔧 Tools for Data Governance:

 Apache Atlas (metadata management)

 Apache Ranger (fine-grained access control)

 Collibra, Informatica, AWS Glue Data Catalog

✅ 2. Common Big Data Security Threats & Mitigation Strategies

⚠️Common Security Threats:

Threat Description Mitigation Strategies

Unauthorized access to Encryption (at-rest/in-transit),

Data Breaches
sensitive data Access controls

Unauthorized Lack of strict access Use Kerberos, LDAP, or OAuth

Access policies authentication

Data Leakage in Leakage during Secure APIs, TLS/SSL, audit

Pipelines processing or transfers trails

Malicious Code Attacks via open-source Code scanning, sandboxing

Injection or shared scripts jobs

Lack of Audit No monitoring of data Use logging systems like

Trails usage Apache Ranger, audit tools

🔐 Key Security Techniques:

 Kerberos: Secure authentication in Hadoop/Spark

 Apache Ranger: Role-based policies and audit logs

 Tokenization & Encryption: Protects PII data

 Network Layer Security: VPN, firewalls, VPCs

✅ 3. How Do You Scale Big Data Processing for Real-Time Analytics?

Real-time analytics requires fast ingestion, low-latency processing, and

scalable architecture.

⚙️Architecture for Real-Time Analytics:

text

CopyEdit

[Data Sources]

[Ingestion Layer] — Kafka / Flume / Kinesis

[Processing Layer] — Apache Spark Streaming / Flink / Storm

[Storage Layer] — Cassandra / HBase / Elasticsearch

[Visualization Layer] — Grafana / Kibana / Tableau

🧠 Key Techniques:

Technique Purpose

Real-time data computation (Spark

Stream Processing
Streaming, Flink)

Micro-Batching Efficient processing in small time windows

Autoscaling Dynamic resource allocation in cloud (K8s,

Infrastructure EMR)

Event-Driven
Process events instantly via Kafka or Pulsar
Architecture

In-Memory
Fast processing using RAM (Spark, Ignite)
Computing

🛠 Example Tools:

 Kafka + Spark Structured Streaming for low-latency pipelines

 AWS Kinesis + Lambda for serverless real-time processing

 Apache Flink for advanced stream processing with stateful operators

🚀 What is Apache Spark?

Apache Spark is an open-source, distributed computing framework

designed for fast processing of large-scale data. It supports batch
processing, streaming, machine learning, and SQL-based analytics —
all in one platform.

Feature Description

In-Memory Keeps intermediate data in memory for faster processing

Computing than Hadoop MapReduce

Supports SQL, MLlib (Machine Learning), GraphX (Graph

Unified Engine
Processing), and Spark Streaming

Language
APIs available in Python, Scala, Java, and R
Support

Distributed
Splits tasks across a cluster for parallel execution
Computing

Fault Tolerant Automatically handles failures using RDD lineage

🔄 Spark vs Hadoop (MapReduce)

Feature Apache Spark Hadoop MapReduce

Speed Faster (in-memory) Slower (disk-based)

Rich APIs in
Ease of Use Low-level Java APIs
Python/Scala/Java

Data
Batch + Streaming Batch only
Processing

Machine Needs integration with external

Built-in MLlib
Learning tools
🔥 Core Components of Apache Spark

1. Spark Core – The execution engine (RDDs, memory mgmt, fault

tolerance)

2. Spark SQL – Query structured data using SQL or DataFrames

3. Spark Streaming – Real-time data processing from sources like Kafka

4. MLlib – Machine learning library (classification, clustering, etc.)

5. GraphX – Graph processing (e.g., PageRank, graph traversal)

💡 Example: PySpark Code for Word Count

python

CopyEdit

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WordCount").getOrCreate()

# Load text file

rdd = spark.sparkContext.textFile("sample.txt")

# Word count logic

counts = (

rdd.flatMap(lambda line: line.split())

.map(lambda word: (word, 1))

.reduceByKey(lambda a, b: a + b)

counts.collect()
✅ Common Use Cases

 Real-time analytics (e.g., fraud detection, log monitoring)

 ETL (Extract, Transform, Load) pipelines

 Recommendation engines

 Social media data analysis

 IoT stream processing

Would you like:

 A simple Spark ML example (like linear regression)?

 A real-time example using Spark Streaming + Kafka?

🚀 Apache Spark Architecture: Overview

Apache Spark follows a master-slave architecture with the following key

components:

🧱 Core Components of Apache Spark:

Component Role

Driver Controls the application, manages SparkContext, and

Program coordinates tasks

Cluster Allocates resources across Spark applications (e.g., YARN,

Manager Mesos, Kubernetes)

Executors Run tasks and return results to the driver

Tasks Individual units of work sent to executors

🔧 Detailed Components of Apache Spark

1. Spark Core (Foundation of everything)

 Manages memory, fault-tolerance, job scheduling.

 Provides the RDD (Resilient Distributed Dataset) abstraction for

distributed data.
2. Spark SQL

 Allows querying structured and semi-structured data using SQL,

DataFrames, and Datasets.

 Can read from Hive, Parquet, JSON, JDBC, etc.

3. Spark Streaming

 Enables real-time data processing.

 Processes live data streams using micro-batching.

4. MLlib (Machine Learning Library)

 Built-in library for scalable machine learning tasks:

o Classification, Regression, Clustering, Recommendation

5. GraphX

 API for graph processing (e.g., social networks, recommendation

graphs).

 Includes graph algorithms like PageRank and connected components.

⚙️How Apache Spark Works (Step-by-Step Execution)

Let’s understand with an example:

💼 Suppose: You want to count words in a large text file using Spark.

🔁 Spark Job Workflow:

1. Driver Program Starts

o It creates a SparkContext (entry point to Spark cluster).

2. Cluster Manager Allocates Resources

o The driver asks for executors on cluster nodes.

3. RDD/DataFrame Created

o Data is loaded into an RDD (e.g., from a text file).

4. Transformations Applied

o Operations like .map(), .filter(), .flatMap() define a DAG (Directed

Acyclic Graph).
5. Actions Trigger Execution

o An action like .collect() or .saveAsTextFile() starts actual

processing.

6. Task Scheduling

o Spark breaks the DAG into stages and tasks.

7. Tasks Sent to Executors

o Executors perform computations in parallel.

8. Results Returned

o Executors return the results to the driver, or write to storage.

🖼 Spark Architecture Diagram (Text-Based)

plaintext

CopyEdit

+----------------------+

| Driver Program | ← Controls the job

+----------------------+

+----------------------+ +----------------------+

| Cluster Manager | ←→→→ | Executors (n) |

+----------------------+ +----------------------+

| |

v v

Distribute Tasks Process Data, Store Cache

🧠 Summary
Component Responsibility

Main controller, builds job, sends tasks to

Driver
workers

Executor Workers that run tasks and store data

Cluster
Manages resources and task scheduling
Manager

RDD/
Data abstraction used for processing
DataFrame

StartUp Engineering
100% (2)
StartUp Engineering
218 pages
GMP Training for Medical Devices
67% (3)
GMP Training for Medical Devices
110 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Big Data Tools and Its Framework
No ratings yet
Big Data Tools and Its Framework
5 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
SPARK
No ratings yet
SPARK
47 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
BIG DATA Class 1 1741496163
No ratings yet
BIG DATA Class 1 1741496163
108 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
Big Data Technologies Presentation
No ratings yet
Big Data Technologies Presentation
10 pages
Bigdata and Hadoop
No ratings yet
Bigdata and Hadoop
39 pages
Big Data Processing Techniques
No ratings yet
Big Data Processing Techniques
21 pages
Apache Spark
No ratings yet
Apache Spark
3 pages
Big Data One Shot
No ratings yet
Big Data One Shot
45 pages
BDA Answers
No ratings yet
BDA Answers
6 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
BDA Simple 1 To 4
No ratings yet
BDA Simple 1 To 4
11 pages
Topic 1 Big Data Technologies
No ratings yet
Topic 1 Big Data Technologies
5 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Big Data Analytics Unit Wise Short Note
No ratings yet
Big Data Analytics Unit Wise Short Note
6 pages
SPARK
No ratings yet
SPARK
66 pages
Apache Spark Basics & Comparison
No ratings yet
Apache Spark Basics & Comparison
66 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
1 Introduction To Big Data Management and Processing
No ratings yet
1 Introduction To Big Data Management and Processing
42 pages
Complete Spark & Azure Databricks Interview Guide - Claude
No ratings yet
Complete Spark & Azure Databricks Interview Guide - Claude
46 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
SPARK
No ratings yet
SPARK
125 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Bda Summer 2024 Solution
No ratings yet
Bda Summer 2024 Solution
26 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Big Data Analytics
No ratings yet
Big Data Analytics
61 pages
Module - 1
No ratings yet
Module - 1
84 pages
Unit V
No ratings yet
Unit V
35 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
Big Data Analytics - Notes
No ratings yet
Big Data Analytics - Notes
13 pages
Module 1-BDA
No ratings yet
Module 1-BDA
82 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
Spark Development for Developers
No ratings yet
Spark Development for Developers
172 pages
Bba13 Notes BDF Unit 1
No ratings yet
Bba13 Notes BDF Unit 1
3 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Big Data Analysis BDA IMP QNA Openinapp
No ratings yet
Big Data Analysis BDA IMP QNA Openinapp
33 pages
Advanced DevOps with Spark
0% (1)
Advanced DevOps with Spark
301 pages
Bda 123
No ratings yet
Bda 123
36 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Big Data Imp-1
No ratings yet
Big Data Imp-1
16 pages
Bda Unit-1 Notes
No ratings yet
Bda Unit-1 Notes
10 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Unit 4 Topic 4 Capped Collections Spark
No ratings yet
Unit 4 Topic 4 Capped Collections Spark
30 pages
Big Data Analytics - Chapter 4
No ratings yet
Big Data Analytics - Chapter 4
22 pages
Lab Manual Computer Data Security & Privacy (COMP-324) : Course Coordinator: Dr. Sherif Tawfik Amin
No ratings yet
Lab Manual Computer Data Security & Privacy (COMP-324) : Course Coordinator: Dr. Sherif Tawfik Amin
51 pages
Fairino Brochure Ev4.3-20241217
100% (1)
Fairino Brochure Ev4.3-20241217
12 pages
Boost OEE with TPM and Pareto Analysis
No ratings yet
Boost OEE with TPM and Pareto Analysis
15 pages
Business Process Simulation Guide
No ratings yet
Business Process Simulation Guide
24 pages
HWS701 Manual
No ratings yet
HWS701 Manual
24 pages
Ba 5211 - Data Analysis and Business Modeling
No ratings yet
Ba 5211 - Data Analysis and Business Modeling
88 pages
Coronnello Et Al. - 2005 - Sector Identification in A Set of Stock Return Time Series Traded at The London Stock Exchange (2) - Annotated
No ratings yet
Coronnello Et Al. - 2005 - Sector Identification in A Set of Stock Return Time Series Traded at The London Stock Exchange (2) - Annotated
27 pages
Pivot Table
No ratings yet
Pivot Table
19 pages
FRST
No ratings yet
FRST
19 pages
Analisis Swot Kurikulum Prodi Pgmi Menyongsong Pembangunan Uin Sun An Kalijaga Yogyakarta 2038 Yang Bervisi Integrasi-Interkonektif
No ratings yet
Analisis Swot Kurikulum Prodi Pgmi Menyongsong Pembangunan Uin Sun An Kalijaga Yogyakarta 2038 Yang Bervisi Integrasi-Interkonektif
16 pages
Control Engineering Basics
100% (1)
Control Engineering Basics
18 pages
Salient Features of IT Act 2000
No ratings yet
Salient Features of IT Act 2000
10 pages
Entry-Task-Validation-Exit (ETVX)
No ratings yet
Entry-Task-Validation-Exit (ETVX)
13 pages
Huawei RTN 905e Brochure
No ratings yet
Huawei RTN 905e Brochure
2 pages
Fibonacci Search: Observation On Unimodal Functions
No ratings yet
Fibonacci Search: Observation On Unimodal Functions
5 pages
Bits ZG553 Ec-2r First Sem 2019-2020
No ratings yet
Bits ZG553 Ec-2r First Sem 2019-2020
2 pages
GPS Tracking Device Specs
No ratings yet
GPS Tracking Device Specs
12 pages
UTS - Lec 11 - Digital Self - Panganiban
No ratings yet
UTS - Lec 11 - Digital Self - Panganiban
13 pages
Cmos Digital Vlsi Design: Sequential Logic Design-VII
No ratings yet
Cmos Digital Vlsi Design: Sequential Logic Design-VII
11 pages
Design A Cloud-Enabled Humanoid Robot Application System To Assess The ABA Learning For Autistic Children
No ratings yet
Design A Cloud-Enabled Humanoid Robot Application System To Assess The ABA Learning For Autistic Children
8 pages
NetWorker 19.1 Installation Guide PDF
No ratings yet
NetWorker 19.1 Installation Guide PDF
196 pages
DSP LAB Manual - ECE - KNCET
No ratings yet
DSP LAB Manual - ECE - KNCET
60 pages
Audison Thesis Car Audio
100% (3)
Audison Thesis Car Audio
5 pages
MF50 Q&a
No ratings yet
MF50 Q&a
3 pages
(Ebook) Visualization Analysis and Design by Tamara Munzner ISBN 9781466508910, 1466508914 Download
100% (1)
(Ebook) Visualization Analysis and Design by Tamara Munzner ISBN 9781466508910, 1466508914 Download
95 pages
FORM R.1 Recognition Application Form
No ratings yet
FORM R.1 Recognition Application Form
9 pages
Microsoft AZ-204 Exam Demo Guide
No ratings yet
Microsoft AZ-204 Exam Demo Guide
12 pages
Lynx
No ratings yet
Lynx
6 pages