0% found this document useful (0 votes)

49 views9 pages

Big Data Complete Notes

The document provides comprehensive semester notes on Big Data, covering its introduction, types, analytics, and the Hadoop ecosystem. It discusses the evolution, challenges, and importance of Big Data, as well as the differences between traditional BI and Big Data. Additionally, it includes sections on NoSQL databases, MongoDB, and R programming, highlighting their features, advantages, and applications.

Uploaded by

anishithacsd226702

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views9 pages

Big Data Complete Notes

Uploaded by

anishithacsd226702

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

BIG DATA - COMPLETE SEMESTER NOTES

UNIT – I: INTRODUCTION TO BIG DATA

Types and Classification of Digital Data

Types of Digital Data:

1. Structured Data

2. Organized in tabular format.

3. Stored in RDBMS.

4. Examples: Bank transactions, sensor logs.

5. Semi-Structured Data

6. Partially organized.

7. Doesn’t conform to formal data models.

8. Examples: XML, JSON, NoSQL documents.

9. Unstructured Data

10. No fixed format.

11. Examples: Emails, audio, video, social media content.

Classification of Digital Data:

• Human-generated Data: Emails, social media posts.

• Machine-generated Data: Sensor data, server logs.
• Metadata: Data about other data.

Introduction to Big Data

Evolution of Big Data:

• Emerged due to exponential growth of internet, mobile data, IoT.

• Traditional systems failed to process unstructured or huge volumes of data.

1
Definition of Big Data:

• Big Data is defined by the 5 V’s:

• Volume – Large amounts of data.
• Velocity – Speed of data generation and processing.
• Variety – Different types of data (text, video, logs).
• Veracity – Trustworthiness of the data.
• Value – Useful insights extracted from data.

Traditional BI vs Big Data

Feature Traditional BI Big Data

Storage GB to TB TB to PB

Data Type Structured All Types

Architecture Centralized Distributed

Processing Batch Batch + Real-time

Tools SQL, OLAP Hadoop, Spark

Coexistence of Big Data and Data Warehouse

• Big Data complements data warehouses.
• Warehouses handle structured historical data.
• Big Data handles real-time and semi/unstructured data.

Big Data Analytics

What It Is:

• Advanced techniques to extract actionable insights from huge and diverse data.

What It Isn’t:

• Not just collecting massive data or using fast computers.

• It's not only for data scientists.

Why Sudden Hype:

• Cost-effective storage.
• Real-time decisions.
• Cloud computing.

2
Classification of Analytics:

1. Descriptive – What happened?

2. Diagnostic – Why it happened?
3. Predictive – What will happen?
4. Prescriptive – What action should be taken?

Challenges for Businesses:

• Poor data quality.

• Lack of skilled professionals.
• Integration with existing systems.
• Privacy and security.

Importance of Big Data Analytics:

• Customer behavior analysis.

• Fraud detection.
• Operational efficiency.
• Real-time alerts.

Data Science and Terminologies

Data Science:

• Interdisciplinary field.
• Combines statistics, machine learning, data engineering, domain expertise.

Important Terminologies:

• HDFS: Distributed file storage.

• MapReduce: Batch processing framework.
• Hive: SQL-based query tool.
• Pig: Dataflow scripting language.
• Spark: In-memory data processing engine.
• Flume: Ingests unstructured data.
• Sqoop: Transfers data between RDBMS and Hadoop.
• YARN: Resource manager in Hadoop.

UNIT – II: HADOOP ECOSYSTEM

Features of Hadoop:
• Open-source.

3
• Highly scalable.
• Fault-tolerant.
• Runs on commodity hardware.
• Data replication for fault recovery.

Key Advantages:
• Cost-effective.
• Handles structured, semi-structured, and unstructured data.
• Supports multiple languages (Java, Python, etc.).
• Ecosystem includes various tools for different tasks.

Versions of Hadoop:
• Hadoop 1.x: Single NameNode, scalability issues.
• Hadoop 2.x: Introduced YARN, better resource management.
• Hadoop 3.x: Erasure coding, containerization support, better performance.

Hadoop Ecosystem Overview:

• HDFS – Storage layer.
• MapReduce – Processing layer.
• YARN – Resource manager.
• Hive – SQL-like queries.
• Pig – Scripting language.
• HBase – Columnar storage DB.
• Oozie – Workflow scheduler.
• Flume – Ingest logs.
• Sqoop – Transfers data from RDBMS.

Distributions:
• Cloudera, Hortonworks, MapR, Amazon EMR.

Need for Hadoop:

• Traditional RDBMSs can’t handle high volume and variety.
• Provides distributed storage and processing.

4
RDBMS vs Hadoop

Aspect RDBMS Hadoop

Data Types Structured All Types

Schema Fixed Dynamic

Scalability Vertical Horizontal

Cost Expensive Low-cost (commodity hardware)

Real-time Possible Not in MapReduce (Spark preferred)

Distributed Computing Challenges:

• Node failure.
• Network latency.
• Synchronization.
• Load balancing.

History of Hadoop:
• Inspired by Google File System (GFS).
• Created by Doug Cutting and Mike Cafarella.
• Yahoo adopted and funded development.

HDFS:
• Master-slave architecture.
• NameNode: Metadata.
• DataNodes: Store blocks.
• Replication factor (default = 3).
• Designed for write-once, read-many workloads.

UNIT – III: PROCESSING DATA WITH HADOOP &

NOSQL
MapReduce Programming

Introduction:

• Programming model for distributed processing of large datasets.

5
Components:

• Mapper: Processes input data and emits key-value pairs.

• Reducer: Aggregates values based on keys from the mapper.
• Combiner: Optional local reducer to optimize performance.
• Partitioner: Decides which reducer a key-value pair should go to.

NoSQL Databases

Introduction:

• Non-relational databases designed for horizontal scalability and flexible data models.

Types:

1. Key-Value Stores (e.g., Redis, Riak)

2. Document Stores (e.g., MongoDB, CouchDB)
3. Column Stores (e.g., Cassandra, HBase)
4. Graph Databases (e.g., Neo4j)

Advantages:

• Schema-free
• Horizontal scaling
• High performance
• Better handling of unstructured data

Use in Industry:

• Real-time web apps

• E-commerce
• Social media analytics
• IoT applications

SQL vs NoSQL vs NewSQL

Feature SQL NoSQL NewSQL

Schema Fixed Dynamic Fixed

Scalability Vertical Horizontal Horizontal

ACID Support Full Limited Full

Query Language SQL Varies SQL

Ideal for Structured data Unstructured/semi-structured OLTP + Big Data

6
UNIT – IV: MONGODB
Necessity of MongoDB
• High availability and scalability
• Schema flexibility
• Rich querying and indexing capabilities

Terms in MongoDB vs RDBMS

MongoDB RDBMS

Document Row

Collection Table

Field Column

Index Index

_id Primary Key

Datatypes in MongoDB
• String, Integer, Double, Boolean
• Array
• ObjectId
• Embedded documents
• Null, Date

MongoDB Query Language

// Insert
> db.users.insert({name: "Alice", age: 25});

// Find
> db.users.find({age: {$gt: 20}});

// Update
> db.users.update({name: "Alice"}, {$set: {age: 26}});

// Delete
> db.users.remove({name: "Alice"});

7
UNIT – V: R PROGRAMMING
Introduction to R
• Statistical computing language
• Open-source and powerful for data analysis and visualization

Operators in R
• Arithmetic: +, -, *, /, ^
• Relational: <, <=, >, >=, ==, !=
• Logical: &, |, !

Control Statements and Functions

• if, else, for, while, repeat

add <- function(x, y) {

return(x + y)
}

Data Structures
• Vectors: One-dimensional
• Matrices: Two-dimensional
• Lists: Collection of elements
• Data Frames: Table-like structure
• Factors: Categorical data
• Tables: Frequency counts

Input and Output

name <- readline("Enter your name: ")

write.csv(df, "output.csv")

Graphs in R
• plot(), barplot(), hist(), boxplot(), pie()

Apply Family
• apply(), lapply(), sapply(), tapply(), mapply()
• Used for repetitive operations on data structures

8
END OF SEMESTER NOTES

Let me know if you need revision MCQs, model answers, or a formatted PDF.

CS 441 Handouts
No ratings yet
CS 441 Handouts
300 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
03 Unit Bda Hadoop, Map Reduce
No ratings yet
03 Unit Bda Hadoop, Map Reduce
80 pages
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
Bda 123
No ratings yet
Bda 123
36 pages
Bda Unit 1 - Mam
No ratings yet
Bda Unit 1 - Mam
198 pages
Big Data Analytics 18CS72 - Module 1
No ratings yet
Big Data Analytics 18CS72 - Module 1
84 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
Module - 1
No ratings yet
Module - 1
84 pages
Mod10-Wk10 CSG2132 Module 10 Big Data 2020
No ratings yet
Mod10-Wk10 CSG2132 Module 10 Big Data 2020
26 pages
BDA Module2
No ratings yet
BDA Module2
83 pages
Big Data Analytics Notess
No ratings yet
Big Data Analytics Notess
69 pages
Big Data Analytics
No ratings yet
Big Data Analytics
61 pages
BDA - Unit-1
No ratings yet
BDA - Unit-1
24 pages
Nosql and Hadoop
No ratings yet
Nosql and Hadoop
42 pages
Big Data
No ratings yet
Big Data
79 pages
Top Big Data Platforms & Use Cases
No ratings yet
Top Big Data Platforms & Use Cases
9 pages
Lecture8 - Big Data (Hadoop)
No ratings yet
Lecture8 - Big Data (Hadoop)
29 pages
2 Emerging
No ratings yet
2 Emerging
10 pages
Dbms Record1
No ratings yet
Dbms Record1
61 pages
Big Data Hadoop Complete Final Spaced
No ratings yet
Big Data Hadoop Complete Final Spaced
15 pages
Chapter 14
No ratings yet
Chapter 14
35 pages
It - (R20) - 4-1 - Big Data Analytics - Digital Notes
No ratings yet
It - (R20) - 4-1 - Big Data Analytics - Digital Notes
117 pages
Chapter-1-Introduction To Big Data
No ratings yet
Chapter-1-Introduction To Big Data
25 pages
BDS Session 1
100% (1)
BDS Session 1
70 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Ese Bda
No ratings yet
Ese Bda
28 pages
01 Unit-BDA - Intro BDA
No ratings yet
01 Unit-BDA - Intro BDA
37 pages
BIG Data - Unit - 1
No ratings yet
BIG Data - Unit - 1
24 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
119 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
Module-2 Business Intelligence
No ratings yet
Module-2 Business Intelligence
25 pages
BD Imp Ques 1
No ratings yet
BD Imp Ques 1
22 pages
Big Data Analytics
No ratings yet
Big Data Analytics
131 pages
Unit 1
No ratings yet
Unit 1
19 pages
Module 1
No ratings yet
Module 1
54 pages
1 Bda A6515 Intro Bda
No ratings yet
1 Bda A6515 Intro Bda
48 pages
Biggdata
No ratings yet
Biggdata
24 pages
The Big Data Technology Landscape
No ratings yet
The Big Data Technology Landscape
36 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
NoSQL Database Overview Lecture
No ratings yet
NoSQL Database Overview Lecture
22 pages
Big Data Tech: NoSQL & Hadoop
No ratings yet
Big Data Tech: NoSQL & Hadoop
16 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
134 pages
Business Process Study Guide Week 5
No ratings yet
Business Process Study Guide Week 5
3 pages
BDT Viva Questions
No ratings yet
BDT Viva Questions
2 pages
2 BDA A6515 Hadoop
No ratings yet
2 BDA A6515 Hadoop
55 pages
Unit 1
No ratings yet
Unit 1
118 pages
1.5 Module-1
No ratings yet
1.5 Module-1
21 pages
Big Data Analytics Unit-1
No ratings yet
Big Data Analytics Unit-1
39 pages
Database Management System
No ratings yet
Database Management System
11 pages
No SQL Database in Bda
No ratings yet
No SQL Database in Bda
84 pages
20IT503 - Big Data Analytics - Unit4
No ratings yet
20IT503 - Big Data Analytics - Unit4
73 pages
DW Slides
No ratings yet
DW Slides
248 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
153 pages
Database Management Systems Week 1
No ratings yet
Database Management Systems Week 1
20 pages
MBSA1133 Business Information System Activity 4 - Roadmap Design & DFD
No ratings yet
MBSA1133 Business Information System Activity 4 - Roadmap Design & DFD
6 pages
Geographic Information System (GIS) : Get Inspired
No ratings yet
Geographic Information System (GIS) : Get Inspired
2 pages
Method For Developing and Partitioning Graph-Based Data Warehouses Using Association Rules
No ratings yet
Method For Developing and Partitioning Graph-Based Data Warehouses Using Association Rules
12 pages
SAP Cutover Strategy PDF
No ratings yet
SAP Cutover Strategy PDF
6 pages
INFS 112 PASCO - Shamsin
No ratings yet
INFS 112 PASCO - Shamsin
40 pages
Understanding JOINS - Lab
No ratings yet
Understanding JOINS - Lab
8 pages
Library Collection Development Guide
No ratings yet
Library Collection Development Guide
6 pages
Sierra Atlantic Atlanta OAUG 2008
No ratings yet
Sierra Atlantic Atlanta OAUG 2008
12 pages
Databases and DBMSS: Todd S. Bacastow January 2005
No ratings yet
Databases and DBMSS: Todd S. Bacastow January 2005
37 pages
BVFLEMR An Integrated Federated Learning and Blockchain Technology For Cloudbased Medical Records Recommendation Systemjournal of Cloud Computing
No ratings yet
BVFLEMR An Integrated Federated Learning and Blockchain Technology For Cloudbased Medical Records Recommendation Systemjournal of Cloud Computing
11 pages
Clinical Workflow Optimization Guide
No ratings yet
Clinical Workflow Optimization Guide
2 pages
Organizing Models of Library Consortia
No ratings yet
Organizing Models of Library Consortia
9 pages
Distributed Messaging Queue 1706649896
No ratings yet
Distributed Messaging Queue 1706649896
23 pages
A0205e-1 Cetrics DBBackupTool UserManual
No ratings yet
A0205e-1 Cetrics DBBackupTool UserManual
16 pages
Twinmotion Background Customization
No ratings yet
Twinmotion Background Customization
1 page
Dbms Syl PDF
No ratings yet
Dbms Syl PDF
7 pages
SRS Document For VTOP Query
No ratings yet
SRS Document For VTOP Query
11 pages
Membuat Database MySQL - Database Myshop
No ratings yet
Membuat Database MySQL - Database Myshop
4 pages
INTRODUCTION Railway Reservation
No ratings yet
INTRODUCTION Railway Reservation
61 pages
Database Schema for Streaming Service
No ratings yet
Database Schema for Streaming Service
3 pages
Chapter 6 Part1 Hands-On Exercies With Answers
No ratings yet
Chapter 6 Part1 Hands-On Exercies With Answers
8 pages
QR
No ratings yet
QR
4 pages
Analytics Engineer
No ratings yet
Analytics Engineer
2 pages
Data Analyst 1707484551
No ratings yet
Data Analyst 1707484551
1 page

Big Data Complete Notes

Uploaded by

Big Data Complete Notes

Uploaded by

BIG DATA - COMPLETE SEMESTER NOTES

UNIT – I: INTRODUCTION TO BIG DATA

Types of Digital Data:

2. Organized in tabular format.

4. Examples: Bank transactions, sensor logs.

7. Doesn’t conform to formal data models.

8. Examples: XML, JSON, NoSQL documents.

10. No fixed format.

11. Examples: Emails, audio, video, social media content.

Classification of Digital Data:

• Human-generated Data: Emails, social media posts.

Introduction to Big Data

Evolution of Big Data:

• Emerged due to exponential growth of internet, mobile data, IoT.

• Big Data is defined by the 5 V’s:

Traditional BI vs Big Data

Feature Traditional BI Big Data

Data Type Structured All Types

Architecture Centralized Distributed

Processing Batch Batch + Real-time

Tools SQL, OLAP Hadoop, Spark

Coexistence of Big Data and Data Warehouse

Big Data Analytics

• Not just collecting massive data or using fast computers.

Why Sudden Hype:

1. Descriptive – What happened?

Challenges for Businesses:

• Poor data quality.

Importance of Big Data Analytics:

• Customer behavior analysis.

Data Science and Terminologies

• HDFS: Distributed file storage.

UNIT – II: HADOOP ECOSYSTEM

Hadoop Ecosystem Overview:

Need for Hadoop:

Aspect RDBMS Hadoop

Data Types Structured All Types

Schema Fixed Dynamic

Scalability Vertical Horizontal

Cost Expensive Low-cost (commodity hardware)

Real-time Possible Not in MapReduce (Spark preferred)

Distributed Computing Challenges:

UNIT – III: PROCESSING DATA WITH HADOOP &

• Programming model for distributed processing of large datasets.

• Mapper: Processes input data and emits key-value pairs.

1. Key-Value Stores (e.g., Redis, Riak)

• Real-time web apps

SQL vs NoSQL vs NewSQL

Feature SQL NoSQL NewSQL

Schema Fixed Dynamic Fixed

Scalability Vertical Horizontal Horizontal

ACID Support Full Limited Full

Query Language SQL Varies SQL

Ideal for Structured data Unstructured/semi-structured OLTP + Big Data

Terms in MongoDB vs RDBMS

_id Primary Key

MongoDB Query Language

Control Statements and Functions

add <- function(x, y) {

Input and Output

name <- readline("Enter your name: ")

You might also like