BIG DATA - COMPLETE SEMESTER NOTES
UNIT – I: INTRODUCTION TO BIG DATA
Types and Classification of Digital Data
Types of Digital Data:
1. Structured Data
2. Organized in tabular format.
3. Stored in RDBMS.
4. Examples: Bank transactions, sensor logs.
5. Semi-Structured Data
6. Partially organized.
7. Doesn’t conform to formal data models.
8. Examples: XML, JSON, NoSQL documents.
9. Unstructured Data
10. No fixed format.
11. Examples: Emails, audio, video, social media content.
Classification of Digital Data:
• Human-generated Data: Emails, social media posts.
• Machine-generated Data: Sensor data, server logs.
• Metadata: Data about other data.
Introduction to Big Data
Evolution of Big Data:
• Emerged due to exponential growth of internet, mobile data, IoT.
• Traditional systems failed to process unstructured or huge volumes of data.
1
Definition of Big Data:
• Big Data is defined by the 5 V’s:
• Volume – Large amounts of data.
• Velocity – Speed of data generation and processing.
• Variety – Different types of data (text, video, logs).
• Veracity – Trustworthiness of the data.
• Value – Useful insights extracted from data.
Traditional BI vs Big Data
Feature Traditional BI Big Data
Storage GB to TB TB to PB
Data Type Structured All Types
Architecture Centralized Distributed
Processing Batch Batch + Real-time
Tools SQL, OLAP Hadoop, Spark
Coexistence of Big Data and Data Warehouse
• Big Data complements data warehouses.
• Warehouses handle structured historical data.
• Big Data handles real-time and semi/unstructured data.
Big Data Analytics
What It Is:
• Advanced techniques to extract actionable insights from huge and diverse data.
What It Isn’t:
• Not just collecting massive data or using fast computers.
• It's not only for data scientists.
Why Sudden Hype:
• Cost-effective storage.
• Real-time decisions.
• Cloud computing.
2
Classification of Analytics:
1. Descriptive – What happened?
2. Diagnostic – Why it happened?
3. Predictive – What will happen?
4. Prescriptive – What action should be taken?
Challenges for Businesses:
• Poor data quality.
• Lack of skilled professionals.
• Integration with existing systems.
• Privacy and security.
Importance of Big Data Analytics:
• Customer behavior analysis.
• Fraud detection.
• Operational efficiency.
• Real-time alerts.
Data Science and Terminologies
Data Science:
• Interdisciplinary field.
• Combines statistics, machine learning, data engineering, domain expertise.
Important Terminologies:
• HDFS: Distributed file storage.
• MapReduce: Batch processing framework.
• Hive: SQL-based query tool.
• Pig: Dataflow scripting language.
• Spark: In-memory data processing engine.
• Flume: Ingests unstructured data.
• Sqoop: Transfers data between RDBMS and Hadoop.
• YARN: Resource manager in Hadoop.
UNIT – II: HADOOP ECOSYSTEM
Features of Hadoop:
• Open-source.
3
• Highly scalable.
• Fault-tolerant.
• Runs on commodity hardware.
• Data replication for fault recovery.
Key Advantages:
• Cost-effective.
• Handles structured, semi-structured, and unstructured data.
• Supports multiple languages (Java, Python, etc.).
• Ecosystem includes various tools for different tasks.
Versions of Hadoop:
• Hadoop 1.x: Single NameNode, scalability issues.
• Hadoop 2.x: Introduced YARN, better resource management.
• Hadoop 3.x: Erasure coding, containerization support, better performance.
Hadoop Ecosystem Overview:
• HDFS – Storage layer.
• MapReduce – Processing layer.
• YARN – Resource manager.
• Hive – SQL-like queries.
• Pig – Scripting language.
• HBase – Columnar storage DB.
• Oozie – Workflow scheduler.
• Flume – Ingest logs.
• Sqoop – Transfers data from RDBMS.
Distributions:
• Cloudera, Hortonworks, MapR, Amazon EMR.
Need for Hadoop:
• Traditional RDBMSs can’t handle high volume and variety.
• Provides distributed storage and processing.
4
RDBMS vs Hadoop
Aspect RDBMS Hadoop
Data Types Structured All Types
Schema Fixed Dynamic
Scalability Vertical Horizontal
Cost Expensive Low-cost (commodity hardware)
Real-time Possible Not in MapReduce (Spark preferred)
Distributed Computing Challenges:
• Node failure.
• Network latency.
• Synchronization.
• Load balancing.
History of Hadoop:
• Inspired by Google File System (GFS).
• Created by Doug Cutting and Mike Cafarella.
• Yahoo adopted and funded development.
HDFS:
• Master-slave architecture.
• NameNode: Metadata.
• DataNodes: Store blocks.
• Replication factor (default = 3).
• Designed for write-once, read-many workloads.
UNIT – III: PROCESSING DATA WITH HADOOP &
NOSQL
MapReduce Programming
Introduction:
• Programming model for distributed processing of large datasets.
5
Components:
• Mapper: Processes input data and emits key-value pairs.
• Reducer: Aggregates values based on keys from the mapper.
• Combiner: Optional local reducer to optimize performance.
• Partitioner: Decides which reducer a key-value pair should go to.
NoSQL Databases
Introduction:
• Non-relational databases designed for horizontal scalability and flexible data models.
Types:
1. Key-Value Stores (e.g., Redis, Riak)
2. Document Stores (e.g., MongoDB, CouchDB)
3. Column Stores (e.g., Cassandra, HBase)
4. Graph Databases (e.g., Neo4j)
Advantages:
• Schema-free
• Horizontal scaling
• High performance
• Better handling of unstructured data
Use in Industry:
• Real-time web apps
• E-commerce
• Social media analytics
• IoT applications
SQL vs NoSQL vs NewSQL
Feature SQL NoSQL NewSQL
Schema Fixed Dynamic Fixed
Scalability Vertical Horizontal Horizontal
ACID Support Full Limited Full
Query Language SQL Varies SQL
Ideal for Structured data Unstructured/semi-structured OLTP + Big Data
6
UNIT – IV: MONGODB
Necessity of MongoDB
• High availability and scalability
• Schema flexibility
• Rich querying and indexing capabilities
Terms in MongoDB vs RDBMS
MongoDB RDBMS
Document Row
Collection Table
Field Column
Index Index
_id Primary Key
Datatypes in MongoDB
• String, Integer, Double, Boolean
• Array
• ObjectId
• Embedded documents
• Null, Date
MongoDB Query Language
// Insert
> db.users.insert({name: "Alice", age: 25});
// Find
> db.users.find({age: {$gt: 20}});
// Update
> db.users.update({name: "Alice"}, {$set: {age: 26}});
// Delete
> db.users.remove({name: "Alice"});
7
UNIT – V: R PROGRAMMING
Introduction to R
• Statistical computing language
• Open-source and powerful for data analysis and visualization
Operators in R
• Arithmetic: +, -, *, /, ^
• Relational: <, <=, >, >=, ==, !=
• Logical: &, |, !
Control Statements and Functions
• if, else, for, while, repeat
add <- function(x, y) {
return(x + y)
}
Data Structures
• Vectors: One-dimensional
• Matrices: Two-dimensional
• Lists: Collection of elements
• Data Frames: Table-like structure
• Factors: Categorical data
• Tables: Frequency counts
Input and Output
name <- readline("Enter your name: ")
write.csv(df, "output.csv")
Graphs in R
• plot(), barplot(), hist(), boxplot(), pie()
Apply Family
• apply(), lapply(), sapply(), tapply(), mapply()
• Used for repetitive operations on data structures
8
END OF SEMESTER NOTES
Let me know if you need revision MCQs, model answers, or a formatted PDF.