BigData Session1

Big Data refers to massive volumes of data that traditional systems struggle to manage, characterized by the 3 Vs: Volume, Variety, and Velocity, along with a 4th V, Veracity. It is essential for businesses to analyze this data to uncover trends and make informed decisions, often utilizing frameworks like Hadoop and Spark for efficient processing. Spark, in particular, is a fast, in-memory compute engine that enhances data processing speed and flexibility compared to traditional methods.

Uploaded by

imvinod0811

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views14 pages

BigData Session1

Uploaded by

imvinod0811

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Big Data and Spark

What is Big Data?

• De nition: A term that describes massive volumes of data that
traditional systems struggle to handle due to size, complexity, and
speed.
• Scale: Approximately 2.5 quintillion bytes
(2,500,000,000,000,000,000) of data generated daily worldwide.
◦ Example: Imagine every photo uploaded to Instagram, every tweet,
and every Google search in a single day—combined, that’s Big Data!
• Why it matters: Businesses use Big Data to uncover trends, predict
behaviors, and make smarter decisions.
fi
The 3 Vs of Big Data (+1 Bonus V)
1. Volume: The sheer scale of data.
◦ Example: Net ix storing petabytes of user watch history.
2. Variety: Different types and forms of data.
◦ Structured: Organized data like spreadsheets or databases (e.g., MySQL customer
records, CSV les).
◦ Semi-Structured: Partially organized, like JSON or XML les (e.g., API responses).
◦ Unstructured: Unorganized data like audio (podcasts), video (YouTube clips), images
( memes), and log les (server logs).
3. Velocity: The speed at which data is generated and processed.
◦ Examples:
▪ 900 million photos uploaded daily on Facebook.
▪ 600 million tweets posted on Twitter daily.
▪ 0.5 million hours of video uploaded to YouTube daily.
▪ 3.5 billion searches on Google daily.
4. Veracity (The 4th V): The uncertainty, noise, or poor quality of data.
◦ Example: Social media posts with typos, incomplete sensor data from IoT devices, or
outdated customer records.
Fun Fact: Some experts also talk about a 5th V—Value—extracting meaningful insights
from data.
fi
fl
fi
fi
Why Big Data?
• Purpose: To process and analyze massive datasets that traditional
systems (e.g., relational databases) can’t handle ef ciently.
• Real-World Use Cases:
◦ E-commerce: Amazon recommending products based on your
browsing history.
◦ Healthcare: Analyzing patient data to predict disease outbreaks.
◦ Finance: Detecting fraudulent transactions in real-time.

Key Insight: Big Data isn’t just about size—it’s about unlocking hidden
patterns and insights.
fi
Big Data System Requirements

1. Store: Must store massive amounts of data reliably.

◦ Example: Storing years of social media posts or IoT sensor readings.
2. Process: Must process data quickly and ef ciently.
◦ Example: Analyzing customer reviews to improve a product in
hours, not weeks.
3. Scale: Must grow seamlessly as data needs increase.
◦ Example: Adding more servers to handle Black Friday shopping
spikes.
fi
Two Ways to Build a System
1.Monolithic:
◦ De nition: One powerful machine with lots of CPU, RAM, and
storage.
◦ Pros: Simple to set up initially.
◦ Cons:
▪ Hard to scale after hitting hardware limits.
▪ Adding resources (vertical scaling) doesn’t always double
performance.
◦ Example: A single supercomputer struggling to process a year’s
worth of Twitter data.
fi
Two Ways to Build a System
2.Distributed:
◦ De nition: Many smaller machines working together as one system.
◦ Pros:
▪ Linear scalability (2x machines = ~2x performance).
▪ True horizontal scaling—add more machines as needed.
◦ Cons: More complex to manage.
◦ Example: Google’s search engine running on thousands of servers
worldwide.
◦ Key Takeaway: All modern Big Data systems (like Hadoop and
Spark) use distributed architecture.
fi
What is Hadoop?
• De nition: An open-source framework
designed to solve Big Data problems by
enabling distributed storage and processing.
• Core Idea: Break data into smaller chunks,
store them across multiple machines, and
process them in parallel.
fi
Hadoop Evolution
• 2003: Google publishes the Google File System (GFS) paper—how to
store massive datasets across many machines.
• 2004: Google releases the MapReduce paper—a programming model
for processing large datasets in parallel.
• 2006: Yahoo builds HDFS (Hadoop Distributed File System) and
MapReduce based on Google’s ideas.
• 2009: Hadoop becomes an Apache open-source project, freely
available to all.
• 2013: Hadoop 2.0 introduces YARN and major performance upgrades.

Fun Fact: Hadoop is named after a toy elephant belonging to its creator
Doug Cutting’s son!
Hadoop Core Components
1.HDFS (Hadoop Distributed File System):
◦ Distributed storage system that splits data into blocks and spreads
them across multiple nodes.
◦ Example: A 1TB video le split into 128MB chunks stored on 10
machines.
2.YARN (Yet Another Resource Negotiator):
◦ Manages resources (CPU, memory) across the cluster and schedules
tasks.
◦ Example: Ensures one job doesn’t hog all the computing power.
3.MapReduce:
◦ A programming model for distributed data processing.
◦ Example: Counting word frequencies in a massive text le by
splitting the task across nodes.
fi
fi
Hadoop Ecosystem
• Hive: SQL-like tool for querying and analyzing data stored in HDFS.
◦ Example: Finding the most popular product in a sales dataset.
• Pig: Scripting language to process and transform data (great for
unstructured data).
◦ Example: Converting raw log les into a structured report.
• Sqoop: Transfers data between Hadoop and relational databases.
◦ Example: Importing customer data from MySQL into HDFS.
• HBase: NoSQL database for real-time, random access to data on HDFS.
◦ Example: Storing and querying live Twitter feeds.
• Oozie: Work ow scheduler to manage and automate Hadoop jobs.
◦ Example: Running a daily report generation job at midnight.
fl
fi
Introduction to Apache Spark
• De nition: A distributed, general-purpose, in-memory compute engine
designed for speed and exibility.
• Key Features:
◦ Processes data in-memory (much faster than Hadoop’s disk-based
MapReduce).
◦ Plug & Play: Works with various systems:
▪ Storage: Local storage, HDFS, Amazon S3, etc.
▪ Resource Managers: YARN, Mesos, Kubernetes.
◦ Written in Scala, with of cial support for Java, Scala, Python, and
R.
• Why Spark?:
◦ Up to 100x faster than Hadoop MapReduce for certain tasks (e.g.,
iterative machine learning).
◦ Easier to use with high-level APIs.
Example: Analyzing live streaming data (e.g., stock market ticks) in real-
time.
fi
fl
fi
Thank You

Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Mc3 Manual en
No ratings yet
Mc3 Manual en
34 pages
Simatic St70 Complete English 2023
100% (1)
Simatic St70 Complete English 2023
1,692 pages
Emergency Electrical Systems Guide
100% (1)
Emergency Electrical Systems Guide
44 pages
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
No ratings yet
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
258 pages
ECE 303 and 353 - LAB - Manual
No ratings yet
ECE 303 and 353 - LAB - Manual
30 pages
Big Data 1
No ratings yet
Big Data 1
28 pages
BIG DATA Class 1 1741496163
No ratings yet
BIG DATA Class 1 1741496163
108 pages
Ex-Hacker: Unit 20 Interview
0% (1)
Ex-Hacker: Unit 20 Interview
8 pages
Big Data Analytics
No ratings yet
Big Data Analytics
61 pages
Manual Simevent
100% (1)
Manual Simevent
117 pages
The Importance of Proper Planning For Telecom Procurement: Case Study
No ratings yet
The Importance of Proper Planning For Telecom Procurement: Case Study
8 pages
Aptitude Test: Paw: Cat:: Hoof: ?
No ratings yet
Aptitude Test: Paw: Cat:: Hoof: ?
15 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
Mitchell On Demand 5.8.2 With Crack
0% (1)
Mitchell On Demand 5.8.2 With Crack
3 pages
Lync 2013 Debugging Guide
No ratings yet
Lync 2013 Debugging Guide
13 pages
Transportation Engineering
50% (2)
Transportation Engineering
2 pages
76 - Holton - Construction 10 Aug 2022
No ratings yet
76 - Holton - Construction 10 Aug 2022
27 pages
Mod10-Wk10 CSG2132 Module 10 Big Data 2020
No ratings yet
Mod10-Wk10 CSG2132 Module 10 Big Data 2020
26 pages
Data Science
No ratings yet
Data Science
87 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
RM68120 LCD
No ratings yet
RM68120 LCD
289 pages
BD Imp Ques 1
No ratings yet
BD Imp Ques 1
22 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
Big Data Technologies Presentation
No ratings yet
Big Data Technologies Presentation
10 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
The University of Bamenda Course Syllabus/Outline
No ratings yet
The University of Bamenda Course Syllabus/Outline
2 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
Unit 4
No ratings yet
Unit 4
25 pages
Master Spark Concepts
No ratings yet
Master Spark Concepts
112 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
Big Data Analytics
No ratings yet
Big Data Analytics
20 pages
Biggdata
No ratings yet
Biggdata
24 pages
Aquatec Price List 2020 - Export
No ratings yet
Aquatec Price List 2020 - Export
120 pages
Dharmendra Sigh: Career Objective
No ratings yet
Dharmendra Sigh: Career Objective
2 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
MC10, MC12, MC15 (B860) Parts Manual: Yale Europe Materials Handling Limited
No ratings yet
MC10, MC12, MC15 (B860) Parts Manual: Yale Europe Materials Handling Limited
170 pages
Module 2
No ratings yet
Module 2
20 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Free As A Bird Event-Based Dynamic Sense-And-Avoid For Ornithopter Robot Flight
No ratings yet
Free As A Bird Event-Based Dynamic Sense-And-Avoid For Ornithopter Robot Flight
8 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
CPP Black Book Final Pranita
No ratings yet
CPP Black Book Final Pranita
40 pages
Big Data
No ratings yet
Big Data
10 pages
Bigdata Intro
No ratings yet
Bigdata Intro
76 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
9 pages
New Elevator Acceptance Inspection Installation Checklist
100% (1)
New Elevator Acceptance Inspection Installation Checklist
3 pages
Kabilan Internship Report A4
No ratings yet
Kabilan Internship Report A4
19 pages
Press Brake Basics for Students
No ratings yet
Press Brake Basics for Students
3 pages
Big Data
100% (1)
Big Data
190 pages
04-Division 16-Section 16040 Power Monitor-Version 2.0
No ratings yet
04-Division 16-Section 16040 Power Monitor-Version 2.0
5 pages
Big Data Challenges and Solutions
No ratings yet
Big Data Challenges and Solutions
36 pages
1 Introduction
No ratings yet
1 Introduction
31 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Bca 1 Sem P.C. Packages 7025 Jan 2019
No ratings yet
Bca 1 Sem P.C. Packages 7025 Jan 2019
4 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Big Data Analytics 0th Lecture
No ratings yet
Big Data Analytics 0th Lecture
19 pages
STMT - Ent - Book - Vista - 2022-03-16T084151.329
No ratings yet
STMT - Ent - Book - Vista - 2022-03-16T084151.329
2 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
Taming Big Data
No ratings yet
Taming Big Data
268 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Intr Oduction of Big Data
No ratings yet
Intr Oduction of Big Data
12 pages
Introduction To Big Data Analytics
No ratings yet
Introduction To Big Data Analytics
33 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Bba13 Notes BDF Unit 1
No ratings yet
Bba13 Notes BDF Unit 1
3 pages
Big Data
No ratings yet
Big Data
4 pages
Cambium-Networks Datasheet PoE Power Injector 30V 15W N000900L001D 01112023
No ratings yet
Cambium-Networks Datasheet PoE Power Injector 30V 15W N000900L001D 01112023
1 page
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Exam Application
No ratings yet
Exam Application
1 page
Distributed Systems and Edge Computing - Google Search
No ratings yet
Distributed Systems and Edge Computing - Google Search
3 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Big Data: Challenges and Solutions
No ratings yet
Big Data: Challenges and Solutions
10 pages
Libra Technical Training 2015
No ratings yet
Libra Technical Training 2015
35 pages

BigData Session1

Uploaded by

BigData Session1

Uploaded by

Big Data and Spark

What is Big Data?

1. Store: Must store massive amounts of data reliably.

You might also like