Big Data and Spark
What is Big Data?
• De nition: A term that describes massive volumes of data that
traditional systems struggle to handle due to size, complexity, and
speed.
• Scale: Approximately 2.5 quintillion bytes
(2,500,000,000,000,000,000) of data generated daily worldwide.
◦ Example: Imagine every photo uploaded to Instagram, every tweet,
and every Google search in a single day—combined, that’s Big Data!
• Why it matters: Businesses use Big Data to uncover trends, predict
behaviors, and make smarter decisions.
fi
The 3 Vs of Big Data (+1 Bonus V)
1. Volume: The sheer scale of data.
◦ Example: Net ix storing petabytes of user watch history.
2. Variety: Different types and forms of data.
◦ Structured: Organized data like spreadsheets or databases (e.g., MySQL customer
records, CSV les).
◦ Semi-Structured: Partially organized, like JSON or XML les (e.g., API responses).
◦ Unstructured: Unorganized data like audio (podcasts), video (YouTube clips), images
( memes), and log les (server logs).
3. Velocity: The speed at which data is generated and processed.
◦ Examples:
▪ 900 million photos uploaded daily on Facebook.
▪ 600 million tweets posted on Twitter daily.
▪ 0.5 million hours of video uploaded to YouTube daily.
▪ 3.5 billion searches on Google daily.
4. Veracity (The 4th V): The uncertainty, noise, or poor quality of data.
◦ Example: Social media posts with typos, incomplete sensor data from IoT devices, or
outdated customer records.
Fun Fact: Some experts also talk about a 5th V—Value—extracting meaningful insights
from data.
fi
fl
fi
fi
Why Big Data?
• Purpose: To process and analyze massive datasets that traditional
systems (e.g., relational databases) can’t handle ef ciently.
• Real-World Use Cases:
◦ E-commerce: Amazon recommending products based on your
browsing history.
◦ Healthcare: Analyzing patient data to predict disease outbreaks.
◦ Finance: Detecting fraudulent transactions in real-time.
Key Insight: Big Data isn’t just about size—it’s about unlocking hidden
patterns and insights.
fi
Big Data System Requirements
1. Store: Must store massive amounts of data reliably.
◦ Example: Storing years of social media posts or IoT sensor readings.
2. Process: Must process data quickly and ef ciently.
◦ Example: Analyzing customer reviews to improve a product in
hours, not weeks.
3. Scale: Must grow seamlessly as data needs increase.
◦ Example: Adding more servers to handle Black Friday shopping
spikes.
fi
Two Ways to Build a System
1.Monolithic:
◦ De nition: One powerful machine with lots of CPU, RAM, and
storage.
◦ Pros: Simple to set up initially.
◦ Cons:
▪ Hard to scale after hitting hardware limits.
▪ Adding resources (vertical scaling) doesn’t always double
performance.
◦ Example: A single supercomputer struggling to process a year’s
worth of Twitter data.
fi
Two Ways to Build a System
2.Distributed:
◦ De nition: Many smaller machines working together as one system.
◦ Pros:
▪ Linear scalability (2x machines = ~2x performance).
▪ True horizontal scaling—add more machines as needed.
◦ Cons: More complex to manage.
◦ Example: Google’s search engine running on thousands of servers
worldwide.
◦ Key Takeaway: All modern Big Data systems (like Hadoop and
Spark) use distributed architecture.
fi
What is Hadoop?
• De nition: An open-source framework
designed to solve Big Data problems by
enabling distributed storage and processing.
• Core Idea: Break data into smaller chunks,
store them across multiple machines, and
process them in parallel.
fi
Hadoop Evolution
• 2003: Google publishes the Google File System (GFS) paper—how to
store massive datasets across many machines.
• 2004: Google releases the MapReduce paper—a programming model
for processing large datasets in parallel.
• 2006: Yahoo builds HDFS (Hadoop Distributed File System) and
MapReduce based on Google’s ideas.
• 2009: Hadoop becomes an Apache open-source project, freely
available to all.
• 2013: Hadoop 2.0 introduces YARN and major performance upgrades.
Fun Fact: Hadoop is named after a toy elephant belonging to its creator
Doug Cutting’s son!
Hadoop Core Components
1.HDFS (Hadoop Distributed File System):
◦ Distributed storage system that splits data into blocks and spreads
them across multiple nodes.
◦ Example: A 1TB video le split into 128MB chunks stored on 10
machines.
2.YARN (Yet Another Resource Negotiator):
◦ Manages resources (CPU, memory) across the cluster and schedules
tasks.
◦ Example: Ensures one job doesn’t hog all the computing power.
3.MapReduce:
◦ A programming model for distributed data processing.
◦ Example: Counting word frequencies in a massive text le by
splitting the task across nodes.
fi
fi
Hadoop Ecosystem
• Hive: SQL-like tool for querying and analyzing data stored in HDFS.
◦ Example: Finding the most popular product in a sales dataset.
• Pig: Scripting language to process and transform data (great for
unstructured data).
◦ Example: Converting raw log les into a structured report.
• Sqoop: Transfers data between Hadoop and relational databases.
◦ Example: Importing customer data from MySQL into HDFS.
• HBase: NoSQL database for real-time, random access to data on HDFS.
◦ Example: Storing and querying live Twitter feeds.
• Oozie: Work ow scheduler to manage and automate Hadoop jobs.
◦ Example: Running a daily report generation job at midnight.
fl
fi
Introduction to Apache Spark
• De nition: A distributed, general-purpose, in-memory compute engine
designed for speed and exibility.
• Key Features:
◦ Processes data in-memory (much faster than Hadoop’s disk-based
MapReduce).
◦ Plug & Play: Works with various systems:
▪ Storage: Local storage, HDFS, Amazon S3, etc.
▪ Resource Managers: YARN, Mesos, Kubernetes.
◦ Written in Scala, with of cial support for Java, Scala, Python, and
R.
• Why Spark?:
◦ Up to 100x faster than Hadoop MapReduce for certain tasks (e.g.,
iterative machine learning).
◦ Easier to use with high-level APIs.
Example: Analyzing live streaming data (e.g., stock market ticks) in real-
time.
fi
fl
fi
Thank You