14 December 2024 11:30
Here’s an overview of the key concepts and topics listed:
1. Data Format
• Refers to the structure or organization of data to make it usable and analyzable. Examples:
○ Structured Data: Stored in relational databases (rows/columns).
○ Semi-Structured Data: JSON, XML, etc.
○ Unstructured Data: Logs, images, audio, video.
2. Analyzing Data with Hadoop
• Hadoop is an open-source framework for processing and storing large datasets using a
distributed system.
• Core Components:
1. HDFS (Hadoop Distributed File System):
▪ Stores data across a cluster of machines.
▪ Provides fault tolerance and high throughput.
2. MapReduce:
▪ Programming model for processing large datasets in parallel.
▪ Splits jobs into smaller tasks (Map) and aggregates results (Reduce).
• Steps in Analyzing Data:
○ Load data into HDFS.
○ Use MapReduce or other tools like Hive, Pig, or Spark to process the data.
○ Export the processed data for visualization or further analysis.
3. Scaling Out
• Scaling out involves adding more nodes to a Hadoop cluster to handle larger datasets and
increase computational power.
○ Horizontal Scaling: Adding commodity hardware to the cluster.
○ Key advantage: Fault tolerance and linear scalability.
4. Hadoop Streaming
• A utility that allows you to write MapReduce jobs in any programming language (Python,
Ruby, Perl, etc.).
• How it Works:
○ Mapper and Reducer scripts communicate with Hadoop through standard input and
output (stdin/stdout).
○ Example:
hadoop jar /path/to/hadoop-streaming.jar \
-mapper /path/to/mapper.py \
-reducer /path/to/reducer.py \
-input /path/to/input \
-output /path/to/output
5. Hadoop Pipes
• A C++ API for writing MapReduce programs.
• Benefits:
○ Allows developers to use C++ for high-performance tasks.
○ Offers tighter integration with the Hadoop framework.
6. Design of Hadoop Distributed File System (HDFS)
bda2 Page 1
6. Design of Hadoop Distributed File System (HDFS)
• HDFS is the storage layer of Hadoop, optimized for large datasets.
• Key Features:
○ Block Storage:
▪ Files are split into blocks (default size: 128 MB) and distributed across nodes.
○ Fault Tolerance:
▪ Replicates data (default: 3 copies) to ensure reliability.
○ Write Once, Read Many:
▪ Optimized for workloads where data is written once and read multiple times.
7. HDFS Concepts
• NameNode:
○ Manages metadata (directory structure, file locations).
○ Acts as the master node.
• DataNode:
○ Stores actual data blocks.
○ Communicates with the NameNode for tasks.
• Secondary NameNode:
○ Periodically takes snapshots of the NameNode's metadata.
• Replication:
○ Data is replicated across multiple DataNodes for fault tolerance.
8. Java Interface for HDFS
• HDFS can be interacted with using Java APIs for file operations like reading, writing, and
deleting files.
• Example: Writing a file to HDFS using Java.
FileSystem fs = FileSystem.get(new Configuration());
Path path = new Path("/user/data.txt");
FSDataOutputStream outputStream = fs.create(path);
outputStream.writeUTF("Hello, HDFS!");
outputStream.close();
9. Data Flow in Hadoop
• Write Operation:
○ Client splits the file into blocks and writes them to DataNodes.
○ Metadata is updated in the NameNode.
• Read Operation:
○ Client requests metadata from the NameNode.
○ Reads data directly from DataNodes.
10. Hadoop I/O
• Data Integrity:
○ Ensures data consistency using checksums.
○ Data corruption is detected and repaired using replicas.
• Compression:
○ Reduces storage and speeds up data transfer.
○ Supported formats: Gzip, Snappy, LZO, etc.
• Serialization:
○ Converts objects into a format that can be stored and transmitted.
○ Hadoop uses Writable interface for serialization.
11. Avro
• A data serialization framework used in Hadoop.
• Features:
Compact and efficient.
bda2 Page 2
○ Compact and efficient.
○ Schema evolution (can update schema without breaking compatibility).
• Example of Avro schema:
{
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"}
]
}
12. File-Based Data Structures
• Hadoop supports various file formats:
○ Text: Plain text (inefficient for large datasets).
○ SequenceFile: Binary format for key-value pairs.
○ Parquet/ORC: Columnar formats optimized for analytics.
13. Cassandra-Hadoop Integration
• Cassandra is a NoSQL database that can integrate with Hadoop to process data stored in
Cassandra using Hadoop's MapReduce framework.
• Integration Approaches:
○ Use Cassandra Hadoop InputFormat and OutputFormat to read/write Cassandra
tables.
○ Run Spark jobs on top of Cassandra for faster and more flexible analytics.
Would you like detailed examples for any specific topic, such as Avro, Cassandra-Hadoop
integration, or HDFS operations? Let me know!
From <https://chatgpt.com/c/675aec4f-6d6c-8012-afd7-cbf911391c16>
Here's a visual representation of the Hadoop MapReduce data flow, showing how data moves
through each stage from input to output. Below is a detailed explanation of each step:
1. Input Splits
• Description:
○ The input data stored in HDFS is split into smaller chunks called input splits.
○ Each split is assigned to a single Mapper for processing.
• Key Feature:
○ Splits ensure parallelism by dividing the workload among multiple nodes.
2. Mapper Phase
• Description:
○ Each Mapper processes one input split, reading records line-by-line or based on key-
value pairs.
○ Transforms data into intermediate key-value pairs.
• Example:
○ Input: A text file of words.
○ Output: Key-value pairs like (word, 1).
3. Shuffle and Sort
• Description:
○ Intermediate key-value pairs from all Mappers are sorted and grouped by key.
The Shuffle step ensures that all values for a single key are sent to the same
bda2 Page 3
○ The Shuffle step ensures that all values for a single key are sent to the same
Reducer.
• Key Operation:
○ Distributed sorting to group similar keys together.
4. Reducer Phase
• Description:
○ Reducers aggregate the grouped data received from the Shuffle phase.
○ Performs computations like counting, summing, or merging.
• Example:
○ Input: (word, [1, 1, 1]).
○ Output: (word, 3).
5. Final Output
• Description:
○ The results from Reducers are written back to HDFS.
○ Output can be used for further processing or analysis.
Would you like additional details on any specific phase, or guidance on creating your own
MapReduce workflow?
From <https://chatgpt.com/c/675aec4f-6d6c-8012-afd7-cbf911391c16>
• Schema-Based:
• Avro uses JSON-based schemas to define the structure of the data.
• Schemas allow for both forward and backward compatibility when data structures evolve
over time.
• Compact and Efficient:
• Data is stored in binary format, minimizing size and improving processing speed.
• Interoperability:
• Avro supports multi-language bindings, allowing data exchange between applications
written in different programming languages (Java, Python, etc.).
• Self-Describing:
• Data files contain the schema, so they can be read without external metadata
bda2 Page 4
Data Serialization and Deserialization
• Serialization:
○ Converts data into a compact binary format.
○ Useful for storage or transmission.
• Deserialization:
○ Converts binary Avro data back into its original structure using the schema.
1. Data Storage:
○ Efficient storage of structured data in HDFS or other systems.
2. Data Transfer:
○ Used in Kafka or REST APIs for transferring compact data.
3. Schema Evolution:
○ Handles changes to data structure without breaking existing systems.
Advantages
• Compact and fast.
• Schema evolution support.
• Language-agnostic.
• Optimized for Hadoop and distributed systems.
Limitations
• Requires schema management.
• Less flexible for unstructured data.
Cassandra and Hadoop Integration
Cassandra and Hadoop integration is used to leverage the distributed data storage capabilities of
bda2 Page 5
Cassandra and Hadoop integration is used to leverage the distributed data storage capabilities of
Cassandra with the data processing power of Hadoop. This integration helps businesses process large-
scale, distributed data efficiently.
Cassandra, being a NoSQL database, offers real-time data management, while Hadoop excels at batch
processing and analytics. Combining the two systems creates a hybrid architecture suitable for high-
throughput real-time applications and analytics.
Why Integrate Cassandra with Hadoop?
1. Real-Time + Batch Processing:
○ Cassandra supports real-time applications with low latency.
○ Hadoop is ideal for offline batch processing of massive datasets.
2. Distributed and Scalable Systems:
○ Both systems are designed for scalability and fault tolerance, making them a natural fit for
large-scale data processing.
3. Unified Data Management:
○ Cassandra stores raw, semi-structured, or structured data, while Hadoop analyzes and
transforms the data stored in Cassandra.
Components Used in Cassandra-Hadoop Integration
1. Hadoop MapReduce:
○ Used for distributed batch processing of data stored in Cassandra.
○ Provides a mechanism to process large datasets in parallel.
2. Hadoop InputFormat/OutputFormat:
○ CassandraInputFormat: Reads data from Cassandra tables to Hadoop jobs.
○ CassandraOutputFormat: Writes processed data from Hadoop jobs back to Cassandra
tables.
3. Hive with Cassandra (Optional):
○ Apache Hive, a data warehouse system, can run SQL-like queries on data stored in
Cassandra.
4. Spark (Optional for Real-Time Processing):
○ Apache Spark is often integrated into this workflow for real-time analytics on Cassandra
data.
Architecture Overview
The integration typically works as follows:
1. Data Ingestion:
○ Raw data is ingested into Cassandra via APIs or streaming platforms.
○ Cassandra acts as the primary storage for transactional or semi-structured data.
2. Hadoop MapReduce Job:
○ Data is read from Cassandra tables using CassandraInputFormat.
○ A MapReduce job performs the required transformations or analytics.
○ Results are written back to Cassandra using CassandraOutputFormat or to HDFS for further
processing.
3. Optional Data Pipeline:
○ Data processed by Hadoop can be stored in HDFS or used for ETL (Extract, Transform, Load)
pipelines.
○ Hive or Pig can be used for querying and additional analysis.
Steps for Cassandra and Hadoop Integration
1. Prerequisites
• Cassandra Cluster:
○ Ensure that Apache Cassandra is installed and running on your cluster.
• Hadoop Cluster:
○ Set up a Hadoop cluster with HDFS and MapReduce configured.
• Hadoop Cassandra Connector:
Use the Hadoop-Cassandra integration connector provided by Apache.
bda2 Page 6
○ Use the Hadoop-Cassandra integration connector provided by Apache.
2. Configuring Cassandra Input and Output Format
1. CassandraInputFormat:
○ Reads data from Cassandra tables into Hadoop jobs.
○ Example Configuration:
conf.set("cassandra.input.split.size", "64"); // Splits data into chunks
ConfigHelper.setInputColumnFamily(conf, "keyspace_name", "table_name");
ConfigHelper.setInputInitialAddress(conf, "127.0.0.1"); // Cassandra node IP
2. CassandraOutputFormat:
○ Writes processed results from Hadoop back to Cassandra.
○ Example Configuration:
ConfigHelper.setOutputColumnFamily(conf, "keyspace_name", "output_table_name");
ConfigHelper.setOutputInitialAddress(conf, "127.0.0.1"); // Cassandra node IP
3. Writing a MapReduce Job for Cassandra
A typical MapReduce job consists of:
• Mapper:
○ Reads data from Cassandra using CassandraInputFormat.
• Reducer:
○ Performs aggregation or computation.
• Output:
○ Writes results back to Cassandra using CassandraOutputFormat.
Example Code Snippet:
public class CassandraHadoopExample {
public static void main(String[] args) throws Exception {
Job job = Job.getInstance(new Configuration());
job.setInputFormatClass(CassandraInputFormat.class);
job.setOutputFormatClass(CassandraOutputFormat.class);
// Set Mapper and Reducer classes
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// Configure Cassandra Input
ConfigHelper.setInputColumnFamily(job.getConfiguration(), "keyspace_name", "table_name");
// Configure Cassandra Output
ConfigHelper.setOutputColumnFamily(job.getConfiguration(), "keyspace_name",
"output_table_name");
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
4. Deploying and Running the Job
1. Compile the MapReduce job into a JAR file.
2. Submit the JAR file to the Hadoop cluster using the following command:
hadoop jar cassandra_hadoop_integration.jar MainClassName
Use Case: Analyzing User Behavior
Scenario:
• Cassandra stores clickstream data for an e-commerce website.
• Hadoop processes this data to find popular products.
Steps:
1. Store Raw Data:
○ Clickstream data is ingested into a Cassandra table (clickstream).
2. Process Data with Hadoop:
○ Read data using CassandraInputFormat.
○ Perform analytics using a MapReduce job (e.g., count product clicks).
Write results back to Cassandra (popular_products table).
bda2 Page 7
○ Write results back to Cassandra (popular_products table).
Advantages of Cassandra-Hadoop Integration
1. Real-Time and Batch Hybrid:
○ Cassandra handles real-time workloads; Hadoop provides offline analytics.
2. High Scalability:
○ Both Cassandra and Hadoop are highly scalable and work well for distributed systems.
3. Efficient Data Movement:
○ Input and output formats ensure seamless integration without additional ETL overhead.
4. Fault Tolerance:
○ Both systems replicate data and jobs for fault-tolerant operations.
Challenges
1. Setup Complexity:
○ Integrating and maintaining the two systems can be challenging.
2. Query Limitations:
○ Cassandra’s query capabilities are limited compared to traditional SQL databases.
3. Performance Tuning:
○ Ensuring optimal performance for both systems may require significant tuning.
Let me know if you'd like examples or further details on how to set this up in a practical environment!
From <https://chatgpt.com/c/675aec4f-6d6c-8012-afd7-cbf911391c16>
bda2 Page 8