Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views8 pages

Big Data Analytics

The document provides an overview of key concepts related to data formats and the Hadoop ecosystem, including data storage, processing, and integration with Cassandra. It details the structure of Hadoop's components such as HDFS, MapReduce, and data flow, along with the advantages of using Avro for data serialization. Additionally, it discusses the integration of Cassandra with Hadoop for efficient real-time and batch data processing, outlining the setup and operational steps involved.

Uploaded by

t4040939
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views8 pages

Big Data Analytics

The document provides an overview of key concepts related to data formats and the Hadoop ecosystem, including data storage, processing, and integration with Cassandra. It details the structure of Hadoop's components such as HDFS, MapReduce, and data flow, along with the advantages of using Avro for data serialization. Additionally, it discusses the integration of Cassandra with Hadoop for efficient real-time and batch data processing, outlining the setup and operational steps involved.

Uploaded by

t4040939
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

14 December 2024 11:30

Here’s an overview of the key concepts and topics listed:

1. Data Format
• Refers to the structure or organization of data to make it usable and analyzable. Examples:
○ Structured Data: Stored in relational databases (rows/columns).
○ Semi-Structured Data: JSON, XML, etc.
○ Unstructured Data: Logs, images, audio, video.

2. Analyzing Data with Hadoop


• Hadoop is an open-source framework for processing and storing large datasets using a
distributed system.
• Core Components:
1. HDFS (Hadoop Distributed File System):
▪ Stores data across a cluster of machines.
▪ Provides fault tolerance and high throughput.
2. MapReduce:
▪ Programming model for processing large datasets in parallel.
▪ Splits jobs into smaller tasks (Map) and aggregates results (Reduce).
• Steps in Analyzing Data:
○ Load data into HDFS.
○ Use MapReduce or other tools like Hive, Pig, or Spark to process the data.
○ Export the processed data for visualization or further analysis.

3. Scaling Out
• Scaling out involves adding more nodes to a Hadoop cluster to handle larger datasets and
increase computational power.
○ Horizontal Scaling: Adding commodity hardware to the cluster.
○ Key advantage: Fault tolerance and linear scalability.

4. Hadoop Streaming
• A utility that allows you to write MapReduce jobs in any programming language (Python,
Ruby, Perl, etc.).
• How it Works:
○ Mapper and Reducer scripts communicate with Hadoop through standard input and
output (stdin/stdout).
○ Example:
hadoop jar /path/to/hadoop-streaming.jar \
-mapper /path/to/mapper.py \
-reducer /path/to/reducer.py \
-input /path/to/input \
-output /path/to/output

5. Hadoop Pipes
• A C++ API for writing MapReduce programs.
• Benefits:
○ Allows developers to use C++ for high-performance tasks.
○ Offers tighter integration with the Hadoop framework.

6. Design of Hadoop Distributed File System (HDFS)

bda2 Page 1
6. Design of Hadoop Distributed File System (HDFS)
• HDFS is the storage layer of Hadoop, optimized for large datasets.
• Key Features:
○ Block Storage:
▪ Files are split into blocks (default size: 128 MB) and distributed across nodes.
○ Fault Tolerance:
▪ Replicates data (default: 3 copies) to ensure reliability.
○ Write Once, Read Many:
▪ Optimized for workloads where data is written once and read multiple times.

7. HDFS Concepts
• NameNode:
○ Manages metadata (directory structure, file locations).
○ Acts as the master node.
• DataNode:
○ Stores actual data blocks.
○ Communicates with the NameNode for tasks.
• Secondary NameNode:
○ Periodically takes snapshots of the NameNode's metadata.
• Replication:
○ Data is replicated across multiple DataNodes for fault tolerance.

8. Java Interface for HDFS


• HDFS can be interacted with using Java APIs for file operations like reading, writing, and
deleting files.
• Example: Writing a file to HDFS using Java.
FileSystem fs = FileSystem.get(new Configuration());
Path path = new Path("/user/data.txt");
FSDataOutputStream outputStream = fs.create(path);
outputStream.writeUTF("Hello, HDFS!");
outputStream.close();

9. Data Flow in Hadoop


• Write Operation:
○ Client splits the file into blocks and writes them to DataNodes.
○ Metadata is updated in the NameNode.
• Read Operation:
○ Client requests metadata from the NameNode.
○ Reads data directly from DataNodes.

10. Hadoop I/O


• Data Integrity:
○ Ensures data consistency using checksums.
○ Data corruption is detected and repaired using replicas.
• Compression:
○ Reduces storage and speeds up data transfer.
○ Supported formats: Gzip, Snappy, LZO, etc.
• Serialization:
○ Converts objects into a format that can be stored and transmitted.
○ Hadoop uses Writable interface for serialization.

11. Avro
• A data serialization framework used in Hadoop.
• Features:
Compact and efficient.

bda2 Page 2
○ Compact and efficient.
○ Schema evolution (can update schema without breaking compatibility).
• Example of Avro schema:
{
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"}
]
}

12. File-Based Data Structures


• Hadoop supports various file formats:
○ Text: Plain text (inefficient for large datasets).
○ SequenceFile: Binary format for key-value pairs.
○ Parquet/ORC: Columnar formats optimized for analytics.

13. Cassandra-Hadoop Integration


• Cassandra is a NoSQL database that can integrate with Hadoop to process data stored in
Cassandra using Hadoop's MapReduce framework.
• Integration Approaches:
○ Use Cassandra Hadoop InputFormat and OutputFormat to read/write Cassandra
tables.
○ Run Spark jobs on top of Cassandra for faster and more flexible analytics.

Would you like detailed examples for any specific topic, such as Avro, Cassandra-Hadoop
integration, or HDFS operations? Let me know!

From <https://chatgpt.com/c/675aec4f-6d6c-8012-afd7-cbf911391c16>
Here's a visual representation of the Hadoop MapReduce data flow, showing how data moves
through each stage from input to output. Below is a detailed explanation of each step:

1. Input Splits
• Description:
○ The input data stored in HDFS is split into smaller chunks called input splits.
○ Each split is assigned to a single Mapper for processing.
• Key Feature:
○ Splits ensure parallelism by dividing the workload among multiple nodes.

2. Mapper Phase
• Description:
○ Each Mapper processes one input split, reading records line-by-line or based on key-
value pairs.
○ Transforms data into intermediate key-value pairs.
• Example:
○ Input: A text file of words.
○ Output: Key-value pairs like (word, 1).

3. Shuffle and Sort


• Description:
○ Intermediate key-value pairs from all Mappers are sorted and grouped by key.
The Shuffle step ensures that all values for a single key are sent to the same

bda2 Page 3
○ The Shuffle step ensures that all values for a single key are sent to the same
Reducer.
• Key Operation:
○ Distributed sorting to group similar keys together.

4. Reducer Phase
• Description:
○ Reducers aggregate the grouped data received from the Shuffle phase.
○ Performs computations like counting, summing, or merging.
• Example:
○ Input: (word, [1, 1, 1]).
○ Output: (word, 3).

5. Final Output
• Description:
○ The results from Reducers are written back to HDFS.
○ Output can be used for further processing or analysis.

Would you like additional details on any specific phase, or guidance on creating your own
MapReduce workflow?

From <https://chatgpt.com/c/675aec4f-6d6c-8012-afd7-cbf911391c16>

• Schema-Based:
• Avro uses JSON-based schemas to define the structure of the data.
• Schemas allow for both forward and backward compatibility when data structures evolve
over time.
• Compact and Efficient:
• Data is stored in binary format, minimizing size and improving processing speed.
• Interoperability:
• Avro supports multi-language bindings, allowing data exchange between applications
written in different programming languages (Java, Python, etc.).
• Self-Describing:
• Data files contain the schema, so they can be read without external metadata

bda2 Page 4
Data Serialization and Deserialization
• Serialization:
○ Converts data into a compact binary format.
○ Useful for storage or transmission.
• Deserialization:
○ Converts binary Avro data back into its original structure using the schema.

1. Data Storage:
○ Efficient storage of structured data in HDFS or other systems.
2. Data Transfer:
○ Used in Kafka or REST APIs for transferring compact data.
3. Schema Evolution:
○ Handles changes to data structure without breaking existing systems.

Advantages
• Compact and fast.
• Schema evolution support.
• Language-agnostic.
• Optimized for Hadoop and distributed systems.
Limitations
• Requires schema management.
• Less flexible for unstructured data.

Cassandra and Hadoop Integration


Cassandra and Hadoop integration is used to leverage the distributed data storage capabilities of

bda2 Page 5
Cassandra and Hadoop integration is used to leverage the distributed data storage capabilities of
Cassandra with the data processing power of Hadoop. This integration helps businesses process large-
scale, distributed data efficiently.
Cassandra, being a NoSQL database, offers real-time data management, while Hadoop excels at batch
processing and analytics. Combining the two systems creates a hybrid architecture suitable for high-
throughput real-time applications and analytics.

Why Integrate Cassandra with Hadoop?


1. Real-Time + Batch Processing:
○ Cassandra supports real-time applications with low latency.
○ Hadoop is ideal for offline batch processing of massive datasets.
2. Distributed and Scalable Systems:
○ Both systems are designed for scalability and fault tolerance, making them a natural fit for
large-scale data processing.
3. Unified Data Management:
○ Cassandra stores raw, semi-structured, or structured data, while Hadoop analyzes and
transforms the data stored in Cassandra.

Components Used in Cassandra-Hadoop Integration


1. Hadoop MapReduce:
○ Used for distributed batch processing of data stored in Cassandra.
○ Provides a mechanism to process large datasets in parallel.
2. Hadoop InputFormat/OutputFormat:
○ CassandraInputFormat: Reads data from Cassandra tables to Hadoop jobs.
○ CassandraOutputFormat: Writes processed data from Hadoop jobs back to Cassandra
tables.
3. Hive with Cassandra (Optional):
○ Apache Hive, a data warehouse system, can run SQL-like queries on data stored in
Cassandra.
4. Spark (Optional for Real-Time Processing):
○ Apache Spark is often integrated into this workflow for real-time analytics on Cassandra
data.

Architecture Overview
The integration typically works as follows:
1. Data Ingestion:
○ Raw data is ingested into Cassandra via APIs or streaming platforms.
○ Cassandra acts as the primary storage for transactional or semi-structured data.
2. Hadoop MapReduce Job:
○ Data is read from Cassandra tables using CassandraInputFormat.
○ A MapReduce job performs the required transformations or analytics.
○ Results are written back to Cassandra using CassandraOutputFormat or to HDFS for further
processing.
3. Optional Data Pipeline:
○ Data processed by Hadoop can be stored in HDFS or used for ETL (Extract, Transform, Load)
pipelines.
○ Hive or Pig can be used for querying and additional analysis.

Steps for Cassandra and Hadoop Integration


1. Prerequisites
• Cassandra Cluster:
○ Ensure that Apache Cassandra is installed and running on your cluster.
• Hadoop Cluster:
○ Set up a Hadoop cluster with HDFS and MapReduce configured.
• Hadoop Cassandra Connector:
Use the Hadoop-Cassandra integration connector provided by Apache.

bda2 Page 6
○ Use the Hadoop-Cassandra integration connector provided by Apache.

2. Configuring Cassandra Input and Output Format


1. CassandraInputFormat:
○ Reads data from Cassandra tables into Hadoop jobs.
○ Example Configuration:
conf.set("cassandra.input.split.size", "64"); // Splits data into chunks
ConfigHelper.setInputColumnFamily(conf, "keyspace_name", "table_name");
ConfigHelper.setInputInitialAddress(conf, "127.0.0.1"); // Cassandra node IP
2. CassandraOutputFormat:
○ Writes processed results from Hadoop back to Cassandra.
○ Example Configuration:
ConfigHelper.setOutputColumnFamily(conf, "keyspace_name", "output_table_name");
ConfigHelper.setOutputInitialAddress(conf, "127.0.0.1"); // Cassandra node IP

3. Writing a MapReduce Job for Cassandra


A typical MapReduce job consists of:
• Mapper:
○ Reads data from Cassandra using CassandraInputFormat.
• Reducer:
○ Performs aggregation or computation.
• Output:
○ Writes results back to Cassandra using CassandraOutputFormat.
Example Code Snippet:
public class CassandraHadoopExample {
public static void main(String[] args) throws Exception {
Job job = Job.getInstance(new Configuration());
job.setInputFormatClass(CassandraInputFormat.class);
job.setOutputFormatClass(CassandraOutputFormat.class);
// Set Mapper and Reducer classes
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// Configure Cassandra Input
ConfigHelper.setInputColumnFamily(job.getConfiguration(), "keyspace_name", "table_name");
// Configure Cassandra Output
ConfigHelper.setOutputColumnFamily(job.getConfiguration(), "keyspace_name",
"output_table_name");
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

4. Deploying and Running the Job


1. Compile the MapReduce job into a JAR file.
2. Submit the JAR file to the Hadoop cluster using the following command:
hadoop jar cassandra_hadoop_integration.jar MainClassName

Use Case: Analyzing User Behavior


Scenario:
• Cassandra stores clickstream data for an e-commerce website.
• Hadoop processes this data to find popular products.
Steps:
1. Store Raw Data:
○ Clickstream data is ingested into a Cassandra table (clickstream).
2. Process Data with Hadoop:
○ Read data using CassandraInputFormat.
○ Perform analytics using a MapReduce job (e.g., count product clicks).
Write results back to Cassandra (popular_products table).

bda2 Page 7
○ Write results back to Cassandra (popular_products table).

Advantages of Cassandra-Hadoop Integration


1. Real-Time and Batch Hybrid:
○ Cassandra handles real-time workloads; Hadoop provides offline analytics.
2. High Scalability:
○ Both Cassandra and Hadoop are highly scalable and work well for distributed systems.
3. Efficient Data Movement:
○ Input and output formats ensure seamless integration without additional ETL overhead.
4. Fault Tolerance:
○ Both systems replicate data and jobs for fault-tolerant operations.

Challenges
1. Setup Complexity:
○ Integrating and maintaining the two systems can be challenging.
2. Query Limitations:
○ Cassandra’s query capabilities are limited compared to traditional SQL databases.
3. Performance Tuning:
○ Ensuring optimal performance for both systems may require significant tuning.

Let me know if you'd like examples or further details on how to set this up in a practical environment!

From <https://chatgpt.com/c/675aec4f-6d6c-8012-afd7-cbf911391c16>

bda2 Page 8

You might also like