Hadoop I/O
Definition
Hadoop I/O refers to the framework and mechanisms used for data input and output operations in the
Hadoop ecosystem. Efficient I/O is critical in distributed systems like Hadoop to ensure scalable and reliable
data processing. Hadoop provides a set of libraries and utilities to handle various data formats, serialization
frameworks, and compression methods for optimal storage and transmission.
Data Integrity
Data integrity in Hadoop ensures that the data being read or written is accurate and uncorrupted. Hadoop
uses checksums to verify the correctness of data blocks. Each file in HDFS is divided into blocks, and for
every block, a checksum is calculated and stored separately. When the data is read, the checksum is
recalculated and compared to the stored value. If a mismatch occurs, the system attempts to read the block
from another replica, thereby ensuring fault tolerance and reliability.
Hadoop Local File System
The Hadoop Local File System is a non-distributed file system used primarily for storing temporary data on a
single machine (often intermediate job outputs). It is not suitable for large-scale distributed data storage. It is
typically used by MapReduce tasks to read input splits and write temporary output before transferring to
HDFS. Although it does not offer replication and fault tolerance like HDFS, it provides fast local read/write
operations critical for performance in processing pipelines.
Compression
Compression in Hadoop reduces the size of data stored and transmitted across the network, improving
performance and reducing disk I/O. Hadoop supports various compression codecs such as Gzip, Bzip2, LZO,
and Snappy. Compression can be applied at different stages:
Hadoop I/O
- Input Compression: Reduces storage and bandwidth when reading input files.
- Intermediate Compression: Compresses intermediate MapReduce outputs.
- Output Compression: Minimizes size of final job output.
Proper use of compression increases throughput but may add CPU overhead during
compression/decompression.
Serialization
Serialization is the process of converting data structures or objects into a format that can be stored or
transmitted and reconstructed later. Hadoop relies on serialization for transferring data between nodes in a
MapReduce job. Writable is Hadoop's native serialization format, providing efficient, compact binary
representations. Hadoop serialization must be fast, compact, and compatible with versioning.
Common serialization frameworks used in Hadoop:
- Writable (native)
- Avro
- Protocol Buffers
- Thrift
Avro
Avro is a serialization framework developed within the Hadoop ecosystem, used for compact, fast, binary
data serialization. It uses JSON for defining schemas and supports schema evolution, making it highly
suitable for big data.
Features of Avro:
Hadoop I/O
- Row-based storage format.
- Supports dynamic typing through schemas.
- Enables inter-language communication (e.g., Java and Python).
- Facilitates big data exchange between systems using different programming languages.
- Efficient serialization with minimal overhead.
Avro is often used in Kafka, Hive, and Pig as well as for storing log data and in data lake solutions.