Job Scheduling in MapReduce
Definition
Job Scheduling in MapReduce is the mechanism by which Hadoop determines the order and allocation of
cluster resources (like CPU, memory) to submitted jobs. It plays a vital role in ensuring that resources are
fairly and efficiently distributed among multiple users and their applications. The goal is to maximize
throughput, minimize response time, and provide fairness and resource guarantees when necessary.
MapReduce Algorithm
MapReduce is a data processing paradigm that allows for distributed computation on large datasets across a
Hadoop cluster. It is composed of:
- Map Phase: Processes input data to generate intermediate key-value pairs.
- Shuffle and Sort Phase: Intermediate data is sorted and grouped by key.
- Reduce Phase: Aggregates values associated with a specific key to produce the final output.
Job scheduling in this context ensures that tasks in each phase are executed efficiently on available nodes.
Hadoop Schedulers
Schedulers in Hadoop manage how jobs are assigned to resources. They aim to enforce policies such as
fairness, prioritization, and guaranteed capacities. Hadoop supports different types of schedulers to match
varying workload and resource-sharing requirements.
Types of Job Scheduling in MapReduce
1. FIFO Scheduler
- First-In-First-Out (FIFO) was the default scheduler in early Hadoop versions.
Job Scheduling in MapReduce
- Jobs are placed in a single queue and executed in the order of submission.
- It is simple to understand and easy to implement.
- Lacks fairness and may delay short jobs if long jobs are submitted earlier.
- Does not support multi-user or multi-tenant environments.
2. Capacity Scheduler
- Designed to allow sharing of cluster resources among multiple organizations.
- Cluster is divided into multiple queues, each with a guaranteed capacity.
- Queues can have sub-queues to provide more granular resource control.
- Unused capacity in one queue can be temporarily allocated to other queues.
- Supports user-based access control and job priorities.
- Encourages multi-tenancy and fair resource distribution.
- Suitable for enterprise environments with strict capacity guarantees.
3. Fair Scheduler
- Developed by Facebook to provide fair sharing of resources among all running jobs.
- Ensures all users/jobs get approximately equal resource share over time.
- Supports job pools, each with guaranteed minimum and fair shares.
- Allows preemption: if a job exceeds its share, running tasks may be paused or killed.
- Can be configured to support priority, deadlines, and interactive responsiveness.
- Best suited for environments with mixed workloads and multiple users.
Advantages
- Ensures fair sharing of resources among users.
- Supports priorities, allowing urgent jobs to be prioritized.
Job Scheduling in MapReduce
- Enhances resource utilization and system throughput.
- Supports multi-tenancy and queue-based resource management.
- Elastic resource sharing (e.g., Capacity Scheduler allows borrowing unused capacity).
- Fair Scheduler improves responsiveness of short, interactive jobs.
- Adaptable to both small and large-scale cluster environments.
Disadvantages
- FIFO does not support fairness or job prioritization.
- Configuration of Capacity and Fair schedulers can be complex.
- Requires careful tuning to avoid resource starvation or imbalance.
- Preemption may disrupt long-running tasks, affecting stability.
- Monitoring and managing multiple queues and pools can add overhead.
- Improper setup may lead to inefficient cluster usage or unfair resource distribution.