Add best-effort cleanup to EmrCreateJobFlowOperator on post-creation failure#61010
Conversation
9968272 to
e494b8b
Compare
Attempt best-effort termination of EMR clusters when failures occur after successful job flow creation. Cleanup does not mask the original exception.
e494b8b to
3cab55f
Compare
|
@SameerMesiah97 Could you please resolve conflicts? Edit: I took care of it so I could include it in the upcoming release (used Gen. AI, GitHub Copilot + Claude Sonnet 4.6) |
# Conflicts: # providers/amazon/src/airflow/providers/amazon/aws/operators/emr.py # providers/amazon/tests/unit/amazon/aws/operators/test_emr_create_job_flow.py
So you no longer want me to fix it? |
No need, I took care of it as I'm starting the release process very soon :) |
Description
Added best-effort cleanup to
EmrCreateJobFlowOperatorto terminate EMR clusters when failures occur after successful cluster creation. Cleanup behavior is guarded by a flag and is opted in by default.In certain failure modes, the operator could previously create a cluster via
create_job_flowand then fail during later execution steps (for example, while waiting for completion whenDescribeClusterpermissions are missing). In these cases, the task failed while leaving the cluster running. The operator now attempts to terminate the created job flow if an exception is raised after creation. Cleanup is best-effort and does not override or mask the original exception.This change applies a similar failure-handling approach recently introduced for
EC2CreateInstanceOperatorin PR #60904. But cleanup is only triggered for post-start EMR job flow failures (including waiter-related errors), ensuring termination is attempted only when a job flow was successfully created and avoiding interception of non-AWS exceptions.Rationale
EmrCreateJobFlowOperatoris responsible for provisioning and coordinating an external, stateful service whose lifecycle extends beyond task execution. If the task fails after cluster creation, Airflow can no longer reliably manage or observe the cluster’s state. Adding opportunistic cleanup in these scenarios reduces the risk of orphaned EMR clusters and unexpected infrastructure costs, while preserving existing failure semantics. Cleanup errors are logged and do not affect the task’s final failure state.Restricting cleanup to post-creation EMR job flow failures prevents unintended termination in unrelated failure paths while still addressing orphaned job flows created during execution.
Tests
Documentation
The docstring for
EmrCreateJobFlowOperatorhas been updated with a brief description of the new flagterminate_job_flow_on_failure.Backwards Compatibility
A new flag called
terminate_job_flow_on_failurehas been added toEmrCreateJobFlowOperatorwith a default setting ofTrue. Cleanup will now be attempted on a best-effort basis ifWaiterErroris encountered.Reproduciblity
The failure scenario could not be reproduced directly due to personal AWS account permissions. However, based on the current control flow of
EmrCreateJobFlowOperator, it is possible for cluster creation to succeed while a later step fails, leaving the EMR cluster running without cleanup. This change defensively addresses that case. Contributors reading this PR are free to provide a reproduction for the aforementioned failure mode if they can.