Dags Definitive Guide Mobile
Dags Definitive Guide Mobile
E
V
IS
E
D
E
D
IT
IO
N
DAGs:
The Definitive Guide
Everything you need to know about Airflow DAGs
Powered by Astronomer
1
Editor’s Note
Welcome to the ultimate guide to Apache Airflow DAGs, brought to you
by the Astronomer team. This ebook covers everything you need to know
to work with DAGs, from the building blocks that make them up to best
practices for writing them, dynamically generating them, testing and
debugging them, and more. It’s a guide written by practitioners
for practitioners.
62 DAG Design
62 DAG Writing Best Practices in Apache Airflow
74 Passing Data Between Airflow Tasks
87 Using Tasks Group in Airflow
96 Cross-DAG Dependencies
3
1. DAGs
Where to Begin?
D I R E C T E D
A C Y C L I C
G R A P H
4
What Exactly is a DAG?
A DAG is a Directed Acyclic Graph — a conceptual representation of a series
of activities, or, in other words, a mathematical abstraction of a data pipeline.
Although used in different circles, both terms, DAG and data pipeline,
represent an almost identical mechanism. In a nutshell, a DAG (or a pipe-
line) defines a sequence of execution stages in any non-recurring algorithm.
DIRECTED — In general, if multiple tasks exist, each must have at least one
Zdefined upstream (previous) or downstream (subsequent) task, or one or
more of both. (It’s important to note however, that there are also DAGs that
have multiple parallel tasks — meaning no dependencies.)
ACYCLIC — No task can create data that goes on to reference itself. That
could cause an infinite loop, which could give rise to a problem or two.
There are no cycles in DAGs.
5
“At Astronomer, we believe using a code-based data pipeline tool like
Airflow should be a standard,” says Kenten Danas, Lead Developer
Advocate at Astronomer. There are many reasons for this, but these
high-level concepts are crucial:
An Example of a DAG
A B
E
D
G F
Consider the directed acyclic graph above. In this DAG, each vertex (line)
has a specific direction (denoted by the arrow) connecting different nodes.
6
This is the key quality of a directed graph: data can follow only in the direc-
tion of the vertex. In this example, data can go from A to B, but never B to A.
In the same way that water flows through pipes in one direction, data must
follow in the direction defined by the graph. Nodes from which a directed
vertex extends are considered upstream, while nodes at the receiving end of
a vertex are considered downstream.
Why must this be true for data pipelines? If F had a downstream process
in the form of D, we would see a graph where D informs E, which informs F,
which informs D, and so on. It creates a scenario where the pipeline could
run indefinitely without ever ending. Like water that never makes it to the
faucet, such a loop would be a waste of data flow.
To put this example in real-world terms, imagine the DAG above represents a
data engineering story:
7
DAGs in Airflow
• DAG dependencies ensure that your data tasks are executed in the same
order every time, making them reliable for your everyday data infrastructure.
• The graphing component of DAGs allows you to visualize dependencies in
Airflow’s user interface.
• Because every path in a DAG is linear, it’s easy to develop and test your
data pipelines against expected outcomes.
An Airflow DAG starts with a task written in Python. You can think of tasks as
the nodes of your DAG: Each one represents a single action, and it can be
dependent on both upstream and downstream tasks.
Tasks are wrapped by operators, which are the building blocks of Airflow,
defining the behavior of their tasks. For example, a Python Operator task will
execute a Python function, while a task wrapped in a Sensor Operator will
wait for a signal before completing an action.
The following diagram shows how these concepts work in practice. As you
can see, by writing a single DAG file in Python, you can begin to define
complex relationships between data and actions.
DAG
Operator Operator
Task Task
8
You can see the flexibility of DAGs in the following real-world example:
9
Using a single DAG (like the Customer Operations one shown in yellow), you
are able to:
• Extract data from a legacy data store and load it into an AWS S3 bucket.
• Either train a data model or complete a data transformation, depending
on the data you’re using.
• Store the results of the previous action in a database.
• Send information about the entire process to various metrics and
reporting systems.
10
From Operators to DagRuns:
Implementing DAGs in Airflow
While DAGs are simple structures, defining them in code requires some more
complex infrastructure and concepts beyond nodes and vertices. This is
especially true when you need to execute DAGs on a frequent,
reliable basis.
11
• Tasks are nodes in a DAG.
In Airflow, a DAG is a group of tasks that have been configured to run in
a directed, acyclic manner. Airflow’s Scheduler parses DAGs to find tasks
which are ready for execution based on their dependencies. If a task is
ready for execution, the Scheduler sends it to an Executor.
12
Want to
know more?
Check out our comprehensive webinars, where
Airflow experts dive deeper into DAGs.
13
2. DAG
Building Blocks
Scheduling and Timetables
in Airflow
One of the fundamental features of Apache Airflow is the ability to sched-
ule jobs. Historically, Airflow users could schedule their DAGs by specifying
a schedule with a cron expression, a timedelta object, or a preset Airflow
schedule.
14
Additionally, Airflow 2.4 introduced datasets and the ability to schedule your
DAGs on updates to a dataset rather than a time-based schedule. A more in-
depth explanation on these features can be found in the Datasets and Data
Driven Scheduling in Airflow guide.
In this guide, we’ll walk through Airflow scheduling concepts and the differ-
ent ways you can schedule a DAG with a focus on timetables. For additional
instructions check out our Scheduling in Airflow webinar.
Assumed knowledge
To get the most out of this guide, you should have knowledge of:
15
Scheduling concepts
There are a couple of terms and parameters in Airflow that are important to
understand related to scheduling.
• Data Interval: The data interval is a property of each DAG run that rep-
resents the period of data that each task should operate on. For exam-
ple, for a DAG scheduled hourly each data interval will begin at the top
of the hour (minute 0) and end at the close of the hour (minute 59). The
DAG run is typically executed at the end of the data interval, depending
on whether your DAG’s schedule has “gaps” in it.
• Logical Date: The logical date of a DAG run is the same as the start of
the data interval. It does not represent when the DAG will actually be
executed. Prior to Airflow 2.2, this was referred to as the execution date.
• Timetable: The timetable is a property of a DAG that dictates the data
interval and logical date for each DAG run (i.e. it determines when a
DAG will be scheduled).
• Run After: The earliest time the DAG can be scheduled. This date is
shown in the Airflow UI, and may be the same as the end of the data
interval depending on your DAG’s timetable.
• Backfilling and Catchup: We won’t cover these concepts in depth here,
but they can be related to scheduling. We recommend reading the
Apache Airflow documentation on them to understand how they work
and whether they’re relevant for your use case.
16
Parameters
The following parameters are derived from the concepts described above
and are important for ensuring your DAG runs at the correct time.
17
Example
If we look at the next DAG run in the UI, the logical date is 2022-08-28
22:42:33, which is shown as the Next Run timestamp in the UI. This is 5 min-
utes after the previous logical date, and the same as the Data interval end
of the last DAG run because there are no gaps in the schedule. If we hover
over Next Run, we can see that Run After, which is the date and time that the
next DAG run will actually start, is also the same as the next DAG run’s Data
interval end:
18
In summary we’ve described 2 DAG runs:
In the sections below, we’ll walk through how to use cron-based schedule,
timetables, or datasets to schedule your DAG.
Cron-based schedules
For pipelines with simple scheduling needs, you can define a schedule in
your DAG using:
• A cron expression.
• A cron preset.
• A timedelta object.
19
Setting a cron-based schedule
Cron expressions
You can pass any cron expression as a string to the schedule parameter in
your DAG. For example, if you want to schedule your DAG at 4:05 AM every
day, you would use schedule='5 4 * * *'.
If you need help creating the correct cron expression, crontab guru is a great
resource.
Cron presets
Airflow can utilize cron presets for common, basic schedules.
For example, schedule='@hourly' will schedule the DAG to run at the be-
ginning of every hour. For the full list of presets, check out the Airflow docu-
mentation. If your DAG does not need to run on a schedule and will only be
triggered manually or externally triggered by another process, you can set
schedule=None.
Timedelta objects
If you want to schedule your DAG on a particular cadence (hourly, every 5
minutes, etc.) rather than at a specific time, you can pass a timedelta object
imported from the datetime package to the schedule parameter. For exam-
ple, schedule=timedelta(minutes=30) will run the DAG every thirty minutes,
and schedule=timedelta(days=1) will run the DAG every day.
20
Cron-based schedules & the logical date
Airflow was originally developed for ETL under the expectation that data is
constantly flowing in from some source and then will be summarized on a
regular interval. If you want to summarize Monday’s data, you can only do
it after Monday is over (Tuesday at 12:01 AM). However, this assumption has
turned out to be ill-suited to the many other things Airflow is being used for
now. This discrepancy is what led to Timetables, which were introduced in
Airflow 2.2.
Each DAG run therefore has a logical_date that is separate from the time
that the DAG run is expected to begin (logical_date was called exection_
date before Airflow 2.2). A DAG run is not actually allowed to run until the
logical_date for the following DAG run has passed. So if you are running a
daily DAG, Monday’s DAG run will not actually execute until Tuesday. In this
example, the logical_date would be Monday 12:01 AM, even though the
DAG run will not actually begin until Tuesday 12:01 AM.
If you want to pass a timestamp to the DAG run that represents “the earliest
time at which this DAG run could have started”, use {{ next_ds }} from the
jinja templating macros.
21
Limitations of cron-based schedules
In the next section, we’ll describe how these limitations were addressed in
Airflow 2.2 with the introduction of timetables.
Timetables
Timetables, introduced in Airflow 2.2, address the limitations of cron expres-
sions and timedelta objects by allowing users to define their own schedules in
Python code. All DAG schedules are ultimately determined by their internal
timetable and if a cron expression or timedelta object is not sufficient for
your use case, you can define your own.
22
Custom timetables can be registered as part of an Airflow plugin. They must
be a subclass of Timetable, and they should contain the following methods,
both of which return a DataInterval with a start and an end:
For this implementation, let’s run our DAG at 6:00 and 16:30. Because this
schedule has run times with differing hours and minutes, it can’t be repre-
sented by a single cron expression. But we can easily implement this sched-
ule with a custom timetable!
• Run at 6:00: Data interval is from 16:30 on the previous day to 6:00 on
the current day
• Run at 16:30: Data interval is from 6:00 to 16:30 on the current day
With that in mind, first we’ll define next_dagrun_info. This method provides
Airflow with the logic to calculate the data interval for scheduled runs.
It also contains logic to handle the DAG’s start_date, end_date, and catch-
up parameters. To implement the logic in this method, we use the
Pendulum package, which makes dealing with dates and times simple.
The method looks like this:
23
1 def next_dagrun_info(
2 self,
3 *,
4 last_automated_data_interval: Optional[DataInterval],
5 restriction: TimeRestriction,
6 ) -> Optional[DagRunInfo]:
7 if last_automated_data_interval is not None: # There
8 was a previous run on the regular schedule.
9 last_start = last_automated_data_interval.start
10 delta = timedelta(days=1)
11 if last_start.hour == 6: # If previous period
12 started at 6:00, next period will start at 16:30 and end
13 at 6:00 following day
14 next_start = last_start.set(hour=16, min-
15 ute=30).replace(tzinfo=UTC)
16 next_end = (last_start+delta).replace(tzin-
17 fo=UTC)
18 else: # If previous period started at 16:30, next
19 period will start at 6:00 next day and end at 16:30
20 next_start = (last_start+delta).set(hour=6,
21 minute=0).replace(tzinfo=UTC)
22 next_end = (last_start+delta).replace(tzin-
23 fo=UTC)
24 else: # This is the first ever run on the regular
25 schedule. First data interval will always start at 6:00
26 and end at 16:30
27 next_start = restriction.earliest
28 if next_start is None: # No start_date. Don't
29 schedule.
30 return None
31 if not restriction.catchup: # If the DAG has
32 catchup=False, today is the earliest to consider.
33 next_start = max(next_start, DateTime.com-
35 bine(Date.today(), Time.min).replace(tzinfo=UTC))
24
36 next_start = next_start.set(hour=6, minute=0).re-
37 place(tzinfo=UTC)
38 next_end = next_start.set(hour=16, minute=30).re-
39 place(tzinfo=UTC)
40 if restriction.latest is not None and next_start > re-
41 striction.latest:
42 return None # Over the DAG's scheduled end; don't
43 schedule.
44 return DagRunInfo.interval(start=next_start, end=next_
45 end)
• If the DAG has an end date, do not schedule the DAG after that date
has passed.
Then we define the data interval for manually triggered DAG runs by defin-
ing the infer_manual_data_interval method. The code looks like this:
25
1 def infer_manual_data_interval(self, run_after: Date-
2 Time) -> DataInterval:
3 delta = timedelta(days=1)
4 # If time is between 6:00 and 16:30, period ends at
5 6am and starts at 16:30 previous day
6 if run_after >= run_after.set(hour=6, minute=0) and
7 run_after <= run_after.set(hour=16, minute=30):
8 start = (run_after-delta).set(hour=16, min-
9 ute=30, second=0).replace(tzinfo=UTC)
10 end = run_after.set(hour=6, minute=0, second=0).
11 replace(tzinfo=UTC)
10 # If time is after 16:30 but before midnight, period
11 is between 6:00 and 16:30 the same day
10 elif run_after >= run_after.set(hour=16, minute=30)
10 and run_after.hour <= 23:
11 start = run_after.set(hour=6, minute=0, sec-
12 ond=0).replace(tzinfo=UTC)
13 end = run_after.set(hour=16, minute=30, sec-
14 ond=0).replace(tzinfo=UTC)
15 # If time is after midnight but before 6:00, period
16 is between 6:00 and 16:30 the previous day
17 else:
18 start = (run_after-delta).set(hour=6, minute=0).
19 replace(tzinfo=UTC)
20 end = (run_after-delta).set(hour=16, minute=30).
21 replace(tzinfo=UTC)
22 return DataInterval(start=start, end=end)
26
This method figures out what the most recent complete data interval is
based on the current time. There are three scenarios:
• The current time is between 6:00 and 16:30: In this case, the data inter-
val is from 16:30 the previous day to 6:00 the current day.
• The current time is after 16:30 but before midnight: In this case, the data
interval is from 6:00 to 16:30 the current day.
• The current time is after midnight but before 6:00: In this case, the data
interval is from 6:00 to 16:30 the previous day.
We need to account for time periods in the same timeframe (6:00 to 16:30)
on different days than the day that the DAG is triggered, which requires
three sets of logic. When defining custom timetables, always keep in mind
what the last complete data interval should be based on when the DAG
should run.
Now we can take those two methods and combine them into a Timetable
class which will make up our Airflow plugin. The full custom timetable plugin
is below:
27
1 from datetime import timedelta
2 from typing import Optional
3 from pendulum import Date, DateTime, Time, timezone
4
5 from airflow.plugins_manager import AirflowPlugin
6 from airflow.timetables.base import DagRunInfo, DataInt-
7 erval, TimeRestriction, Timetable
8
9 UTC = timezone("UTC")
10
11 class UnevenIntervalsTimetable(Timetable):
10
11 def infer_manual_data_interval(self, run_after: Da-
12 teTime) -> DataInterval:
13 delta = timedelta(days=1)
14 # If time is between 6:00 and 16:30, period ends
15 at 6am and starts at 16:30 previous day
16 if run_after >= run_after.set(hour=6, minute=0)
17 and run_after <= run_after.set(hour=16, minute=30):
18 start = (run_after-delta).set(hour=16, min-
19 ute=30, second=0).replace(tzinfo=UTC)
20 end = run_after.set(hour=6, minute=0, sec-
21 ond=0).replace(tzinfo=UTC)
22 # If time is after 16:30 but before midnight,
23 period is between 6:00 and 16:30 the same day
24 elif run_after >= run_after.set(hour=16, min-
25 ute=30) and run_after.hour <= 23:
26 start = run_after.set(hour=6, minute=0, sec-
27 ond=0).replace(tzinfo=UTC)
28 end = run_after.set(hour=16, minute=30, sec-
29 ond=0).replace(tzinfo=UTC)
30 # If time is after midnight but before 6:00, pe-
31 riod is between 6:00 and 16:30 the previous day
32 else:
33 start = (run_after-delta).set(hour=6, min-
28
34 ute=0).replace(tzinfo=UTC)
35 end = (run_after-delta).set(hour=16, min-
36 ute=30).replace(tzinfo=UTC)
37 return DataInterval(start=start, end=end)
38
39 def next_dagrun_info(
40 self,
41 *,
42 last_automated_data_interval: Optional[DataInt-
43 erval],
44 restriction: TimeRestriction,
45 ) -> Optional[DagRunInfo]:
46 if last_automated_data_interval is not None: #
47 There was a previous run on the regular schedule.
48 last_start = last_automated_data_interval.
49 start
50 delta = timedelta(days=1)
51 if last_start.hour == 6: # If previous peri-
52 od started at 6:00, next period will start at 16:30 and
53 end at 6:00 following day
54 next_start = last_start.set(hour=16,
55 minute=30).replace(tzinfo=UTC)
56 next_end = (last_start+delta).re-
57 place(tzinfo=UTC)
58 else: # If previous period started at 14:30,
59 next period will start at 6:00 next day and end at 14:30
60 next_start = (last_start+delta).
61 set(hour=6, minute=0).replace(tzinfo=UTC)
62 next_end = (last_start+delta).re-
63 place(tzinfo=UTC)
64 else: # This is the first ever run on the reg-
65 ular schedule. First data interval will always start at
66 6:00 and end at 16:30
67 next_start = restriction.earliest
68 if next_start is None: # No start_date.
29
69 Don't schedule.
70 return None
71 if not restriction.catchup: # If the DAG has
72 catchup=False, today is the earliest to consider.
73 next_start = max(next_start, DateTime.
74 combine(Date.today(), Time.min).replace(tzinfo=UTC))
75 next_start = next_start.set(hour=6, min-
76 ute=0).replace(tzinfo=UTC)
77 next_end = next_start.set(hour=16, min-
78 ute=30).replace(tzinfo=UTC)
79 if restriction.latest is not None and next_start
80 > restriction.latest:
81 return None # Over the DAG's scheduled end;
82 don't schedule.
83 return DagRunInfo.interval(start=next_start,
84 end=next_end)
85
86 class UnevenIntervalsTimetablePlugin(AirflowPlugin):
87 name = "uneven_intervals_timetable_plugin"
88 timetables = [UnevenIntervalsTimetable]
57
Note
58that because timetables are plugins, you will need to restart the Airflow
Scheduler
59 and Webserver after adding or updating them.
60
In the
61 DAG, we can then import the custom timetable plugin and use it to
schedule
62 the DAG by setting the timetable parameter:
63
64
65
66
67
68
30
1 from uneven_intervals_timetable import UnevenIntervals-
2 Timetable
3
4 with DAG(
5 dag_id="example_timetable_dag",
6 start_date=datetime(2021, 10, 9),
7 max_active_runs=1,
8 timetable=UnevenIntervalsTimetable(),
9 default_args={
10 "retries": 1,
11 "retry_delay": timedelta(minutes=3),
10 },
11 catchup=True
12 ) as dag:
Looking at the Tree View in the UI, we can see that this DAG has run twice
per day at 6:00 and 16:30 since the start date of 2021-10-09.
31
The next scheduled run is for the interval starting on 2021-10-12 at 16:30 and
ending the following day at 6:00. This run will be triggered at the end of the
data interval, so after 2021-10-13 6:00.
If we run the DAG manually after 16:30 but before midnight, we can see the
data interval for the triggered run was between 6:00 and 16:30 that day as
expected.
32
This is a simple timetable that could easily be adjusted to suit other use cas-
es. In general, timetables are completely customizable as long as the meth-
ods above are implemented.
Current Limitations
• Timetable methods should return the same result every time they are
called (e.g. avoid things like HTTP requests). They are not designed to
implement event-based triggering.
• Timetables are parsed by the scheduler when creating DAG runs, so
avoid slow or lengthy code that could impact Airflow performance.
33
1 dataset1 = Dataset(f"{DATASETS_PATH}/dataset_1.txt")
2 dataset2 = Dataset(f"{DATASETS_PATH}/dataset_2.txt")
3
4 with DAG(
5 dag_id='dataset_dependent_example_dag',
6 catchup=False,
7 start_date=datetime(2022, 8, 1),
8 schedule=[dataset1, dataset2],
9 tags=['consumes', 'dataset-scheduled'],
10 ) as dag:
This DAG runs only when both dataset1 and dataset2 have been updated.
These updates can occur by tasks in different DAGs as long as they are locat-
ed in the same Airflow environment.
In the Airflow UI, the DAG now has a schedule of Dataset and the Next Run
column shows how many datasets the DAG depends on and how many of
them have been updated.
To learn more about datasets and data driven scheduling, check out the
Datasets and Data Driven Scheduling in Airflow guide.
34
Never miss an update
from us.
Sign up for the Astronomer newsletter.
Sign Up
35
Operators 101
Overview
Operators are the building blocks of Airflow DAGs. They contain the logic of
how data is processed in a pipeline. Each task in a DAG is defined by instanti-
ating an operator.
There are many different types of operators available in Airflow. Some opera-
tors execute general code provided by the user, like a Python function, while
other operators perform very specific actions such as transferring data from
one system to another.
In this guide, we’ll cover the basics of using operators in Airflow and show an
example of how to implement them in a DAG.
36
Operator Basics
Under the hood, operators are Python classes that encapsulate logic to do a
unit of work. They can be thought of as a wrapper around each unit of work
that defines the actions that will be completed and abstracts away a lot of
code you would otherwise have to write yourself. When you create an in-
stance of an operator in a DAG and provide it with its required parameters, it
becomes a task.
All operators inherit from the abstract BaseOperator class, which contains
the logic to execute the work of the operator within the context of a DAG.
The work each operator does varies widely. Some of the most frequently
used operators in Airflow are:
Operators are easy to use and typically only a few parameters are required.
There are a few details that every Airflow user should know about operators:
• The Astronomer Registry is the best place to go to learn about what op-
erators are out there and how to use them.
• The core Airflow package that contains basic operators such as the
PythonOperator and BashOperator. These operators are automatical-
ly available in your Airflow environment. All other operators are part of
provider packages, which must be installed separately. For example, the
SnowflakeOperator is part of the Snowflake provider.
• If an operator exists for your specific use case, you should always use it
over your own Python functions or hooks. This makes your DAGs easier to
read and maintain.
37
• If an operator doesn’t exist for your use case, you can extend operator
to meet your needs. For more on how to customize operators, check out
our previous Anatomy of an Operator webinar.
• Sensors are a type of operator that wait for something to happen.
They can be used to make your DAGs more event-driven.
• Deferrable Operators are a type of operator that release their worker
slot while waiting for their work to be completed. This can result in cost
savings and greater scalability. Astronomer recommends using deferra-
ble operators whenever one exists for your use case and your task takes
longer than about a minute. Note that you must be using Airflow 2.2+ and
have a triggerer running to use deferrable operators.
• Any operator that interacts with a service external to Airflow will typical-
ly require a connection so that Airflow can authenticate to that external
system. More information on how to set up connections can be found in
our guide on managing connections or in the examples to follow.
Example Implementation
This example shows how to use several common operators in a DAG used to
transfer data from S3 to Redshift and perform data quality checks.
Note: The full code and repository for this example can be found on
the Astronomer Registry.
38
• LocalFilesystemToS3Operator: This operator is part of the AWS provider
and is used to upload a file from a local filesystem to S3.
• S3ToRedshiftOperator: This operator is part of the AWS provider and is
used to transfer data from S3 to Redshift.
• PostgresOperator: This operator is part of the Postgres provider and is
used to execute a query against a Postgres database.
• SQLCheckOperator: This operator is part of core Airflow and is used to
perform checks against a database using a SQL query.
The following code shows how each of those operators can be instantiated in
a DAG file to define the pipeline:
1 import hashlib
2 import json
3
4 from airflow import DAG, AirflowException
5 from airflow.decorators import task
6 from airflow.models import Variable
7 from airflow.models.baseoperator import chain
8 from airflow.operators.empty import EmptyOperator
9 from airflow.utils.dates import datetime
10 from airflow.providers.amazon.aws.hooks.s3 import S3Hook
11 from airflow.providers.amazon.aws.transfers.local_to_s3
12 import (
13 LocalFilesystemToS3Operator
14 )
15 from airflow.providers.amazon.aws.transfers.s3_to_redshift
16 import (
17 S3ToRedshiftOperator
18 )
19 from airflow.providers.postgres.operators.postgres import-
20 PostgresOperator
21 from airflow.operators.sql import SQLCheckOperator
39
22 from airflow.operators.sql import SQLCheckOperator
23 from airflow.utils.task_group import TaskGroup
24
25
26 # The file(s) to upload shouldn't be hardcoded in a pro-
27 duction setting,
28 # this is just for demo purposes.
29 CSV_FILE_NAME = "forestfires.csv"
30 CSV_FILE_PATH = f"include/sample_data/forestfire_data/
31 {CSV_FILE_NAME}"
32
33 with DAG(
34 "simple_redshift_3",
35 start_date=datetime(2021, 7, 7),
36 description="""A sample Airflow DAG to load data from
37 csv files to S3
38 and then Redshift, with data integrity
39 and quality checks.""",
40 schedule_interval=None,
41 template_searchpath="/usr/local/airflow/include/sql/
42 redshift_examples/",
43 catchup=False,
44 ) as dag:
45
46 """
47 Before running the DAG, set the following in an Air-
48 flow
49 or Environment Variable:
50 - key: aws_configs
51 - value: { "s3_bucket": [bucket_name], "s3_key_pre-
52 fix": [key_prefix],
53 "redshift_table": [table_name]}
54 Fully replacing [bucket_name], [key_prefix], and [ta-
55 ble_name].
56 """
40
57 upload_file = LocalFilesystemToS3Operator(
58 task_id="upload_to_s3",
59 filename=CSV_FILE_PATH,
60 dest_key="{{ var.json.aws_configs.s3_key_prefix
61 }}/" + CSV_FILE_PATH,
62 dest_bucket="{{ var.json.aws_configs.s3_bucket
63 }}",
64 aws_conn_id="aws_default",
65 replace=True,
66 )
67
68 @task
69 def validate_etag():
70 """
71 #### Validation task
72 Check the destination ETag against the local MD5
73 hash to ensure
74 the file was uploaded without errors.
75 """
76 s3 = S3Hook()
77 aws_configs = Variable.get("aws_configs", deseri-
78 alize_json=True)
79 obj = s3.get_key(
80 key=f"{aws_configs.get('s3_key_prefix')}/{CSV_
81 FILE_PATH}",
82 bucket_name=aws_configs.get("s3_bucket"),
83 )
84 obj_etag = obj.e_tag.strip('"')
85 # Change `CSV_FILE_PATH` to `CSV_CORRUPT_FILE_
86 PATH` for the "sad path".
87 file_hash = hashlib.md5(
88 open(CSV_FILE_PATH).read().encode("utf-8")).
89 hexdigest()
90 if obj_etag != file_hash:
41
90 hexdigest()
91 if obj_etag != file_hash:
92 raise AirflowException(
93 f"""Upload Error: Object ETag in S3 did
94 not match
95 hash of local file."""
96 )
97
98 # Tasks that were created using decorators have to be
99 called to be used
100 validate_file = validate_etag()
101
102 #### Create Redshift Table
103 create_redshift_table = PostgresOperator(
104 task_id="create_table",
105 sql="create_redshift_forestfire_table.sql",
106 postgres_conn_id="redshift_default",
107 )
108
109 #### Second load task
110 load_to_redshift = S3ToRedshiftOperator(
111 task_id="load_to_redshift",
112 s3_bucket="{{ var.json.aws_configs.s3_bucket }}",
113 s3_key="{{ var.json.aws_configs.s3_key_prefix }}"
114 + f"/{CSV_FILE_PATH}",
115 schema="PUBLIC",
116 table="{{ var.json.aws_configs.redshift_table }}",
117 copy_options=["csv"],
118 )
119
120 #### Redshift row validation task
121 validate_redshift = SQLCheckOperator(
122 task_id="validate_redshift",
123 conn_id="redshift_default",
42
124 sql="validate_redshift_forestfire_load.sql",
125 params={"filename": CSV_FILE_NAME},
126 )
127
128 #### Row-level data quality check
129 with open("include/validation/forestfire_validation.
130 json") as ffv:
131 with TaskGroup(group_id="row_quality_checks") as
132 quality_check_group:
133 ffv_json = json.load(ffv)
134 for id, values in ffv_json.items():
135 values["id"] = id
136 SQLCheckOperator(
137 task_id=f"forestfire_row_quality_
138 check_{id}",
139 conn_id="redshift_default",
140 sql="row_quality_redshift_forestfire_
141 check.sql",
142 params=values,
143 )
144
145 #### Drop Redshift table
146 drop_redshift_table = PostgresOperator(
147 task_id="drop_table",
148 sql="drop_redshift_forestfire_table.sql",
149 postgres_conn_id="redshift_default",
150 )
151
152 begin = EmptyOperator(task_id="begin")
153 end = EmptyOperator(task_id="end")
154
155 #### Define task dependencies
156 chain(
157 begin,
43
159 upload_file,
160 validate_file,
161 create_redshift_table,
162 load_to_redshift,
163 validate_redshift,
164 quality_check_group,
165 drop_redshift_table,
166 end
167 )
44
The resulting DAG looks like this:
There are a few things to note about the operators in this DAG:
45
Hooks 101
Overview
Hooks are one of the fundamental building blocks of Airflow. At a high level,
a hook is an abstraction of a specific API that allows Airflow to interact with
an external system. Hooks are built into many operators, but they can also be
used directly in DAG code.
In this guide, we’ll cover the basics of using hooks in Airflow and when to use
them directly in DAG code. We’ll also walk through an example of imple-
menting two different hooks in a DAG.
Hook Basics
Hooks wrap around APIs and provide methods to interact with different
external systems. Because hooks standardize the way you can interact with
external systems, using them makes your DAG code cleaner, easier to read,
and less prone to errors.
46
All hooks inherit from the BaseHook class, which contains the logic to set up
an external connection given a connection ID. On top of making the con-
nection to an external system, each hook might contain additional methods
to perform various actions within that system. These methods might rely on
different Python libraries for these interactions.
For example, the S3Hook, which is one of the most widely used hooks, relies
on the boto3 library to manage its connection with S3.
47
• Hooks should always be used over manual API interaction to connect to
external systems.
• If you write a custom operator to interact with an external system,
it should use a hook to do so.ok to do so.
• If an operator with built-in hooks exists for your specific use case, then
it is best practice to use the operator over setting up hooks manually.
• If you regularly need to connect to an API for which no hook exists yet,
consider writing your own and sharing it with the community!
Example Implementation
The following example shows how you can use two hooks (S3Hook and Slack-
Hook) to retrieve values from files in an S3 bucket, run a check on them, post
the result of the check on Slack, and log the response of the Slack API.
For this use case, we use hooks directly in our Python functions because none
of the existing S3 Operators can read data from several files within an S3
bucket. Similarly, none of the existing Slack Operators can return the re-
sponse of a Slack API call, which you might want to log for monitoring pur-
poses.
The full source code of the hooks used can be found here:
Before running the example DAG, make sure you have the necessary Airflow
providers installed. If you are using the Astro CLI, you can do this by adding
the following packages to your requirements.txt:
48
1 apache-airflow-providers-amazon
2 apache-airflow-providers-slack
Next you will need to set up connections to the S3 bucket and Slack in the
Airflow UI.
1. Go to Admin -> Connections and click on the plus sign to add a new
connection.
2. Select Amazon S3 as connection type for the S3 bucket (if the connec-
tion type is not showing up, double check that you installed the provider
correctly) and provide the connection with your AWS access key ID as
login and your AWS secret access key as password (See AWS docu-
mentation for how to retrieve your AWS access key ID and AWS secret
access key).
The DAG below uses Airflow Decorators to define tasks and XCom to pass
information between them. The name of the S3 bucket and the names of the
files that the first task reads are stored as environment variables for security
purposes.
49
1 # importing necessary packages
2 import os
3 from datetime import datetime
4 from airflow import DAG
5 from airflow.decorators import task
6 from airflow.providers.slack.hooks.slack import SlackHook
7 from airflow.providers.amazon.aws.hooks.s3 import S3Hook
8
9 # import environmental variables for privacy (set in Dock-
10 erfile)
11 S3BUCKET_NAME = os.environ.get('S3BUCKET_NAME')
12 S3_EXAMPLE_FILE_NAME_1 = os.environ.get('S3_EXAMPLE_FILE_
13 NAME_1')
14 S3_EXAMPLE_FILE_NAME_2 = os.environ.get('S3_EXAMPLE_FILE_
15 NAME_2')
16 S3_EXAMPLE_FILE_NAME_3 = os.environ.get('S3_EXAMPLE_FILE_
17 NAME_3')
18
19 # task to read 3 keys from your S3 bucket
20 @task.python
21 def read_keys_form_s3():
22 s3_hook = S3Hook(aws_conn_id='hook_tutorial_s3_conn')
23 response_file_1 = s3_hook.read_key(key=S3_EXAMPLE_
24 FILE_NAME_1,
25 bucket_name=S3BUCKET_NAME)
26 response_file_2 = s3_hook.read_key(key=S3_EXAMPLE_
27 FILE_NAME_2,
28 bucket_name=S3BUCKET_NAME)
29 response_file_3 = s3_hook.read_key(key=S3_EXAMPLE_
30 FILE_NAME_3,
31 bucket_name=S3BUCKET_NAME)
32
33 response = {'num1' : int(response_file_1),
34 'num2' : int(response_file_2),
50
35 'num3' : int(response_file_3)}
36
37 return response
38
39 # task running a check on the data retrieved from your S3
40 bucket
41 @task.python
42 def run_sum_check(response):
43 if response['num1'] + response['num2'] == respon-
44 se['num3']:
45 return (True, response['num3'])
46 return (False, response['num3'])
47
48 # task posting to slack depending on the outcome of the
49 above check
50 # and returning the server response
51 @task.python
52 def post_to_slack(sum_check_result):
53 slack_hook = SlackHook(slack_conn_id='hook_tutorial_
54 slack_conn')
55
56 if sum_check_result[0] == True:
57 server_response = slack_hook.call(api_
58 method='chat.postMessage',
59 json={"channel": "#test-airflow",
60 "text": f"""All is well in your
61 bucket!
62 Correct sum: {sum_check_re-
63 sult[1]}!"""})
64 else:
65 server_response = slack_hook.call(api_
66 method='chat.postMessage',
67 json={"channel": "#test-airflow",
68 "text": f"""A test on your bucket
51
69 contents failed!
70 Target sum not reached: {sum_
71 check_result[1]}"""})
72
73 # return the response of the API call (for logging or
74 use downstream)
75 return server_response
76
77 # implementing the DAG
78 with DAG(dag_id='hook_tutorial',
79 start_date=datetime(2022,5,20),
80 schedule_interval='@daily',
81 catchup=False,
82 ) as dag:
83
84 # the dependencies are automatically set by XCom
85 response = read_keys_form_s3()
86 sum_check_result = run_sum_check(response)
87 post_to_slack(sum_check_result)
2. With the results of the first task, use a second decorated Python
Operator to complete a simple sum check.
3. Post the result of the check to a Slack channel using the call method of
the SlackHook and return the response from the Slack API.
52
Sensors 101
Sensors are a special kind of operator. When they run, they check to see if
a certain criterion is met before they let downstream tasks execute. This is a
great way to have portions of your DAG wait on some external check or pro-
cess to complete.
To browse and search all of the available Sensors in Airflow, visit the
Astronomer Registry. Take the following sensor as an example:
1 s1 = S3KeySensor(
2 task_id=’s3_key_sensor’,
3 bucket_key=’{{ ds_nodash }}/my_file.csv’,
4 bucket_name=’my_s3_bucket’,
5 aws_conn_id=’my_aws_connection’,
6 )
S3 Key Sensor
The S3KeySensor checks for the existence of a specified key in S3 every few
seconds until it finds it or times out. If it finds the key, it will be marked as a
success and allow downstream tasks to run. If it times out, it will fail and pre-
vent downstream tasks from running.
S3KeySensor Code
53
Sensor Params
There are sensors for many use cases, such as ones that check a database for
a certain row, wait for a certain time of day, or sleep for a certain amount of
time. All sensors inherit from the BaseSensorOperator and have 4 parameters
you can set on any sensor.
54
Deferrable Operators
Prior to Airflow 2.2, all task execution occurred within your worker resources.
For tasks whose work was occurring outside of Airflow (e.g. a Spark Job),
your tasks would sit idle waiting for a success or failure signal. These idle tasks
would occupy worker slots for their entire duration, potentially queuing other
tasks and delaying their start times.
With the release of Airflow 2.2, Airflow has introduced a new way to run tasks
in your environment: deferrable operators. These operators leverage
Python’s asyncio library to efficiently run tasks waiting for an external
resource to finish. This frees up your workers, allowing you to utilize those
resources more effectively. In this guide, we’ll walk through the concepts of
deferrable operators, as well as the new components introduced to Airflow
related to this feature.
There are some terms and concepts that are important to understand when
discussing deferrable operators:
55
Note: The terms “deferrable” and “async” or “asynchronous” are of-
ten used interchangeably. They mean the same thing in this context.
With deferrable operators, worker slots can be released while polling for job
status. When the task is deferred (suspended), the polling process is offload-
ed as a trigger to the triggerer, freeing up the worker slot. The triggerer has
the potential to run many asynchronous polling tasks concurrently, prevent-
ing this work from occupying your worker resources. When the terminal status
for the job is received, the task resumes, taking a worker slot while it finishes.
Visually, this is represented in the diagram below:
Receive Terminal
Submit Job to
Poll Spark Cluster for Job Status Status for Job on
Spark Cluster
Spark Cluster
56
When and Why to Use Deferrable Operators
• TimeSensorAsync
• DateTimeSensorAsync
However, this list will grow quickly as the Airflow community makes more
investments into these operators. In the meantime, you can also create your
own (more on this in the last section of this guide). Additionally, Astronomer
maintains some deferrable operators available only on Astro Runtime.
There are numerous benefits to using deferrable operators. Some of the most
notable are:
57
• Paves the way to event-based DAGs: The presence of asyncio in core
Airflow is a potential foundation for event-triggered DAGs.
Let’s say we have a DAG that is scheduled to run a sensor every minute,
where each task can take up to 20 minutes. Using the default settings with 1
worker, we can see that after 20 minutes we have 16 tasks running, each hold-
ing a worker slot:
Because worker slots are held during task execution time, we would need at
least 20 worker slots available for this DAG to ensure that future runs are not
delayed. To increase concurrency, we would need to add additional resources
to our Airflow infrastructure (e.g. another worker pod).
58
1 from datetime import datetime
2 from airflow import DAG
3 from airflow.sensors.date_time import DateTimeSensor
4
5 with DAG(
6 “sync_dag”,
7 start_date=datetime(2021, 12, 22, 20, 0),
8 end_date=datetime(2021, 12, 22, 20, 19),
9 schedule_interval=”* * * * *”,
10 catchup=True,
11 max_active_runs=32,
12 max_active_tasks=32
13 ) as dag:
14
15 sync_sensor= DateTimeSensor(
16 task_id=”sync_task”,
17 target_time=”””{{ macros.datetime.utcnow() + mac-
18 ros.timedelta(minutes=20) }}
19 )
59
1 from datetime import datetime
2 from airflow import DAG
3 from airflow.sensors.date_time import DateTimeSensorAsync
4
5 with DAG(
6 “async_dag”,
7 start_date=datetime(2021, 12, 22, 20, 0),
8 end_date=datetime(2021, 12, 22, 20, 19),
9 schedule_interval=”* * * * *”,
10 catchup=True,
11 max_active_runs=32,
12 max_active_tasks=32
13 ) as dag:
14
15 async_sensor = DateTimeSensorAsync(
16 task_id=”async_task”,
17 target_time=”””{{ macros.datetime.utcnow() + mac-
18 ros.timedelta(minutes=20) }}”””,
19 )
60
Note that if you are running Airflow on Astro, the triggerer runs automatically
if you are on Astro Runtime 4.0+. If you are using Astronomer Software 0.26+,
you can add a triggerer to an Airflow 2.2+ deployment in the Deployment
Settings tab. This guide details the steps for configuring this feature in the
platform.
As tasks are raised into a deferred state, triggers are registered in the trig-
gerer. You can set the number of concurrent triggers that can run in a single
triggerer process with the default_capacity configuration setting in Airflow.
This can also be set via the AIRFLOW__TRIGGERER__DEFAULT_CAPACITY envi-
ronment variable. By default, this variable’s value is 1,000.
High Availability
Note that triggers are designed to be highly-available. You can implement
this by starting multiple triggerer processes. Similar to the HA scheduler in-
troduced in Airflow 2.0, Airflow ensures that they co-exist with correct lock-
ing and HA. You can reference the Airflow docs for further information on
this topic.
61
3. DAG Design
Because Airflow is 100% code, knowing the basics of Python is all it takes to
get started writing DAGs. However, writing DAGs that are efficient, secure,
and scalable requires some Airflow-specific finesse. In this section, we will
cover some best practices for developing DAGs that make the most of what
Airflow has to offer.
In general, most of the best practices we cover here fall into one of
two categories:
• DAG design
• Using Airflow as an orchestrator
Reviewing Idempotency
• Before we jump into best practices specific to Airflow, we need to review one
concept which applies to all data pipelines.
62
Idempotency is the foundation for many computing
practices, including the Airflow best practices in this sec-
tion. Specifically, it is a quality: A computational operatio-
wn is considered idempotent if it always produces the
same output.
DAG Design
The following DAG design principles will help to make your DAGs idempo-
tent, efficient, and readable.
For example, in an ETL pipeline you would ideally want your Extract, Trans-
form, and Load operations covered by three separate tasks. Atomizing these
tasks allows you to rerun each operation in the pipeline independently, which
supports idempotence
63
Contrary to our best practices, the following example defines variables
based on datetime Python functions:
You can use one of Airflow’s many built-in variables and macros, or you can
create your own templated field to pass in information at runtime. For more on
this topic check out our guide on templating and macros in Airflow.
64
Incremental Record Filtering
It is ideal to break out your pipelines into incremental extracts and loads
wherever possible. For example, if you have a DAG that runs hourly, each DAG
Run should process only records from that hour, rather than the whole dataset.
When the results in each DAG Run represent only a small subset of your total
dataset, a failure in one subset of the data won’t prevent the rest of your DAG
Runs from completing successfully. And if your DAGs are idempotent, you can
rerun a DAG for only the data that failed rather than reprocessing the entire
dataset.
There are multiple ways you can achieve incremental pipelines. The two best
and most common methods are described below.
• Sequence IDs
When a last modified date is not available, a sequence or incrementing
ID can be used for incremental loads. This logic works best when the
source records are only being appended to and never updated. While we
recommend implementing a “last modified” date system in your
records if possible, basing your incremental logic off of a sequence ID
can be a sound way to filter pipeline records without a last modified
date.
65
Avoid Top-Level Code in Your DAG File
In the context of Airflow, we use “top-level code” to mean any code that isn’t
part of your DAG or operator instantiations.
Treat your DAG file like a config file and leave all of the heavy lifting to the
hooks and operators that you instantiate within the file. If your DAGs need to
access additional code such as a SQL script or a Python function, keep that
code in a separate file that can be read into a DAG Run.
For one example of what not to do, in the DAG below a PostgresOperator
executes a SQL query that was dropped directly into the DAG file:
66
15 #Instantiate DAG
16 with DAG(‘bad_practices_dag_1’,
17 start_date=datetime(2021, 1, 1),
18 max_active_runs=3,
19 schedule_interval=’@daily’,
20 default_args=default_args,
21 catchup=False
22 ) as dag:
23
24 t0 = DummyOperator(task_id=’start’)
25
26 #Bad example with top level SQL code in the DAG file
27 query_1 = PostgresOperator(
28 task_id=’covid_query_wa’,
29 postgres_conn_id=’postgres_default’,
30 sql=’’’with yesterday_covid_data as (
31 SELECT *
32 FROM covid_state_data
33 WHERE date = {{ params.today }}
34 AND state = ‘WA’
35 ),
36 today_covid_data as (
37 SELECT *
38 FROM covid_state_data
39 WHERE date = {{ params.yesterday }}
40 AND state = ‘WA’
41 ),
42 two_day_rolling_avg as (
43 SELECT AVG(a.state, b.state) as two_day_avg
44 FROM yesterday_covid_data a
45 JOIN yesterday_covid_data b
46 ON a.state = b.state
47 )
48 SELECT a.state, b.state, c.two_day_avg
49 FROM yesterday_covid_data a
67
50 JOIN today_covid_data b
51 ON a.state=b.state
52 JOIN two_day_rolling_avg c
53 ON a.state=b.two_day_avg;’’’,
54 params={‘today’: today, ‘yesterday’:yesterday}
55 )
Keeping the query in the DAG file like this makes the DAG harder to read
and maintain. Instead, in the DAG below we call in a file named covid_
state_query.sql into our PostgresOperator instantiation, which embodies
the best practice:
68
17 with DAG(‘good_practices_dag_1’,
18 start_date=datetime(2021, 1, 1),
19 max_active_runs=3,
20 schedule_interval=’@daily’,
21 default_args=default_args,
22 catchup=False,
23 template_searchpath=’/usr/local/airflow/include’
24 #include path to look for external files
25 ) as dag:
26
27 query = PostgresOperator(
28 task_id=’covid_query_{0}’.format(state),
29 postgres_conn_id=’postgres_default’,
30 sql=’covid_state_query.sql’, #reference query
31 kept in separate file
32 params={‘state’: “’” + state + “’”}
33 )
34 ‘email_on_retry’: False,
35 ‘retries’: 1,
36 ‘retry_delay’: timedelta(minutes=1)
37 }
38
39 #Instantiate DAG
69
Use a Consistent Method for Task Dependencies
In Airflow, task dependencies can be set multiple ways. You can use set_up-
stream() and set_downstream() functions, or you can use << and >> opera-
tors. Which method you use is a matter of personal preference, but for read-
ability it’s best practice to choose one method and stick with it.
1 task_1.set_downstream(task_2)
2 task_3.set_upstream(task_2)
3 task_3 >> task_4
70
Leverage Airflow Features
For easy discovery of all the great provider packages out there, check out
the Astronomer Registry.
71
We recommend that you consider the size of your data now and in the future
when deciding whether to process data within Airflow or offload to an exter-
nal tool. If your use case is well suited to processing data within Airflow, then
we would recommend the following:
Depending on your data retention policy, you could modify the load logic
and rerun the entire historical pipeline without having to rerun the extracts.
This is also useful in situations where you no longer have access to the source
system (e.g. you hit an API limit).
72
Other Best Practices
Finally, here are a few other noteworthy best practices that don’t fall under
the two categories above.
Additionally, if you change the start_date of your DAG you should also change
the DAG name. Changing the start_date of a DAG creates a new entry in
Airflow’s database, which could confuse the scheduler because there will be two
DAGs with the same name but different schedules.
Changing the name of a DAG also creates a new entry in the database, which
73
powers the dashboard, so follow a consistent naming convention since changing
a DAG’s name doesn’t delete the entry in the database for the old name.
Set Retries at the DAG Level
Even if your code is perfect, failures happen. In a distributed environment
where task containers are executed on shared hosts, it’s possible for tasks to
be killed off unexpectedly. When this happens, you might see Airflow’s logs
mention a zombie process.
Issues like this can be resolved by using task retries. The best practice is to
set retries as a default_arg so they are applied at the DAG level and get
more granular for specific tasks only where necessary. A good range to try
is ~2–4 retries.
Sharing data between tasks is a very common use case in Airflow. If you’ve been
writing DAGs, you probably know that breaking them up into appropriately small
tasks is the best practice for debugging and recovering quickly from failures. But,
maybe one of your downstream tasks requires metadata about an upstream task
or processes the results of the task immediately before it.
There are a few methods you can use to implement data sharing between
your Airflow tasks. In this section, we will walk through the two most com-
monly used methods, discuss when to use each, and show some example
DAGs to demonstrate the implementation. Before we dive into the specifics,
there are a couple of high-level concepts that are important when writing
DAGs where data is shared between tasks.
74
Ensure Idempotency
An important concept for any data pipeline, including an Airflow DAG, is
idempotency. This is the property whereby an operation can be applied
multiple times without changing the result. We often hear about this concept
as it applies to your entire DAG; if you execute the same DAGRun multiple
times, you will get the same result. However, this concept also applies to
tasks within your DAG; if every task in your DAG is idempotent, your full DAG
will be idempotent as well.
XCom
The first method for passing data between Airflow tasks is to use XCom,
which is a key Airflow feature for sharing task data.
What is XCom
XCom (short for cross-communication) is a native feature within Airflow.
XComs allow tasks to exchange task metadata or small amounts of data. They
are defined by a key, value, and timestamp.
75
ue (e.g. if your Python callable for your PythonOperator has a return), that
value will automatically be pushed to XCom. Tasks can also be configured to
push XComs by calling the xcom_push() method. Similarly, xcom_pull() can
be used in a task to receive an XCom.
You can view your XComs in the Airflow UI by navigating to Admin → XComs.
While there is nothing stopping you from passing small data sets with XCom,
be very careful when doing so. This is not what XCom was designed for, and
using it to pass data like pandas dataframes can degrade the performance of
your DAGs and take up storage in the metadata database.
76
XCom cannot be used for passing large data sets between tasks. The limit
for the size of the XCom is determined by which metadata database you are
using:
• Postgres: 1 Gb
• SQLite: 2 Gb
• MySQL: 64 Kb
You can see that these limits aren’t very big. And even if you think your data
might squeak just under, don’t use XComs. Instead, see the section below on
using intermediary data storage, which is more appropriate for larger chunks
of data.
Example DAGs
This section will show a couple of example DAGs that use XCom to pass data
between tasks. For this example, we are interested in analyzing the increase
in a total number of Covid tests for the current day for a particular state. To
implement this use case, we will have one task that makes a request to the
Covid Tracking API and pulls the totalTestResultsIncrease parameter
from the results. We will then use another task to take that result and com-
plete some sort of analysis. This is a valid use case for XCom because the
data being passed between the tasks is a single integer.
77
1 from airflow import DAG
2 from airflow.operators.python_operator import PythonOp-
3 erator
4 from datetime import datetime, timedelta
5
6 import requests
7 import json
8
9 url = ‘https://covidtracking.com/api/v1/states/’
10 state = ‘wa’
11
10 def get_testing_increase(state, ti):
11 “””
10 Gets totalTestResultsIncrease field from Covid API
10 for given state and returns value
11 “””
12 res = requests.get(url+’{0}/current.json’.for-
13 mat(state))
14 testing_increase = json.loads(res.text)[‘totalT-
15 estResultsIncrease’]
16
17 ti.xcom_push(key=’testing_increase’, value=testing_
18 increase)
19
20 def analyze_testing_increases(state, ti):
21 “””
22 Evaluates testing increase results
23 “””
24 testing_increases=ti.xcom_pull(key=’testing_in-
25 crease’, task_ids=’get_testing_increase_data_{0}’.for-
26 mat(state))
27 print(‘Testing increases for {0}:’.format(state),
28 testing_increases)
29 #run some analysis here
78
30 # Default settings applied to all tasks
31 default_args = {
32 ‘owner’: ‘airflow’,
33 ‘depends_on_past’: False,
34 ‘email_on_failure’: False,
35 ‘email_on_retry’: False,
36 ‘retries’: 1,
37 ‘retry_delay’: timedelta(minutes=5)
38 }
39
40 with DAG(‘xcom_dag’,
41 start_date=datetime(2021, 1, 1),
42 max_active_runs=2,
43 schedule_interval=timedelta(minutes=30),
44 default_args=default_args,
45 catchup=False
46 ) as dag:
47
48 opr_get_covid_data = PythonOperator(
49 task_id = ‘get_testing_increase_data_{0}’.for-
50 mat(state),
51 python_callable=get_testing_increase,
52 op_kwargs={‘state’:state}
53 )
54
55 opr_analyze_testing_data = PythonOperator(
56 task_id = ‘analyze_data’,
57 python_callable=analyze_testing_increases,
58 op_kwargs={‘state’:state}
59 )
60
61 opr_get_covid_data >> opr_analyze_testing_data
79
In this DAG we have two PythonOperator tasks which share data using the
xcom_push and xcom_pull functions. Note that in the get_testing_increase
function, we used the xcom_push method so that we could specify the key
name. Alternatively, we could have made the function return the testing_
increase value, because any value returned by an operator in Airflow will
automatically be pushed to XCom; if we had used this method, the XCom key
would be “returned_value”.
If we run this DAG and then go to the XComs page in the Airflow UI, we see
that a new row has been added for our get_testing_increase_data_wa task
with the key testing_increase and value returned from the API.
In the logs for the analyze_data task, we can see the value from the prior
task was printed, meaning the value was successfully retrieved from XCom.
80
TaskFlow API
Another way to implement this use case is to use the TaskFlow API that was
released with Airflow 2.0. With the TaskFlow API, returned values are pushed
to XCom as usual, but XCom values can be pulled simply by adding the key
as an input to the function as shown in the following DAG:
81
13 @dag(‘xcom_taskflow_dag’, schedule_interval=’@daily’, de-
14 fault_args=default_args, catchup=False)
15 def taskflow():
16
17 @task
18 def get_testing_increase(state):
19 “””
20 Gets totalTestResultsIncrease field from Covid API
21 for given state and returns value
22 “””
23 res = requests.get(url+’{0}/current.json’.for-
24 mat(state))
25 return{‘testing_increase’: json.loads(res.text)
26 [‘totalTestResultsIncrease’]}
27
28 @task
29 def analyze_testing_increases(testing_increase: int):
30 “””
31 Evaluates testing increase results
32 “””
33 print(‘Testing increases for {0}:’.format(state),
34 testing_increase)
35 #run some analysis here
36
37 analyze_testing_increases(get_testing_increase(state))
38
39 dag = taskflow()
This DAG is functionally the same as the first one, but thanks to the TaskFlow
API there is less code required overall and no additional code required for
passing the data between the tasks using XCom.
82
Intermediary Data Storage
As mentioned above, XCom can be a great option for sharing data between
tasks because it doesn’t rely on any tools external to Airflow itself. Howev-
er, it is only designed to be used for very small amounts of data. What if the
data you need to pass is a little bit larger, for example, a small dataframe?
The best way to manage this use case is to use intermediary data storage.
This means saving your data to some system external to Airflow at the end of
one task, then reading it in from that system in the next task. This is common-
ly done using cloud file storage such as S3, GCS, Azure Blob Storage, etc.,
but it could also be done by loading the data in either a temporary or per-
sistent table in a database.
We will note here that while this is a great way to pass data that is too large
to be managed with XCom, you should still exercise caution. Airflow is meant
to be an orchestrator, not an execution framework. If your data is very large,
it is probably a good idea to complete any processing using a framework like
Spark or compute-optimized data warehouses like Snowflake or dbt.
Example DAG
Building on our Covid example above, let’s say instead of a specific value of
testing increases, we are interested in getting all of the daily Covid data for a
state and processing it. This case would not be ideal for XCom, but since the
data returned is a small dataframe, it is likely okay to process using Airflow.
83
1 from airflow import DAG
2 from airflow.operators.python_operator import PythonOpera-
3 tor
4 from airflow.providers.amazon.aws.hooks.s3 import S3Hook
5 from datetime import datetime, timedelta
6
7 from io import StringIO
8 import pandas as pd
9 import requests
10
11 s3_conn_id = ‘s3-conn’
12 bucket = ‘astro-workshop-bucket’
13 state = ‘wa’
14 date = ‘{{ yesterday_ds_nodash }}’
15
16 def upload_to_s3(state, date):
17 ‘’’Grabs data from Covid endpoint and saves to flat
18 file on S3
19 ‘’’
20 # Connect to S3
21 s3_hook = S3Hook(aws_conn_id=s3_conn_id)
22
23 # Get data from API
24 url = ‘https://covidtracking.com/api/v1/states/’
25 res = requests.get(url+’{0}/{1}.csv’.format(state,
26 date))
27
28 # Save data to CSV on S3
29 s3_hook.load_string(res.text, ‘{0}_{1}.csv’.for-
30 mat(state, date), bucket_name=bucket, replace=True)
31
32 def process_data(state, date):
33 ‘’’Reads data from S3, processes, and saves to new S3
34 file
84
35 # Connect to S3
36 s3_hook = S3Hook(aws_conn_id=s3_conn_id)
37
38 # Read data
39 data = StringIO(s3_hook.read_key(key=’{0}_{1}.csv’.
40 format(state, date), bucket_name=bucket))
41 df = pd.read_csv(data, sep=’,’)
42
43 # Process data
44 processed_data = df[[‘date’, ‘state’, ‘positive’, ‘neg-
45 ative’]]
46
47 # Save processed data to CSV on S3
48 s3_hook.load_string(processed_data.to_string(), ‘{0}_
49 {1}_processed.csv’.format(state, date), bucket_name=bucket,
50 replace=True)
51
52 # Default settings applied to all tasks
53 default_args = {
54 ‘owner’: ‘airflow’,
55 ‘depends_on_past’: False,
56 ‘email_on_failure’: False,
57 ‘email_on_retry’: False,
58 ‘retries’: 1,
59 ‘retry_delay’: timedelta(minutes=1)
60 }
61
62 with DAG(‘intermediary_data_storage_dag’,
63 start_date=datetime(2021, 1, 1),
64 max_active_runs=1,
65 schedule_interval=’@daily’,
66 default_args=default_args,
67 catchup=False
68
85
69 ) as dag:
70
71 generate_file = PythonOperator(
72 task_id=’generate_file_{0}’.format(state),
73 python_callable=upload_to_s3,
74 op_kwargs={‘state’: state, ‘date’: date}
75 )
76
77 process_data = PythonOperator(
78 task_id=’process_data_{0}’.format(state),
79 python_callable=process_data,
80 op_kwargs={‘state’: state, ‘date’: date}
81 )
82
83 generate_file >> process_data
‘’’
In this DAG we make use of the S3Hook to save data retrieved from the API
to a CSV on S3 in the generate_file task. The process_data task then grabs
that data from S3, converts it to a dataframe for processing, and then saves
the processed data back to a new CSV on S3.
86
Using Task Groups in Airflow
Overview
Prior to the release of Airflow 2.0 in December 2020, the only way to group
tasks and create modular workflows within Airflow was to use SubDAGs.
SubDAGs were a way of presenting a cleaner-looking DAG by capitalizing on
code patterns. For example, ETL DAGs usually share a pattern of tasks that
extract data from a source, transform the data, and load it somewhere. The
SubDAG would visually group the repetitive tasks into one UI task, making
the pattern between tasks clearer.
However, SubDAGs were really just DAGs embedded in other DAGs. This
caused both performance and functional issues:
• When a SubDAG is triggered, the SubDAG and child tasks take up work-
er slots until the entire SubDAG is complete. This can delay other task
processing and, depending on your number of worker slots, can lead to
deadlocking.
• SubDAGs have their own parameters, schedule, and enabled settings.
When these are not consistent with their parent DAG, unexpected be-
havior can occur.
Unlike SubDAGs, Task Groups are just a UI grouping concept. Starting in Air-
flow 2.0, you can use Task Groups to organize tasks within your DAG’s graph
view in the Airflow UI. This avoids the added complexity and performance
issues of SubDAGs, all while using less code!
In this section, we will walk through how to create Task Groups and show
some example DAGs to demonstrate their scalability.
87
Creating Task Groups
To use Task Groups you’ll need to use the following import statement.
For our first example, we will instantiate a Task Group using a with statement
and provide a group_id. Inside our Task Group, we will define our two tasks,
t1 and t2, and their respective dependencies.
You can use dependency operators (<< and >>) on Task Groups in the same
way that you can with individual tasks. Dependencies applied to a Task Group
are applied across its tasks. In the following code, we will add additional de-
pendencies to t0 and t3 to the Task Group, which automatically applies the
same dependencies across t1 and t2:
1 t0 = DummyOperator(task_id=’start’)
2
3 # Start Task Group definition
4 with TaskGroup(group_id=’group1’) as tg1:
5 t1 = DummyOperator(task_id=’task1’)
6 t2 = DummyOperator(task_id=’task2’)
7
8 t1 >> t2
9 # End Task Group definition
10
11 t3 = DummyOperator(task_id=’end’)
12
13 # Set Task Group’s (tg1) dependencies
14 t0 >> tg1 >> t3
88
In the Airflow UI, Task Groups look like tasks with blue shading. When we ex-
pand group1 by clicking on it, we see blue circles where the Task Group’s de-
pendencies have been applied to the grouped tasks. The task(s) immediately
to the right of the first blue circle (t1) get the group’s upstream dependen-
cies and the task(s) immediately to the left (t2) of the last blue circle get
the group’s downstream dependencies.
Note: When your task is within a Task Group, your callable task_id
will be the task_id prefixed with the group_id (i.e. group_id.task_
id). This ensures the uniqueness of the task_id across the DAG. This is
important to remember when calling specific tasks with XCOM pass-
ing or branching operator decisions.
89
Dynamically Generating Task Groups
Just like with DAGs, Task Groups can be dynamically generated to make use
of patterns within your code. In an ETL DAG, you might have similar down-
stream tasks that can be processed independently, such as when you call
different API endpoints for data that needs to be processed and stored in
the same way. For this use case, we can dynamically generate Task Groups
by API endpoint. Just like with manually written Task Groups, generated Task
Groups can be drilled into from the Airflow UI to see specific tasks.
In the code below, we use iteration to create multiple Task Groups. While the
tasks and dependencies remain the same across Task Groups, we can change
which parameters are passed in to each Task Group based on the group_id:
90
This screenshot shows the expanded view of the Task Groups we generated
above in the Airflow UI:
By default, using a loop to generate your Task Groups will put them in paral-
lel. If your Task Groups are dependent on elements of another Task Group,
you’ll want to run them sequentially. For example, when loading tables with
foreign keys, your primary table records need to exist before you can load
your foreign table.
In the example below, our third dynamically generated Task Group has a foreign
key constraint on both our first and second dynamically generated Task Groups, so
we will want to process it last. To do this, we will create an empty list and append
our Task Group objects as they are generated. Using this list, we can reference the
Task Groups and define their dependencies to each other:
91
1 groups = []
2 for g_id in range(1,4):
3 tg_id = f’group{g_id}’
4 with TaskGroup(group_id=tg_id) as tg1:
5 t1 = DummyOperator(task_id=’task1’)
6 t2 = DummyOperator(task_id=’task2’)
7
8 t1 >> t2
9
10 if tg_id == ‘group1’:
11 t3 = DummyOperator(task_id=’task3’)
12 t1 >> t3
13
14 groups.append(tg1)
15
16 [groups[0] , groups[1]] >> groups[2]
The following screenshot shows how these Task Groups appear in the
Airflow UI:
92
Conditioning on Task Groups
In the above example, we added an additional task to group1 based on our
group_id. This was to demonstrate that even though we are dynamically cre-
ating Task Groups to take advantage of patterns, we can still introduce vari-
ations to the pattern while avoiding code redundancies from building each
Task Group definition manually.
For additional complexity, you can nest Task Groups. Building on our previ-
ous ETL example, when calling API endpoints, we may need to process new
records for each endpoint before we can process updates to them.
In the following code, our top-level Task Groups represent our new and
updated record processing, while the nested Task Groups represent our API
endpoint processing:
1 groups = []
2 for g_id in range(1,3):
3 with TaskGroup(group_id=f’group{g_id}’) as tg1:
4 t1 = DummyOperator(task_id=’task1’)
5 t2 = DummyOperator(task_id=’task2’)
6
7 sub_groups = []
8 for s_id in range(1,3):
9 with TaskGroup(group_id=f’sub_group{s_id}’) as
10 tg2:
11 st1 = DummyOperator(task_id=’task1’)
12 st2 = DummyOperator(task_id=’task2’)
13
14 st1 >> st2
15 sub_groups.append(tg2)
93
16 t1 >> sub_groups >> t2
17 groups.append(tg1)
18
19 groups[0] >> groups[1]
The following screenshot shows the expanded view of the nested Task
Groups in the Airflow UI:
94
Takeaways
Task Groups are a dynamic and scalable UI grouping concept that eliminates
the functional and performance issues of SubDAGs.
Ultimately, Task Groups give you the flexibility to group and organize your tasks
in a number of ways. To help guide your implementation of Task Groups, think
about:
95
Cross-DAG Dependencies
When designing Airflow DAGs, it is often best practice to put all related tasks
in the same DAG. However, it’s sometimes necessary to create dependencies
between your DAGs. In this scenario, one node of a DAG is its own complete
DAG, rather than just a single task. Throughout this guide, we’ll use the fol-
lowing terms to describe DAG dependencies:
• Upstream DAG: A DAG that must reach a specified state before a down-
stream DAG can run
• Downstream DAG: A DAG that cannot run until an upstream DAG reach-
es a specified state
• A DAG should only run after one or more datasets have been updated
by tasks in other DAGs.
• Two DAGs are dependent, but they have different schedules.
• Two DAGs are dependent, but they are owned by different teams.
• A task depends on another task but for a different execution date.
For any scenario where you have dependent DAGs, we’ve got you covered!
In this guide, we’ll discuss multiple methods for implementing cross-DAG
dependencies, including how to implement dependencies if your dependent
DAGs are located in different Airflow deployments.
Note: All code in this section can be found in this Github repo.
96
Assumed knowledge
To get the most out of this guide, you should have knowledge of:
In this section, we detail how to use each method and ideal scenarios for
each, as well as how to view dependencies in the Airflow UI.
97
You should use this method if you have a downstream DAG that should only
run after a dataset has been updated by an upstream DAG, especially if
those updates can be very irregular. This type of dependency also provides
you with increased observability into the dependencies between your DAGs
and datasets in the Airflow UI.
Any task can be made into a producing task by providing one or more data-
sets to the outlets parameter as shown below.
1 dataset1 = Dataset('s3://folder1/dataset_1.txt')
2
3 # producing task in the upstream DAG
4 EmptyOperator(
5 task_id="producing_task",
6 outlets=[dataset1] # flagging to Airflow that data-
7 set1 was updated
8 )
The downstream DAG is scheduled to run after dataset1 has been updated
by providing it to the schedule parameter.
98
1 dataset1 = Dataset('s3://folder1/dataset_1.txt')
2
3 # consuming DAG
4 with DAG(
5 dag_id='consuming_dag_1',
6 catchup=False,
7 start_date=datetime.datetime(2022, 1, 1),
8 schedule=[dataset1]
9 ) as dag:
In the Airflow UI, the Next Run column for the downstream DAG shows
how many datasets the DAG depends on and how many of those have been
updated since the last DAG run. The screenshot below shows that the DAG
dataset_dependent_example_dag runs only after two different datasets have
been updated. One of those datasets has already been updated by an up-
stream DAG.
Check out the Datasets and Data Driven Scheduling in Airflow guide to learn
more and see an example implementation of this feature.
99
TriggerDagRunOperator
A common use case for this implementation is when an upstream DAG fetch-
es new testing data for a machine learning pipeline, runs and tests a model,
and publishes the model’s prediction. In case of the model underperforming,
the TriggerDagRunOperator is used to kick off a separate DAG that retrains
the model while the upstream DAG waits. Once the model is retrained and
tested by the downstream DAG, the upstream DAG resumes and publishes
the new model’s results.
100
1 from airflow import DAG
2 from airflow.operators.python import PythonOperator
3 from airflow.operators.trigger_dagrunimport TriggerDa-
4 gRunOperator
5 from datetime import datetime, timedelta
6
7 def print_task_type(**kwargs):
8 """
9 Dummy function to call before and after dependent DAG.
10 """
11 print(f"The {kwargs['task_type']} task has complet-
12 ed.")
13
14 # Default settings applied to all tasks
15 default_args = {
16 'owner': 'airflow',
17 'depends_on_past': False,
18 'email_on_failure': False,
19 'email_on_retry': False,
20 'retries': 1,
21 'retry_delay': timedelta(minutes=5)
22 }
23
24 with DAG(
25 'trigger-dagrun-dag',
26 start_date=datetime(2021, 1, 1),
27 max_active_runs=1,
28 schedule_interval='@daily',
29 default_args=default_args,
30 catchup=False
31 ) as dag:
32
33 start_task = PythonOperator(
101
34 task_id='starting_task',
35 python_callable=print_task_type,
36 op_kwargs={'task_type': 'starting'}
37 )
38
39 trigger_dependent_dag = TriggerDagRunOperator(
40 task_id="trigger_dependent_dag",
41 trigger_dag_id="dependent-dag",
42 wait_for_completion=True
43 )
44
45 end_task = PythonOperator(
46 task_id='end_task',
47 python_callable=print_task_type,
48 op_kwargs={'task_type': 'ending'}
49 )
50
51 start_task >> trigger_dependent_dag >> end_task
In the following graph view, you can see that the trigger_dependent_dag
task in the middle is the TriggerDagRunOperator, which runs the depen-
dent-dag.
102
Note that if your dependent DAG requires a config input or a specific execu-
tion date, these can be specified in the operator using the conf and execu-
tion_date params respectively.
ExternalTaskSensor
To create cross-DAG dependencies from a downstream DAG, consider using
one or more ExternalTaskSensors. The downstream DAG will pause until a
task is completed in the upstream DAG before resuming.
For example, you could have upstream tasks modifying different tables in a
data warehouse and one downstream DAG running one branch of data qual-
ity checks for each of those tables. You can use one ExternalTaskSensor at
the start of each branch to make sure that the checks running on each table
only start, once the update to that specific table has finished.
103
1 from airflow import DAG
2 from airflow.operators.python import PythonOperator
3 from airflow.sensors.external_task import ExternalTaskSen-
4 sor
5 from airflow.operators.empty import EmptyOperator
6 from datetime import datetime, timedelta
7
8 def downstream_function_branch_1():
9 print('Upstream DAG 1 has completed. Starting tasks of
10 branch 1.')
11
12 def downstream_function_branch_2():
13 print('Upstream DAG 2 has completed. Starting tasks of
14 branch 2.')
15
16 def downstream_function_branch_3():
17 print('Upstream DAG 3 has completed. Starting tasks of
18 branch 3.')
19
20 default_args = {
21 'owner': 'airflow',
22 'depends_on_past': False,
23 'email_on_failure': False,
24 'email_on_retry': False,
25 'retries': 1,
26 'retry_delay': timedelta(minutes=5)
27 }
28
29 with DAG(
30 'external-task-sensor-dag',
31 start_date=datetime(2022, 8, 1),
32 max_active_runs=3,
33 schedule='*/1 * * * *',
104
34 catchup=False
35 ) as dag:
36
37 start = EmptyOperator(task_id="start")
38 end = EmptyOperator(task_id="end")
39
40 ets_branch_1 = ExternalTaskSensor(
41 task_id="ets_branch_1",
42 external_dag_id='upstream_dag_1',
43 external_task_id='my_task',
44 allowed_states=['success'],
45 failed_states=['failed', 'skipped']
46 )
47
48 task_branch_1 = PythonOperator(
49 task_id='task_branch_1',
50 python_callable=downstream_function_branch_1,
51 )
52
53 ets_branch_2 = ExternalTaskSensor(
54 task_id="ets_branch_2",
55 external_dag_id='upstream_dag_2',
56 external_task_id='my_task',
57 allowed_states=['success'],
58 failed_states=['failed', 'skipped']
59 )
60
61 task_branch_2 = PythonOperator(
62 task_id='task_branch_2',
63 python_callable=downstream_function_branch_2,
64 )
65
65 ets_branch_3 = ExternalTaskSensor(
105
67 task_id="ets_branch_3",
68 external_dag_id='upstream_dag_3',
69 external_task_id='my_task',
70 allowed_states=['success'],
71 failed_states=['failed', 'skipped']
72 )
73
74 task_branch_3 = PythonOperator(
75 task_id='task_branch_3',
78 python_callable=downstream_function_branch_3,
79 )
80
81 start >> [ets_branch_1, ets_branch_2, ets_branch_3]
82
83 ets_branch_1 >> task_branch_1
84 ets_branch_2 >> task_branch_2
85 ets_branch_3 >> task_branch_3
86
87 [task_branch_1, task_branch_2, task_branch_3] >> end
36
In this DAG
106
If you want the downstream DAG to wait for the entire upstream DAG to
finish instead of a specific task, you can set the external_task_id to None.
In the example above, we specify that the external task must have a state
of success for the downstream task to succeed, as defined by the allowed_
states and failed_states.
Also note that in the example above, the upstream DAG (example_dag)
and downstream DAG (external-task-sensor-dag) must have the same
start date and schedule interval. This is because the ExternalTaskSensor will
look for completion of the specified task or DAG at the same logical_date
(previously called execution_date). To look for completion of the external
task at a different date, you can make use of either of the execution_delta
or execution_date_fn parameters (these are described in more detail in the
documentation linked above).
Airflow API
This method is useful if your dependent DAGs live in different Airflow en-
vironments (more on this in the Cross-Deployment Dependencies section
below). The task triggering the downstream DAG will complete once the API
call is complete.
107
Using the API to trigger a downstream DAG can be implemented within a
DAG by using the SimpleHttpOperator as shown in the example DAG below:
108
1 with DAG(
2 'api-dag',
3 start_date=datetime(2021, 1, 1),
4 max_active_runs=1,
5 schedule_interval='@daily',
6 catchup=False
7 ) as dag:
8
9 start_task = PythonOperator(
10 task_id='starting_task',
11 python_callable=print_task_type,
12 op_kwargs={'task_type': 'starting'}
13 )
14
15 api_trigger_dependent_dag = SimpleHttpOperator(
16 task_id="api_trigger_dependent_dag",
17 http_conn_id='airflow-api',
18 endpoint='/api/v1/dags/dependent-dag/dagRuns',
19 method='POST',
20 headers={'Content-Type': 'application/json'},
21 data=json_body
22 )
23
24 end_task = PythonOperator(
25 task_id='end_task',
26 python_callable=print_task_type,
27 op_kwargs={'task_type': 'ending'}
28 )
29
zz start_task >> api_trigger_dependent_dag >> end_task
109
This DAG has a similar structure to the TriggerDagRunOperator DAG above,
but instead uses the SimpleHttpOperator to trigger the dependent-dag using
the Airflow API. The graph view looks like this:
110
DAG Dependencies View
In Airflow 2.1, a new cross-DAG dependencies view was added to the Airflow
UI. This view shows all dependencies between DAGs in your Airflow environ-
ment as long as they are implemented using one of the following methods:
111
When DAGs are scheduled depending on datasets, both the DAG containing
the producing task, as well as the dataset itself will be shown upstream of the
consuming DAG.
In Airflow 2.4 an additional Datasets tab was added, which shows all depen-
dencies between datasets and DAGs.
112
Cross-Deployment Dependencies
113
4. Dynamically
Generating DAGs
in Airflow
Overview
In Airflow, DAGs are defined as Python code. Airflow executes all Python
code in the DAG_FOLDER and loads any DAG objects that appear in globals().
The simplest way of creating a DAG is to write it as a static Python file.
However, sometimes manually writing DAGs isn’t practical. Maybe you have
hundreds or thousands of DAGs that do similar things with just a parameter
changing between them. Or perhaps you need a set of DAGs to load tables
but don’t want to manually update DAGs every time those tables change.
In these cases and others, it can make more sense to generate DAGs
dynamically.
114
Because everything in Airflow is code, you can dynamically generate DAGs
using Python alone. As long as a DAG object in globals() is created by
Python code that lives in the DAG_FOLDER, Airflow will load it. In this section,
we will cover a few of the many ways of generating DAGs. We will also dis-
cuss when DAG generation is a good option and some pitfalls to watch out
for when doing this at scale.
Single-File Methods
One method for dynamically generating DAGs is to have a single Python file
that generates DAGs based on some input parameter(s) (e.g., a list of APIs
or tables). An everyday use case for this is an ETL or ELT-type pipeline with
many data sources or destinations. It would require creating many DAGs that
all follow a similar pattern.
115
In the following examples, the single-file method is implemented differently
based on which input parameters are used for generating DAGs.
EXAMPLE:
116
21 task_id=’hello_world’,
22 python_callable=hello_world_py,
23 dag_number=dag_number)
24
25 return dag schedule,
26 dag_number,
27 default_args):
In this example, the input parameters can come from any source that the
Python script can access. We can then set a simple loop (range(1, 4)) to
generate these unique parameters and pass them to the global scope, there-
by registering them as valid DAGs within the Airflow scheduler:
117
1 dag = DAG(dag_id,
2 schedule_interval=schedule,
3 default_args=default_args)
4
5 with dag:
6 t1 = PythonOperator(
7 task_id=’hello_world’,
python_callable=hello_world_py)
9
10 return dag
11
12
13 # build a dag for each number in range(10)
14 for n in range(1, 4):
15 dag_id = ‘loop_hello_world_{}’.format(str(n))
16
17 default_args = {‘owner’: ‘airflow’,
18 ‘start_date’: datetime(2021, 1, 1)
19 }
And if we look at the Airflow UI, we can see the DAGs have been created.
Success!
118
EXAMPLE:
119
We can retrieve this value by importing the Variable class and passing it into
our range. We want the interpreter to register this file as valid — regardless
of whether the variable exists, the default_var is set to 3.
120
29 for n in range(1, number_of_dags):
30 dag_id = ‘hello_world_{}’.format(str(n))
31
32 default_args = {‘owner’: ‘airflow’,
33 ‘start_date’: datetime(2021, 1, 1)
34 }
35
36 schedule = ‘@daily’
37 dag_number = n
38 globals()[dag_id] = create_dag(dag_id,
39 schedule,
40 dag_number,
41 default_args)
If we look at the scheduler logs, we can see this variable was pulled into the
DAG and, and 15 DAGs were added to the DagBag based on its value.
We can then go to the Airflow UI and see all of the new DAGs that have
been created.
121
EXAMPLE:
122
1 from airflow import DAG, settings
2 from airflow.models import Connection
3 from airflow.operators.python_operator import PythonOpera-
4 tor
5 from datetime import datetime
6
7 def create_dag(dag_id,
8 schedule,
9 dag_number,
10 default_args):
11
12 def hello_world_py(*args):
13 print(‘Hello World’)
14 print(‘This is DAG: {}’.format(str(dag_number)))
15
16 dag = DAG(dag_id,
17 schedule_interval=schedule
123
18 default_args=default_args)
19
20 with dag:
21 t1 = PythonOperator(
22 task_id=’hello_world’,
23 python_callable=hello_world_py)
24
25 return dag
26
27
28 session = settings.Session()
29 conns = (session.query(Connection.conn_id)
30 .filter(Connection.conn_id.ilike(‘%MY_DATABASE_
31 CONN%’))
32 .all())
33
34 for conn in conns:
35 dag_id = ‘connection_hello_world_{}’.format(conn[0])
36
37 default_args = {‘owner’: ‘airflow’,
38 ‘start_date’: datetime(2018, 1, 1)
39 }
40
41 schedule = ‘@daily’
42 dag_number = conn
43
44 globals()[dag_id] = create_dag(dag_id,
45 schedule,
46 dag_number,
47 default_args)
124
Notice that, as before, we access the Models library to bring in the Connec-
tion class (as we did previously with the Variable class). We are also ac-
cessing the Session() class from settings, which will allow us to query the
current database session.
We can see that all of the connections that match our filter have now been
created as a unique DAG. The one connection we had which did not match
(SOME_OTHER_DATABASE) has been ignored.
125
Multiple-File Methods
126
EXAMPLE:
127
1 with dag:
2 t1 = PostgresOperator(
3 task_id=’postgres_query’,
4 postgres_conn_id=connection_id
5 sql=querytoreplace)
Next, we create a dag-config folder that will contain a JSON config file
for each DAG. The config file should define the parameters that we noted
above, the DAG ID, schedule interval, and query to be executed.
1 {
2 “DagId”: “dag_file_1”,
3 “Schedule”: “’@daily’”,
4 “Query”:”’SELECT * FROM table1;’”
5 }
Finally, we write a Python script to create the DAG files based on the tem-
plate and the config files. The script loops through every config file in the
dag-config/ folder, makes a copy of the template in the dags/ folder and
overwrites the parameters in that file (including the parameters from the
config file).
1 import json
2 import os
3 import shutil
4 import fileinput
5
6 config_filepath = ‘include/dag-config/’
7 dag_template_filename = ‘include/dag-template.py’
128
8 for filename in os.listdir(config_filepath):
9 f = open(filepath + filename)
10 config = json.load(f)
11
12 new_filename = ‘dags/’+config[‘DagId’]+’.py’
13 shutil.copyfile(dag_template_filename, new_filename)
14
15
16 for line in fileinput.input(new_filename, in-
17 place=True):
18 line.replace(“dag_id”, “’”+config[‘DagId’]+”’”)
19 line.replace(“scheduletoreplace”, config[‘Sched-
20 ule’])
21 line.replace(“querytoreplace”, config[‘Query’])
22 print(line, end=””)
To generate our DAG files, we either run this script ad-hoc as part of our CI/
CD workflow, or we create another DAG that would run it periodically. Af-
ter running the script, our final directory would look like the example below,
where the include/ directory contains the files shown above, and the dags/
directory contain the two dynamically generated DAGs:
1 dags/
2 ├── dag_file_1.py
3 ├── dag_file_2.py
4 include/
5 ├── dag-template.py
6 ├── generate-dag-files.py
7 └── dag-config
8 ├── dag1-config.json
9 └── dag2-config.json
129
This is obviously a simple starting example that works only if all DAGs fol-
low the same pattern. However, it could be expanded upon to have dynamic
inputs for tasks, dependencies, different operators, etc.
DAG Factory
To use dag-factory, you can install the package in your Airflow environment
and create YAML configuration files for generating your DAGs. You can then
build the DAGs by calling the dag-factory.generate_dags() method in a
Python script, like this example from the dag-factory README:
130
Scalability
• If the DAG parsing time (i.e., the time to parse all code in the DAG_
FOLDER) is greater than the Scheduler heartbeat interval, the scheduler
can get locked up, and tasks won’t get executed. If you are dynamically
generating DAGs and tasks aren’t running, this is a good metric to review
in the beginning of troubleshooting.
Upgrading to Airflow 2.0 to make use of the HA Scheduler should help with
these performance issues. But it can still take some additional optimization
work depending on the scale you’re working at. There is no single right way
to implement or scale dynamically generated DAGs. Still, the flexibility of
Airflow means there are many ways to arrive at a solution that works for a
particular use case.
131
5. Testing Airflow DAGs
Overview
One of the core principles of Airflow is that your DAGs are defined as Python
code. Because you can treat data pipelines like you would any other piece of
code, you can integrate them into a standard software development lifecycle
using source control, CI/CD, and automated testing.
Although DAGs are 100% Python code, effectively testing DAGs requires
accounting for their unique structure and relationship to other code and data
in your environment. This guide will discuss a couple of types of tests that we
would recommend to anybody running Airflow in production, including DAG
validation testing, unit testing, and data and pipeline integrity testing.
132
T U TO R I A L
T U TO R I A L
ARTICLE
D O C U M E N TAT I O N
Note on test runners: Before we dive into different types of tests for
Airflow, we have a quick note on test runners. There are multiple test
runners available for Python, including unittest, pytest, and nose2.
The OSS Airflow project uses pytest, so we will do the same in this
section. However, Airflow doesn’t require using a specific test runner.
In general, choosing a test runner is a matter of personal preference
and experience level, and some test runners might work better than
others for a given use case.
133
DAG Validation Testing
DAG validation tests are designed to ensure that your DAG objects are de-
fined correctly, acyclic, and free from import errors.
These are things that you would likely catch if you were starting with the local
development of your DAGs. But in cases where you may not have access
to a local Airflow environment or want an extra layer of security, these tests
can ensure that simple coding errors don’t get deployed and slow down your
development.
DAG validation tests apply to all DAGs in your Airflow environment, so you
only need to create one test suite.
To test whether your DAG can be loaded, meaning there aren’t any syntax
errors, you can run the Python file:
1 python your-dag-file.py
Or to test for import errors specifically (which might be syntax related but
could also be due to incorrect package import paths, etc.), you can use
something like the following:
1 import pytest
2 from airflow.models import DagBag
3
4 def test_no_import_errors():
5 dag_bag = DagBag()
6 assert len(dag_bag.import_errors) == 0, “No Import Fail-
7 ures”
134
You may also use DAG validation tests to test for properties that you want to
be consistent across all DAGs. For example, if your team has a rule that all
DAGs must have two retries for each task, you might write a test like this to
enforce that rule:
1 def test_retries_present():
2 dag_bag = DagBag()
3 for dag in dag_bag.dags:
4 retries = dag_bag.dags[dag].default_args.get(‘re-
5 tries’, [])
6 error_msg = ‘Retries not set to 2 for DAG {id}’.for-
7 mat(id=dag)
8 assert retries == 2, error_msg
Unit Testing
Unit testing is a software testing method where small chunks of source code
are tested individually to ensure they function as intended. The goal is to iso-
late testable logic inside of small, well-named functions, for example:
1 def test_function_returns_5():
2 assert my_function(input) == 5
135
In the context of Airflow, you can write unit tests for any part of your DAG,
but they are most frequently applied to hooks and operators. All official Air-
flow hooks, operators, and provider packages have unit tests that must pass
before merging the code into the project. For an example, check out the
AWS S3Hook, which has many accompanying unit tests.
If you have your custom hooks or operators, we highly recommend using unit
tests to check logic and functionality. For example, say we have a custom
operator that checks if a number is even:
136
We would then write a test_evencheckoperator.py file with unit tests like
the following:
1 import unittest
2 import pytest
3 from datetime import datetime
4 from airflow import DAG
5 from airflow.models import TaskInstance
6 from airflow.operators import EvenNumberCheckOperator
7
8 DEFAULT_DATE = datetime(2021, 1, 1)
9
10 class EvenNumberCheckOperator(unittest.TestCase):
11
12 def setUp(self):
13 super().setUp()
14 self.dag = DAG(‘test_dag’, default_args={‘owner’:
15 ‘airflow’, ‘start_date’: DEFAULT_DATE})
16 self.even = 10
17 self.odd = 11
18
19 def test_even(self):
20 “””Tests that the EvenNumberCheckOperator returns True for
21 10.”””
22 task = EvenNumberCheckOperator(my_operator_param=-
23 self.even, task_id=’even’, dag=self.dag)
24 ti = TaskInstance(task=task, execution_date=date-
25 time.now())
26 result = task.execute(ti.get_template_context())
27 assert result is True
28
29 def test_odd(self):
30 “””Tests that the EvenNumberCheckOperator returns False
31 for 11.”””
137
32 task = EvenNumberCheckOperator(my_operator_param=-
33 self.odd, task_id=’odd’, dag=self.dag)
34 ti = TaskInstance(task=task, execution_date=date-
35 time.now())
36 result = task.execute(ti.get_template_context())
37 assert result is False
38
Note that if your DAGs contain PythonOperators that execute your Python
functions, it is a good idea to write unit tests for those functions as well.
Mocking
Sometimes unit tests require mocking: the imitation of an external system,
dataset, or another object. For example, you might use mocking with an
Airflow unit test if you are testing a connection but don’t have access to the
metadata database. Another example could be testing an operator that exe-
cutes an external service through an API endpoint, but you don’t want to wait
for that service to run a simple test.
Many Airflow tests have examples of mocking. This blog post also has a help-
ful section on mocking Airflow that may help get started.
Data integrity tests are designed to prevent data quality issues from break-
ing your pipelines or negatively impacting downstream systems. These tests
could also be used to ensure your DAG tasks produce the expected output
when processing a given piece of data. They are somewhat different in scope
than the code-related tests described in previous sections since your data is
not static like a DAG.
138
One straightforward way of implementing data integrity tests is to build them
directly into your DAGs. This allows you to use Airflow dependencies to man-
age any errant data in whatever way makes sense for your use case.
There are many ways you could integrate data checks into your DAG. One
method worth calling out is using Great Expectations (GE), an open-source
Python framework for data validations. You can make use of the Great Ex-
pectations provider package to easily integrate GE tasks into your DAGs. In
practice, you might have something like the following DAG, which runs an
Azure Data Factory pipeline that generates data then runs a GE check on the
data before sending an email.
139
19 #Make connection to ADF, and run pipeline with parame-
20 ter
21 hook = AzureDataFactoryHook(‘azure_data_factory_conn’)
22 hook.run_pipeline(pipeline_name, parameters=params)
23
24 def get_azure_blob_files(blobname, output_filename):
25 ‘’’Downloads file from Azure blob storage
26 ‘’’
27 azure = WasbHook(wasb_conn_id=’azure_blob’)
28 azure.get_file(output_filename, container_
29 name=’covid-data’, blob_name=blobname)
30
31
32 default_args = {
33 ‘owner’: ‘airflow’,
34 ‘depends_on_past’: False,
35 ‘email_on_failure’: False,
36 ‘email_on_retry’: False,
37 ‘retries’: 0,
38 ‘retry_delay’: timedelta(minutes=5)
39 }
40
41 with DAG(‘adf_great_expectations’,
42 start_date=datetime(2021, 1, 1),
43 max_active_runs=1,
44 schedule_interval=’@daily’,
45 default_args=default_args,
46 catchup=False
47
48 ) as dag:
49
50 run_pipeline = PythonOperator(
51 task_id=’run_pipeline’,
52 python_callable=run_adf_pipeline,
53 op_kwargs={‘pipeline_name’: ‘pipeline1’, ‘date’:
140
54 yesterday_date}
55 )
56
57 download_data = PythonOperator(
58 task_id=’download_data’,
59 python_callable=get_azure_blob_files,
60 op_kwargs={‘blobname’: ‘or/’+ yesterday_date
61 +’.csv’, ‘output_filename’: data_file_path+’or_’+yesterday_
62 date+’.csv’}
63 )
64
65 ge_check = GreatExpectationsOperator(
66 task_id=’ge_checkpoint’,
67 expectation_suite_name=’azure.demo’,
68 batch_kwargs={
69 ‘path’: data_file_path+’or_’+yesterday_
70 date+’.csv’,
71 ‘datasource’: ‘data__dir’
72 },
73 data_context_root_dir=ge_root_dir
74 )
75
76 send_email = EmailOperator(
77 task_id=’send_email’,
78 to=’[email protected]’,
79 subject=’Covid to S3 DAG’,
80 send_email = EmailOperator(
81 task_id=’send_email’,
82 to=’[email protected]’,
83 subject=’Covid to S3 DAG’,
84
85 html_content=’<p>The great expectations checks passed success-
86 fully. <p>’
87 )
141
If the GE check fails, any downstream tasks will be skipped. Implementing
checkpoints like this allows you to either conditionally branch your pipeline
to deal with data that doesn’t meet your criteria or potentially skip all down-
stream tasks so problematic data won’t be loaded into your data warehouse
or fed to a model. For more information on conditional DAG design, check
out the documentation on Airflow Trigger Rules and our guide on branching
in Airflow.
It’s also worth noting that data integrity testing will work better at scale if
you design your DAGs to load or process data incrementally. We talk more
about incremental loading in our Airflow Best Practices guide. Still, in short,
processing smaller, incremental chunks of your data in each DAG Run en-
sures that any data quality issues have a limited blast radius and are easier to
recover from.
WEBINAR
142
DAG Authoring
for Apache Airflow
The Astronomer Certification: DAG Authoring for Apache
Airflow gives you the opportunity to challenge yourself and show
the world your ability to create incredible data pipelines.
And don’t worry, we’ve also prepared a preparation course to
give you the best chance of success!
Concepts Covered:
• Variables • Idempotency
• Pools • Dynamic DAGs
• Trigger Rules • DAG Best Practices
• DAG Dependencies • DAG Versioning and much more
Get Certified
143
6. Debugging DAGs
7 Common Errors to Check when
Debugging Airflow DAGs
Apache Airflow is the industry standard for workflow orchestration. It’s an
incredibly flexible tool that powers mission-critical projects, from machine
learning model training to traditional ETL at scale, for startups and Fortune
50 teams alike.
Whether you’re new to Airflow or an experienced user, check out this list of
common errors and some corresponding fixes to consider.
144
Note: Following the Airflow 2.0 release in December of 2020, the
open-source project has addressed a significant number of pain
points
commonly reported by users running previous versions. We strongly
encourage your team to upgrade to Airflow 2.x.
You wrote a new DAG that needs to run every hour and you’re ready to turn it
on. You set an hourly interval beginning today at 2pm, setting a reminder to
check back in a couple of hours. You hop on at 3:30pm to find that your DAG
did in fact run, but your logs indicate that there was only one recorded exe-
cution at 2pm. Huh — what happened to the 3pm run?
Before you jump into debugging mode (you wouldn’t be the first), rest
assured that this is expected behavior. The functionality of the Airflow
Scheduler can be counterintuitive, but you’ll get the hang of it.
The two most important things to keep in mind about scheduling are:
• By design, an Airflow DAG will run at the end of its schedule_interval
Airflow operates in UTC by default.
145
Airflow’s Schedule Interval
As stated above, an Airflow DAG will execute at the completion of its sched-
ule_interval, which means one schedule_interval AFTER the start date.
An hourly DAG, for example, will execute its 2:00 PM run when the clock
strikes 3:00 PM. This happens because Airflow can’t ensure that all of the
data from 2:00 PM - 3:00 PM is present until the end of that hourly interval.
There are some data engineering use cases that are difficult or even impossible
to address with Airflow’s original scheduling method. Scheduling DAGs to skip
holidays, run only at certain times, or otherwise run on varying intervals can
cause major headaches if you’re relying solely on cron jobs or timedeltas.
This is why Airflow 2.2 introduced timetables as the new default scheduling
method. Essentially, timetable is a DAG-level parameter that you can set to a
Python function that contains your execution schedule.
146
Airflow Time Zones
Airflow stores datetime information in UTC internally and in the database. This
behavior is shared by many databases and APIs, but it’s worth clarifying.
You should not expect your DAG executions to correspond to your local time-
zone. If you’re based in US Pacific Time, a DAG run of 19:00 will correspond to
12:00 local time.
In recent releases, the community has added more time zone-aware features
to the Airflow UI. For more information, refer to Airflow documentation.
It’s intuitive to think that if you tell your DAG to start “now” that it’ll execute
immediately. But that’s not how Airflow reads datetime.now().
For a DAG to be executed, the start_date must be a time in the past, other-
wise Airflow will assume that it’s not yet ready to execute. When Airflow eval-
uates your DAG file, it interprets datetime.now() as the current timestamp
(i.e. NOT a time in the past) and decides that it’s not ready to run.
To properly trigger your DAG to run, make sure to insert a fixed time in the
past and set catchup=False if you don’t want to perform a backfill.
147
Note: You can manually trigger a DAG run via Airflow’s UI directly on
your dashboard (it looks like a “Play” button). A manual trigger exe-
cutes immediately and will not interrupt regular scheduling, though
it will be limited by any concurrency configurations you have at the
deployment level, DAG level, or task level. When you look at corre-
sponding logs, the run_id will show manual__ instead of scheduled__.
If your Airflow UI is entirely inaccessible via web browser, you likely have a
Webserver issue.
If you’ve already refreshed the page once or twice and continue to see a 503
error, read below for some Webserver-related guidelines.
A 503 error might indicate an issue with your Deployment’s Webserver, which
is the Airflow component responsible for rendering task state and task execu-
tion logs in the Airflow UI. If it’s underpowered or otherwise experiencing an
issue, you can expect it to affect UI loading time or web browser accessibili-
ty.
148
In our experience, a 503 often indicates that your Webserver is crashing.
If you push up a deploy and your Webserver takes longer than a few seconds
to start, it might hit a timeout period (10 secs by default) that “crashes” the
Webserver before it has time to spin up. That triggers a retry, which crashes
again, and so on and so forth.
Raising those values will tell your Airflow Webserver to wait a bit longer to
load before it hits you with a 503 (a timeout). You might still experience slow
loading times if your Webserver is underpowered, but you’ll likely avoid hit-
ting a 503.
149
Avoid Making Requests Outside of an Operator
When Airflow interprets a file to look for any valid DAGs, it first runs all code
at the top level (i.e. outside of operators). Even if the operator itself only
gets executed at execution time, everything outside of an operator is called
every heartbeat, which can be very taxing on performance.
We’d recommend taking the logic you have currently running outside of an
operator and moving it inside of a Python Operator if possible.
If your sensor tasks are failing, it might not be a problem with your task.
It might be a problem with the sensor itself.
By default, Airflow sensors run continuously and occupy a task slot in per-
petuity until they find what they’re looking for, often causing concurrency
issues. Unless you never have more than a few tasks running concurrently, we
recommend avoiding them unless you know it won’t take too long for them to
exit.
For example, if a worker can only run X number of tasks simultaneously and
you have three sensors running, then you’ll only be able to run X-3 tasks at
any given point. Keep in mind that if you’re running a sensor at all times, that
limits how and when a scheduler restart can occur (or else it will fail
the sensor).
150
Depending on your use case, we’d suggest considering the following:
151
Update Concurrency Settings
The potential root cause for a bottleneck is specific to your setup. For ex-
ample, are you running many DAGs at once, or one DAG with hundreds of
concurrent tasks?
Most users can set parameters in Airflow’s airflow.cfg file. If you’re using
Astro, you can also set environment variables via the Astro UI or your proj-
ect’s Dockerfile. We’ve formatted these settings as parameters for readability
– the environment variables for these settings are formatted as AIRFLOW__
CORE__PARAMETER_NAME. For all default values, refer here.
Parallelism
parallelism determines how many task instances can run in parallel across
all DAGs given your environment resources. Think of this as “maximum active
tasks anywhere.” To increase the limit of tasks set to run in parallel, set this
value higher than its default of 32.
DAG Concurrency
152
Max Active Runs per DAG
Worker Concurrency
It’s important to note that this number will naturally be limited by dag_con-
currency. If you have 1 Worker and want it to match your Deployment’s
capacity, worker_concurrency should be equal to parallelism. The default
value is 16.
153
Try Scaling Up Your Scheduler or Adding a Worker
If tasks are getting bottlenecked and your concurrency configurations are al-
ready optimized, the issue might be that your Scheduler is underpowered or
that your Deployment could use another worker. If you’re running on Astro,
we generally recommend 5 AU (0.5 CPUs and 1.88 GiB of memory) as the
default minimum for the Scheduler and 10 AU (1 CPUs and 3.76 GiB of mem-
ory) for workers.
Whether or not you scale your current resources or add an extra Celery
Worker depends on your use case, but we generally recommend the follow-
ing:
• If you’re running a relatively high number of light tasks across DAGs and
at a relatively high frequency, you’re likely better off having 2 or 3 “light”
workers to spread out the work.
• If you’re running fewer but heavier tasks at a lower frequency, you’re like-
ly better off with a single but “heavier” worker that can more efficiently
execute those tasks.
If you’re missing logs, you might see something like this under “Log by at-
tempts” in the Airflow UI:
154
Failed to fetch log file from worker. Invalid URL ‘http://:8793/
log/staging_to_presentation_pipeline_v5/redshift_to_s3_Order_Pay-
ment_17461/2019-01-11T00:00:00+00:00/1.log’: No host supplied
If your tasks are slower than usual to get scheduled, you might need to up-
date Scheduler settings to increase performance and optimize your environ-
ment.
155
Just like with concurrency settings, users can set parameters in Airflow’s air-
flow.cfg file. If you’re using Astro, you can also set environment variables via
the Astro UI or your project’s Dockerfile. We’ve formatted these settings as
parameters for readability – the environment variables for these settings are
formatted as AIRFLOW__CORE__PARAMETER_NAME. For all default values, refer
here.
156
Pro-tip: Scheduler performance was a critical part of the Airflow 2
release and has seen significant improvements since December of
2020. If you are experiencing Scheduler issues, we strongly recom-
mend upgrading to Airflow 2.x. For more information, read our blog
post: The Airflow 2.0 Scheduler.
Discover Guides
157
Error Notifications in Airflow
Overview
A key question when using any data orchestration tool is “How do I know
if something has gone wrong?” Airflow users always have the option to
check the UI to see the status of their DAGs, but this is an inefficient way
of managing errors systematically, especially if certain failures need to be
addressed promptly or by multiple team members. Fortunately, Airflow has
built-in notification mechanisms that can be leveraged to configure error
notifications in a way that works for your team.
In this section, we will cover the basics of Airflow notifications and how to
set up common notification mechanisms including email, Slack, and SLAs.
We will also discuss how to make the most of Airflow alerting when using the
Astronomer platform.
Airflow has an incredibly flexible notification system. Having your DAGs de-
fined as Python code gives you full autonomy to define your tasks and notifi-
cations in whatever way makes sense for your use case.
In this section, we will cover some of the options available when working with
notifications in Airflow.
Notification Levels
• Sometimes it makes sense to standardize notifications across your entire
DAG. Notifications set at the DAG level will filter down to each task in
the DAG. These notifications are usually defined in default_args.
• For example, in the following DAG, email_on_failure is set to True,
meaning any task in this DAG’s context will send a failure email to all
addresses in the email array.
158
1 from datetime import datetime
2 from airflow import DAG
3
4 default_args = {
5 ‘owner’: ‘airflow’,
6 ‘start_date’: datetime(2018, 1, 30),
7 ‘email’: [‘[email protected]’],
8 ‘email_on_failure’: True
9 }
10
11 with DAG(‘sample_dag’,
12 default_args=default_args,
13 schedule_interval=’@daily’,
14 catchup=False) as dag:
15
16 ...
In contrast, it’s sometimes useful to have notifications only for certain tasks.
The BaseOperator that all Airflow Operators inherit from has support for
built-in notification arguments, so you can configure each task individually as
needed. In the DAG below, email notifications are turned off by default at
the DAG level but are specifically enabled for the will_email task.
159
1 from datetime import datetime
2 from airflow import DAG
3 from airflow.operators.dummy_operator import DummyOperator
4
5 default_args = {
6 ‘owner’: ‘airflow’,
7 ‘start_date’: datetime(2018, 1, 30),
8 ‘email_on_failure’: False,
9 ‘email’: [‘[email protected]’],
10 ‘retries’: 1
11 }
12
13 with DAG(‘sample_dag’,
14 default_args=default_args,
15 schedule_interval=’@daily’,
16 catchup=False) as dag:
17
18 wont_email = DummyOperator(
19 task_id=’wont_email’
20 )
21
22 will_email = DummyOperator(
23 task_id=’will_email’,
24 email_on_failure=True
25 )
160
Notification Triggers
The most common trigger for notifications in Airflow is a task failure. However,
notifications can be set based on other events, including retries and successes.
Emails on retries can be useful for debugging indirect failures; if a task need-
ed to retry but eventually succeeded, this might indicate that the problem
was caused by extraneous factors like a load on an external system. To turn
on email notifications for retries, simply set the email_on_retry parameter
to True as shown in the DAG below.
161
When working with retries, you should configure a retry_delay. This is the
amount of time between a task failure and when the next try will begin. You
can also turn on retry_exponential_backoff, which progressively increases
the wait time between retries. This can be useful if you expect that extrane-
ous factors might cause failures periodically.
Finally, you can also set any task to email on success by setting the email_
on_success parameter to True. This is useful when your pipelines have con-
ditional branching, and you want to be notified if a certain path is taken (i.e.
certain tasks get run).
Custom Notifications
The email notification parameters shown in the sections above are an exam-
ple of built-in Airflow alerting mechanisms. These simply have to be turned
on and don’t require any configuration from the user.
You can also define your own notifications to customize how Airflow alerts you
about failures or successes. The most straightforward way of doing this is by
defining on_failure_callback and on_success_callback Python functions.
These functions can be set at the DAG or task level, and the functions will be
called when a failure or success occurs respectively. For example, the following
DAG has a custom on_failure_callback function set at the DAG level and an
on_success_callback function for just the success_task.
162
1 dag_run = context.get(‘dag_run’)
2 task_instances = dag_run.get_task_instances()
3 print(“These task instances failed:”, task_instances)
4
5 def custom_success_function(context):
6 “Define custom success notification behavior”
7 dag_run = context.get(‘dag_run’)
8 task_instances = dag_run.get_task_instances()
9 print(“These task instances succeeded:”, task_instanc-
10 es)
11
12 default_args = {
13 ‘owner’: ‘airflow’,
14 ‘start_date’: datetime(2018, 1, 30),
15 ‘on_failure_callback’: custom_failure_function
16 ‘retries’: 1
17 }
18
19 with DAG(‘sample_dag’,
20 default_args=default_args,
21 schedule_interval=’@daily’,
22 catchup=False) as dag:
23
24 failure_task = DummyOperator(
25 task_id=’failure_task’
26 )
27
28 success_task = DummyOperator(
29 task_id=’success_task’,
30 on_success_callback=custom_success_function
31 )
163
Note that custom notification functions can be used in addition to email
notifications.
Email Notifications
Email notifications are a native feature in Airflow and are easy to set up. As
shown above, the email_on_failure and email_on_retry parameters can be
set to True either at the DAG level or task level to send emails when tasks
fail or retry. The email parameter can be used to specify which email(s) you
want to receive the notification. If you want to enable email alerts on all fail-
ures and retries in your DAG, you can define that in your default arguments
like this:
164
In order for Airflow to send emails, you need to configure an SMTP server in
your Airflow environment. You can do this by filling out the SMTP section of
your airflow.cfg like this:
1 [smtp]
2 # If you want airflow to send emails on retries, failure,
3 and you want to use
4 # the airflow.utils.email.send_email_smtp function, you
5 have to configure an
6 # smtp server here
7 smtp_host = your-smtp-host.com
8 smtp_starttls = True
9 smtp_ssl = False
10 # Uncomment and set the user/pass settings if you want to
11 use SMTP AUTH
12 # smtp_user =
13 # smtp_password =
14 smtp_port = 587
15 smtp_mail_from = [email protected]
You can also set these values using environment variables. In this case, all
parameters are preceded by AIRFLOW__SMTP__, consistent with Airflow envi-
ronment variable naming convention. For example, smtp_host can be speci-
fied by setting the AIRFLOW__SMTP__SMTP_HOST variable. For more on Airflow
email configuration, check out the Airflow documentation.
Note: If you are running on the Astronomer platform, you can set up
SMTP using environment variables since the airflow.cfg cannot be
directly edited. For more on email alerting on the Astronomer plat-
form, see the ‘Notifications on Astronomer’ section below.
165
Customizing Email Notifications
By default, email notifications will be sent in a standard format as defined
in the email_alert() and get_email_subject_content() methods of the
TaskInstance class. The default email content is defined like this:
To see the full method, check out the source code here.
You can overwrite this default with your custom content by setting the sub-
ject_template and/or html_content_template variables in your airflow.cfg
with the path to your jinja template files for subject and content respectively.
166
Slack Notifications
There are multiple ways you can send messages to Slack from Airflow. In this
section, we will cover how to use the Slack Provider’s SlackWebhookOperator
with a Slack Webhook to send messages, since this is Slack’s recommended way
of posting messages from apps. To get started, follow these steps:
1. From your Slack workspace, create a Slack app and an incoming Web-
hook. The Slack documentatio here walks through the necessary steps.
Make a note of the Slack Webhook URL to use in your Python function.
2. Create an Airflow connection to provide your Slack Webhook to Airflow.
Choose an HTTP connection type (if you are using Airflow 2.0 or greater,
you will need to install the apache-airflow-providers-http provider for
the HTTP connection type to appear in the Airflow UI). Enter https://
hooks.slack.com/services/ as the Host, and enter the remainder of
your Webhook URL from the last step as the Password (formatted as
T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX).
167
3. Create a Python function to use as your on_failure_callback method.
Within the function, define the information you want to send and invoke
the SlackWebhookOperator to send the message. Here’s an example:
168
Note: In Airflow 2.0 or greater, to use the SlackWebhookOperator you will
need to install the apache-airflow-providers-slack provider package.
States
One of the key pieces of data stored in Airflow’s metadata database is State.
States are used to keep track of what condition task instances and DAG Runs are
in. In the screenshot below, we can see how states are represented in the Airflow
UI:
Task States
• None (Light Blue): No associated state. Syntactically - set as Python
None.
169
• Queued (Gray) : The task is waiting to be executed, set as queued.
• Scheduled (Tan): The task has been scheduled to run.
• Running (Lime): The task is currently being executed.
• Failed (Red): The task failed.
• Success (Green): The task was executed successfully.
• Skipped (Pink): The task has been skipped due to an upstream condition.
• Shutdown (Blue): The task is up for retry.
• Removed (Light Grey): The task has been removed.
• Retry (Gold): The task is up for retry.
• Upstream Failed (Orange): The task will not run because of a failed
upstream dependency.
Airflow SLAs
Airflow SLAs are a type of notification that you can use if your tasks are tak-
ing longer than expected to complete. If a task takes longer than a maximum
amount of time to complete as defined in the SLA, the SLA will be missed
and notifications will be triggered. This can be useful in cases where you have
potentially long-running tasks that might require user intervention after a
certain period of time or if you have tasks that need to complete by a certain
deadline.
Note that exceeding an SLA will not stop a task from running. If you want
tasks to stop running after a certain time, try using timeouts instead.
You can set an SLA for all tasks in your DAG by defining ‘sla’ as a default
argument, as shown in the DAG below:
170
1 from airflow import DAG
2 from airflow.operators.dummy_operator import DummyOperator
3 from airflow.operators.python_operator import PythonOpera-
4 tor
5 from datetime import datetime, timedelta
6 import time
7
8 def my_custom_function(ts,**kwargs):
9 print(“task is sleeping”)
10 time.sleep(40)
11
12 # Default settings applied to all tasks
13 default_args = {
14 ‘owner’: ‘airflow’,
15 ‘depends_on_past’: False,
16 ‘email_on_failure’: True,
17 ‘email’: ‘[email protected]’,
18 ‘email_on_retry’: False,
19 ‘sla’: timedelta(seconds=30)
20 }
21
22 # Using a DAG context manager, you don’t have to specify
23 the dag property of each task
24 with DAG(‘sla-dag’,
25 start_date=datetime(2021, 1, 1),
26 max_active_runs=1,
27 schedule_interval=timedelta(minutes=2),
28 default_args=default_args,
29 catchup=False
30 ) as dag:
171
31 t0 = DummyOperator(
32 task_id=’start’
33 )
34
35 t1 = DummyOperator(
36 task_id=’end’
37 )
38
39 sla_task = PythonOperator(
40 task_id=’sla_task’,
41 python_callable=my_custom_function
42 )
43 t0 >> sla_task >> t1
SLAs have some unique behaviors that you should consider before implementing
them:
• SLAs are relative to the DAG execution date, not the task start time.
For example, in the DAG above the sla_task will miss the 30 second
SLA because it takes at least 40 seconds to complete. The t1 task will
also miss the SLA, because it is executed more than 30 seconds after
the DAG execution date. In that case, the sla_task will be considered
“blocking” to the t1 task.
• SLAs will only be evaluated on scheduled DAG Runs. They will not be
evaluated on manually triggered DAG Runs.
• SLAs can be set at the task level if a different SLA is required for each
task. In this case, all task SLAs are still relative to the DAG execution
date. For example, in the DAG below, t1 has an SLA of 500 seconds.
If the upstream tasks (t0 and sla_task) combined take 450 seconds
to complete, and t1 takes 60 seconds to complete, then t1 will miss its
SLA even though the task did not take more than 500 seconds
172
1 from airflow import DAG
2 from airflow.operators.dummy_operator import DummyOp-
3 erator
4 from airflow.operators.python_operator import Pytho-
5 nOperator
6 from datetime import datetime, timedelta
7 import time
8
9 def my_custom_function(ts,**kwargs):
10 print(“task is sleeping”)
11 time.sleep(40)
12
13 # Default settings applied to all tasks
14 default_args = {
15 ‘owner’: ‘airflow’,
16 ‘depends_on_past’: False,
17 ‘email_on_failure’: True,
18 ‘email’: ‘[email protected]’,
19 ‘email_on_retry’: False
20 }
21
22 # Using a DAG context manager, you don’t have to spec-
23 ify the dag property of each task
24 with DAG(‘sla-dag’,
25 start_date=datetime(2021, 1, 1),
26 max_active_runs=1,
27 schedule_interval=timedelta(minutes=2),
28 default_args=default_args,
29 catchup=False
30 ) as dag:
173
31 t0 = DummyOperator(
32 task_id=’start’,
33 sla=timedelta(seconds=50)
34 )
35
36 t1 = DummyOperator(
37 task_id=’end’,
38 sla=timedelta(seconds=500)
39 )
40
41 sla_task = PythonOperator(
42 task_id=’sla_task’,
43 python_callable=my_custom_function,
44 sla=timedelta(seconds=5)
45 )
46
46 t0 >> sla_task >> t1
Any SLA misses will be shown in the Airflow UI. You can view them by going
to Browse SLA Misses, which looks something like this:
174
If you configured an SMTP server in your Airflow environment, you will also
receive an email with notifications of any missed SLAs.
Note that there is no functionality to disable email alerting for SLAs. If you
have an ‘email’ array defined and an SMTP server configured in your Air-
flow environment, an email will be sent to those addresses for each DAG Run
that has missed SLAs.
Notifications on Astronomer
If you are running Airflow on the Astronomer platform, you have multiple
options for managing your Airflow notifications. All of the methods above for
sending task notifications from Airflow are easily implemented on Astrono-
mer. Our documentation discusses how to leverage these notifications on the
platform, including how to set up SMTP to enable email alerts.
175
Thank you
We hope you’ve enjoyed our guide to DAGs. Please follow us on
Twitter and LinkedIn, and share your feedback, if any.
Get Started
176