Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
119 views10 pages

Dagster Orchestration Guide

Dagster is an orchestration tool used to automate complex workflows and data pipelines. It uses directed acyclic graphs (DAGs) to visually represent pipeline steps and their order of execution. Key features include scheduling, error handling, monitoring, and logging. Dagster takes an asset-centric approach where assets represent outputs. This makes data lineage and reuse easier compared to task-centric approaches. Common Dagster concepts include resources, jobs, schedules, and partitions. Resources represent tools and services used by assets. Jobs focus on running specific asset materializations. Schedules define when jobs run using cron expressions. Partitions split data into chunks to improve efficiency. IOManagers handle input and output for assets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views10 pages

Dagster Orchestration Guide

Dagster is an orchestration tool used to automate complex workflows and data pipelines. It uses directed acyclic graphs (DAGs) to visually represent pipeline steps and their order of execution. Key features include scheduling, error handling, monitoring, and logging. Dagster takes an asset-centric approach where assets represent outputs. This makes data lineage and reuse easier compared to task-centric approaches. Common Dagster concepts include resources, jobs, schedules, and partitions. Resources represent tools and services used by assets. Jobs focus on running specific asset materializations. Schedules define when jobs run using cron expressions. Partitions split data into chunks to improve efficiency. IOManagers handle input and output for assets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

S - Dagter

Note
notable note
 search along with example code to see an example usage of function, document for dagster is
very vague instructions
notable recourse
 dagster documentation: link
 Dagster essential course: link
 fully feature project: link
1. Theory
1.1. wtf is orchestration
- orchestration is tool to helps automate, coordinate, and manage complex workflows, data pineline or
processes. Core feature:
 Directed Acyclic Graphs (DAGs): data structure that visually manage a step in a pineline
 Scheduling and Workflow Management : what time and order the step need to be executed
 Error Handling and Retry Mechanisms : how to behave when error occur at specific step
 Monitoring and Logging
 ...
1.2. orchestrating approaches (OA) and Dagster OA
 task-centric: focuses on managing and coordinating the execution of tasks. It focuses on the
hows and less on the whats.

 asset-centric: Assets are what we call the outputs made by workflows. Asset-centric workflows
make it easy to, at a glance, focus on the whats and less on the hows
 advantage of asset-centric compare to task-centric:
 easily understand the data lineage and how data assets relate to each other
 allow for reusing assets without changing an existing sequence of tasks
 tell exactly why assets are out-of-date, whether it might be late upstream data or errors in
code
 ...
 dagster use asset centric
1.3. relationship between assets
 Asset dependencies can be:
 Downstream, which means an asset is dependent on another asset
 Upstream, which means an asset is depended on by another asset
2. Installation
 pip install 'dagster~=1.4'
3. Operation with dagster (and more theory)
3.1. web UI
 Inote: operation in web server by default create a temporary folder to save the change via action
on server, when server down, these folder will be removed
 Inote: asset UI is straightforward and easy to remember, read concept definition and code
implementation in next following section to understand how dagster behave and familiar with
the keyword is enough, then just try to familiar with UI (or gg search) and check following note
if get any problem
 note: each time modify the code, should `reload definition`
 note: fail asset information will be place in asset log
 note: value print out of each asset place in asset log
3.2. run dagster as single file
 dagster dev -f <file>.py
> p_remind: this file must contain Definition declaration (cover later)
3.3. recommended project structure and basic action with project
 Create a project skeleton: dagster project scaffold --name my-project-name
 Install dependencies of project: pip install -e ".[dev]"
 -e == --editable : install project dependencies and its project as module in editable mode
note: not clear the real effect of -e, just install with this command for safety
addition info (not checked in action): by using -e, you’ll only need to reload definitions
when adding new assets or other Dagster objects, such as schedules, to your project
 Component in project:
 Add python dependencies: add package name to /setup.py (where to put is
straightforward, or search chatgpt)
 /.env: Environment variables, described in later separate section
 ./my-project-name: folder that contain dagster code as a python module
 /my-project-name/__init__.py : import and combine stuff in /my-project-name/
with Definitions declaration, this is called as Code Location (cover in later)
 other subfolder vs relative ref to __init__.py: resource/, job/, asset/, ...
 Run dagster project (as module): dagster dev
 note: this command is shorthand to open multiple subservice at once command (detail
about dagster service cover in later section)
 default open at localhost port 3000
3.4. .env file
 Inote: this section is not tested and missing core content, read, extract from Document when
having chance : link
 env variable in dagster is the same concept as many other programming languages, tool (EX:
Nodejs)
 conventionally located at /.env
 how to use
 approach 1:
import os
os.getenv("DUCKDB ") # assumpt DUCKDB is var defined in /.env
 approach 2:
from dagster import EnvVar
EnvVar("DUCKDB_DATABASE")
Inote: seem there some conflict and convention when using EnvVar: EnvVar can
only be used in resource and some other place, check this doc example from dagster
link
 the difference:
 EnvVar fetches the environmental variable’s value every time a dagster run
command starts (only one run per deployment)
 os.getenv fetches the environment variable when the code location is loaded (can
reload code location/definition many time per deploy)
 By using EnvVar instead of os.getenv, you can dynamically customize a resource’s
configuration. For example, you can change which DuckDB database is being used
without having to restart Dagster’s web server.
3.5. Resource
 Resources are the tools and services you use to make assets, it can be an API connection,
Database connection, ...
 resources are conventionally located at /<project_name>/resources/__init__.py
 syntax: (example with duckDB)
from dagster_duckdb import DuckDBResource
database_resource = DuckDBResource(database=EnvVar("DUCKDB_PATH")) # with
EnvVar("DUCKDB_DATABASE") is path to database
Inote: syntax for declare resource in definition and using it in asset is described in later
respective section
 Doc to integrate other kind of resource to dagster definition: link

3.6. Jobs and Schedule


 Jobs are a Dagster utility to take a slice of your asset graph and focus specifically on running
materializations of those assets.
> Jobs are conventionally defined at /<project_name>/jobs/__init__.py
> syntax:
from dagster import AssetSelection
from ..partitions import monthly_partition
my_assets = AssetSelection.keys([<asset_name:str>,])
# my_assets = AssetSelection.all() - other_assets
my_job = define_asset_job(name="my_job", selection=my_assets,
partitions_def=monthly_partition,)
# note: partition is introduced in later section
 Cron expession: same as crontab in linux, not recover
 Schedules are object to manage the time to run jobs
> Schedules are conventionally defined at /<project_name>/schedules/__init__.py
> syntax:
from dagster import ScheduleDefinition
from ..jobs import my_job
my_schedule = ScheduleDefinition(job=my_job, cron_schedule="0 0 5 * *", ) # crontab
expression
 Inote: syntax for declare the jobs and schedule in Definition is defined in respective section
 mock test : an action of manually run a schedule to test whether schedule is running correctly or
not, accomplish this via dagster web UI (simple, not covered)
3.6.1. > Partitions
 Inote: this section seem to missing the core content, recheck the document when having a
chance
 conventionally defined at /<project_name>/partitions/__init__.py
 Partitions are a way to split your data into smaller, easier-to-use chunks, partition is usually
devided by date (EX: each month in the year is the partition), some notable benefits:
 split your data into smaller, easier-to-use chunks
 treat partition asset differently to obtain best efficiency (EX: store recent orders in hot
storage and older orders in cheaper, cold storage.)
 distribute partitions across multiple servers or storage systems and run multiple partitions
in parallel
 Backfilling is the process of running partitions for assets that either don’t exist (EX: due to not
run yet because of first time deploy a pipeline) or updating existing records (EX: when you’ve
changed the logic for an asset and need to update historical data with the new logic.)
 syntax: (example for define a monthly partition in specific range of time)
from dagster import MonthlyPartitionsDefinition
monthly_partition = MonthlyPartitionsDefinition( start_date="2023-01-01", end_date="2023-
12-01")
# for partition in monthly_partition.get_partition_set():
# print(f"Partition: {partition.name}, Range: {partition.value['start']} to
{partition.value['end']}")
# output: Partition: 2023-01, Range: 2023-01-01 to 2023-01-31
# note: this example is written in 2023/11, meaning only partition for 1-10th month is created,
up to November, but not include November. (need recheck)
 Inote: syntax for adding partition setup to asset, job and Definitions is described in respective
section

3.6.1.1.1.1 Q: why already defined schedule in asset but repeatively declare it in schedule
(and job), while some asset in these schedule completely not use it
> A??: hình như chỉ khi add như vậy thì khi chạy schedule nó mới activate partition trong mấy
thằng asset
3.7. IOManager *need-update
 IOManager is used to handle return (output) and upstream's return value (input to cur asset)
 syntax:

**# define an CSVIOManager in resource/<io file>


from dagster import IOManager, OutputContext
import pandas as pd
class CSVIOManager(IOManager):
def handle_output(self, context: OutputContext, obj: pd.DataFrame) -> None:
file_name = context.asset_key.path[-1]
obj.to_csv(f"/tmp/{file_name}.csv", index=False)
Inote: need a config to tell asset use the implemented io set up, this is described in `asset`
section
3.8. Sensor
 Sensor are way to monitor a specific event and create runs base on it. Sensors continuously
check and execute logic to know whether to kick off a run, by default, it polls every 30s
 Sensors are commonly used for situations where you want to materialize an asset after
something happened
 a new file arrive
 another asset has been materialized elsewhere
 Sensor cursors is a stored value used to manage the state of the sensor:
 store the ID of the last fetched record, keep track of what requests it has already made a
report for.
 where the computation last left off
> sensor will retrieve all the file names in the data/requests directory, compare it to the list of
files it already looked at and stored in its stateful cursor, update the cursor with the new files,
and kick off a new run for each of those files.
 syntax: sensor lecture in Dagster Essential Course : link
> summary the process:
 problem: when stakeholders request a report for: muốn biết tình hình các chuyến xe trong
khoảng thời gian xác định có biến đổi như thế nào
 implement process: tạo class defined the configuration (info) of request  write assets to
create report base on received request info  create job contain assets related to request
 defined sensor checking the event to run a job  register sensor to definition
3.9. Asset
3.9.1. General
 Asset are conventionally implemented at /<project_name>/asset/file_name.py
 Asset name should be an noun, asset name by default is function name (see asset syntax)
 Asset key (formally: AssetKey): key that uniquely identifies the asset in Dagster, by default is
function name (when key prefix is not used, cover later)
 (not checked) Executer location in asset function is at project home folder but import part is
relative to current asset file
3.9.2. general syntax
3.9.2.1. Asset with single resource
from dagster import asset, Output, Definitions, AssetIn, AssetOut, multi_asset
from ..partitions import my_partition
from dagster_duckdb import DuckDBResource
@asset(
deps=["<upstream_asset_name>", ] ,
partitions_def=my_partition,
io_manager_key="minio_io_manager", # set the IOmanager stragety for current asset
required_resource_keys={"mysql_io_manager"}, # require a resource to use, in this case is
"mysql_io_manager", then we can use the resource via context.mysql_io_manager.<resousrce
feature> (see code usage example in function implement)
name: "asset_name",
key_prefix = ["dir1" , "dir2"] # then in UI, asset is place in folder tree: dir1/dir2/my_asset
metadata = *dict_of_info_about_asset, # a dictionary to describe the asset (as info for other
user)
compute_kind = "<method>", # mark which kind of tool used in this asset (it seem just a
decorator for visual purpose), EX: python, sql
group_name="<group_name>", # specify asset belong to which group of assets EX: bronze,
gold,... group seem to help UI look more clearly and materialize multiple assets in same group at
once
)
def my_asset(context,
database: DuckDBResource, # with `database` is resource name, defined in `Definitions`
io_manager_key="csv_io_manager"
):
...do_something...
# pd_data = context.resources.mysql_io_manager.extract_data(sql_stm)
# with database.get_connection() as conn: conn.execute("<sql query>")
# partition_date_str = context.asset_partition_key_for_output() # return value is string with
format: "YYYY-MM-DD"
# return pd.DataFrame() # by default (without io_manager_key specified) save as pickle format
in $DAGSTER_HOME/??path
return Output(<return_value>, metadata = <a dictionary to describe the asset as info for other
user>) # return value along with metadata after a asset is execute
note: instead of using deps as above to specify upstream asset (usually used for asset that don’t return
value), another approach is using ins. This method overcomes the limitation of old one by allowing the
use of return value from upstream assets:
@asset(
ins={ "my_asset": AssetIn( key_prefix=["dir1", ], ) },
)
def my_asset_downstream(context, my_asset):
print(my_asset) # is return value from my_asset
3.9.2.2. Asset associated with multiple
 when asset has multiple up and down asset (each with different io_manager stragety), we using
multi_asset, (other case not sure)
 syntax: "mywork - DE tools - essential example code">L3
summary: nó support thêm decorator parameter `outs` để định nghĩa nhiều downstream assets
(and io_manager of each one)
3.9.3. asset context
 context is first para passed to asset function, provides information about how Dagster is running
and materializing your asset. For example, you can use it to find out which partition Dagster is
materializing, which job triggered the materialization, or what metadata was been attached to its
previous materializations.
3.9.4. action within asset
3.9.4.1. get asset key prefix and asset name
Inote: debug to check
 # context.step_key
 # "__".join(context.asset_key.path)
3.9.4.2. print value: matedata
 meta data is key-value pair as additional info print out to user when asset materialized
sucessfully
 context.add_output_metadata({ "row_count": random_number, })
3.9.4.3. print value: log
 context.log.info(<things to print out>) # print value out as log
> similarly, we have .warning() and .error(), critical()
3.9.4.4. input value to asset (or aka configure asset) while running
- syntax: (EX: enter a value and print out to log)
@asset(config_schema={"api_endpoint":str})
def my_asset(context):
...do something...
api_endpoint = context.op_config.get ("api_endpoint", "no endpoint")
context.log.info(f"API: {api_endpoint}")
then, when materialize an asset, provide the value in `configure assets` box as:
ops:
my_asset:
config:
api_endpoint: "https://api.io/data"

3.10. Code location and Definitions object


3.10.1. Code location
 Code locations is abstract term refer to space to run dagster project, it enables users to run code
with their own versions of Python and other dependencies. It includes:
 A Python module that contains a Definitions object
 A Python environment that can load the module above
Inote: not test using multiple code locations yet
 Code locations are all maintained in one single Dagster deployment. This allows you to silo
packages and versions, but still create connections between data assets as needed. For example,
an asset in one code location can depend on an asset in another code location.
 By default, code locations are named using the name of the module (folder of dagster code)
loaded by Dagster
3.10.2. > Definitions
 Definitions is conventionally implemented in /<project_name>/__init__.py
 The Definitions object is used to assign definitions (asset, job,schefule,... definition) to a code
location, and each code location can only have a single Definitions object. This object maps to
one code location.
Inote: not test using multiple code locations yet
 syntax:
from dagster import Definitions, load_assets_from_modules
from .jobs import my_job
from .schedules import my_schedule
from .assets import asset_file_1, ...
from .resources import database_resource
from .resource/csv_io_manager import CSVIOManager # Inote: the path is not checked
defs = Definitions(
assets=[*load_assets_from_modules([asset_file_1]),],
resources={"database": database_resource,"csv_io_manager": CSVIOManager },
jobs=[my_job, ],
schedules= [my_schedule, ],
)
4. Dagster debug
Inote: only tested to debug on VSCode with single file
 assume project folder as root (aka /) contain /etl_pipeline/assets/app.py ,
/etl_pipeline/resource/... , ...
 create a launch.json file: debug  create a launch.json file
> launch.json file content:
{
"version": "0.2.0",
"configurations": [
{
"name": "Dagster",
"type": "python",
"request": "launch",
"program": "C:/Users/LENOVO/Home/OneDrive - VNU-HCMUS/H -
Tech/Code/Python 3.10 DSK - IntelliJ/Scripts/dagster.exe",
"args": [
"dev",
"-f",
"${workspaceFolder}/etl_pipeline/assets/one_table.py"
],
"cwd": "${workspaceFolder}/etl_pipeline", // relative ref in one_table.py start from
/etl_pipeline
"console": "integratedTerminal", // coerce debug log to cur VScode terminal
"stopOnEntry": false
}
]
}
5. Archived - detail architecture of dagster system, deploy one by one component of
dagsster

The dagster-daemon is a long-running process that does things like check the time to see if a schedule
should be run or if a sensor should be ticked, running dagster dev automatically spins up the dagster-
daemon


 ./dagster_home/workspace.yaml
load_from:
- grpc_server:
host: localhost
port: 4800
location_name: "demo_grpc"
command:
 note: following command is run in different terminal as separate service
 create a server for API receive command in dagster, formally: starts the Dagster API
server in gRPC mode.
 export DAGSTER_HOME=$PWD/dagster_home # with dagster_home folder
contain the workspace.yaml file
 dagster api grpc -p 4080 -f asset_partitioning.py # -f : using pipeline place in
asset_partitioning.py
 starts the Dagster daemon, which is responsible for running background processes, such
as sensors and asset materialization
 * declare as DAGSTER_HOME var
 dagster-deamon run -w dagster_home/workspace.yaml
 starts Dagit, the web-based user interface for Dagster.
 * declare as DAGSTER_HOME var
 dagit -w dagster_hone/workspace.yaml
6. Archived - dagster used to be a task-centric
 some time ago, dagster used to be a task-centric, with solid (abstract to a task)
7. Dockerize * updating (emp)

You might also like