S - Dagter
Note
notable note
search along with example code to see an example usage of function, document for dagster is
very vague instructions
notable recourse
dagster documentation: link
Dagster essential course: link
fully feature project: link
1. Theory
1.1. wtf is orchestration
- orchestration is tool to helps automate, coordinate, and manage complex workflows, data pineline or
processes. Core feature:
Directed Acyclic Graphs (DAGs): data structure that visually manage a step in a pineline
Scheduling and Workflow Management : what time and order the step need to be executed
Error Handling and Retry Mechanisms : how to behave when error occur at specific step
Monitoring and Logging
...
1.2. orchestrating approaches (OA) and Dagster OA
task-centric: focuses on managing and coordinating the execution of tasks. It focuses on the
hows and less on the whats.
asset-centric: Assets are what we call the outputs made by workflows. Asset-centric workflows
make it easy to, at a glance, focus on the whats and less on the hows
advantage of asset-centric compare to task-centric:
easily understand the data lineage and how data assets relate to each other
allow for reusing assets without changing an existing sequence of tasks
tell exactly why assets are out-of-date, whether it might be late upstream data or errors in
code
...
dagster use asset centric
1.3. relationship between assets
Asset dependencies can be:
Downstream, which means an asset is dependent on another asset
Upstream, which means an asset is depended on by another asset
2. Installation
pip install 'dagster~=1.4'
3. Operation with dagster (and more theory)
3.1. web UI
Inote: operation in web server by default create a temporary folder to save the change via action
on server, when server down, these folder will be removed
Inote: asset UI is straightforward and easy to remember, read concept definition and code
implementation in next following section to understand how dagster behave and familiar with
the keyword is enough, then just try to familiar with UI (or gg search) and check following note
if get any problem
note: each time modify the code, should `reload definition`
note: fail asset information will be place in asset log
note: value print out of each asset place in asset log
3.2. run dagster as single file
dagster dev -f <file>.py
> p_remind: this file must contain Definition declaration (cover later)
3.3. recommended project structure and basic action with project
Create a project skeleton: dagster project scaffold --name my-project-name
Install dependencies of project: pip install -e ".[dev]"
-e == --editable : install project dependencies and its project as module in editable mode
note: not clear the real effect of -e, just install with this command for safety
addition info (not checked in action): by using -e, you’ll only need to reload definitions
when adding new assets or other Dagster objects, such as schedules, to your project
Component in project:
Add python dependencies: add package name to /setup.py (where to put is
straightforward, or search chatgpt)
/.env: Environment variables, described in later separate section
./my-project-name: folder that contain dagster code as a python module
/my-project-name/__init__.py : import and combine stuff in /my-project-name/
with Definitions declaration, this is called as Code Location (cover in later)
other subfolder vs relative ref to __init__.py: resource/, job/, asset/, ...
Run dagster project (as module): dagster dev
note: this command is shorthand to open multiple subservice at once command (detail
about dagster service cover in later section)
default open at localhost port 3000
3.4. .env file
Inote: this section is not tested and missing core content, read, extract from Document when
having chance : link
env variable in dagster is the same concept as many other programming languages, tool (EX:
Nodejs)
conventionally located at /.env
how to use
approach 1:
import os
os.getenv("DUCKDB ") # assumpt DUCKDB is var defined in /.env
approach 2:
from dagster import EnvVar
EnvVar("DUCKDB_DATABASE")
Inote: seem there some conflict and convention when using EnvVar: EnvVar can
only be used in resource and some other place, check this doc example from dagster
link
the difference:
EnvVar fetches the environmental variable’s value every time a dagster run
command starts (only one run per deployment)
os.getenv fetches the environment variable when the code location is loaded (can
reload code location/definition many time per deploy)
By using EnvVar instead of os.getenv, you can dynamically customize a resource’s
configuration. For example, you can change which DuckDB database is being used
without having to restart Dagster’s web server.
3.5. Resource
Resources are the tools and services you use to make assets, it can be an API connection,
Database connection, ...
resources are conventionally located at /<project_name>/resources/__init__.py
syntax: (example with duckDB)
from dagster_duckdb import DuckDBResource
database_resource = DuckDBResource(database=EnvVar("DUCKDB_PATH")) # with
EnvVar("DUCKDB_DATABASE") is path to database
Inote: syntax for declare resource in definition and using it in asset is described in later
respective section
Doc to integrate other kind of resource to dagster definition: link
3.6. Jobs and Schedule
Jobs are a Dagster utility to take a slice of your asset graph and focus specifically on running
materializations of those assets.
> Jobs are conventionally defined at /<project_name>/jobs/__init__.py
> syntax:
from dagster import AssetSelection
from ..partitions import monthly_partition
my_assets = AssetSelection.keys([<asset_name:str>,])
# my_assets = AssetSelection.all() - other_assets
my_job = define_asset_job(name="my_job", selection=my_assets,
partitions_def=monthly_partition,)
# note: partition is introduced in later section
Cron expession: same as crontab in linux, not recover
Schedules are object to manage the time to run jobs
> Schedules are conventionally defined at /<project_name>/schedules/__init__.py
> syntax:
from dagster import ScheduleDefinition
from ..jobs import my_job
my_schedule = ScheduleDefinition(job=my_job, cron_schedule="0 0 5 * *", ) # crontab
expression
Inote: syntax for declare the jobs and schedule in Definition is defined in respective section
mock test : an action of manually run a schedule to test whether schedule is running correctly or
not, accomplish this via dagster web UI (simple, not covered)
3.6.1. > Partitions
Inote: this section seem to missing the core content, recheck the document when having a
chance
conventionally defined at /<project_name>/partitions/__init__.py
Partitions are a way to split your data into smaller, easier-to-use chunks, partition is usually
devided by date (EX: each month in the year is the partition), some notable benefits:
split your data into smaller, easier-to-use chunks
treat partition asset differently to obtain best efficiency (EX: store recent orders in hot
storage and older orders in cheaper, cold storage.)
distribute partitions across multiple servers or storage systems and run multiple partitions
in parallel
Backfilling is the process of running partitions for assets that either don’t exist (EX: due to not
run yet because of first time deploy a pipeline) or updating existing records (EX: when you’ve
changed the logic for an asset and need to update historical data with the new logic.)
syntax: (example for define a monthly partition in specific range of time)
from dagster import MonthlyPartitionsDefinition
monthly_partition = MonthlyPartitionsDefinition( start_date="2023-01-01", end_date="2023-
12-01")
# for partition in monthly_partition.get_partition_set():
# print(f"Partition: {partition.name}, Range: {partition.value['start']} to
{partition.value['end']}")
# output: Partition: 2023-01, Range: 2023-01-01 to 2023-01-31
# note: this example is written in 2023/11, meaning only partition for 1-10th month is created,
up to November, but not include November. (need recheck)
Inote: syntax for adding partition setup to asset, job and Definitions is described in respective
section
3.6.1.1.1.1 Q: why already defined schedule in asset but repeatively declare it in schedule
(and job), while some asset in these schedule completely not use it
> A??: hình như chỉ khi add như vậy thì khi chạy schedule nó mới activate partition trong mấy
thằng asset
3.7. IOManager *need-update
IOManager is used to handle return (output) and upstream's return value (input to cur asset)
syntax:
**# define an CSVIOManager in resource/<io file>
from dagster import IOManager, OutputContext
import pandas as pd
class CSVIOManager(IOManager):
def handle_output(self, context: OutputContext, obj: pd.DataFrame) -> None:
file_name = context.asset_key.path[-1]
obj.to_csv(f"/tmp/{file_name}.csv", index=False)
Inote: need a config to tell asset use the implemented io set up, this is described in `asset`
section
3.8. Sensor
Sensor are way to monitor a specific event and create runs base on it. Sensors continuously
check and execute logic to know whether to kick off a run, by default, it polls every 30s
Sensors are commonly used for situations where you want to materialize an asset after
something happened
a new file arrive
another asset has been materialized elsewhere
Sensor cursors is a stored value used to manage the state of the sensor:
store the ID of the last fetched record, keep track of what requests it has already made a
report for.
where the computation last left off
> sensor will retrieve all the file names in the data/requests directory, compare it to the list of
files it already looked at and stored in its stateful cursor, update the cursor with the new files,
and kick off a new run for each of those files.
syntax: sensor lecture in Dagster Essential Course : link
> summary the process:
problem: when stakeholders request a report for: muốn biết tình hình các chuyến xe trong
khoảng thời gian xác định có biến đổi như thế nào
implement process: tạo class defined the configuration (info) of request write assets to
create report base on received request info create job contain assets related to request
defined sensor checking the event to run a job register sensor to definition
3.9. Asset
3.9.1. General
Asset are conventionally implemented at /<project_name>/asset/file_name.py
Asset name should be an noun, asset name by default is function name (see asset syntax)
Asset key (formally: AssetKey): key that uniquely identifies the asset in Dagster, by default is
function name (when key prefix is not used, cover later)
(not checked) Executer location in asset function is at project home folder but import part is
relative to current asset file
3.9.2. general syntax
3.9.2.1. Asset with single resource
from dagster import asset, Output, Definitions, AssetIn, AssetOut, multi_asset
from ..partitions import my_partition
from dagster_duckdb import DuckDBResource
@asset(
deps=["<upstream_asset_name>", ] ,
partitions_def=my_partition,
io_manager_key="minio_io_manager", # set the IOmanager stragety for current asset
required_resource_keys={"mysql_io_manager"}, # require a resource to use, in this case is
"mysql_io_manager", then we can use the resource via context.mysql_io_manager.<resousrce
feature> (see code usage example in function implement)
name: "asset_name",
key_prefix = ["dir1" , "dir2"] # then in UI, asset is place in folder tree: dir1/dir2/my_asset
metadata = *dict_of_info_about_asset, # a dictionary to describe the asset (as info for other
user)
compute_kind = "<method>", # mark which kind of tool used in this asset (it seem just a
decorator for visual purpose), EX: python, sql
group_name="<group_name>", # specify asset belong to which group of assets EX: bronze,
gold,... group seem to help UI look more clearly and materialize multiple assets in same group at
once
)
def my_asset(context,
database: DuckDBResource, # with `database` is resource name, defined in `Definitions`
io_manager_key="csv_io_manager"
):
...do_something...
# pd_data = context.resources.mysql_io_manager.extract_data(sql_stm)
# with database.get_connection() as conn: conn.execute("<sql query>")
# partition_date_str = context.asset_partition_key_for_output() # return value is string with
format: "YYYY-MM-DD"
# return pd.DataFrame() # by default (without io_manager_key specified) save as pickle format
in $DAGSTER_HOME/??path
return Output(<return_value>, metadata = <a dictionary to describe the asset as info for other
user>) # return value along with metadata after a asset is execute
note: instead of using deps as above to specify upstream asset (usually used for asset that don’t return
value), another approach is using ins. This method overcomes the limitation of old one by allowing the
use of return value from upstream assets:
@asset(
ins={ "my_asset": AssetIn( key_prefix=["dir1", ], ) },
)
def my_asset_downstream(context, my_asset):
print(my_asset) # is return value from my_asset
3.9.2.2. Asset associated with multiple
when asset has multiple up and down asset (each with different io_manager stragety), we using
multi_asset, (other case not sure)
syntax: "mywork - DE tools - essential example code">L3
summary: nó support thêm decorator parameter `outs` để định nghĩa nhiều downstream assets
(and io_manager of each one)
3.9.3. asset context
context is first para passed to asset function, provides information about how Dagster is running
and materializing your asset. For example, you can use it to find out which partition Dagster is
materializing, which job triggered the materialization, or what metadata was been attached to its
previous materializations.
3.9.4. action within asset
3.9.4.1. get asset key prefix and asset name
Inote: debug to check
# context.step_key
# "__".join(context.asset_key.path)
3.9.4.2. print value: matedata
meta data is key-value pair as additional info print out to user when asset materialized
sucessfully
context.add_output_metadata({ "row_count": random_number, })
3.9.4.3. print value: log
context.log.info(<things to print out>) # print value out as log
> similarly, we have .warning() and .error(), critical()
3.9.4.4. input value to asset (or aka configure asset) while running
- syntax: (EX: enter a value and print out to log)
@asset(config_schema={"api_endpoint":str})
def my_asset(context):
...do something...
api_endpoint = context.op_config.get ("api_endpoint", "no endpoint")
context.log.info(f"API: {api_endpoint}")
then, when materialize an asset, provide the value in `configure assets` box as:
ops:
my_asset:
config:
api_endpoint: "https://api.io/data"
3.10. Code location and Definitions object
3.10.1. Code location
Code locations is abstract term refer to space to run dagster project, it enables users to run code
with their own versions of Python and other dependencies. It includes:
A Python module that contains a Definitions object
A Python environment that can load the module above
Inote: not test using multiple code locations yet
Code locations are all maintained in one single Dagster deployment. This allows you to silo
packages and versions, but still create connections between data assets as needed. For example,
an asset in one code location can depend on an asset in another code location.
By default, code locations are named using the name of the module (folder of dagster code)
loaded by Dagster
3.10.2. > Definitions
Definitions is conventionally implemented in /<project_name>/__init__.py
The Definitions object is used to assign definitions (asset, job,schefule,... definition) to a code
location, and each code location can only have a single Definitions object. This object maps to
one code location.
Inote: not test using multiple code locations yet
syntax:
from dagster import Definitions, load_assets_from_modules
from .jobs import my_job
from .schedules import my_schedule
from .assets import asset_file_1, ...
from .resources import database_resource
from .resource/csv_io_manager import CSVIOManager # Inote: the path is not checked
defs = Definitions(
assets=[*load_assets_from_modules([asset_file_1]),],
resources={"database": database_resource,"csv_io_manager": CSVIOManager },
jobs=[my_job, ],
schedules= [my_schedule, ],
)
4. Dagster debug
Inote: only tested to debug on VSCode with single file
assume project folder as root (aka /) contain /etl_pipeline/assets/app.py ,
/etl_pipeline/resource/... , ...
create a launch.json file: debug create a launch.json file
> launch.json file content:
{
"version": "0.2.0",
"configurations": [
{
"name": "Dagster",
"type": "python",
"request": "launch",
"program": "C:/Users/LENOVO/Home/OneDrive - VNU-HCMUS/H -
Tech/Code/Python 3.10 DSK - IntelliJ/Scripts/dagster.exe",
"args": [
"dev",
"-f",
"${workspaceFolder}/etl_pipeline/assets/one_table.py"
],
"cwd": "${workspaceFolder}/etl_pipeline", // relative ref in one_table.py start from
/etl_pipeline
"console": "integratedTerminal", // coerce debug log to cur VScode terminal
"stopOnEntry": false
}
]
}
5. Archived - detail architecture of dagster system, deploy one by one component of
dagsster
The dagster-daemon is a long-running process that does things like check the time to see if a schedule
should be run or if a sensor should be ticked, running dagster dev automatically spins up the dagster-
daemon
./dagster_home/workspace.yaml
load_from:
- grpc_server:
host: localhost
port: 4800
location_name: "demo_grpc"
command:
note: following command is run in different terminal as separate service
create a server for API receive command in dagster, formally: starts the Dagster API
server in gRPC mode.
export DAGSTER_HOME=$PWD/dagster_home # with dagster_home folder
contain the workspace.yaml file
dagster api grpc -p 4080 -f asset_partitioning.py # -f : using pipeline place in
asset_partitioning.py
starts the Dagster daemon, which is responsible for running background processes, such
as sensors and asset materialization
* declare as DAGSTER_HOME var
dagster-deamon run -w dagster_home/workspace.yaml
starts Dagit, the web-based user interface for Dagster.
* declare as DAGSTER_HOME var
dagit -w dagster_hone/workspace.yaml
6. Archived - dagster used to be a task-centric
some time ago, dagster used to be a task-centric, with solid (abstract to a task)
7. Dockerize * updating (emp)