Long Scheduler iteration times #31594
Replies: 3 comments
-
Just noticed that in the previous discussion, we were asked to run |
Beta Was this translation helpful? Give feedback.
-
here is the output of pyspy record. Of note is that all but one of the four daemon works is idle - here is the flamegraph of the third showing the 15 seconds between run requests. Looks like all the time is spent handling events? Not sure what that means exactly/how to action an improvement. Hoping @gibsondan might have a magical solve π€ thanks in advance. ![]() |
Beta Was this translation helpful? Give feedback.
-
We implemented pgbouncer which has reduced the time between run requests to 4s. That still seems slow but is a good improvement. We're also noticing that all but 1 of the daemon workers is idle, so we'll look into that next. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
We are experiencing some performance issues with the Dagster scheduler. This appears to be a re-emergence of an issue we mitigated by increasing the available resources, but this time we do not seem resource-constrained.
Details
Our daemon scheduler thread is taking ~15 seconds to process an individual run request. We have an hourly schedule that launches ~550 partitions on its busiest tick, meaning that the scheduler takes around 2 hours and 20 minutes just to pass our run requests to the queued run coordinator.
The individual partitions take between 20 minutes and 3 hours to complete.
Here is an example of our long schedule tick evaluation

and logs showing RunRequests are being processed about 15s apart

Resources
The dagster daemon's container seems to have ample resources
Similarly, dagsters database seems to have plenty of resources
Dagster DB resource screenshots
Logs & pyspy dumps
Here are the first 10 minutes of logs from part of the relevant timerange, 3:00AM EDT - 3:10AM
downloaded-logs-20250806-141211.json
And some pyspy dumps taken while this is happening, as suggested in the previous discussion.
PySpy Dump 1
PySpy Dump 2
RunRequests
Our actual run requests are fairly lightweight, our schedule looks like this
Our run coordinator configuration looks like this
Naively, I would expect that all 550 RunRequests would be queued reasonably quickly, and then started as resources become available to process them.
What is the expected behaviour here? Is there some setting or other metric we should check?
Beta Was this translation helpful? Give feedback.
All reactions