-
Notifications
You must be signed in to change notification settings - Fork 679
Description
Description
For non ephemeral runners the status of the workflow job is checked, and only for queued jobs scaling is done. For ephemeral runners this check is not applied because the assumption was that every job needs a runners.
We found out when we start scaling with a couple of 100 runners this ideas was not working as expected. When we got a large number of cancelled jobs. For example based on a job time out. The events are still on queue. This is typically the case when we have reach the max of runners. The lambda's will crate all the runenrs. But they will remain idle since jobs are canclled. This is not a problem with a few cancelled jobs. But when having hugh amount of cancelled jobs, this could casue a large fleet of useless runners.
Solution
We have tested a modified scale up lambda, where we applied the the check for the job in the same way as for non ephemeral runners. In our case this was solving the problem. However, since there is not correlation between job and runner this approach could lead that events are not used for scaling in cases they should lead to scaling. As mitigation we have a very small fleet of runners in the pool to keep track of those missed events.