Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@germanfgv
Copy link
Contributor

ErrorHandler.maxFailTime sets a maximum time a job can run to be retried. If a job takes longer than maxFailTime, it does not go to the retry logic and it is failed immediately.

This allows regular production agents to avoid retrying long executing jobs. Nonetheless, T0 never wants to skip the retry logic, as we want to be able to recover jobs using the PauseAlgo plugin.

The current configuration of maxFailTime=12000 (33.3h) means that if a job fails after this long, it is directly exhausted and doesn't go to the usual pause state, causing issues like the one described here:
https://cms-talk.web.cern.ch/t/summary-of-failures-in-promptreco-jobs/16373/5

Here we are setting maxFailTime=601200 (167h), effectively disabling this feature for T0 agents.

Increase maxFailTime so the jobs are not mistakenly exhausted by ErrorHandler
@germanfgv germanfgv merged commit eca7d7b into master Nov 14, 2022
@germanfgv germanfgv deleted the maxFailTime branch February 29, 2024 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants