-
Notifications
You must be signed in to change notification settings - Fork 68
Add new script to alarm on late workflows #4020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new script to alarm on late workflows #4020
Conversation
|
Why use JSON for human-interface files? (configuration files), XML at least I would say. |
|
Who is producing that JSON file ? Is it provided by an operator ? If so, yes, one should be able to just edit it... |
|
It is supposed to be provided by operators. JSON was just a matter of taste and ease of use. We can change it to more user friendly format if you guys suggest it. |
|
JSON is easy to use in the code, but not that easy to handle/edit by an operator. If an operator should be able to modify this manually (ie. in a text editor), it should be XML. |
|
Either XML or the parameters should be individually passable to the script (doesn't make sense past a certain number of parameters and dependencies between them). |
|
I'm up to the classical configuration file (unix) : As long as it doesn't break with spaces I'm happy, there should be a python lib for that, for Perl there are many that I used in the past. |
|
Since we have WMCore support, what about using Configuration files like the agent configuration. It could be a single file for all alarms with a section_() per alarm? |
|
I don't like the sections thing, prefer different files for different things, but I think it's only prejudice, we can go like that. |
|
Second iteration, config file used. Modified the backlog script as well to use this. Please take a look. |
|
Format looks good to me. @samircury, please sign off on this and I can merge. |
|
I would put instead of : Something like : so it's more intuitively, if I didn't know what the alarm was about, I would never know. "WorkflowLimits" doesn't really tell it, maybe WorkflowDelay or WorkflowTimeout. Wouldn't die with the current naming though, Diego, you choose what to do. Whatever is ok. |
Implement it under bin/ it uses an external config file to get the limits and support run blacklists writes the file to: /afs/cern.ch/user/c/cmsprod/www/sls/cmst0_late_workflows.xml Service name in the XML is CMST0-late-workflows Also move the backlog alarm to bin/ and add a functional version of the Config file to etc/operations. Note that the scripts are now named with underscores instead of dashes, this is to make it compatible with attributes and module names in Python.
|
Indeed it was ambiguous, I changed it to WorkflowTimeouts. |
|
Ok, I'll merge this now then. Format isn't set in stone either, can easily be changed later if it gets too annoying to use. |
Add new script to alarm on late workflows
Implement it under bin/ it uses an external
JSON file to get the limits and support run blacklists
writes the file to: /afs/cern.ch/user/c/cmsprod/www/sls/cmst0-late-workflows.xml
Service name in the XML is CMST0-late-workflows
Also move the backlog alarm to bin/ and add an example
of the JSON file to etc/operations.
Here's there is something to dicuss. I propose the following file organization:
etc/operations -> general operations data such as JSON files to configure the alarms, only basic functional examples. We wouldn't pull request any changes there unless there is a change in the format and therefore in the scripts as wel..
bin/ -> Executable scripts such as these two and the one in #4016
The alarms shouldn't have run blacklists or information we would want to change often in the code but rather in external JSON (straightforward format) files. Note that this first version still uses some hardcoded run blacklists.
@hufnagel, @samircury comments?