How to configure alerts for a job
This page shows you how to configure alerts for Scaleway Serverless Jobs using Scaleway Cockpit and Grafana.
Before you start
To complete the actions presented below, you must have:
- A Scaleway account logged into the console
- Owner status or IAM permissions allowing you to perform actions in the intended Organization
- Scaleway resources you can monitor
- Created Grafana credentials with the Editor role
- Enabled the alert manager
- Added at least one contact in the Scaleway console or contact points in Grafana
- Selected the Scaleway Alerting alert manager in Grafana
-
Log in to Grafana using your credentials.
-
Click the Grafana icon in the top left side of your screen to open the menu.
-
Click the arrow next to Alerting on the left-side menu, then click Alert rules.
-
Click + New alert rule.
-
Enter a name for your alert.
-
In the Define query and alert condition section, toggle Advanced options.
-
Select the data source you want to configure alerts for. For the sake of this documentation, we are choosing the Scaleway Metrics data source.
-
In the Rule type subsection, click the Data source-managed tab.
-
In the query field next to the Loading metrics... > button, select the metric you want to configure an alert for. Refer to the table below for details on each alert for Serverless Jobs.
AnyJobError
- Pending period
- 5s
- Summary
- Job run
{{ $labels.resource_id }}
is in error. - Query and alert condition
(serverless_job_run:state_failed == 1)
OR(serverless_job_run:state_internal_error == 1)
- Description
- Job run
{{ $labels.resource_id }}
from the job definition{{ $labels.resource_name }}
finish in error. Check the console to find out the error message.
JobError
- Pending period
- 5s
- Summary
- Job run
{{ $labels.resource_id }}
is in error. - Query and alert condition
(serverless_job_run:state_failed{resource_name="your-job-name-here"} == 1)
OR(serverless_job_run:state_internal_error{resource_name="your-job-name-here"} == 1)
- Description
- Job run
{{ $labels.resource_id }}
from the job definition{{ $labels.resource_name }}
finish in error. Check the console to find out the error message.
AnyJobHighCPUUsage
- Pending period
- 10s
- Summary
- High CPU usage for job run
{{ $labels.resource_id }}
. - Query and alert condition
serverless_job_run:cpu_usage_seconds_total:rate30s / serverless_job_run:cpu_limit * 100 > 90
- Description
- Job run
{{ $labels.resource_name }}
from the job definition{{ $labels.resource_name }}
is using more than{{ printf "%.0f" $value }}
% of its available CPU since 10s.
JobHighCPUUsage
- Pending period
- 10s
- Summary
- High CPU usage for job run
{{ $labels.resource_job definition }}
. - Query and alert condition
serverless_job_run:cpu_usage_seconds_total:rate30s{resource_name="your-job-name-here"} / serverless_job_run:cpu_limit{resource_name="your-job-name-here"} * 100 > 90
- Description
- Job run
{{ $labels.resource_name }}
from the job definition{{ $labels.resource_name }}
is using more than{{ printf "%.0f" $value }}
% of its available CPU since 10s.
AnyJobHighMemoryUsage
- Pending period
- 10s
- Summary
- High memory usage for job run
{{ $labels.resource_job definition }}
. - Query and alert condition
(serverless_job_run:memory_usage_bytes / serverless_job_run:memory_limit_bytes ) * 100 > 80
- Description
- Job run
{{ $labels.resource_name }}
from the job definition{{ $labels.resource_name }}
is using more than{{ printf "%.0f" $value }}
% of its available RAM since 10s.
JobHighMemoryUsage
- Pending period
- 10s
- Summary
- High memory usage for job run
{{ $labels.resource_id }}
. - Query and alert condition
(serverless_job_run:memory_usage_bytes{resource_id="your-job-name-here"} / serverless_job_run:memory_limit_bytes{resource_id="your-job-name-here"}) * 100 > 80
- Description
- Job run
{{ $labels.resource_name }}
from the job definition{{ $labels.resource_name }}
is using more than{{ printf "%.0f" $value }}
% of its available RAM since 10s.
-
Make sure that the values for the labels you have selected correspond to those of the target resource.
-
In the Set alert evaluation behavior section, specify how long the condition must be met before triggering the alert.
-
Enter a name in the Namespace and Group fields to categorize and manage your alert rules. Rules that share the same group will use the same configuration, including the evaluation interval which determines how often the rule is evaluated (by default: every 1 minute). You can modify this interval later in the group settings.
-
In the Configure labels and notifications section, click + Add labels. A pop-up appears.
-
Enter a label and value name and click Save. You can skip this step if you want your alerts to be sent to the contacts you may already have created in the Scaleway console.
-
Click Save rule and exit in the top right corner of your screen to save and activate your alert. Once your alert meets the requirements you have configured, you will receive an email to inform you that your alert has been triggered.