Add Prometheus alert metrics#1140
Conversation
|
@MemoMeto35 Remember the authors files - see DEVELOPERS.md |
|
@MemoMeto35 And, the manual |
|
Hello @jesperpedersen, thanks for the feedback. I wanted to ask about you comment for the manual. Do we need a new chapter for alerts or in |
|
Hello, We need to add to the manual how users can configure alert rules with Grafana, similar to pgexporter: We also need to add an introductory guide explaining alerts and why they are important (similar to the first part of pgexporter here: https://github.com/pgexporter/pgexporter/blob/main/doc/ALERT.md). I think we can keep all of this in |
|
Yes, it belongs in the Prometheus chapter, and in doc/PROMETHEUS.md |
|
Hi @jesperpedersen @Abdelrhmansersawy , thanks for your feedback. I applied the changes needed. Let me know if any further modifications should be done! |
Abdelrhmansersawy
left a comment
There was a problem hiding this comment.
Great work, I have just left a few comments.
| All alert metrics carry three labels: `server` (the server identifier), `alert` (the alert | ||
| name), and `type` (the alert type, currently `state`). | ||
|
|
||
| # Alerting with Grafana |
There was a problem hiding this comment.
Could you add some images guide of each step similar to pgexporter
https://github.com/pgexporter/pgexporter/blob/main/doc/manual/en/09-grafana.md#alerting-with-grafana
we try to keep cross-port consistent between projects
| | libev | `auto` | String | No | Select the [libev](http://software.schmorp.de/pkg/libev.html) backend to use. Valid options: `auto`, `select`, `poll`, `epoll`, `iouring`, `devpoll` and `port` | | ||
| | max_rate | 0 | Int | No | The maximum backup transfer rate in bytes per second. Use 0 to disable | | ||
| | progress | off | Bool | No | Enable backup progress tracking | | ||
| | alert | off | Bool | No | Enable Prometheus alert metrics | |
|
@MemoMeto35 follow commit message format |
c1bec03 to
4b3f10a
Compare
|
@Abdelrhmansersawy , your requested changes should be done now. please let me know if any further modifications needed |
| config->progress = false; | ||
| } | ||
| } | ||
| else if (!strcmp(key, "alert")) |
There was a problem hiding this comment.
same
Please ensure that we have it alert(s) every-way (keep it consistent between docs and codebase)
|
Hello @Abdelrhmansersawy , changes requested have been updated |
Abdelrhmansersawy
left a comment
There was a problem hiding this comment.
Good job, just a few things from my side.
| config = (struct main_configuration*)shmem; | ||
|
|
||
| /* pgmoneta_alert_server_down */ | ||
| data = pgmoneta_append(data, "#HELP pgmoneta_alert_server_down Alert: server is not online (1 = down, 0 = up)\n"); |
There was a problem hiding this comment.
Space between # and HELP
it should be # HELP
|
|
||
| /* pgmoneta_alert_server_down */ | ||
| data = pgmoneta_append(data, "#HELP pgmoneta_alert_server_down Alert: server is not online (1 = down, 0 = up)\n"); | ||
| data = pgmoneta_append(data, "#TYPE pgmoneta_alert_server_down gauge\n"); |
| data = NULL; | ||
|
|
||
| /* pgmoneta_alert_wal_streaming_down */ | ||
| data = pgmoneta_append(data, "#HELP pgmoneta_alert_wal_streaming_down Alert: WAL streaming is not active (1 = down, 0 = streaming)\n"); |
There was a problem hiding this comment.
same to all of others...
| } | ||
| int critical = 0; | ||
|
|
||
| if (total_s > 0 && free_s < total_s / 10) |
There was a problem hiding this comment.
Avoid magic number, Please #define PGMONETA_ALERT_DISK_CRITICAL_THRESHOLD 10 at the top
There was a problem hiding this comment.
replace pgexporter-slack to pgmoneta-slack
There was a problem hiding this comment.
We all add Alert sections to manual 10-prometheus.md
both en and es
| srv.workers = -1; | ||
| srv.max_rate = -1; | ||
| srv.progress_enabled = -1; | ||
| srv.alert_enabled = -1; |
There was a problem hiding this comment.
We need to set default value also config->alerts = false as well
| continue; | ||
| } | ||
| int stale = 0; | ||
| int retention = config->common.servers[i].retention_days; |
There was a problem hiding this comment.
We need to consider also retention_weeks/months/years or just skip when no retention is configured?
| if (backup_ts > 0) | ||
| { | ||
| time_t now = time(NULL); | ||
| double age_days = difftime(now, backup_ts) / 86400.0; |
There was a problem hiding this comment.
replace magic number 86400 with #define
| free_s = pgmoneta_free_space(base_dir); | ||
| total_s = pgmoneta_total_space(base_dir); |
There was a problem hiding this comment.
free_s and total_s are computed once from config->base_dir, then emitted
for every server with only the server= label changing. All servers report
the exact same value.
Could we move the measurement inside the loop and use each server's own backup
path? Something like:
for (int i = 0; i < config->common.number_of_servers; i++)
{
if (!pgmoneta_is_alert_enabled(i))
{
continue;
}
char* server_path = pgmoneta_get_server_backup(i);
unsigned long free_s = pgmoneta_free_space(server_path);
unsigned long total_s = pgmoneta_total_space(server_path);
free(server_path);
int critical = (total_s > 0 && free_s < total_s / 10) ? 1 : 0;
/* ... emit metric ... */
}
This PR solves #1096