Hi There,
I’m currently trying to set up alerting based on loki logs.
I have a cronjob, that creates Database Backups every day. The backup agent reports success in the logs.
My goal is to be notified, if this message is missing, so I came up with this Rule:
- name: app-db-backup
rules:
- alert: appDbNoBackup
expr: absent_over_time(({app="percona-server-mongodb", container="backup-agent", pod=~"app-db.*"} != "[pitr]" |= "backup finished")[24h])
for: 2h
labels:
severity: critical
annotations:
summary: No Backups have been finished within 24h
Running this Query via Grafana generates this Graph:
Running the above Query without absent_over_time
grants these logs:
2021-11-21 19:00:45
{"log":"2021-11-21T18:00:44.000+0000 I [backup/2021-11-21T18:00:15Z] backup finished\n","stream":"stderr","time":"2021-11-21T18:00:44.881235862Z"}
2021-11-20 19:00:44
{"log":"2021-11-20T18:00:44.000+0000 I [backup/2021-11-20T18:00:15Z] backup finished\n","stream":"stderr","time":"2021-11-20T18:00:44.84431218Z"}
2021-11-19 19:00:42
{"log":"2021-11-19T18:00:42.000+0000 I [backup/2021-11-19T18:00:14Z] backup finished\n","stream":"stderr","time":"2021-11-19T18:00:42.845762133Z"}
2021-11-18 19:00:45
{"log":"2021-11-18T18:00:44.000+0000 I [backup/2021-11-18T18:00:14Z] backup finished\n","stream":"stderr","time":"2021-11-18T18:00:44.872780481Z"}
2021-11-18 09:01:34
{"log":"2021-11-18T08:01:34.000+0000 I [backup/2021-11-18T08:01:04Z] backup finished\n","stream":"stderr","time":"2021-11-18T08:01:34.580555628Z"}
2021-11-17 19:00:43
{"log":"2021-11-17T18:00:43.000+0000 I [backup/2021-11-17T18:00:14Z] backup finished\n","stream":"stderr","time":"2021-11-17T18:00:43.792733085Z"}
2021-11-17 16:20:48
{"log":"2021-11-17T15:20:48.000+0000 I [backup/2021-11-17T15:20:18Z] backup finished\n","stream":"stderr","time":"2021-11-17T15:20:48.061940266Z"}
However, the alert is triggered 3.5h after the last successful backup and remains until the next backup is created.
Meaning, the Alert fires every day from 21:30 to 18:00 the next day.
Any idea, why this is happening? The vector for absent_over_time is set to 24h, so this alert should not trigger, as long as messages appear within 24h. Looking at the graph, I can’t see any indication why this alert is behaving as it is.
Any help would be greatly appreciated.
-Markus