Hello Grafana Community,
I have set up alerts for monitoring containers with the following conditions:
[DIE, OOM, START]
- Target: Quiet containers
- Alert: Final event of the container
- Frequency: Every 2 minutes
- Condition: Based on the node’s final state within the last 2 minutes + 3 seconds
- Trigger: Die, Start, OOM less than 2 times
[RE-STARTING]
- Target: Noisy containers
- Alert: Number of DIE ↔ START repetitions
- Frequency: Every 1 hour
- Condition: Based on the node’s final state within the last 1 hour + 10 seconds
- Trigger: Die, Start, OOM more than 6 times
The problem I’m facing is that important alerts (such as container crashes or OOM errors) are not getting through, while I keep receiving alerts from containers that repeatedly restart or are noisy.
The first alert is expected, but I would like to ignore further alerts for the same “noisy” containers after a certain period. How can I suppress further alerts for noisy containers after they have triggered an alert within a specific interval (e.g., within 1 hour) while ensuring that important alerts continue to come through?
My current configurations are as follows:
- [DIE, OOM, START]
- Pending: 2 minutes
- Evaluation: 10 seconds
- Time Range: 2 minutes + 3 seconds
- Repeat Interval: 4 hours
- [RE-STARTING]
- Pending: 1 hour
- Evaluation: 10 seconds
- Time Range: 1 hour + 10 seconds
- Repeat Interval: 4 hours
I’ve already explored the Grafana Alert Notification Policy Docs, but I would love some guidance on how to adjust these settings to achieve the desired behavior.
My query is like below
topk1(1, last_over_time({container = “A-*”}
| json
| status =~ “DIE|START|OOM”
| unwrap time [120s])
by (name, ip, text, status))
by (name, ip, text)
Any help or suggestions would be greatly appreciated!