Hi!
goal I’ve set up application metrics to ship with an ‘availability’ label. I’m planning to set up black-box monitoring for a series of applications, with different ‘availability’ levels.
To that end, I’m setting up some multidimensional alerts and am looking for the best way to go about this. I realize that even with blackbox monitoring, the norms, or thresholds for each application are going to be different. My plan is to use tresholds based on the ‘availability’ rating as a sensible default and allow overrides in some way (e.g. with an extra alertrule and nested notification policies to redirect the default to /dev/nullThis text will be hidden)
What I have
here’s an example with one threshold:
Query A: sum(increase(logback_events_total{level="error"}[10m])) by (app_kubernetes_io_name, company.com/availability, company.com/team)
Expression B: Reduce max A
Expression C: Math $B > 5
(a notification policy then redirects alerts to the right team, based on company.com/team)
One way to do it
Now, to go to multiple tresholds, I can add a filter on availability to query A (company.com/availability="2"
), duplicate that query for each of 4 possible values, duplicate the expressions too and make the final expression something like $E > 10 || $F > 5 || $G > 3 || $H > 1
.
This has a lot of duplication in it and if there is no application with that availability yet, the alert does not show correctly in the gui, because the preview can’t handle no-data situations.
Another way to do it
I could do the same, but in 4 different alerts, which doesn’t have that last problem, but has a lot of duplication as well.
What I’m looking for
What I would actually want is something like this (promQL style), in expression C:
$B{mycompany.com/availability="1"} > 10 || $B{mycompany.com/availability="2"} > 5 || $B{mycompany.com/availability="3"} > 3 || $B{mycompany.com/availability="4"} > 1
The above is invalid and as far as I can see, expressions can not operate on labels, but is there any other way to this?
Bonus question: Is there a way to store the thresholds seperate from the alert definitions, (such as we can do with constants in the dashboards or with prometheus recordrules), but in a way that it can be adjusted in the grafana gui?