Hello,
I am trying to understand how to write alert rules, and specifically provision alerts from Terraform / OpenTofu.
I have the following with Grafana v12.1.1:
resource "grafana_rule_group" "common_alerts" {
depends_on = [grafana_folder.opentofu_alerts]
name = "Common alerts for all Host"
folder_uid = grafana_folder.opentofu_alerts.uid
interval_seconds = 60
rule {
name = "All systemd services must be running"
for = "2m"
condition = "above 0"
data {
ref_id = "failed services"
datasource_uid = data.grafana_data_source.mimir.uid
relative_time_range {
from = 120
to = 0
}
model = jsonencode({
refId = "failed services"
instant = true
expr = <<-EOT
node_systemd_unit_state{state="failed", name=~".+\\.service"}
EOT
})
}
data {
ref_id = "above 0"
datasource_uid = "__expr__"
relative_time_range {
from = 0
to = 0
}
model = jsonencode({
refId = "above 0"
expression = "failed services"
type = "threshold"
conditions = [
{
evaluator = {
params = [
0
],
type = "gt"
},
operator = {
type = "and"
},
query = {
params = [
"above 0"
]
},
type = "query"
}
],
datasource = {
"type" : "__expr__",
"uid" : "__expr__"
}
})
}
}
}
And my question is about what kind of queries you can use in alert rules, as documented at Queries and conditions where we have:
① Time series data — The query returns a collection of time series, where each series must be reduced to a single numeric value for evaluating the alert condition.
One of the difficulties I encountered when authoring the above resource was that I didn’t have instant = true
under model
in the first data
block, which led to the error message:
invalid format of evaluation results for the alert definition : frame cannot uniquely be identified by its labels: has duplicate results with labels {}
Because the mandatory relative_time_range
block (which cannot be an empty range, it gets rejected with a 4xx bad request) means you are doing a range query (unless you have instant = true
that is) and you therefore end up with multiple values and violate point ①.
Now I guess using a range makes sense if you e.g. sum up the data in your promql expr
and fall back to one value. But then why can’t I specify an empty range to get the last value, and what is the purpose of the reduce functions in the alert condition (the second data
block in my resource) if you cannot get there with multiple values?
Thank you for all the work, and let me know if I need to clarify anything.