Terraform: How to use the reduce functions in alert condition

Hello,

I am trying to understand how to write alert rules, and specifically provision alerts from Terraform / OpenTofu.

I have the following with Grafana v12.1.1:

resource "grafana_rule_group" "common_alerts" {
  depends_on = [grafana_folder.opentofu_alerts]

  name = "Common alerts for all Host"

  folder_uid       = grafana_folder.opentofu_alerts.uid
  interval_seconds = 60

  rule {
    name      = "All systemd services must be running"
    for       = "2m"
    condition = "above 0"

    data {
      ref_id         = "failed services"
      datasource_uid = data.grafana_data_source.mimir.uid
      relative_time_range {
        from = 120
        to   = 0
      }
      model = jsonencode({
        refId   = "failed services"
        instant = true
        expr    = <<-EOT
          node_systemd_unit_state{state="failed", name=~".+\\.service"}
        EOT
      })
    }

    data {
      ref_id         = "above 0"
      datasource_uid = "__expr__"
      relative_time_range {
        from = 0
        to   = 0
      }
      model = jsonencode({
        refId      = "above 0"
        expression = "failed services"
        type       = "threshold"
        conditions = [
          {
            evaluator = {
              params = [
                0
              ],
              type = "gt"
            },
            operator = {
              type = "and"
            },
            query = {
              params = [
                "above 0"
              ]
            },
            type = "query"
          }
        ],
        datasource = {
          "type" : "__expr__",
          "uid" : "__expr__"
        }
      })
    }
  }
}

And my question is about what kind of queries you can use in alert rules, as documented at Queries and conditions where we have:

Time series data — The query returns a collection of time series, where each series must be reduced to a single numeric value for evaluating the alert condition.

One of the difficulties I encountered when authoring the above resource was that I didn’t have instant = true under model in the first data block, which led to the error message:

invalid format of evaluation results for the alert definition : frame cannot uniquely be identified by its labels: has duplicate results with labels {}

Because the mandatory relative_time_range block (which cannot be an empty range, it gets rejected with a 4xx bad request) means you are doing a range query (unless you have instant = true that is) and you therefore end up with multiple values and violate point ①.

Now I guess using a range makes sense if you e.g. sum up the data in your promql expr and fall back to one value. But then why can’t I specify an empty range to get the last value, and what is the purpose of the reduce functions in the alert condition (the second data block in my resource) if you cannot get there with multiple values?

Thank you for all the work, and let me know if I need to clarify anything.

The HCL I posted is valid and working.

I still have my question, thank you.

Hi @lopter,

As @jangaraj suggested, we recommend starting by building the alert rule in the Grafana UI.

The query and expression details are tightly coupled with the Grafana UI and API, so it’s the easiest way to get them right.

In the UI, you can also export your alert rule in TH format.

This workflow lets you modify the rule in the UI and observe how those changes are reflected in the TH configuration. It’s a great way to learn how the Grafana Alerting API works under the hood.

Hello @pepecano,

And this is exactly what I did, as I am certainly not smart enough to figure out the obscure (undocumented?) schema to use if you wanna define alerts as code, or how many \ one needs to issue a literal . in a regexp… If that sounds negative, rest assured that I am old enough to be very grateful for the work done by Prometheus and Grafana Labs…

As I told @jangaraj, the code in my original post is working. My question stems from authoring it, so let me try to rephrase it a bit:

My code was not working at first because I didn’t have instant = true under model in the first data block, which led to the error message:

invalid format of evaluation results for the alert definition : frame cannot uniquely be identified by its labels: has duplicate results with labels {}

Because the mandatory relative_time_range block (which cannot be an empty range, it gets rejected with a 4xx bad request) means you are doing a range query (unless you have instant = true that is) and you therefore end up with multiple values and violate point ① documented at Queries and conditions:

Time series data — The query returns a collection of time series, where each series must be reduced to a single numeric value for evaluating the alert condition.

Now I guess using a range makes sense if you e.g. sum up the data in your promql expr and fall back to one value. But then why can’t I specify an empty range to get the last value, and what is the purpose of the reduce functions in the alert condition (the second data block in my resource) if you cannot get there with multiple values?

edit: if that helps, the title of this thread is « Terraform: How to use the reduce functions in alert condition », as opposed to « How to define alerts in Terraform ». In hindsight maybe I could have dropped Terraform from the title entirely, maybe that would have helped…

Yes. I would recommend to create new post and simplify it = don’t use TF at all there (that’s your implementation detail on your end). Now you need someone with TF AND alerting knowledge, so you will be far more better if your post need only alerting knowledge.

Alright… Here we go: How to use the reduce functions in alert condition?

Replying to this from the other thread, because the context in this thread is helpful:

@jangaraj, as we can see in my TF resource (or in @pepecano’s second screenshot), at least two queries are involved in the definition of an alert rule: the data source query and the alert condition query, to quote the docs:

The alert condition is the query or expression that determines whether the alert fires or not […].

Now whether you read the documentation or look at the code generated by default (using the export button), it feels like the reduce function is meant to be used on the alert condition query¹. And I don’t understand the point of doing that if you are not allowed to have a time series with more than one value in your alert condition.

It may just be that the generated code you can export is slightly nonsensical, that the documentation is confusing, and that the reduce functions are in fact not meant to be used on the alert condition queries, but on the data source queries only.

I’ve not talked with Grot about this, but I did talk with Sonnet 4.5, and the (low) quality of the response, as well as “classical search engines” results for the error message in my original post being for a different issue/context is why I made this thread in the first place.

edit: to be clear, I understand the alert condition can only be evaluated on a single value, I am not asking or arguing about that.


¹ And the Terraform resource in my original post had this attribute under model.conditions.0 in the second data block:

reducer = {
  params = [],
  type   = "last"
},

Which I removed because I didn’t understand how it made any sense.

Hi again @lopter,

You didn’t sound negative. I agree the format isn’t obvious, and it’s not properly documented (which is why I recommended learning by using the UI).

You wrote:

means you are doing a range query (unless you have instant = true that is) and you therefore end up with multiple values and violate point ① documented at Queries and conditions:

DOC: The query returns a collection of time series, where each series must be reduced to a single numeric value for evaluating the alert condition.

That doc sentence could be clearer. The initial query does not need to return a reduced value. The important distinction is:

  • The initial query can return a time series (or multiple time series).

  • Normally, a reduce expression then reduces all data points in the time series to a single numeric value for the alert condition.

The initial query could also provide the reduced value — I guess this is where you’re running into an issue.

I’d still suggest using the UI to get it working first, and then exporting the Terraform (TH) code. I believe this workflow will make things easier.

If you want, ping me on Slack and I’ll try to help if I can replicate the issue.