CloudWatch metric math in alert rules?

  • What Grafana version and what operating system are you using?

Grafana v10.4.2 (701c851be7) from Docker image “grafana/grafana-oss:10.4.2”

  • What are you trying to achieve?

Build an alert rule based on CloudWatch metric HTTPCode_ELB_5XX_Count (no data gaps are common)

When I build a series in a data source explorer or in a “Time series” visualization, the FILL() works well.

However, when I try to use the same function while building the alert rule, it doesn’t work.

Should I use some other ID instead of t5xx? I tried D to no avail. Can FILL()work at all in the alert rules? I saw in docs a paragraph:

If you use the expression field to reference another query, like queryA * 2, you can’t create an alert rule based on that query.

but honestly, I don’t understand if that applies here.

  • What did you expect to happen?

FILL(t5xx, 0) would produce a series with zeroes where t5xx has no values.

  • Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.

The error was

[sse.dataQueryError] failed to execute query [E]: metric request error: "ValidationError: Error in expression 'queryd2a8fefdcf004519a6f2d8edcb28730b': ID 't5xx' not found\n\tstatus code: 400, request id: e1fbdb41-2db1-45b7-b7a0-22b45d0f0fd0"

@jangaraj thank you for the link, it helped to solve the issue.
I run the 10.4 version, still Grafana needs the sseGroupByDatasource feature.

When I enabled sseGroupByDatasource, another problem popped up. This time with CloudWatch itself.

“Error in expression e1 [No valid range found for FILL() function.]”

Eventually, I was able to craft the alert rule. Will describe full solution here, in case somebody has the same problem.

I created one rule per load balancer. A dimension LoadBalancer=* didn’t work well. It’s not a problem to get a list of load balancers in Terraform and create the rules with a dynamic block, so I preferred doing it instead of overloading Grafana with the LoadBalancer=* dimension.

The Terraform code looks something like this:

data "aws_lbs" "all" {}

data "aws_alb" "all" {
  for_each = data.aws_lbs.all.arns
  arn      = each.key
}
resource "grafana_rule_group" "sr" {
  name               = "Success Rate"
...
  dynamic "rule" {
    for_each = toset(
      [
        for arn in data.aws_lbs.all.arns : arn if !strcontains(arn, "/net/")
      ]
    )
    content {
      name = "${data.aws_alb.all[rule.key].name} service=${lookup(data.aws_alb.all[rule.key].tags, "service", "unknown")} SR"
...

It made the Grafana’s side of the problem easier and more reliable (as far as I can tell after 12-ish hours)

First, I get RequestCount

Next, I get a 5XX count.

Next, use FILL() to replace “no data” point with zeros. The trick was to fill metrics with a “5xx” in their names. Which is only one metric HTTPCode_ELB_5XX_Count.

And finally, usual alert rule logic.

2 Likes