Values gotten from AWS cloudwatch exporter not matching the ones on AWS (in casu AWS/ApplicationELB)

  • What Grafana version and what operating system are you using?

grafana 11.3.1, OS is bottlerocket (linux), data is persisted on mimir 2.14.0

  • What are you trying to achieve?

I want 1 minute accurate values from our AWS load balancer so we can rely on that to match the values that AWS registered. In particular the Sum for ActiveConnectionCount, NewConnectionCount, RequestCount. Then there are still the 4XX and 5XX count ones.

  • How are you trying to achieve it?

We want this via the prometheus.exporter.cloudwatch. Fetching the data with Alloy (see config below) and then displaying it by running query:

Expr: (aws_applicationelb_new_connection_count_sum{})
Step: 1m0s
  • What happened?

When comparing the values I have with those on cloudwatch directly, my count is way off. E.g. for NewConnectionCount, on AWS Cloudwatch metrics my count was always considerably higher when comparing with prometheus.

Apart from the totals being different, I could match peaks happening on both. But those peaks were registered on a timestamp 3’ later in prometheus. This indicates there is a delay. So if I shift those up, the metrics would better match those on AWS.

If I compared the count on AWS Cloudwatch with prometheus and considered to account for the delay, there was still a numerous amount of connections that got missed.

E.g. over a duration of 147 minutes, counting 1362 new connections on AWS, it missed 270 or 265 on respectively prometheus and prometheus considering the delay. That’s around 20%
E.g. over a duration of 657 minutes, counting 2708 new connections on AWS, it missed 625 or 626 on respectively prometheus and prometheus considering the delay. Which is around 23%

  • What did you expect to happen?

I expected to have correct values associated with the timestamps as soon as AWS Cloudwatch settled on the metrics. I could live with having to deal with a delay (even though that would complicate things when having to define alarms), as long as I would know that I can rely on the metrics.

  • Can you copy/paste the configuration(s) that you are having problems with?

Our config is set via:

  prometheus.exporter.cloudwatch "ourmonitoredcloud" {
    sts_region = "eu-west-1"

    debug = false

    decoupled_scraping {
      enabled = true
      scrape_interval = "1m"
    }

    discovery {
      type = "AWS/ApplicationELB"
      regions = "eu-west-1"
    
      role {
        role_arn = [rolehere]
      }

      search_tags = {
        "Monitored" = "true",
      }

      dimension_name_requirements = ["LoadBalancer"]

      metric {
        name       = "NewConnectionCount"
        statistics = ["Sum"]
        period     = "1m"
        length     = "1m"
      }
      metric {same setup for the other metrics as with NewConnectionCount}
    }
  }
  • Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.

No

  • Did you follow any online instructions? If so, what is the URL?

It took a while at first to realise that CloudWatch can’t get correct values for the load balancer “real time”. E.g. if I want to get a count via CLI now and execute the same one 20s later, the value for the same 1minute period might already be different.
So I went checking on how to account for that. In github yet-another-cloudwatch-exporter, they were mentioning the use of delay, which seemed to be relevant for my particular use case.
But even though the repo is used by Alloy,

The prometheus.exporter.cloudwatch component embeds yet-another-cloudwatch-exporter, letting you collect Amazon CloudWatch metrics, translate them to a prometheus-compatible format and remote write them.

I couldn’t find any indication that this is something I can leverage (in github Alloy it seems to set the Delay to 0 and points to the rounding of the period as solving it). Based on the numbers in my findings, Alloy is doing some delay already, but one that I cannot seem to control via configuration.

Any help on getting accurate AWS Cloudwatch load balancer data, is appreciated. Having reliable numbers, means I can consider detecting an unexpected increase in numbers and trigger reactions accordingly quickly.

I would say that’s because CloudWatch metrics are “delayed” usually.

For example the alloy scrape at 10:01:45 - scraped period 10:00:00-10:00:59 - value 10
but if you check the same period later, e.g. at 10:05:00, the value can be higher.

If you need to replicate exact values from CloudWatch, then I would recommend a different exporter, where you can set a delay (I would suggest a delay 5min) - yet-another-cloudwatch-exporter/docs/configuration.md at master ¡ prometheus-community/yet-another-cloudwatch-exporter ¡ GitHub

If you don’t need Prometheus, then configure CloudWatch datasource and read metrics directly from CloudWatch.

Indeed, that delay I have experienced with their CLI. I had hoped that it would either have been possible (a parameter I missed) or that it was unexpected to have inaccurate values so that it would be a “bug”.

So I understand your response:

  1. the fact that currently metrics register under a timestamp that is +/- 3’ after the facts, that delay is a given and cannot be modified

  2. it is expected to have inaccurate values when using the prometheus.exporter.cloudwatch | Grafana Alloy documentation

Would ‘have accurate values’ be a feature request that I could launch?
Because, as you said, the yet-another-cloudwatch-exporter has it and Grafana says that they embed that one. So why not make use of the functionalities in yet-another-cloudwatch-exporter to help with that.

That doesn’t sound good. Ask to expose/document all config options from used underlying lib/tool. Then you can configure what/how you need (if you understand how it works of course). It doesn’t makes sense to hardcode delay, because some people may prefer “realtime” metrics.

Hi, my apologies, I would be more diplomatic in the wording of course of a feature request :).
I only wanted to find out if it is worth to raise it so it would enable having a choice. Then people could choose to go for

  • real time (but possible inaccurate) values by using default
  • accurate (but not real time) values by e.g. using delay

Best regards,
Diederik

Yes, it is worth it - for example for your use case. That CloudWatch delay is a CloudWatch feature not a bug (I guess there are some async requests on the AWS side, so it is delayed), so there should be option on the exporter side to work with that feature.

1 Like

That feature request is already there: