- What Grafana version and what operating system are you using?
grafana 11.3.1, OS is bottlerocket (linux), data is persisted on mimir 2.14.0
- What are you trying to achieve?
I want 1 minute accurate values from our AWS load balancer so we can rely on that to match the values that AWS registered. In particular the Sum for ActiveConnectionCount, NewConnectionCount, RequestCount. Then there are still the 4XX and 5XX count ones.
- How are you trying to achieve it?
We want this via the prometheus.exporter.cloudwatch. Fetching the data with Alloy (see config below) and then displaying it by running query:
Expr: (aws_applicationelb_new_connection_count_sum{})
Step: 1m0s
- What happened?
When comparing the values I have with those on cloudwatch directly, my count is way off. E.g. for NewConnectionCount, on AWS Cloudwatch metrics my count was always considerably higher when comparing with prometheus.
Apart from the totals being different, I could match peaks happening on both. But those peaks were registered on a timestamp 3â later in prometheus. This indicates there is a delay. So if I shift those up, the metrics would better match those on AWS.
If I compared the count on AWS Cloudwatch with prometheus and considered to account for the delay, there was still a numerous amount of connections that got missed.
E.g. over a duration of 147 minutes, counting 1362 new connections on AWS, it missed 270 or 265 on respectively prometheus and prometheus considering the delay. Thatâs around 20%
E.g. over a duration of 657 minutes, counting 2708 new connections on AWS, it missed 625 or 626 on respectively prometheus and prometheus considering the delay. Which is around 23%
- What did you expect to happen?
I expected to have correct values associated with the timestamps as soon as AWS Cloudwatch settled on the metrics. I could live with having to deal with a delay (even though that would complicate things when having to define alarms), as long as I would know that I can rely on the metrics.
- Can you copy/paste the configuration(s) that you are having problems with?
Our config is set via:
prometheus.exporter.cloudwatch "ourmonitoredcloud" {
sts_region = "eu-west-1"
debug = false
decoupled_scraping {
enabled = true
scrape_interval = "1m"
}
discovery {
type = "AWS/ApplicationELB"
regions = "eu-west-1"
role {
role_arn = [rolehere]
}
search_tags = {
"Monitored" = "true",
}
dimension_name_requirements = ["LoadBalancer"]
metric {
name = "NewConnectionCount"
statistics = ["Sum"]
period = "1m"
length = "1m"
}
metric {same setup for the other metrics as with NewConnectionCount}
}
}
- Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.
No
- Did you follow any online instructions? If so, what is the URL?
It took a while at first to realise that CloudWatch canât get correct values for the load balancer âreal timeâ. E.g. if I want to get a count via CLI now and execute the same one 20s later, the value for the same 1minute period might already be different.
So I went checking on how to account for that. In github yet-another-cloudwatch-exporter, they were mentioning the use of delay, which seemed to be relevant for my particular use case.
But even though the repo is used by Alloy,
The
prometheus.exporter.cloudwatch
component embeds yet-another-cloudwatch-exporter, letting you collect Amazon CloudWatch metrics, translate them to a prometheus-compatible format and remote write them.
I couldnât find any indication that this is something I can leverage (in github Alloy it seems to set the Delay to 0 and points to the rounding of the period as solving it). Based on the numbers in my findings, Alloy is doing some delay already, but one that I cannot seem to control via configuration.
Any help on getting accurate AWS Cloudwatch load balancer data, is appreciated. Having reliable numbers, means I can consider detecting an unexpected increase in numbers and trigger reactions accordingly quickly.