We use CloudWatch as our data source, and we believe that some of the alerts may be misconfigured. However, because there are so many alerts (many thousands), it’s hard to find out.
So is there a way to log the query expression of every GetMetricData API call to CloudWatch for diagnosis purpose?
We’ve already tried changing the log level to debug but it doesn’t work - it logs the API calls when you’re viewing dashboards with a browser, but not the API calls from alert execution.
Any ideas please?
I would say that better will be to use some API and list all alert rules and their alert queries.
What we’re trying to do is to figure out which queries are causing excessive CloudWatch throttling. So we need the execution timestamps as well as the queries.
Which means unfortunately getting the alert definitions themselves won’t help much. Unless I’m misunderstanding your suggestion? If so please correct me. Thanks.
OK, got it. What kind of CW quota are you reaching?
Generally, they can be increased, so why you don’t increase quota?
It’s the GetMetricData DPS (Datapoints Per Second) quota. Sadly this is a hard limit and cannot be changed.
I would still list all rules and their CW queries, time ranges, and aggregation periods. You can run them with CLI so you will be able to evaluate DPS for each query.
You can focus on suspicious queries, e. g. longer time periods, fine aggregation, wildcard dimension value,…