Cloudwatch data source 504 Gateway Timeout when assuming a role

  • What Grafana version and what operating system are you using?
    11.6.3 via Amazon Linux 2 in AWS ECS

  • What are you trying to achieve?
    Query Cloudwatch logs and metrics from various AWS accounts within my organization from a single Grafana instance running in an ECS cluster in its own AWS account.

For the first step, I just try to get the Cloudwatch logs and metrics from the same AWS account where Grafana is running.

  • How are you trying to achieve it?
    Assuming a role within the Cloudwatch data source.
  1. I created a role “GrafanaCloudwatchAccessRole” with a) permissions to access Cloudwatch and b) a trust policy that allows the ecs task to assume the role.
  2. The ECS task role has permissions to assume the “GrafanaCloudwatchAccessRole”
  3. Via the Grafana UI of the Cloudwatch data source I add the arn to the “GrafanaCloudwatchAccessRole” role.
  • What happened?
    UI showing error: 504 Gateway Time-out

  • What did you expect to happen?
    That the connection to Cloudwatch can established by assuming the role.

  • Can you copy/paste the configuration(s) that you are having problems with?
    data “aws_iam_policy_document” “grafana_cloudwatch_access_assume” {
    statement {
    actions = [“sts:AssumeRole”]
    effect = “Allow”
    principals {
    type = “AWS”
    identifiers = [“arn:aws:iam::<account_id_a>:role/GATSHA-shared-EcsTask-grafana”]
    }
    }
    }

resource “aws_iam_role” “grafana_cloudwatch_access” {
name = “GrafanaCloudwatchAccessRole”
path = “/${var.coordinates.scope}/monitoring/”
assume_role_policy = data.aws_iam_policy_document.grafana_cloudwatch_access_assume.json
permissions_boundary = “arn:aws:iam::${var.coordinates.account_id}:policy/ScopePermissionBoundary”
}

resource “aws_iam_policy” “grafana_cloudwatch_access” {
name = “${var.coordinates.scope}-grafana-cloudwatch-access-policy”
description = “Policy to allow Grafana in the shared monitoring account to access CloudWatch.”
policy = jsonencode({
Version = “2012-10-17”,
Statement = [
{
Sid = “AllowReadingMetricsFromCloudWatch”,
Effect = “Allow”,
Action = [
“cloudwatch:DescribeAlarmsForMetric”,
“cloudwatch:DescribeAlarmHistory”,
“cloudwatch:DescribeAlarms”,
“cloudwatch:ListMetrics”,
“cloudwatch:GetMetricStatistics”,
“cloudwatch:GetMetricData”,
“cloudwatch:GetInsightRuleReport”
],
Resource = “"
},
{
Sid = “AllowReadingLogsFromCloudWatch”,
Effect = “Allow”,
Action = [
“logs:DescribeLogGroups”,
“logs:GetLogGroupFields”,
“logs:StartQuery”,
“logs:StopQuery”,
“logs:GetQueryResults”,
“logs:GetLogEvents”
],
Resource = "

},
{
Sid = “AllowReadingResourceMetricsFromPerformanceInsights”,
Effect = “Allow”,
Action = “pi:GetResourceMetrics”,
Resource = “"
},
{
Sid = “AllowReadingTagsInstancesRegionsFromEC2”,
Effect = “Allow”,
Action = [
“ec2:DescribeTags”,
“ec2:DescribeInstances”,
“ec2:DescribeRegions”
],
Resource = "

},
{
Sid = “AllowReadingResourcesForTags”,
Effect = “Allow”,
Action = “tag:GetResources”,
Resource = “"
},
{
Sid = “AllowReadingOAMResources”,
Effect = “Allow”,
Action = [
“oam:ListSinks”,
“oam:ListAttachedLinks”
],
Resource = "

}
]
})
}

resource “aws_iam_role_policy_attachment” “grafana_cloudwatch_access” {
role = aws_iam_role.grafana_cloudwatch_access.name
policy_arn = aws_iam_policy.grafana_cloudwatch_access.arn
}

and here the policy for the ECS task role to allow assuming the above role:

resource “aws_iam_policy” “assume_cross_account_cloudwatch” {
name = “${var.coordinates.scope}-grafana-cloudwatch-assume-policy”
description = “Policy to allow Grafana to assume a cross-account role for CloudWatch access.”
policy = jsonencode({
Version = “2012-10-17”,
Statement = [
{
Sid = “AllowAssumingCrossAccountCloudwatchRole”,
Effect = “Allow”,
Action = [
“sts:AssumeRole”
],
Resource = “arn:aws:iam::<account_id_a>:role/GATSHA/monitoring/GrafanaCloudwatchAccessRole”,

  }
]

})
}

  • Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.
    In the UI after adding the role arn: 504 Gateway Timeout

grafana logs:
logger=tsdb.cloudwatch endpoint=callResource pluginId=cloudwatch dsName=cloudwatch dsUID=ber70rddn9y4ga uname=admin t=2025-07-07T14:19:15.612800886Z level=error msg=“Error handling resource request” error=“error getting accounts for current user or role: ListSinks error: RequestError: send request failed\ncaused by: Post "https ://sts.amazonaws.com/": dial tcp 52.94.139.12:443: i/o timeout”

logger=tsdb.cloudwatch endpoint=callResource pluginId=cloudwatch dsName=cloudwatch dsUID=ber70rddn9y4ga uname=admin t=2025-07-07T14:19:15.613006569Z level=error msg="Failed to get regions: " error=“RequestError: send request failed\ncaused by: Post "https ://sts.amazonaws.com/": dial tcp 52.94.139.12:443: i/o timeout”

What am I missing to connect to Cloudwatch? FYI, when adding the policy directly to the task role, I can access Cloudwatch. But this is not a solution to query Cloudwatch in multiple accounts. Also, within the Grafana container I can use the aws cli to assume the role and query Cloudwatch.

That appears to be a connectivity issue. So don’t focus on IAM, but on TCP network level (security groups, network ACL, NAT, GW, routing tables, firewall, …). Test it with Reachability Analyzer… Ask your AWS support - this doesn’t look like a Grafana app issue, but your infra issue.

Hi jangaraj, thank you for your response.

I think the connectivity between the ECS tasks that runs Grafana and STS is ok as I’m able to assume the role using the cli within the container.

Checking the permission boundaries for the role, I see that oam:* is currently not available for my organization. When I try to use list-sinks with the cli I get a AccessDeniedException and I guess this leads further to the timeout raised by Grafana. But I’ll get the permissions for oam and test again afterwards..

/usr/share/grafana# aws oam list-sinks
An error occurred (AccessDeniedException) when calling the ListSinks operation: User: arn:aws:sts::xxxxxx:assumed-role/GrafanaCloudwatchAccessRole/MySessionName is not authorized to perform: oam:ListSinks on resource: arn:aws:oam:eu-central-1:xxxx:/ListSinks

No, this is TCP connectivity error - it is not able to establish TCP connection to sts.amazonaws.com (where that IAM magic is happening, when TCP/https connection is established):

https ://sts.amazonaws.com/”: dial tcp 52.94.139.12:443: i/o timeout`

STS is ok as I’m able to assume the role using the cli within the container.

Yes, that’s one case, where can be more details differents, e.g. it can be connecting to regional STS endpoint, instead of global STS endpoint.

There is obvious check, just curl -v https ://sts.amazonaws.com/ + run debug for that “working assume the role using the cli within the container.” - check which endpoints is connecting, which IP is resolving, …

I get your point. So within the ecs task of Grafana, when executing curl -v https ://sts.amazonaws.com/ it runs into a timeout.

ip-10-202-235-114:/usr/share/grafana# curl -v https://sts.amazonaws.com/
* Host sts.amazonaws.com:443 was resolved.
* IPv6: (none)
* IPv4: 52.94.141.74
*   Trying 52.94.141.74:443...
* connect to 52.94.141.74 port 443 from 10.202.235.114 port 49072 failed: Operation timed out

When adding proxy configurations to the ECS task to enable internet access I get a success:

ip-10-202-235-127:/usr/share/grafana# curl -v https://sts.amazonaws.com/
* Uses proxy env variable NO_PROXY == 'xxxx,,.eks.amazonaws.com,s3.eu-central-1.amazonaws.com'
* Uses proxy env variable HTTPS_PROXY == 'https://proxy.eu-central-1.xxxxxxxxx.com:443'
* Host proxy.eu-central-1.aws.xxxxxx.com:443 was resolved.
* IPv6: (none)
* IPv4: 10.202.16.240, 10.202.18.84, 10.202.17.19
*   Trying 10.202.16.240:443...
* CONNECT tunnel: HTTP/1.1 negotiated
* allocate connect buffer
* Establish HTTP proxy tunnel to sts.amazonaws.com:443
> CONNECT sts.amazonaws.com:443 HTTP/1.1
> Host: sts.amazonaws.com:443
>
< HTTP/1.1 200 Connection established

Assuming the role using the --debug flag it shows that it is getting the credentials from host:sts.eu-central-1.amazonaws.com successfully

ip-10-202-235-53:/usr/share/grafana# aws sts assume-role --role-arn arn:aws:iam::xxxx:role/GATSHA/monitoring/GrafanaCloudwatchAccessRole --role-session-name DebugSession --debug
...
content-type:application/x-www-form-urlencoded; charset=utf-8
host:sts.eu-central-1.amazonaws.com

20250709T135455Z
20250709/eu-central-1/sts/aws4_request
{
    "Credentials": {
        "AccessKeyId": "xxx",
        "SecretAccessKey": "xxxxxx",        "SessionToken": "xxx"    }
}

Within the Grafana UI when trying again to connect to Cloudwatch by assuming the role I still get an errors, but they are now different ones:

logger=tsdb.cloudwatch endpoint=callResource pluginId=cloudwatch dsName=cloudwatch dsUID=ceriy65bj7thcc uname=admin t=2025-07-10T16:22:05.633298804Z level=error msg="Failed to get regions: " error="NoCredentialProviders: no valid providers in chain\ncaused by: EnvAccessKeyNotFound: failed to find credentials in the environment.\nSharedCredsLoad: failed to load profile, .\nCredentialsEndpointError: failed to load credentials
logger=tsdb.cloudwatch endpoint=callResource pluginId=cloudwatch dsName=cloudwatch dsUID=ceriy65bj7thcc uname=admin t=2025-07-10T16:22:05.634330588Z level=error msg="Error handling resource request" error="error getting accounts for current user or role: ListSinks error: NoCredentialProviders: no valid providers in chain\ncaused by: EnvAccessKeyNotFound: failed to find credentials in the environment.\nSharedCredsLoad: failed to load profile, .\nCredentialsEndpointError: failed to load credentials\ncaused by: SerializationError: failed to decode error message\n\tstatus code: 403,

I think adding the proxy fixed the connectivity issue on TCP level that you mentioned? at least no timeouts anymore..

Am I right with the assumption that the missing permissions for oam:ListSinks are failing now the connection? Is this something we even need when assuming the role in the same account as in this example?

I commented only “TCP timeout issues” - there can be billions and billions other issues, when you sort TCP timeout.

IMHO proxy variables may introduce other troubles. CloudWatch datasource config has Endpoint config, where you can specify your regional sts endpoint. When sts.eu-central-1.amazonaws.com is working for your infra, then use that in the Grafana:

After specifying the regional sts endpoint I can now assume roles across accounts and query data from Cloudwatch, without using a proxy.

Thank you for your help, jangaraj! It is highly appreciated!