Grafana-agent memory leak/usage when ingestion endpoints die?

pmansfield · June 30, 2023, 10:32am

Hello,
thanks for taking the time to read this.

we recently had a problem where the ingestion endpoints failed, and the grafana agents running on our EC2 instances were unable to send metrics and logs.

The agent then ate all the system memory until the OOM killer got invoked and started killing processes, sometimes it was the agent, sometimes it was other large processes. This caused a right headache as you can imagine.

Is there a way to prevent this from happening again, using a configuration setting in the agent perhaps? My other idea is to use cgroups, but ideally I want a simple quick fix of course!

thanks
Paul

pmansfield · June 30, 2023, 10:53am

This is what we saw in the logs

# egrep -i "oom|memory" /var/log/messages
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.326855] grafana-agent invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.474421]  oom_kill_process+0x223/0x420
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.478000]  out_of_memory+0x102/0x4c0
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.776955] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300913.261818] Out of memory: Kill process 9793 (grafana-agent) score 431 or sacrifice child
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300913.471967] oom_reaper: reaped process 9793 (grafana-agent), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

pmansfield · June 30, 2023, 10:55am

this is the process

# ps -ewfwl | grep grafa
4 S root      1990     1  1  80   0 - 346297 -     Jun29 ?        00:23:34 /usr/bin/grafana-agent --config.file /etc/grafana-agent/agent-config.yml -config.expand-env

pmansfield · July 10, 2023, 4:34pm

despite the bot marking as resolved, it isn’t.

I’m happy to receive any ideas.

I read the docs about the command line options and there was nothing obvious that I can do to control this behaviour.

I am pondering learning about cgroups to try and mitigate this behaviour.

jangaraj · July 10, 2023, 6:55pm

You can try to configure backoff behaviour, e. g. for logs Configuration | Grafana Loki documentation
I guess all data between retries are kept in the memory, so just drop them (metrics, logs, traces) earlier, e. g. max_period: 30sec, max_retries: 3. Of course you will loose those dropped data.

mcarrasco1 · October 2, 2023, 3:03pm

Have you found any solution? I’m having the same issue and I don’t know what to do

pmansfield · October 6, 2023, 1:42pm

ideally the grafana agent process will run as its own user, and then it is easier to limit.

say the user is “grafana_agent” then maybe something like this added to /etc/security/limits.conf will work

grafana_agent hard core 2097152
grafana_agent hard data 268435456
grafana_agent hard stack 131072
grafana_agent hard nproc 4

to stop it using more than 64MB or core, 64MB of data, 10MB of stack, and 4 processes.

I’ll update these numbers when I’ve had time to play, I’m adding them to the systemd service file since the GA is not running as a unique user on my systems

update
so far this has been stable
LimitCORE=2097152
LimitDATA=268435456
LimitNPROC=4
LimitSTACK=131072

pmansfield · October 10, 2023, 9:08am

here’s our systemd service file

[Unit]
Description=Grafana Agent
After=network.target
Documentation="https://confluence.example.com/display/Ops/HowTo%3A+Build+Grafana+Agent+Package"

[Service]
Type=simple
User=root
Group=root
Restart=on-failure
RestartSec=5
# limit the agent to 2M pages of core, 256MB of data, 4 processes and 128M pages stack
# at least that's what  think these numbers mean
LimitCORE=2097152
LimitDATA=268435456
LimitNPROC=4
LimitSTACK=131072
ExecStart=/bin/bash -c 'HOSTNAME=$(hostname) exec /usr/bin/grafana-agent --config.file=/etc/grafana-agent/agent-config.yml -config.expand-env'

[Install]
WantedBy=multi-user.target

pmansfield · October 12, 2023, 5:32pm

I realised you don’t need to set the “core” value as that only affects core dumps

Topic		Replies	Views
Grafana agent - High CPU/Memory consumption Grafana Alloy	1	1205	November 14, 2024
Grafana using too many memmory Configuration	0	304	October 11, 2021
Is there a way to restrict the memory limit of promtail agent running in linux ec2 instance Grafana loki , configuration , promtail	0	325	August 17, 2023
Grafana runs Out of Memory when querying longer timeframe dataset Grafana	10	7475	June 4, 2021
`grafana-agent-integrations-deploy` is taking lot of memory Grafana Alloy	2	289	April 30, 2024

Grafana-agent memory leak/usage when ingestion endpoints die?

Related topics