thanks for taking the time to read this.
we recently had a problem where the ingestion endpoints failed, and the grafana agents running on our EC2 instances were unable to send metrics and logs.
The agent then ate all the system memory until the OOM killer got invoked and started killing processes, sometimes it was the agent, sometimes it was other large processes. This caused a right headache as you can imagine.
Is there a way to prevent this from happening again, using a configuration setting in the agent perhaps? My other idea is to use cgroups, but ideally I want a simple quick fix of course!
This is what we saw in the logs
# egrep -i "oom|memory" /var/log/messages
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.326855] grafana-agent invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.474421] oom_kill_process+0x223/0x420
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.478000] out_of_memory+0x102/0x4c0
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300912.776955] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300913.261818] Out of memory: Kill process 9793 (grafana-agent) score 431 or sacrifice child
Jun 27 22:12:32 ip-10-152-157-140 kernel: [80300913.471967] oom_reaper: reaped process 9793 (grafana-agent), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
this is the process
# ps -ewfwl | grep grafa
4 S root 1990 1 1 80 0 - 346297 - Jun29 ? 00:23:34 /usr/bin/grafana-agent --config.file /etc/grafana-agent/agent-config.yml -config.expand-env
despite the bot marking as resolved, it isn’t.
I’m happy to receive any ideas.
I read the docs about the command line options and there was nothing obvious that I can do to control this behaviour.
I am pondering learning about cgroups to try and mitigate this behaviour.
You can try to configure backoff behaviour, e. g. for logs Configuration | Grafana Loki documentation
I guess all data between retries are kept in the memory, so just drop them (metrics, logs, traces) earlier, e. g. max_period: 30sec, max_retries: 3. Of course you will loose those dropped data.
Have you found any solution? I’m having the same issue and I don’t know what to do
ideally the grafana agent process will run as its own user, and then it is easier to limit.
say the user is “grafana_agent” then maybe something like this added to /etc/security/limits.conf will work
grafana_agent hard core 2097152
grafana_agent hard data 268435456
grafana_agent hard stack 131072
grafana_agent hard nproc 4
to stop it using more than 64MB or core, 64MB of data, 10MB of stack, and 4 processes.
I’ll update these numbers when I’ve had time to play, I’m adding them to the systemd service file since the GA is not running as a unique user on my systems
so far this has been stable
here’s our systemd service file
# limit the agent to 2M pages of core, 256MB of data, 4 processes and 128M pages stack
# at least that's what think these numbers mean
ExecStart=/bin/bash -c 'HOSTNAME=$(hostname) exec /usr/bin/grafana-agent --config.file=/etc/grafana-agent/agent-config.yml -config.expand-env'
I realised you don’t need to set the “core” value as that only affects core dumps