Grafana Agent Flow Cluster

matt328 · July 15, 2023, 3:12pm

I’m evaluating the LGTM stack, and I am deploying grafana agent in flow mode using the official helm chart. I would like to use a discovery.kubernetes component to discover nodes from which to scrape prometheus metrics. What I am seeing is grafana agent is deployed as a daemon set, so I get one instance on each node in my cluster. From my understanding (and this could be incorrect) but each of the 4 nodes will scrape metrics from all nodes in the cluster, resulting in duplicated metrics. What I’m seeing in logs are tons of out-of-order metrics warnings, and unusually high CPU usage although the LGTM stack is the only thing running. I suspect that 4 agents scraping all nodes and overwhelming mimir is the cause.

What I am wondering is what is a typical scrape interval? I started at 15s, but am suspecting that is far too often, and possibly not giving each agent a wide enough window to stagger scrapes and not stomp on each other when reporting metrics to mimir. What I think I could do is either via discovery.kubernetes, or discovery.relabel, filter out targets that are ‘other’ nodes, and only have each grafana agent responsible for scraping the node it is running on.

Is this a valid approach, or does it kind of defeat the purpose of clustering? Should I increase the interval to something like the default of 1m and end up with metrics being scraped every (interval / number of nodes) seconds? Any advice here is greatly appreciated.

Topic		Replies	Views
Grafana Agent scraping interval and metrics Grafana Alloy	0	1376	June 1, 2023
Agent stops scraping at scrape_interval=60s Prometheus	0	809	March 21, 2022
Grafana-agent trying to scrape some static endpoints Grafana Cloud	5	1017	September 4, 2024
Agent scrape_interval break CPU chart Prometheus query-help	7	499	December 22, 2023
Grafana agent health check Grafana Alloy alerting	0	556	September 6, 2023

Grafana Agent Flow Cluster

Related topics