I’m evaluating the LGTM stack, and I am deploying grafana agent in flow mode using the official helm chart. I would like to use a discovery.kubernetes component to discover nodes from which to scrape prometheus metrics. What I am seeing is grafana agent is deployed as a daemon set, so I get one instance on each node in my cluster. From my understanding (and this could be incorrect) but each of the 4 nodes will scrape metrics from all nodes in the cluster, resulting in duplicated metrics. What I’m seeing in logs are tons of out-of-order metrics warnings, and unusually high CPU usage although the LGTM stack is the only thing running. I suspect that 4 agents scraping all nodes and overwhelming mimir is the cause.
What I am wondering is what is a typical scrape interval? I started at 15s, but am suspecting that is far too often, and possibly not giving each agent a wide enough window to stagger scrapes and not stomp on each other when reporting metrics to mimir. What I think I could do is either via discovery.kubernetes, or discovery.relabel, filter out targets that are ‘other’ nodes, and only have each grafana agent responsible for scraping the node it is running on.
Is this a valid approach, or does it kind of defeat the purpose of clustering? Should I increase the interval to something like the default of 1m and end up with metrics being scraped every (interval / number of nodes) seconds? Any advice here is greatly appreciated.