Enterprise LGTM Stack - Missing Logs

interrobang · November 26, 2024, 4:33am

Hello, building a large Grafana Enterprise LGTM stack to replace our current solution.

Nearing the final phases of the platform and so far in testing, a major unresolved headscratcher as we’ve gotten more production replica level data coming in, is inconsistent or missing logs.

Our current solution collects detail from several hundred OTEL enabled service such as VM’s and OpenShift OCP microservices and whilst different technology, consistency has always been out of the box.

However, in our LGTM stack testing testing so far, in particular with OTEL enabled microservices hitting Grafana, we’ve regularly observed missing logs. For example, a microservice run generating 20k logs …but we only see 16k in Loki and somewhat at a loss. Resources seem fine etc. Lots of different config settings for this kind of thing that we think we’ve hit, but about the only thing we can confirm is that a % go missing, at volumes not even approaching a tiny % of what production will handle.

Of course, this being the type of thing that worries people expecting 1:1 results, this has a few of us freaking out the sky is falling. We’re going back and forth checking configs and the Loki setup etc. but thus far generally being confused on the how/where/why of the inconsistency.

So I guess my question being, is there any detailed info or best practice or methodology for approaching this problem with Grafana and OpenTelemetry? Verification and checks and balances for this kind of thing.

As a result I now have audit acceptance criteria to meet where we’ll need to show & prove 1:1 consistency. My first instincts being to start from the beginning & so far, I’ve quickly spun up a sample application with an OTEL collector and LGTM stack, run countless tests hitting it with K6 to test locally and found many instances of inconsistency for which the most obvious seems to be resource problems at some point but the area for where these problems can arise seems large. Especially as it moves beyond my local dev env. Next step being more controlled testing within our prod replica env’s - but I guess struggling to find some information and best approaches to this problem around the Grafana stack.

Anyone who has faced similar issues or any advice as how best to approach this concern would be much appreciated! Thanks.

jangaraj · November 26, 2024, 4:43am

Use Billing/Usage dashboard and check Discarded Log Samples, there can be reported issues e. g. rate limitations, line too long, stream limit, too far behind. That may give you some ideas if you are hitting some Grafana Cloud limits.

Otel collector logs/metrics also can provide insights

interrobang · November 26, 2024, 4:56am

Assuming the Billing/Usage dashboard is a Grafana cloud thing? Built our own on enterprise licensing, anything equivalent?

jangaraj · November 26, 2024, 5:00am

If you have onprem enterprise license, then you have also enterprise support. Why you don’t ask them?

Topic		Replies	Views
Some logs are missing in Loki Grafana Loki	3	1235	August 22, 2024
Traces and Logs intermittently disappearing from Tempo and Loki Grafana Tempo	6	2524	January 27, 2023
Strange delays in trace availability Grafana Tempo	0	19	December 13, 2024
Loki performance and missing logs issues Grafana Loki loki	4	497	November 4, 2024
OpenTelemetry traces? Grafana Tempo opentelemetry	7	925	July 11, 2024

Enterprise LGTM Stack - Missing Logs

Related topics