Hello, building a large Grafana Enterprise LGTM stack to replace our current solution.
Nearing the final phases of the platform and so far in testing, a major unresolved headscratcher as we’ve gotten more production replica level data coming in, is inconsistent or missing logs.
Our current solution collects detail from several hundred OTEL enabled service such as VM’s and OpenShift OCP microservices and whilst different technology, consistency has always been out of the box.
However, in our LGTM stack testing testing so far, in particular with OTEL enabled microservices hitting Grafana, we’ve regularly observed missing logs. For example, a microservice run generating 20k logs …but we only see 16k in Loki and somewhat at a loss. Resources seem fine etc. Lots of different config settings for this kind of thing that we think we’ve hit, but about the only thing we can confirm is that a % go missing, at volumes not even approaching a tiny % of what production will handle.
Of course, this being the type of thing that worries people expecting 1:1 results, this has a few of us freaking out the sky is falling. We’re going back and forth checking configs and the Loki setup etc. but thus far generally being confused on the how/where/why of the inconsistency.
So I guess my question being, is there any detailed info or best practice or methodology for approaching this problem with Grafana and OpenTelemetry? Verification and checks and balances for this kind of thing.
As a result I now have audit acceptance criteria to meet where we’ll need to show & prove 1:1 consistency. My first instincts being to start from the beginning & so far, I’ve quickly spun up a sample application with an OTEL collector and LGTM stack, run countless tests hitting it with K6 to test locally and found many instances of inconsistency for which the most obvious seems to be resource problems at some point but the area for where these problems can arise seems large. Especially as it moves beyond my local dev env. Next step being more controlled testing within our prod replica env’s - but I guess struggling to find some information and best approaches to this problem around the Grafana stack.
Anyone who has faced similar issues or any advice as how best to approach this concern would be much appreciated! Thanks.