Hello all,
We are using opentelemetry collector to send traces to grafana cloud. We have deployed this collector as a deployment in kubernetes cluster.
[opentelemetry-collector 0.90.1 · opentelemetry/opentelemetry-helm]
We have noticed, some traces are taking too long to get ingest into the grafana cloud. i.e. 2h,6h,8h,22h… We have found this behavious in 20% of traces .
We have given good resorces to the collector, environment is having autoscaling, there are no errors/warnings we found in the collector in the past days in ingestion or exporting things.What can be the reason behind this latency of traces taking too long to ingest?
Do anyone have any idea regarding this or faced similiar kind of issue? I can share config of my collector if needed as well.
How do you measure it? Do you have any trace ingestion errors reported in your Grafana Cloud Billing dashboard trace ingestion panel? How otel metrics of collector exporter looks like?
@jangaraj
I measured duration with this traceql: {traceDuration>1h}
Regarding your question about errors reported in trace ingestion panel. i have attached a screenshort .
Do you know the possible reasons behind why my traces are taking too long to ingest into the grafana cloud? Also, i am not able to understand your collector exporter question fully but there are no errors in collector in past 30 days regarding ingestion/exporting.
That’s a trace duration, not duration of ingestion. You have some long running operation, so trace is closed and sent when operation is finished.
@jangaraj
I was thinking the same way that this is the duration of the api to complete the whole operation till grafana cloud support team told me that this is the duration of ingestion of that trace.
Can you describe what is the meaning of the duration here? Can i say api x took this time to to complete successfuly? Right now, i am confused that what does this parameter means?
@jangaraj
so according to my understaning from this document, When we say trace duration, that means a time taken to complete that whole apicall. Right?
And any idea about how to indentify that which long processes are slowing down the process of ingestion ?
Please show full your trace, not just duration.
Here is the full trace:
you can see the time 11h52minutes . What does that time here?
Again, that’s not prove that trace ingestion duration is 12hours. It is duration your app operations (there is redis, fs,… operations) - it was 12 hours.
So this is nothing wrong with Grafana Cloud, but with your app (so it is out of scope of this forum).
@jangaraj
I had similar kind of understanding regarding the duration but one of the grafana support team said that it denotes the ingestion time it took and not the time of application operations.
I think i should ask them again regarding this and tell them that maybe they misunderstood the thing.
you can see the time 11h52minutes
Yes that 11h52m timespan is the min/max of span times. It’s hard to tell from the screenshot, but if you click each span it will show the start time relative to the start of the trace, and we can find the spans that are causing the long run time. Based on the durations printed on the right side, the timeline seems to have a gap in the middle like:
xxxx-care-prod (duration 48ms)
---> 11h gap
---> remaining spans (duration < 1s)
Tempo and GC don’t alter the timestamps of spans, so this information should be accurate to what was uploaded.
Edit
Meant to add: span start times are uploaded as unix nanoseconds so they are absolute and not affected by time of ingestion. But it could be explained if the system clocks between hosts don’t match, where Host 2 is outputting spans with timestamps 11h behind Host 1. Would be worth double-checking.