I left every config value at its default, beside enabling otlp.
The cluster is running fine, apps are sending data, I can see loglines with traces in Loki and refer to them in Tempo.
But every traceid is only visible for about an hour, then I get a 404 from the querier. All log files look fine, no errors. The only thing that corelates is an hourly job in the ingester:
As I said I am fairly new to tempo, so I have no idea where to look next. block_retention: 48hgave the impression I should be able to query traceIds for 48h.
Hi, I’m not familiar with that chart, but let’s see if we can get this fixed. It looks like Tempo is installed in microservices mode, where each component runs as separate pods. And it looks like Tempo’s backend data storage is a local disk. This section:
trace:
backend: local
local:
path: /bitnami/grafana-tempo/data/traces
Is /bitnami/grafana-tempo/data/traces a shared storage space by all pods or a per-pod folder? In the latter case, I think what is happening is that each pod is only seeing its own files. When installed in microservices mode, Tempo needs to be pointed at a shared storage backend like AWS S3, Google Cloud Storage, or Azure Blob Storage. Traces are written to blocks by the ingester, and then the ingester flushes the blocks to the shared storage. The querier then reads the blocks from the shared storage.
I am facing the same issue. I deployed tempo distributed. if its a shared volume issue, how is it managing in the 1st hour to get traces?
below is the tempo part of my values.yaml. Chart.yaml specifies the dependency as tempo-distributed 0.21.8
When we search for traces we will search both the backend and the ingesters. This way we can also find traces that have just been received by Tempo.
After a block has reached a certain size or age, the ingester will write it to the backend (= the flush operation). Once it’s in the backend, the queriers should be able to find it. Because it might take a while for a block to be detected, the ingesters will also hold on to completed blocks for a little bit longer. This way the traces are still searchable, even if the queriers haven’t found this new block yet.
This duration is configurable with complete_block_timeout.
The value of complete_block_timeout is also very high (4 weeks). This means blocks that have been flushed to the backend will only be removed after 4 weeks. While this ensures they remain searchable, you will be using a lot of local disk (= expensive) and searching the blocks will not scale.
We are requested to use local as backend, hence the settings weren’t changed. From this issue 1223 I understand all we can do is increase the complete_block_timeout.
Even when complete_block_timeout is set to 4 weeks why are the blocks getting flushed in just 1 hour? I understand its expensive I can reduce it to may be 5 days.
Here search.enabled, doesnt have any impact? I hope I understood you correctly.
PS: the above configuration used to retain traces for more than a week since may. Only in the last 2 weeks we are seeing this issue. We used to be on Chart version 0.17.2, even after upgrading to 0.21.8 the retention is only for an hour.
Blocks being flushed is controlled by two parameters: max_block_bytes and max_block_duration (by default 1G and 1 hour). So if a block reaches a certain size or age, the block will be cut anyway and flushed to the backend. After a block has been flushed it is considered ‘completed’ and the complete_block_timeout will control how long this block stays around.
What is the version of Tempo you were running before?
Are you searching for traces or only doing a trace ID lookup? We did some changes in the query-frontend config, namely the parameters query_backend_after and query_ingesters_until might be important.
We have derived field set in loki which redirects to tempo’s query using traceId.
Just noticed, that my config might be wrong complete_block_timeout is directly under ingester. But according to values.yaml it should be ingester.config.complete_block_timeout. I will test and update if that was the issue
I think that this ingester config is not the proper way to address your problem.
Its main responsibility is to feed the data into storage + facilitate the query of recently ingested traces. I don’t know if it rebuilds any cache after the restart, but I suspect that it doesn’t - in that case, you’d want your backend search to be configured properly.