I have a 48h-long batch job that creates a huge trace. Although I’m not sure such a big trace is a good idea, the trace seems to be correctly stored on Tempo / S3.
However, when I try to visualize the trace (with jaeger-ui or with grafana), tempo consumes nearly 2GB RAM then fails with the following log:
GET /api/traces/0000000000000000015152c328c38ba0 (500) 2.327861491s
Response: \"response larger than the max (478537672 vs 16777216)\"
ws: false;
Accept: application/protobuf;
Accept-Encoding: gzip;
Uber-Trace-Id: 5f81c8deb13785e8:5f81c8deb13785e8:0000000000000000:0;
User-Agent: Go-http-client/1.1;
I’ve had a look at the query frontend and querier config, but I’m not sure how to tune this “max” 16MB value.
Yes, that’s a big trace, and I’m not even sure how my browser will behave when it receives it !
I tried to update the server section of my tempo.yml config file, but that did not change my error. Still the same 16MB max value.
The error is logged as coming from “frontend”, so I was looking at what could be a frontend for queries and found the querier’s config: querier config, line 30
grpc client seems to be initialized with MaxSendMsgSize: 16 << 20.
That seems a lot ! And a round count, not sure if it could be related to my 16777216 limit.
I’m not good at reading go, so I can’t see how to override this value from config. grpc_server_max_send_msg_size seemed a nice candidate, but did not change anything.
Anyhow, maybe some more log with more context can help:
prometheus_tempo.1.r5wfv6lrzf0z@admin-node-2 | level=error ts=2021-03-08T16:23:11.533322365Z caller=frontend_processor.go:131 msg="error processing query" err="response larger than the max (478537672 vs 16777216)"
prometheus_tempo.1.r5wfv6lrzf0z@admin-node-2 | level=info ts=2021-03-08T16:23:11.534160445Z caller=frontend.go:63 method=GET traceID=252ed34f3802cab9 url=/api/traces/0000000000000000015152c328c38ba0 duration=3.569090217s status=500
prometheus_tempo.1.r5wfv6lrzf0z@admin-node-2 | level=warn ts=2021-03-08T16:23:11.534258282Z caller=logging.go:71 traceID=252ed34f3802cab9 msg="GET /api/traces/0000000000000000015152c328c38ba0 (500) 3.56934289s Response: \"response larger than the max (478537672 vs 16777216)\" ws: false; Accept: application/protobuf; Accept-Encoding: gzip; Uber-Trace-Id: 252ed34f3802cab9:252ed34f3802cab9:0000000000000000:0; User-Agent: Go-http-client/1.1; "
Not that complex, it’s mainly a “get from postgres / process / write to elasticsearch” loop.
I guess the size comes from the big SQL queries, which add 300KB for each jdbc span.