How to query a 500MB trace?

Hello,

I have a 48h-long batch job that creates a huge trace. Although I’m not sure such a big trace is a good idea, the trace seems to be correctly stored on Tempo / S3.

However, when I try to visualize the trace (with jaeger-ui or with grafana), tempo consumes nearly 2GB RAM then fails with the following log:

GET /api/traces/0000000000000000015152c328c38ba0 (500) 2.327861491s
Response: \"response larger than the max (478537672 vs 16777216)\"
ws: false;
Accept: application/protobuf;
Accept-Encoding: gzip;
Uber-Trace-Id: 5f81c8deb13785e8:5f81c8deb13785e8:0000000000000000:0;
User-Agent: Go-http-client/1.1;

I’ve had a look at the query frontend and querier config, but I’m not sure how to tune this “max” 16MB value.

That is a fantastically large trace. At Grafana we increase the grpc send limits for similar reasons.

server:
    grpc_server_max_recv_msg_size: 1.572864e+07
    grpc_server_max_send_msg_size: 1.572864e+07

Tempo uses the weaveworks grpc/http server. All config options are available here:

Thanks for your answer @joeelliott !

Yes, that’s a big trace, and I’m not even sure how my browser will behave when it receives it ! :sweat_smile:

I tried to update the server section of my tempo.yml config file, but that did not change my error. Still the same 16MB max value.

The error is logged as coming from “frontend”, so I was looking at what could be a frontend for queries and found the querier’s config:
querier config, line 30
grpc client seems to be initialized with MaxSendMsgSize: 16 << 20.
That seems a lot ! And a round count, not sure if it could be related to my 16777216 limit.

I’m not good at reading go, so I can’t see how to override this value from config. grpc_server_max_send_msg_size seemed a nice candidate, but did not change anything. :man_shrugging:

Anyhow, maybe some more log with more context can help:

prometheus_tempo.1.r5wfv6lrzf0z@admin-node-2    | level=error ts=2021-03-08T16:23:11.533322365Z caller=frontend_processor.go:131 msg="error processing query" err="response larger than the max (478537672 vs 16777216)"
prometheus_tempo.1.r5wfv6lrzf0z@admin-node-2    | level=info ts=2021-03-08T16:23:11.534160445Z caller=frontend.go:63 method=GET traceID=252ed34f3802cab9 url=/api/traces/0000000000000000015152c328c38ba0 duration=3.569090217s status=500
prometheus_tempo.1.r5wfv6lrzf0z@admin-node-2    | level=warn ts=2021-03-08T16:23:11.534258282Z caller=logging.go:71 traceID=252ed34f3802cab9 msg="GET /api/traces/0000000000000000015152c328c38ba0 (500) 3.56934289s Response: \"response larger than the max (478537672 vs 16777216)\" ws: false; Accept: application/protobuf; Accept-Encoding: gzip; Uber-Trace-Id: 252ed34f3802cab9:252ed34f3802cab9:0000000000000000:0; User-Agent: Go-http-client/1.1; "

Ah, good catch:

querier:
    frontend_worker:
        grpc_client_config:
            max_send_msg_size: 1.34217728e+08

here is the max send message size on the querier client. Perhaps that will help?

1 Like

Great ! :+1: This solves the tempo server error.

Tempo and tempo-query require 3GB RAM each to manage the traces, but then it reaches my browser, and Grafana displayed it perfectly. :partying_face:

It seems that tempo-query is the weak link though: this request often makes it lose the connection to tempo, and keeps trying to reconnect unsuccessfully until I restart it.
I guess Remove dependency on Jaeger-Query · Issue #382 · grafana/tempo · GitHub should fix it.

If you don’t mind me asking, how many spans are in that beastly trace?

Sure, here are the trace infos:

  • Trace Start March 3 2021, 11:47:07.206
  • Duration 113845.71s
  • Services 1
  • Depth 3
  • Total Spans 4591

Not that complex, it’s mainly a “get from postgres / process / write to elasticsearch” loop.
I guess the size comes from the big SQL queries, which add 300KB for each jdbc span.

Yes 300KB is an absolutely enormous span. That is 3000 times larger than the average span size at Grafana. Very cool.

Good on Tempo for handling that :).