How to query a 500MB trace?

hervenicol · March 5, 2021, 11:24am

Hello,

I have a 48h-long batch job that creates a huge trace. Although I’m not sure such a big trace is a good idea, the trace seems to be correctly stored on Tempo / S3.

However, when I try to visualize the trace (with jaeger-ui or with grafana), tempo consumes nearly 2GB RAM then fails with the following log:

GET /api/traces/0000000000000000015152c328c38ba0 (500) 2.327861491s
Response: \"response larger than the max (478537672 vs 16777216)\"
ws: false;
Accept: application/protobuf;
Accept-Encoding: gzip;
Uber-Trace-Id: 5f81c8deb13785e8:5f81c8deb13785e8:0000000000000000:0;
User-Agent: Go-http-client/1.1;

I’ve had a look at the query frontend and querier config, but I’m not sure how to tune this “max” 16MB value.

joeelliott · March 8, 2021, 3:28pm

That is a fantastically large trace. At Grafana we increase the grpc send limits for similar reasons.

server:
    grpc_server_max_recv_msg_size: 1.572864e+07
    grpc_server_max_send_msg_size: 1.572864e+07

Tempo uses the weaveworks grpc/http server. All config options are available here:

github.com

weaveworks/common/blob/master/server/server.go#L46


	// Starts the signals handler. This method is blocking, and returns only after signal is received,
	// or "Stop" is called.
	Loop()

	// Stop blocked "Loop" method.
	Stop()
}

// Config for a Server
type Config struct {
	MetricsNamespace  string `yaml:"-"`
	HTTPListenAddress string `yaml:"http_listen_address"`
	HTTPListenPort    int    `yaml:"http_listen_port"`
	HTTPConnLimit     int    `yaml:"http_listen_conn_limit"`
	GRPCListenAddress string `yaml:"grpc_listen_address"`
	GRPCListenPort    int    `yaml:"grpc_listen_port"`
	GRPCConnLimit     int    `yaml:"grpc_listen_conn_limit"`

	HTTPTLSConfig node_https.TLSStruct `yaml:"http_tls_config"`
	GRPCTLSConfig node_https.TLSStruct `yaml:"grpc_tls_config"`

hervenicol · March 8, 2021, 4:44pm

Thanks for your answer @joeelliott !

Yes, that’s a big trace, and I’m not even sure how my browser will behave when it receives it !

I tried to update the server section of my tempo.yml config file, but that did not change my error. Still the same 16MB max value.

The error is logged as coming from “frontend”, so I was looking at what could be a frontend for queries and found the querier’s config:
querier config, line 30
grpc client seems to be initialized with MaxSendMsgSize: 16 << 20.
That seems a lot ! And a round count, not sure if it could be related to my 16777216 limit.

I’m not good at reading go, so I can’t see how to override this value from config. grpc_server_max_send_msg_size seemed a nice candidate, but did not change anything.

Anyhow, maybe some more log with more context can help:

prometheus_tempo.1.r5wfv6lrzf0z@admin-node-2    | level=error ts=2021-03-08T16:23:11.533322365Z caller=frontend_processor.go:131 msg="error processing query" err="response larger than the max (478537672 vs 16777216)"
prometheus_tempo.1.r5wfv6lrzf0z@admin-node-2    | level=info ts=2021-03-08T16:23:11.534160445Z caller=frontend.go:63 method=GET traceID=252ed34f3802cab9 url=/api/traces/0000000000000000015152c328c38ba0 duration=3.569090217s status=500
prometheus_tempo.1.r5wfv6lrzf0z@admin-node-2    | level=warn ts=2021-03-08T16:23:11.534258282Z caller=logging.go:71 traceID=252ed34f3802cab9 msg="GET /api/traces/0000000000000000015152c328c38ba0 (500) 3.56934289s Response: \"response larger than the max (478537672 vs 16777216)\" ws: false; Accept: application/protobuf; Accept-Encoding: gzip; Uber-Trace-Id: 252ed34f3802cab9:252ed34f3802cab9:0000000000000000:0; User-Agent: Go-http-client/1.1; "

joeelliott · March 8, 2021, 5:58pm

Ah, good catch:

querier:
    frontend_worker:
        grpc_client_config:
            max_send_msg_size: 1.34217728e+08

here is the max send message size on the querier client. Perhaps that will help?

hervenicol · March 8, 2021, 6:30pm

Great ! This solves the tempo server error.

Tempo and tempo-query require 3GB RAM each to manage the traces, but then it reaches my browser, and Grafana displayed it perfectly.

It seems that tempo-query is the weak link though: this request often makes it lose the connection to tempo, and keeps trying to reconnect unsuccessfully until I restart it.
I guess Remove dependency on Jaeger-Query · Issue #382 · grafana/tempo · GitHub should fix it.

joeelliott · March 9, 2021, 1:14pm

If you don’t mind me asking, how many spans are in that beastly trace?

hervenicol · March 9, 2021, 1:44pm

Sure, here are the trace infos:

Trace Start March 3 2021, 11:47:07.206
Duration 113845.71s
Services 1
Depth 3
Total Spans 4591

Not that complex, it’s mainly a “get from postgres / process / write to elasticsearch” loop.
I guess the size comes from the big SQL queries, which add 300KB for each jdbc span.

joeelliott · March 9, 2021, 1:48pm

Yes 300KB is an absolutely enormous span. That is 3000 times larger than the average span size at Grafana. Very cool.

Good on Tempo for handling that :).

system · March 9, 2022, 1:49pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High number of spans in trace doesn't open in Grafana UI Grafana Tempo	6	1415	January 27, 2024
Grafana UI truncates trace rendering at 2560 spans - how to increase? Grafana Tempo templating , api , plugins , dashboard	1	39	June 25, 2025
Failed to get trace with id:<ID> Status: 500 Internal Server Error Body: trace exceeds max size (max bytes: 3000000) Dashboards	3	1312	March 5, 2024
Tempo compactor crash on trace too large Grafana Tempo	0	58	November 4, 2024
Response Larger Than the Max Configuration	1	1144	August 6, 2023

How to query a 500MB trace?

Related topics