[Caddy v2 | Traefik | nginx] proxying to Grafana

Hi there,

so I’m trying to setup a reverse proxy for Grafana (7.3.4) and I’m epically failing to do so! I’ve tried Caddy, nginx and Traefik on the frontend and every possible combination of Grafana deployment on the backend (plain Docker, HashiCorp Nomad-orchenstrated Docker, barebones tarball) to no avail. We’re talking no subpath routing here. Plain top level URL proxying that we all did since, well forever. Following the Grafana reverse proxy nginx guide, it seems easy enough…

But after messing around with the grafana.ini and/or GF_* vars and such, buffer timeouts and sizes on the various proxies I’m out of ideas!

At this point I don’t even think this is a grafana.ini plus proxy specific combination type issue because all permutations kinda end up showing the same result.

So the constructed paths seem to be okay, which leads me to believe that the root_url and domain values in grafana.ini are setup correctly. The issue is, that the two app.hash.js and vendor.app.hash.js are only partially loaded.

Caddy is logging “connection reset by peer” or “broken pipe” or"context cancelled" every time one tries to refresh the page with cache disabled.

The content lenghts reported in the headers of these two files are very different from the actual size the browser manages to load. Now if I enable caching and refresh often enough eventually all 3MB and 6MB respectively of JS files will be loaded and Grafana will greet me with the login screen.

Well here’s the million dollar question then! What am I missing? Who is the culprit? Is this a bug in Grafana 7.3.4? Do I need to beef up proxy_buffers? Do I need more sleep?

Needless to say that I’m grateful for any hints on this. Not only will I include you in my prayers at night, I will nominate you for the “Kelsey-Hightower-extra-dope-person-award”.

Cheers
Ralph

Even Kelsey Hightower won’t be able to help you, because you didn’t provide reproducible example.
It will be lottery to guess your root cause from millions variables which can be wrong.

Fair enough @jangaraj!

Here’s my Caddy config:

http://monitor.topleveldomain.com {
reverse_proxy { 
  to http://172.18.131.60:3000
  flush_interval -1
  transport http {
    keepalive 310s
    compression off
  }
}

}

#################################### Server ####################################
[server]
# Protocol (http, https, h2, socket)
protocol = http

# The ip address to bind to, empty will bind to all interfaces
http_addr =

# The http port  to use
http_port = 3000

# The public facing domain name used to access grafana from a browser
domain = monitor.topleveldomain.com

# Redirect to correct domain if host header does not match domain
# Prevents DNS rebinding attacks
;enforce_domain = true

# The full public facing url you use in browser, used for redirects and emails
# If you use reverse proxy and sub path specify full url (with sub path)
;root_url = %(protocol)s://%(domain)s:%(http_port)s/
root_url = http://monitor.topleveldomain.com/

# Serve Grafana from subpath specified in `root_url` setting. By default it is set to `false` for 
compatibility reasons.
serve_from_sub_path = false

# Log web requests
router_logging = true

# the path relative working path
;static_root_path = public

# enable gzip
enable_gzip = false

I did omit the unchanged parts from the Caddyfile and grafana.ini…

Here’s a FF screenshot showing that these two files in question are only partially loaded:

Those are all 200 responses btw.

This is how it looks if the files are loaded from the cache:

Again, everything is 200!

Here’s Caddy:

{"level":"error","ts":1606824379.7012222,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response","error":"write tcp 172.18.131.60:80->172.21.194.246:51389: write: connection reset by peer"}
{"level":"error","ts":1606824473.845335,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response","error":"write tcp 172.18.131.60:80->172.21.194.246:51395: write: connection reset by peer"}
{"level":"error","ts":1606824474.3236578,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response","error":"write tcp 172.18.131.60:80->172.21.194.246:51407: write: broken pipe"}
{"level":"error","ts":1606824520.4760795,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response","error":"write tcp 172.18.131.60:80->172.21.194.246:51394: write: connection reset by peer"}
{"level":"error","ts":1606824520.9819324,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response","error":"write tcp 172.18.131.60:80->172.21.194.246:51419: write: broken pipe"}
{"level":"error","ts":1606824552.8041818,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response","error":"write tcp 172.18.131.60:80->172.21.194.246:51385: write: connection reset by peer"}
{"level":"error","ts":1606824552.8913424,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response","error":"write tcp 172.18.131.60:80->172.21.194.246:51424: write: broken pipe"}
{"level":"error","ts":1606824577.9811473,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response","error":"write tcp 172.18.131.60:80->172.21.194.246:51428: write: connection reset by peer"}

Why you need reverse proxy?
Why you have disabled compression?
What we can’t see Caddy/Grafana debug log level logs?
What are resource limitations?
What is infrastructure setup, especially networking?

Look man, I’m really grateful that you took the time to look into this. I really am. But respectfully, I’m not sure why do I need reverse proxy is the question to ask here… But let me explain!

For starters, it’s the situation I found myself in doing my job. Secondly, it’s quite a common pattern to do. I am fiddling around with this for roughly a week now, trying every combination I could come up with. I did ask a collegue of mine to co-debug this with me. We skimmed through the usual resources (SO, docs, issues, changelogs) with little luck. I even found open issues on the tracker suggesting to “make reverse proxying simpler”. I found configurations that went quite a strech farther than the proxy_pass http://grafana.staged-by-discourse.com/; that’s suggested in the docs. After you try the obvious stuff, one get’s into the nitty gritty: compression, trailing slashes, buffer sizes, request body sizes etc… That is for instance why I left compression off in the Caddy config… Still the same outcome.

Bottom line is, I can’t shake the impression, that maybe there’s more to it. Maybe this specific version has a bug (there are plenty of closed issues around reverse proxy stuff). Given that the constructed URLs seem to be fine since I can reach them and given that the JS files are big-ish, I thought that maybe body size or timeouts may have a hand in all of this. Again, I might be wrong. Since you don’t seem to completely disagree with my grafana.ini, I take it that I didn’t make the obvious config mistakes. I might be wrong here too…

With regard to the other questions… I even put the reverse proxy and Grafana on the same server. So basically this is localhost communication. It’s one off the shelf EC2 instance (t3.medium if I recall correctly).

In closing, again, let me thank you for your time and rephrase my question. Are there any obvious mistakes I made in the grafana.ini? Has anybody a working reverse proxy / Grafana 7.3.4 config that he/she would be able to share? Thanks a ton.

Cheers
Ralph

IMHO you don’t need reverse proxy. I don’t see benefit in your setup. If it is working without proxy, then it clear that proxy is a problem and not Grafana configuration.

Aparently connection from Grafana to caddy was terminated. That may have a many reasons. E.g. Grafana has not enough memory to process request (my question about resource limitation).

You are not using localhost in the reverse proxy config, so it may not be really local communication. In theory it may go through WAF, Docker network or it is reaching some tcp timeout. You have mention also Kelsy, who is doing far more complicated networking. (question about networking).

Yes, js files are big so usually compression is enabled, so they are downloaded faster. Maybe they are so huge and network is so slow that some tcp timeout is reached (question why compression is disabled).

Anyway, it is still lotery (still no Grafana logs provided). You will be better if you follow official tutorial https://grafana.com/tutorials/run-grafana-behind-a-proxy/#2 instead of debuging current setup and setup new machine from the scratch for that.

I think it is beyond the scope of this thread to debate the usefulness of ingress gateways or reverse proxies. We’ll probably have to agree to disagree on this.

Since Grafana is basically JS and Go and I successfully ran it on devices as small as a Raspberry PIs in various combinations, I seriously doubt it that a corporate network and AWS t3.medium instances are the bottleneck here… Furthermore, I went by the guide as I stated in my original post. And I started from scratch a couple of times.

It’s obviously difficult to transport fun, sarcasm or feelings in general on the interwebz which is why you probably failed to deduce my intention to close my post with a funny Kelsey-remark.

Returning back to the topic at hand and with some more testing I’m quite certain that the reverse proxy as well as the Grafana configs are okay.

What I will acknowledge though @jangaraj, is that some entity between the corporate network and AWS is messing with the request. That may be a WAF, an IPSEC tunnel, or whatever… It is beyond me, how this never manifested itself before in other scenarios, but at this point it seems very likely.

With all that said, it seems not to be a Grafana problem per se. I’ll investige further and update the post accordingly with what I’ve learned.

Cheers