We are getting a DatasourceError about every day but are failing to find the root cause.
Our thinking is that this is due to a short network outage or a timeout.
The problem is that we don’t find any trace of it in the logs even if we log Grafana in debug mode.
Is there any other way to zoom in on a root cause? Do we need to configure Prometheus also in a certain way or increase the logging there?
You might also find metrics of Grafana helpful. They can show the datasource response times, http errors (if any). Also if you don’t want to receive DatasourceError alerts, you might want to check other options in Configure no data and error handling toggle in alert creation screen. By default it’s No Data / Error, so when only one no data / error happens, alert is firing. You might want to change it to OK / Alerting / Keep Last State.
| http errors (if any)
I saw that you can get nice visualisation on the alerting timeline but does this show any more information that it can get from the logs? I suspect not right. Loki ingests these logs so it doesn’t add information AFAIK.
| Also if you don’t want to receive DatasourceError alerts, you might want to check other options in Configure no data and error handling toggle in alert creation screen
We are doing that at the moment. But this masks connection problems so is only a temporary solution
I don’t know about the visualization, I haven’t got Loki installed, so I didn’t see that. Grafana metrics can be helpful, as those show the query time etc. but I’m not sure they are in the timeline. We use those metrics to monitor our datasource (you can also check if maybe your datasource exposes some metrics).
As for the connection problems being masked - everyone I’ve ever worked with (my professors in college and colleagues at work) said that connectivity issues are something certain in distributed systems. If they are really frequent (I think I missed the part if those are or not), I’d resolve to metrics exposed by Grafana and your datasource to ensure / exclude if the network is the cause. Also, how many queries are there? Maybe your datasource is not powerful enough? (more CPU and memory?)
| Grafana metrics can be helpful, as those show the query time etc.
That is what I get from the link you sent, it has a nice visualisation but doesn’t proved extra information about the root cause
| As for the connection problems being masked - everyone I’ve ever worked with
True, you have to prepare for failure. We can do this with retries or increase timeouts but what I am asking here is how to enable logging on the issue so we know what is causing this
| If they are really frequent
It happens about once a day. Problem is that when it occurs at night the oncall engineers are woken up…
We found the log line that reports the network failure
logger=ngalert.scheduler rule_uid=fdwvfo9oxnvgge org_id=1 t=2024-10-03T09:02:24.03889848Z level=error msg=“Failed to evaluate rule” version=10 fingerprint=0e0f94072e47ac8e attempt=1 now=2024-10-03T09:02:20Z error=“the result-set has errors that can be retried: [sse.dataQueryError] failed to execute query [A]: ReadObject: expect { or , or } or n, but found S, error found in #1 byte of …|Service Una|…, bigger context …|Service Unavailable|…”
The network is down for about 15-20 seconds.
The “max_attempts” is set to 1 by default and I have bumped to 40 to survive these blips
One thing to note is that a retry happens every second whereas I would have expected an (exponential) back off strategy