Grafana Loki sanitizing of dashboard-variables

Since updating Grafana to version 11.x (currently 11.1.0) and Loki to 3.x (currently 3.1.0) I had some strange issues with using dashboard-variables. During the process of creating a new dashboard and adding some variable and implementing them in the query, the loki process is starting to use way more resources than usual and the process may even crash and restart. Grafana process only needs slightly more resource at the same time. I did not pay too much attention to it until recently it left me with a corrupt loki db. Unfortunately I did not find yet a stable way to reproduce this behavior, but will update as soon as I know more. It only happens during initial dashboard creation, once the dashboard is up and running, I never experienced any further troubles. So I assume a problem with at this time not yet set default value and therefore null value in the variable, but that’s just an assumption.

The query for the dashboards looks something like

{source=~"some_source_.*", host=~"some-host.*"} |= `$var1` |= `$var2` |= `$var3`

Would it be better to use " instead of `?

{source=~"some_source_.*", host=~"some-host.*"} |= "$var1" |= "$var2" |= "$var3"

Or is there any sanitize option for Grafana when querying Loki that I am not aware of? Or are there any known changes in Grafana 11.x that could have an effect on input sanitizing and rolling back to 10.x would be better (at least for a productive environment)?

Personally have never seen this before. You should be able to find some error log from Loki.

As far as I know it’s no different if you use ` or " for filtering. ` might be easier when you are doing regex.

Hi @tonyswumac,

Tried to extract the info from loki syslog (btw, why does loki not have its own log location like /var/log/loki.log?).

Below is the loki log (slightly edited to not show user and domain) that was generated when editing a dashboard/variable from grafana-ui. After the first query error loki did not accept further queries and CPU and memory consumption went up to the point where oom-reaper started to kill loki pro-cess.

loki-log:

Sep  4 23:25:08 host grafana[1210]: logger=tsdb.loki endpoint=queryData pluginId=loki dsName="Loki Default" dsUID=edl3mv17d5tz4c uname=some_user fromAlert=false t=2024-09-04T23:25:08.725600
044+02:00 level=error msg="Error received from Loki" error="Get \"http://localhost:3101/loki/api/v1/query_range?direction=backward&end=1725485144754000000&limit=10&query=%7Bsource%3D%22some_source
%22%2C+host%3D~%22some_host.%2A%22%7D+%7C%3D+%60User32%60+%7C%3D+%60EventID%3D1074%60+%7C%3D+%60param6%60+%7C+pattern+%60%3C_%3E+param1%3D%3Cparam1%3E+param2%3D%3Cparam2%3E+param
3%3D%3Cparam3%3E+param4%3D%3Cparam4%3E+param5%3D%3Cparam5%3E+param6%3D%3Cparam6%3E+param7%3D%3Cparam7%3E+%3C_%3E%60+%7C%3D+%60%24Hostname%60&start=1725474344754000000&step=10800000ms\": context c
anceled" status=cancelled duration=6.268563348s stage=databaseRequest start=2024-09-04T18:25:44.754Z end=2024-09-04T21:25:44.754Z step=3h0m0s query="{source=\"some_source\", host=~
\"some_host.*\"} |= `User32` |= `EventID=1074` |= `param6` | pattern `<_> param1=<param1> param2=<param2> param3=<param3> param4=<param4> param5=<param5> param6=<param6> param7=<param7> <_>` |= `$
Hostname`" queryType=range direction=backward maxLines=10 supportingQueryType=dataSample lokiHost=localhost:3101 lokiPath=/loki/api/v1/query_range

Sep  4 23:25:08 host grafana[1210]: logger=context userId=25 orgId=2 uname=some_user t=2024-09-04T23:25:08.725876358+02:00 level=info msg="Request Completed" method=POST path=/api/ds/query
status=400 remote_addr=10.204.16.55 time_ms=6316 duration=6.316472923s size=609 referer="https://grafana.domain.tld/d/bdww9757he5fkd/wa-patching-exit-codes?editPanel=1&var-Hostname=&var-hostname="
 handler=/api/ds/query status_source=downstream

Sep  4 23:25:08 host grafana[1210]: logger=tsdb.loki endpoint=queryData pluginId=loki dsName="Loki Default" dsUID=edl3mv17d5tz4c uname=some_user fromAlert=false t=2024-09-04T23:25:08.744963
71+02:00 level=info msg="Prepared request to Loki" duration=27.223µs queriesLength=1 stage=prepareRequest runInParallel=false

Sep  4 23:25:10 host loki[2655045]: level=error ts=2024-09-04T21:25:08.751157291Z caller=retry.go:95 org_id=fake traceID=1fbf167554551155 msg="error processing request" try=0 query="{sourc
e=\"some_source\", host=~\"some_host.*\"} |= \"User32\" |= \"EventID=1074\" |= \"param6\" | pattern \"<_> param1=<param1> param2=<param2> param3=<param3> param4=<param4> param5=<par
am5> param6=<param6> param7=<param7> <_>\" |= \"$Hostname\"" query_hash=3533705000 start=2024-09-04T21:00:00+02:00 end=2024-09-04T22:00:00+02:00 start_delta=2h25m8.75114217s end_delta=1h25m8.7511
42577s length=1h0m0s retry_in=465.1894ms err="context canceled"

kernel log

Sep  4 23:25:23 ksl-vlnx107 kernel: [5437036.244986] loki invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

Sep  4 23:25:23 ksl-vlnx107 kernel: [5437036.245005] CPU: 29 PID: 2656585 Comm: loki Not tainted 5.15.0-113-generic #123-Ubuntu

...

Sep  4 23:26:52 ksl-vlnx107 kernel: [5437125.591491] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=grafana-server.service,mems_allowed=0-1,global_oom,task_memcg=/system.slice/loki.service,task=loki,pid=3421016,uid=119

Sep  4 23:26:52 ksl-vlnx107 kernel: [5437125.594922] Out of memory: Killed process 3421016 (loki) total-vm:72314680kB, anon-rss:64098124kB, file-rss:0kB, shmem-rss:0kB, UID:119 pgtables:130036kB oom_score_adj:0

Sep  4 23:26:59 ksl-vlnx107 kernel: [5437132.816549] oom_reaper: reaped process 3421016 (loki), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

At this point a warning for using the option flush_on_shutdown: true, I had this option always set to false and when crashing loki process never had any troubles with database consistency. But when this crash happened I had this option set to true and it was the first and only time I had a corrupt table. Most likely somewhere during oom-pressure loki process starts to shutdown and tries to write down all log in memory and then finaly gets killed during the process of flushing to disc. If this is true I prefer to have wal enabled and just can safely kill loki process.

You should have WAL enabled. I generally prefer to disable flush on shutdown as well. You just need to make sure your WAL directory is persistent.