Including full log error message in alert notification using Loki

Hello, guys! I searched the entire Internet to find the answer to the following question:
How to include full log error message in alert notification using Loki as Datasource.
For example, I have the following query:
count_over_time({app=“some-application”} |= ERROR [1s])
And I want if the last value of this request is more than zero, send a notification including the FULL log on which the alert was triggered(example: ; ERROR 1 — [ctor-http-nio-2] c.c.m.d.e.a.s.MarketDataEventSender : Error market data event sender [error occurred in message handler [org.springframework.integration.amqp.outbound.AmqpOutboundEndpoint@361a6caf] )
And unfortunately, I didn’t find anything about this. Please, help me. Where I can find a step-by-step guide or some documentation about this, or you can write now how to do it.
I would be very grateful for a detailed answer. Thank you in advance!

1 Like

I don’t think this is supported, but there is a workaround. When creating a rule you can supply labels in annotations, and you can manipulate LogQL to put the entire log line into a label. See Add log message in alert · Issue #5844 · grafana/loki · GitHub for an example.

@tonyswumac Can you please elaborate a little and give an example, maybe?

I am not sure how to put the entire log line into a label, as you suggest.

Thank you,
Thierry.

Here is a simple example (not tested). Consider following log line:

TIMESTAMP ERROR Something went wrong

And let’s say you are catching the ERROR part with a rule like this:

  - name: "errorlog"
    rules:
      - alert: "errorlog"
        expr: 'count_over_time({<SELECTOR>} |~ "(?i)error" [5m]) > 0'
        for: 5m
        annotations:
          description: "got an error"
          message: "some error log"

And you’d get an alert with whatever labels you might have. You can now tweak the query a bit like so:

sum by (log_message) (count_over_time({<SELECTOR>} |~ "(?i)error" | pattern `<_> <_> <log_message>` [5m]))) > 0

Notice the part when you are explicitly extracting the message part, and with the sum by now that part is transformed into a message label. You can now attempt to use that label in your ruler configuration:

  - name: "errorlog"
    rules:
      - alert: "errorlog"
        expr: 'sum by (log_message) (count_over_time({<SELECTOR>} |~ "(?i)error" | pattern `<_> <_> <log_message>` [5m]))) > 0'
        for: 5m
        annotations:
          description: "got an error"
          message: "{{ $labels.log_message }}"
1 Like

@tonyswumac Thank you for the fast response!

Not sure if you can further help me, but I am not sure if I am missing something or if Grafana has some kind of limitation, since this solution has not been tested.

Here is the query I am doing in the Editor of Alert rules/ Add rule:

sum by (log_message) (count_over_time({app="agones"} |~ "(?i)error" | pattern `<_> <_> <log_message>` [5m]))

The error I get is this:

Failed to evaluate queries and expressions: failed to execute query A: parse error at line 1, col 79: syntax error: unexpected STRING

I tried changing the query, but I am not sure what to put instead of a string.

For example, I have tried this:

sum by (log_message) (count_over_time({app="agones"} |~ "(?i)agones" | pattern `<_> "<log_message>"` [6h]))

Same error:

Failed to evaluate queries and expressions: failed to execute query A: parse error at line 1, col 80: syntax error: unexpected STRING

Here is an extract of the actual logs I would like to work with:

:

If this sounds fairly easy to you and you can help me understand this, that would be greatly appreciated.

Just to be clear, “all I want” is to display the error log extract in the notification that is being sent. I am aware, though, that it probably is not as simple as I was hoping it would be :sweat_smile:

Been spending hours and hours on this. So if the solution is handy, that would be great!

Otherwise, I’ll do without it!
Thierry.

Maybe replace ` with " for pattern. Example from logql analyzer: LogQL Analyzer | Grafana Loki documentation

Thanks for the link! That is helpful to verify my syntax.

I had already tried with the double quotes as well and it returns the same error message. I have also reproduced the syntax by using the Grafana query builder. Even when I do that, I am getting the same error which leads me to think that it is related to how Grafana handles the logQL query…

Actually, I got the query to work on Grafana 10. I was using Grafana 9 when it was returning an error.

Although, I’m still struggling to retrieve the log_message label.

I have put this in Grafana:

image

I am getting this:

 "<no value>"

The query is working, thought:

I have tried different syntaxes. The only thing that gets me some data is if I put

"{{ . }}"

Which returns the following:

 "{map[__alert_rule_namespace_uid__:DYYJLTgSk __alert_rule_uid__:E2vxYTRSz alertname:Agones errors grafana_folder:log-monitor] map[] }"

I don’t see anything about labels…
How would you troubleshoot this?

Is there a way to display all the labels and annotations?

Thank you.
Thierry.

"{{ $labels.log_message }}" is for alert manager, I don’t use Grafana for alerting, so I am not sure what you’d put in there. Perhaps try {{ __log_message }} or {{ .log_message }}?

That’s the thing, I’m playing trial and error without really knowing what I am doing.
These two ways that you suggested did not work.

I don’t know if I can retrieve it in the annotations, because if I put {{ . }}, here is what I am getting:

"{map[__alert_rule_namespace_uid__:DYYJLTgSk __alert_rule_uid__:E2vxYTRSz alertname:Agones errors grafana_folder:log-monitor] map[] }"

If I go to the Grafana alert manager, I can also configure stuff there and that’s where I am seeing the error log, if I put {{ . }}.
However, it is way too verbose and I would like to be able to display only the fields that I am interested in.

Here is what I am putting in the Text body of my notification in the Grafana alert manager:

Alert summaries:

range alerts-firing: {{ range .Alerts.Firing }}

Summary: {{ .Annotations.summary }}

Message: {{ .Annotations.message }}

Description: {{ .Annotations.description }}

Full output: {{ . }}

{{ end }}

With
.Annotation.summary being: Here is the error summary: "{{ . }}"
.Annotations.message being: Here is the error message: "{{ .log_message }}"
.Annotations.description being: Here is the error description: "{{ __log_message }}"

And here is the output I am getting from there:

1 alerts are firing.
Alert summaries:
range alerts-firing:
Full output: {firing map[alertname:Agones errors grafana_folder:log-monitor] map[description:Here is the error description: "{{ __log_message  }}" message:Here is the error message: "{{ .log_message }}" summary:Here is the error summary: "{map[__alert_rule_namespace_uid__:DYYJLTgSk __alert_rule_uid__:E2vxYTRSz alertname:Agones errors grafana_folder:log-monitor] map[B0:1 B1:1 B10:1 B11:1 B12:1 B13:1 B14:1 B15:1 B16:1 B17:1 B18:1 B19:1 B2:1 B20:1 B21:1 B22:1 B23:1 B24:1 B25:1 B26:1 B27:1 B28:1 B29:1 B3:1 B30:1 B31:1 B4:1 B5:1 B6:1 B7:1 B8:1 B9:1] [ var='B0' metric='{log_message="http: TLS handshake error from 172.29.0.0:10064: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:10064: EOF} value=1 ], [ var='B1' metric='{log_message="http: TLS handshake error from 172.29.0.0:1235: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:1235: EOF} value=1 ], [ var='B2' metric='{log_message="http: TLS handshake error from 172.29.0.0:17575: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:17575: EOF} value=1 ], [ var='B3' metric='{log_message="http: TLS handshake error from 172.29.0.0:17622: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:17622: EOF} value=1 ], [ var='B4' metric='{log_message="http: TLS handshake error from 172.29.0.0:20110: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:20110: EOF} value=1 ], [ var='B5' metric='{log_message="http: TLS handshake error from 172.29.0.0:20214: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:20214: EOF} value=1 ], [ var='B6' metric='{log_message="http: TLS handshake error from 172.29.0.0:2039: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:2039: EOF} value=1 ], [ var='B7' metric='{log_message="http: TLS handshake error from 172.29.0.0:20426: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:20426: EOF} value=1 ], [ var='B8' metric='{log_message="http: TLS handshake error from 172.29.0.0:2086: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:2086: EOF} value=1 ], [ var='B9' metric='{log_message="http: TLS handshake error from 172.29.0.0:20969: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:20969: EOF} value=1 ], [ var='B10' metric='{log_message="http: TLS handshake error from 172.29.0.0:21871: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:21871: EOF} value=1 ], [ var='B11' metric='{log_message="http: TLS handshake error from 172.29.0.0:25474: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:25474: EOF} value=1 ], [ var='B12' metric='{log_message="http: TLS handshake error from 172.29.0.0:25786: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:25786: EOF} value=1 ], [ var='B13' metric='{log_message="http: TLS handshake error from 172.29.0.0:28767: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:28767: EOF} value=1 ], [ var='B14' metric='{log_message="http: TLS handshake error from 172.29.0.0:34484: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:34484: EOF} value=1 ], [ var='B15' metric='{log_message="http: TLS handshake error from 172.29.0.0:40926: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:40926: EOF} value=1 ], [ var='B16' metric='{log_message="http: TLS handshake error from 172.29.0.0:41173: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:41173: EOF} value=1 ], [ var='B17' metric='{log_message="http: TLS handshake error from 172.29.0.0:41727: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:41727: EOF} value=1 ], [ var='B18' metric='{log_message="http: TLS handshake error from 172.29.0.0:43126: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:43126: EOF} value=1 ], [ var='B19' metric='{log_message="http: TLS handshake error from 172.29.0.0:44664: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:44664: EOF} value=1 ], [ var='B20' metric='{log_message="http: TLS handshake error from 172.29.0.0:47387: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:47387: EOF} value=1 ], [ var='B21' metric='{log_message="http: TLS handshake error from 172.29.0.0:48222: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:48222: EOF} value=1 ], [ var='B22' metric='{log_message="http: TLS handshake error from 172.29.0.0:51410: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:51410: EOF} value=1 ], [ var='B23' metric='{log_message="http: TLS handshake error from 172.29.0.0:54182: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:54182: EOF} value=1 ], [ var='B24' metric='{log_message="http: TLS handshake error from 172.29.0.0:55419: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:55419: EOF} value=1 ], [ var='B25' metric='{log_message="http: TLS handshake error from 172.29.0.0:60327: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:60327: EOF} value=1 ], [ var='B26' metric='{log_message="http: TLS handshake error from 172.29.0.0:60330: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:60330: EOF} value=1 ], [ var='B27' metric='{log_message="http: TLS handshake error from 172.29.0.0:61002: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:61002: EOF} value=1 ], [ var='B28' metric='{log_message="http: TLS handshake error from 172.29.0.0:64058: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:64058: EOF} value=1 ], [ var='B29' metric='{log_message="http: TLS handshake error from 172.29.0.0:64065: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:64065: EOF} value=1 ], [ var='B30' metric='{log_message="http: TLS handshake error from 172.29.0.0:8404: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:8404: EOF} value=1 ], [ var='B31' metric='{log_message="http: TLS handshake error from 172.29.0.0:9816: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:9816: EOF} value=1 ]}"] 2023-08-22 18:22:30 +0000 UTC 0001-01-01 00:00:00 +0000 UTC http://localhost:3000/alerting/grafana/E2vxYTRSz/view 1661f2d935f5cc1b http://localhost:3000/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DAgones+errors&matcher=grafana_folder%3Dlog-monitor   map[B:1] [ var='B0' metric='{log_message="http: TLS handshake error from 172.29.0.0:10064: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:10064: EOF} value=1 ], [ var='B1' metric='{log_message="http: TLS handshake error from 172.29.0.0:1235: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:1235: EOF} value=1 ], [ var='B2' metric='{log_message="http: TLS handshake error from 172.29.0.0:17575: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:17575: EOF} value=1 ], [ var='B3' metric='{log_message="http: TLS handshake error from 172.29.0.0:17622: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:17622: EOF} value=1 ], [ var='B4' metric='{log_message="http: TLS handshake error from 172.29.0.0:20110: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:20110: EOF} value=1 ], [ var='B5' metric='{log_message="http: TLS handshake error from 172.29.0.0:20214: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:20214: EOF} value=1 ], [ var='B6' metric='{log_message="http: TLS handshake error from 172.29.0.0:2039: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:2039: EOF} value=1 ], [ var='B7' metric='{log_message="http: TLS handshake error from 172.29.0.0:20426: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0:20426: EOF} value=1 ], [ var='B8' metric='{log_message="http: TLS handshake error from 172.29.0.0:2086: EOF"}' labels={log_message=http: TLS handshake error from 172.29.0.0…

As you can see, we are seeing the actual error log log_message="http: TLS handshake error from 172.29.0.0:17622: EOF". My goal has been reached.
However, as you can see, it is quite a verbose output.
If you could help me display only the error message, that would greatly help.

Thank you!

Judging from your output I do see log_message in there, so "{{ $labels.log_message }}" should work in the alert manager configuration. If you can’t get that to work I am not sure I can be of further help, since I don’t use Grafana for alerting.

However, I do want to caution you again on your approach. If you look closely at your very detailed output, you would be firing 30 different alerts because each of them has a slightly different log_message label. Maybe that’s precisely what you are looking for, but I’d think that could get a bit noisy.

Yeah, the alert itself is set on a time window of 6 hours, which is way too long and the reason why we see so many occurrences. That is set that way just so I am sure to always have it firing alerts so I can test the notifications. Later on, I’ll shrink the time window to a few minutes.

I would also expect "{{ $labels.log_message }}" to work, but it does just crashes.

I appreciate your input and understand that you are not familiar with Grafana.

Still, this is a Grafana forum, I would hope that someone would be able to help with this.
I will make a more specific post about the problem I am now having, fine tuning this alert notification.

Thank you!

I would say that you need to group by log message (not sure how) in query - then it should be available, as label for alert.

So, I’ve been working with it for a bit and so I have an answer for you. According to the documentation about the templating language what you want is something like this:

{{ define "yourtemplate" }}
   {{ range .Alerts }}
      {{ index .Labels "log_message" }}
   {{ end }}
{{ end }}

We first define the template to call it in the contact points text fields like this:

{{ template "yourtemplate" . }}

Then we iterate over the list of current Alerts so that we are in a scope in which we have access to their data. We then access the data of the Labels array with the index function, providing the key of the label that we want to display.
There are ways to iterate over all the arrays as written in the docs
While you can copy the data of the alert’s labels inside an annotation, you can just access them like I wrote.
In the alert’s query to get a label with the log line that triggered it you need to do two things: add a Sum by or any other aggregator (if you are not interested in the amount of line that triggered the alert) and add a pattern (I don’t know the syntax to format text, only how to get the log line)

sum by(log_message) (count_over_time({filename="/var/log/latest.log"} |= `error` | pattern `<message>` [11s]))

This works on my installation.