How to count events(errors) on selected period? (Loki/Prometheus)

Hello!
(Grafana newbie here.)

Really frustrated about simple counting of particular events during selected time.
I’ve got a simple dashboard, where I have log lines from LOKI (on the screenshot - filtered by “ERROR” keyword), some graphs, etc.

I want to count (just count, not average, etc.) my app’s errors during given time interval.
5m, 30m or user selected (with mouse on graph).

I’ve created a variable ($var_selected_interval, type=interval, Auto=ON) and used it in a panel:

So… as we can see on the 1st screenshot Errors count in the logs really differs from the “calculated”.
What this " 1.9 " errors mean?
1.9 (even 19) in selected by user 5 minutes period? Wrong!
1.9 per second?.. Doubtfully
So, I need only a count. Count all error lines in my log.

I have tried count_over_time, sum, rate, etc… Nothing helps.

What I am doing wrong?
Please, help.

Hmmm… looks like I should use " $__range" variable, right?

do you have TOTAL selected in the stats value setting?

Not sure, where to find this “TOTAL” option…

HI. I got this to work in a similar use case. I wanted to tally up the total number of lines where a specific issue occurred. Here is what I did.

One thing I ran into is that when the interval was very large the counts were not accurate, it was like it was dropping dome of the entries. This is why I set the max data points option, once I did that it was better unless I went to a really large interval.

Hope this helps!

3 Likes

Thank you, this worked!! Amazing when you dig up a bit of gold like this :tada:

Interestingly, if I go above 10000 it returns no data. Not 100% on why, but it works fine for now

Hi!
Thanks for sharing your solution, it helped me figure out how to solve a similar objective.
I was also reading the docs but while i couldn’t put the puzzle pieces together there, i remembered something that might be the cause of your issue with huge counts.
In your query you have a hardcoded interval of [1s] so if the inspected time range becomes larger, you are totalling up an increasing amount of 1s data points, growing towards and exceeding the max datapoints (thats why pushing that limit helped).
But that is where the $__interval variable can help. If you increase the time range to a month and would show that curve on a graph, you would have way more 1s data points than pixels on the screen, so grafana comes up with an interval of time that is roughly one datapoint per pixel based on the selected time range, and provides you that number in the $__interval to use.
Long story short, if you replace [1s] with $__interval your query will always total up a reasonably low amount of intermediate counts, and should not run in any datapoint limits (and maybe perform a little better too).

Check out also this blog post, it explains some more mystery behind those special grafana variables: New in Grafana 7.2: $__rate_interval for Prometheus rate queries that just work | Grafana Labs