We have some logs in Loki with some more fields out of which the ones that are of interest for this scenario are: filename and computer name.
I want to create an alert with the following specifications:
If the count of machines on which the filename was observed is greater than a threshold, raise an alert.
I have managed to put this kind of aggregation together in a panel, but I am not able to use the same transformations and aggregation capabilities in a query/alert.
The panel output would look like this:
data:image/s3,"s3://crabby-images/524db/524dbf2351c1be6bd593e2b5437cac7fb9e1791d" alt="image"
With these results available in the panel, we would only be interested in getting an alert for File3 that was observed on 6 machines (considering a threshold of 5) and send an alert ideally containing the list of Machines along with the filename
found a similar topic here: Aggregating logs label’s values in one line - Grafana Loki - Grafana Labs Community Forums but no solution provided.
Can you share your query, please? And the rule you’ve tried.
Starting from this:
count by(filename, winlog_computer_name) (rate({index=“wineventlog”, source=“json_we_source_events”} | json | regexp "param1":(?P<filename>\S+)
| event_code = 1000
| winlog_event_data_param5 = `` [$__auto]))
Apologies for any mistakes, I am new to Grafana
I don’t think your query returns what you want. Try this (not tested):
count by (filename) (
count_over_time(
{index=“wineventlog”, source=“json_we_source_events”}
| json
| regexp "param1":(?P<filename>\S+)
| event_code = 1000
| winlog_event_data_param5 = `` [10m]
)
)
You want to count number of series based on filename, so what computer name it is doesn’t really matter (and it’ll also give you the wrong results). And when putting queries in ruler it’s best to have a set interval instead of using auto.
Thank you for your reply! Query works, but what i would really want is to count the distinct names of computers on which the filename was observed.
The scenario is something like this: if a file is throwing errors on a single machine, it’s fine. but if a file is throwing errors on more than 10 machines, i would like to get an alert. Additionally, if i can get the list of the 10 machines, it would be perfect
Perhaps try and add another aggregation level:
count by (filename) (
sum by (winlog_computer_name) (
count_over_time(...)
)
)