Please find the alert Definition.
I am able to get alerts however it is not consistent and sometimes alerts are missed, though I am able to view it Grafana Loki UI
groups:
- name: rate-alerting
rules:
- alert: ServiceFailure
expr: |
sum(count_over_time({job="foo"} |~ "^.*status:.*Failure;" | regexp `(?P<log>.+)`| regexp `(?:custID:(?P<SomeID>[0-9]+);c)` [4m] )) by (SomeID,log)
labels:
severity: warning
category: logs
annotations:
summary: Service Failure for SomeID {{ $labels.SomeID }}
description: Service Failure for SomeID {{ $labels.SomeID }} , Log Message {{ $labels.log }}
I don’t see anything obviously wrong here.
Are you sure your Alertmanager instance is 100% available?
Also, what’s your evaluation_interval value set at?
yes . AlertManager is 100% available plus I have set the log level to debug ,so I can see all alerts coming from Loki.
we have not set evaluation_interval, so it is using the default value i.e. 1m.
Are you sure the alert was not already raised? AFAIK the behaviour in Alertmanager is that it will not fire another notification if the same alert is received within a given period of time.
In any case, there are many moving parts here so that makes this pretty difficult to diagnose. If this is reproducible, I’d suggest removing AM from the equation and configure your AM URL to a service that will receive the requests (like https://requestbin.net/), and validate that the requests are successfully sent. If not, we can try dive deeper about why there appear to be gaps.
yes I can confirm Alert Manager is not a issue (verified from logs), plus I can also see logs in Loki ruler (whenever it fetches results).It is logged with “Rule evaluation result discarded… <>”
We are on Loki 2.3.
One more thing I would like to add is Ruler and Ingestor are on different host. Not sure if this is correct.
Server 1(Query,frontend,ruler) , server2(ingestor,distributer)
Any other parameter or field in loki config which we need to look into ?