Loki ruler sometimes does not raise alert

We have seen Loki ruler sometimes does not raise a alert.
There are no error in logs.

Any pointers or settings we need to look into?

When I query from grafana Loki UI , I can get the results during that time window.
Alert Query used is similar to below

sum(
count_over_time(
{foo=“bar”}
|= “bazzError”
# extract the entire log line as a label
| regexp (?P<log>.+)
[4m]
)
) by (log)

Hi @jsmorphiz

Can you please provide your alert definition as well?

Please find the alert Definition.
I am able to get alerts however it is not consistent and sometimes alerts are missed, though I am able to view it Grafana Loki UI

groups:
  - name: rate-alerting
    rules:
      - alert: ServiceFailure
        expr: |
          sum(count_over_time({job="foo"} |~ "^.*status:.*Failure;" | regexp `(?P<log>.+)`| regexp `(?:custID:(?P<SomeID>[0-9]+);c)` [4m] )) by (SomeID,log)
        labels:
            severity: warning
            category: logs
        annotations:
            summary:  Service Failure for SomeID {{ $labels.SomeID }}
            description:  Service Failure for SomeID {{ $labels.SomeID }} , Log Message {{ $labels.log }}

OK thanks.

I don’t see anything obviously wrong here.
Are you sure your Alertmanager instance is 100% available?
Also, what’s your evaluation_interval value set at?

Hi @dannykopping ,

yes . AlertManager is 100% available plus I have set the log level to debug ,so I can see all alerts coming from Loki.
we have not set evaluation_interval, so it is using the default value i.e. 1m.

Are you sure the alert was not already raised? AFAIK the behaviour in Alertmanager is that it will not fire another notification if the same alert is received within a given period of time.

In any case, there are many moving parts here so that makes this pretty difficult to diagnose. If this is reproducible, I’d suggest removing AM from the equation and configure your AM URL to a service that will receive the requests (like https://requestbin.net/), and validate that the requests are successfully sent. If not, we can try dive deeper about why there appear to be gaps.

yes I can confirm Alert Manager is not a issue (verified from logs), plus I can also see logs in Loki ruler (whenever it fetches results).It is logged with “Rule evaluation result discarded… <>”
We are on Loki 2.3.

That’s an error originating from the underlying Prometheus code:

Do you see any other errors starting with “Error on ingesting”?

hi @dannykopping ,

I don’t see any errors with “Error on ingesting”.

One more thing I would like to add is Ruler and Ingestor are on different host. Not sure if this is correct.
Server 1(Query,frontend,ruler) , server2(ingestor,distributer)
Any other parameter or field in loki config which we need to look into ?

The ruler behaves like a querier, so it shouldn’t matter