Loki Rules and Alerts Issues sending to Mimir

This doesn’t feel like it should be so difficult.

I have a recording rule set up in Loki, sending the data to Mimir, yet it never arrives and there are no errors either in Loki or in Mimir. I have tried creating an Alert rule and that does work.

sum by (site)(count_over_time({job="cloudflare"} | json | ClientRequestURI!~".*(cdn-cgi|ico|js|jpg|jpeg|ico|tokens|png|webp|cfm|css|webmanifest)" [1m]))

  rulerConfig:
    enable_api: true
    enable_alertmanager_v2: true
    alertmanager_url: http://mimir-alertmanager.mimir.svc:8080/alertmanager
    storage:
      type: s3
      s3:
        s3: s3://keys@us-east-2
        bucketnames: monitoring-loki-rules
    remote_write:
      enabled: true
      client:
        url: http://mimir-nginx.mimir.svc:80/api/v1/push

In the Loki Backend pod, I can see the rule being executed:

level=info ts=2023-05-24T17:09:37.620230802Z caller=metrics.go:152 component=ruler org_id=fake latency=fast query=“sum by (site)(count_over_time({job="cloudflare"} | json | ClientRequestURI!~".*(cdn-cgi|ico|js|jpg|jpeg|ico|tokens|png|webp|cfm|css|webmanifest)"[1m]))” query_hash=414661914 query_type=metric range_type=instant length=0s start_delta=174.601775ms end_delta=174.602052ms step=0s duration=89.152282ms status=200 limit=0 returned_lines=0 throughput=175MB total_bytes=16MB lines_per_second=127601 total_lines=11376 total_entries=1 store_chunks_download_time=0s queue_time=0s splits=0 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s

For Alerts, I do see the alert triggering in Grafana Alerts, however I’m also seeing the following in the logs:

Loki Backend:

caller=dedupe.go:112 storage=registry manager=tenant-wal instance=fake component=remote level=error remote_name=fake-rw-mimir url=http://mimir-nginx.mimir.svc:80/api/v1/push msg="non-recoverable error" count=3 exemplarCount=0 err="server returned HTTP status 400 Bad Request: failed pushing to ingester: user=fake: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2023-05-24T19:24:37.445Z and is from series {__name__=\"ALERTS_FOR_STATE\", alertname=\"Test Loki Alert Rule\", site=\"sitename\"}"      
                                               

Mimir

caller=grpc_logging.go:43 level=warn duration=1.021925ms method=/cortex.Ingester/Push err="rpc error: code = Code(400) desc = user=fake: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2023-05-24T19:10:37.445Z and is from series {__name__=\"ALERTS_FOR_STATE\", alertname=\"Test Loki Alert Rule\", site=\"sitename\"}" msg=gRPC

It would appear that this alert is being generated by multiple Loki instances, all of which have the same labels. To work around this, you can add a label which uniquely identifies the pod from which the alert was generated.

It appears that you’re using the simple-scalable Helm chart, is that correct?
This might be a bug in our chart if you have loki-backend scaled > 1.

Sorry I didn’t see this sooner, I don’t seem to be getting notifications.

The chart is loki-5.5.3 - which defaults to simple-scalable mode.

The Loki back end is scaled to 3, which is the default in the helm chart.

@mdiorio thanks for the confirmation.

Could you please try scaling the backend to 1 to see if this resolves your issue?

That doesn’t seem to have changed anything.

Alert rule is still firing in Loki but not getting sent to Mimir Alert Manager so it can be forwarded to On Call.

Recording rule is still running on the single Loki backend without error. In the Mimir distributor, the push error for the sample rejection went from rejecting it constantly to only occasionally, I’m guessing as the timestamps may not being aligning in the Loki backend pods anymore.

However the recording rule is evaluated every 60 seconds, I’m not seeing an constant error, yet it’s not showing up in Mimir anywhere.