Loki Rules and Alerts Issues sending to Mimir

mdiorio · May 24, 2023, 7:27pm

This doesn’t feel like it should be so difficult.

I have a recording rule set up in Loki, sending the data to Mimir, yet it never arrives and there are no errors either in Loki or in Mimir. I have tried creating an Alert rule and that does work.

sum by (site)(count_over_time({job="cloudflare"} | json | ClientRequestURI!~".*(cdn-cgi|ico|js|jpg|jpeg|ico|tokens|png|webp|cfm|css|webmanifest)" [1m]))

  rulerConfig:
    enable_api: true
    enable_alertmanager_v2: true
    alertmanager_url: http://mimir-alertmanager.mimir.svc:8080/alertmanager
    storage:
      type: s3
      s3:
        s3: s3://keys@us-east-2
        bucketnames: monitoring-loki-rules
    remote_write:
      enabled: true
      client:
        url: http://mimir-nginx.mimir.svc:80/api/v1/push

In the Loki Backend pod, I can see the rule being executed:

level=info ts=2023-05-24T17:09:37.620230802Z caller=metrics.go:152 component=ruler org_id=fake latency=fast query=“sum by (site)(count_over_time({job="cloudflare"} | json | ClientRequestURI!~".*(cdn-cgi|ico|js|jpg|jpeg|ico|tokens|png|webp|cfm|css|webmanifest)"[1m]))” query_hash=414661914 query_type=metric range_type=instant length=0s start_delta=174.601775ms end_delta=174.602052ms step=0s duration=89.152282ms status=200 limit=0 returned_lines=0 throughput=175MB total_bytes=16MB lines_per_second=127601 total_lines=11376 total_entries=1 store_chunks_download_time=0s queue_time=0s splits=0 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s

For Alerts, I do see the alert triggering in Grafana Alerts, however I’m also seeing the following in the logs:

Loki Backend:

caller=dedupe.go:112 storage=registry manager=tenant-wal instance=fake component=remote level=error remote_name=fake-rw-mimir url=http://mimir-nginx.mimir.svc:80/api/v1/push msg="non-recoverable error" count=3 exemplarCount=0 err="server returned HTTP status 400 Bad Request: failed pushing to ingester: user=fake: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2023-05-24T19:24:37.445Z and is from series {__name__=\"ALERTS_FOR_STATE\", alertname=\"Test Loki Alert Rule\", site=\"sitename\"}"

Mimir

caller=grpc_logging.go:43 level=warn duration=1.021925ms method=/cortex.Ingester/Push err="rpc error: code = Code(400) desc = user=fake: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2023-05-24T19:10:37.445Z and is from series {__name__=\"ALERTS_FOR_STATE\", alertname=\"Test Loki Alert Rule\", site=\"sitename\"}" msg=gRPC

dannykopping · May 26, 2023, 7:21am

It would appear that this alert is being generated by multiple Loki instances, all of which have the same labels. To work around this, you can add a label which uniquely identifies the pod from which the alert was generated.

It appears that you’re using the simple-scalable Helm chart, is that correct?
This might be a bug in our chart if you have loki-backend scaled > 1.

mdiorio · May 31, 2023, 4:35pm

Sorry I didn’t see this sooner, I don’t seem to be getting notifications.

The chart is loki-5.5.3 - which defaults to simple-scalable mode.

The Loki back end is scaled to 3, which is the default in the helm chart.

dannykopping · May 31, 2023, 8:15pm

@mdiorio thanks for the confirmation.

Could you please try scaling the backend to 1 to see if this resolves your issue?

mdiorio · June 1, 2023, 3:11pm

That doesn’t seem to have changed anything.

Alert rule is still firing in Loki but not getting sent to Mimir Alert Manager so it can be forwarded to On Call.

Recording rule is still running on the single Loki backend without error. In the Mimir distributor, the push error for the sample rejection went from rejecting it constantly to only occasionally, I’m guessing as the timestamps may not being aligning in the Loki backend pods anymore.

However the recording rule is evaluated every 60 seconds, I’m not seeing an constant error, yet it’s not showing up in Mimir anywhere.

mdiorio · June 5, 2023, 2:06pm

@dannykopping - any thoughts on what else I can try or to possibly turn up logging? I figured I’d see some error somewhere either on the Loki side or the Mimir side if the metrics and alerts weren’t properly being sent or recorded. It’s very odd.

dannykopping · June 5, 2023, 7:24pm

@mdiorio apologies for the delayed response.

Please send me your rule definition to have a look at.
The ALERTS_FOR_STATE series is only created for rules that have a for: definition, AFAIK.

mdiorio · June 6, 2023, 7:17pm

@dannykopping

Here are the rules as created in Grafana Alerting:

Here are the rules as retrieved from Loki’s /loki/api/v1/rules endpoint:

loki:
    - name: cloudflare
      interval: 1m
      rules:
        - record: cloudflare:site:requests
          expr: sum by (site)(count_over_time({job="cloudflare"} | json | ClientRequestURI!~".*(cdn-cgi|ico|js|jpg|jpeg|ico|tokens|png|webp|cfm|css|webmanifest)" [1m]))
        - alert: TestLokiAlertRule
          expr: sum by(site)(count_over_time({job="cloudflare"} | json | ClientRequestURI!~".*(cdn-cgi|ico|js|jpg|jpeg|ico|tokens|png|webp|cfm|css|webmanifest)"[1m]))
          for: 1m
          annotations:
            summary: Test alert

The alert rule is actually firing for a site, but never makes it into Mimir AlertManager so it can be sent to On Call.

Thanks!

mdiorio · June 7, 2023, 8:30pm

@dannykopping

I’m wondering if it has something to do with tenancy. I don’t use tenancy in Mimir or Loki. Mimir Nginx uses:

       # Ensure that X-Scope-OrgID is always present, default to the no_auth_tenant for backwards compatibility when multi-tenancy was turned off.                     
       map $http_x_scope_orgid $ensured_x_scope_orgid {                                                                                                                
         default $http_x_scope_orgid;                                                                                                                                  
         "" "anonymous";                                                                                                                                               
       }

Wondering if they’re just discarding stuff silently?

mdiorio · June 8, 2023, 9:18pm

@dannykopping

This isn’t a great situation. I disabled multi-tenancy in Mimir and now my Loki rules are working. There has to be an easier, more logical way of handling this. Fortunately I don’t need multi-tenancy in Mimir at this time, but it is possibly in our future. If I need multi-tenancy in Mimir but not Loki, what happens?

tonyswumac · June 8, 2023, 10:26pm

I haven’t had to use remote_write from ruler yet, but since a lot of Loki ruler’s implementation is very similar to prometheus, I wonder if the remote_write configuration also supports headers and basic_auth.

dannykopping · June 13, 2023, 8:49am

@tonyswumac’s reply is correct; you can configure your tenant using a header when performing remote write.

mdiorio · June 13, 2023, 12:05pm

@dannykopping - Thanks. Does it make more sense to make the non-tenant names the same? Why is one fake and the other anonymous?

dannykopping · June 13, 2023, 12:22pm

This has come up a few times. Loki chose fake previously (which I hate, btw, and causes so much confusion), and Mimir chose anonymous. We might change this in Loki v3, but it’s complicated in terms of backwards-compatibility.

mdiorio · June 13, 2023, 1:53pm

It may be easier to update the Mimir gateway to rewrite the header for fake to anonymous? Still doesn’t change the fake to something better. Personally - I would have gone for “default” as the default tenant

system · June 12, 2024, 1:53pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Manage Loki Alerts in Grafana Grafana Loki alerting	4	762	September 14, 2024
Alerts not being sent to Alertmanager Grafana Loki alerting	6	1838	September 9, 2022
Mimir/Loki ruler datasource rules setup Alerting	0	35	December 6, 2024
Mimir managed alerting rule does not send notifications to contact point Alerting alerting , mimir	17	2322	November 20, 2023
Loki Alerting via Grafana Grafana Loki alerting	2	721	December 22, 2023

Loki Rules and Alerts Issues sending to Mimir

Related topics