Multi-tenant Loki Ruler is not sending Alerts to Mimir AlertManager

edtshuma · July 18, 2024, 12:59pm

I have an AlertRule set up in my Loki-Distributed instance via the Loki ruler as below:

ruler:
      directories:
        fake:
          rules.txt: |
            groups:
              - name: mimir_loki_test
                rules:
                  - alert: LokiAlertsMimir 
                    expr: |
                      sum(rate({node_name="ip-10-XXX-XXX-XXX.ec2.internal"} |~ ".*OOM-killed.*" [5m])) by (node_name)
                    for: 3m
                    labels:
                      severity: critical
                      job: mimir
                    annotations:
                      summary: Loki Alerts with MimirAM - OOM Alert

Before I enabled multi-tenancy on the Loki helm chart the AlertRule was firing as expected and it was visible in the Mimir AlertManager UI. The UI is accessed via an Nginx Reverse proxy that sits in front of the Loki instance.

After enabling multi-tenancy on the Loki-Distributed chart , the AlertRule is no longer visible in Mimir AlertManager, nor is it firing. After enabling multi-tenancy part of the configuration lOKI-Distributed looks like below:

 gateway:
      # -- Specifies whether the gateway should be enabled
      enabled: true
ingress:
        annotations:
          cert-manager.io/cluster-issuer: letsencrypt-prod
          external-dns.alpha.kubernetes.io/hostname: loki.${sre_domain}
          nginx.ingress.kubernetes.io/auth-secret: loki-basic-auth
          nginx.ingress.kubernetes.io/auth-type: basic
        enabled: true
        hosts:
          - host: loki.${sre_domain}
            paths:
              - path: /
                pathType: Prefix
        ingressClassName: nginx
        tls:
          - hosts:
              - loki.${sre_domain}
           secretName: loki.${sre_domain}-tls
      nginxConfig:
        httpSnippet: |-
          client_max_body_size 100M;
          proxy_connect_timeout       900;
          proxy_send_timeout          900;
          proxy_read_timeout          900;
          send_timeout                900;
        serverSnippet: |-
          client_max_body_size 100M;
          location ~ /loki/api/v1/rules/ { proxy_pass  http://loki-ruler.monitoring.svc.cluster.local:3100$request_uri; }
loki:
      config: |
        auth_enabled: true
        ruler:
          alertmanager_url: http://mimir-alertmanager.monitoring:8080/mimir-alertmanager
          external_url: https://monitoring.${sre_domain}/mimir-alertmanager
          enable_alertmanager_v2: true
          enable_api: true
          ring:
            kvstore:
              store: memberlist
          rule_path: /tmp/loki/scratch
          storage:
            local:
              directory: /etc/loki/rules
            type: local

In the Grafana helm chart I added the following for the Loki Datasource:

datasources:
      datasources.yaml:
        apiVersion: 1
        datasources:
            - access: proxy
            editable: true
            jsonData:
              alertmanagerUid: ORAMGR
              derivedFields:
                - datasourceUid: ORTEMPO
                  matcherRegex: '"traceId": "([^"]*)"'
                  name: traceId -> Tempo
                  url: '$${__value.raw}'
              httpHeaderName1: 'X-Scope-OrgID'
              maxLines: 1500
              timeout: "120"
            name: Logs
            secureJsonData:
              httpHeaderValue1: 'fake'
            type: loki
            uid: ORLOKI
            url: http://loki-gateway.monitoring

If I check the Loki-Ruler pod, I can see that the Alertrule is evaluating correctly:

level=info ts=2024-07-18T08:06:46.114021053Z caller=compat.go:66 user=fake rule_name="LokiAlertsMimir" rule_type=alerting query="sum by (node_name)(rate({node_name=\"ip-10-XXX-XXX-XXX.ec2.internal\"} |~ \".*OOM-killed.*\"[5m]))" query_hash=2362249182 msg="evaluating rule"
level=info ts=2024-07-18T08:06:46.114066624Z caller=engine.go:232 component=ruler evaluation_mode=local org_id=fake msg="executing query" type=instant query="sum by (node_name)(rate({node_name=\"ip-10-XXX-XXX-XXX.ec2.internal\"} |~ \".*OOM-killed.*\"[5m]))" query_hash=2362249182
level=info ts=2024-07-18T08:06:46.115572328Z caller=metrics.go:160 component=ruler evaluation_mode=local org_id=fake latency=fast query="sum by (node_name)(rate({node_name=\"ip-10-XXX-XXX-XXX.ec2.internal\"} |~ \".*OOM-killed.*\"[5m]))" query_hash=2362249182 query_type=metric range_type=instant length=0s start_delta=2.679478ms end_delta=2.679618ms step=0s duration=1.433063ms status=200 limit=0 returned_lines=0 throughput=0B total_bytes=0B total_bytes_structured_metadata=0B lines_per_second=0 total_lines=0 post_filter_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=0 shards=0 chunk_refs_fetch_time=1.033714ms cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s

If I check the Mimir-AlertManager pod, I see nothing related to the Alertrule, not even errors:

ts=2024-07-18T08:04:22.457955222Z caller=multitenant.go:546 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"
ts=2024-07-18T08:04:37.457257717Z caller=multitenant.go:546 level=info component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"

It appears the AlertManager is completely ignoring the AlertRule. The receiver configuration in the AlertManager is configured correctly, and has already been validated before I turned on multitenancy.

There is a comment here that says in multitenancy mode “Loki forwards the X-Scope-OrgID header”. I am not quite sure what this exactly means … should I need to set the X-Scope-OrgID header against the Mimir Alertmanager in this case or this is already set on each POST call to the Mimir AlertManager URL ?

Additional information
Loki-Distributed Chart: 0.79.1
Mimir-Distributed Chart: 5.4.0
Grafana Chart: 8.3.4’

Do I need to set X-Scope-OrgID header during the “remote push” operation of the alertrule by the Loki Ruler.?

Or is there a way I can check whether my Mimir AlertManager is tenant-aware ?

How do i check if the Loki ruler sending the tenantId to Mimir AlertManager, if at all?

What am I missing?

tonyswumac · July 18, 2024, 8:33pm

Try disabling add_org_id_header and see what happens (see Grafana Loki configuration parameters | Grafana Loki documentation).

Personally I use a standalone alertmanager with simple auth, so I’ve never tried this before. But if I had to guess I’d say perhaps Loki ruler is adding the organization ID header and your mimir cluster doesn’t recognize the org ID (if your mimir cluster also has auth enabled).

edtshuma · July 19, 2024, 7:48am

I can see that add_org_id_header header appears under -ruler.remote-write.add-org-id-header.

Can you confirm that the alertmanager_url is also considered as a Prometheus remote-write endpoint of some sort in this case ?

I am curious though about the potential effects of disabling add-org-id-header especially since I have enabled X-Scope-OrgID on the Grafana helm chart .

tonyswumac · July 19, 2024, 7:55am

Not quite sure what you mean, but if you mean if you can use alertmanager_url to send alerts to a prometheus style alertmanager, then yes.

edtshuma · July 19, 2024, 8:06am

Yes thats what I mean. The Mimir AlertManager that my Loki Ruler is sending Alerts to is part (i.e a component) of a Mimir-Distributed helm chart deployment:

apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  name: mimir-distributed
  namespace: monitoring
spec:
  chart:
    spec:
      chart: 3rdparty/grafana/mimir-distributed
      sourceRef:
        kind: HelmRepository
        name: infra1
        namespace: flux-system
      version: '5.4.0'
  dependsOn:
    - name: prometheus-operator-crds
      namespace: monitoring
 values:
    alertmanager:
      enabled: true
      fallbackConfig: |
        global:
          pagerduty_url: https://events.pagerduty.com/v2/enqueue
          resolve_timeout: 5m
....
...

My question is that if we disable add_org_id_header, which appears under -ruler.remote-write portion of the Loki Ruler configuration does this also mean that we will be effectively disabling the Loki Ruler from adding the **organization ID header ** to alert payloands that are being sent to Mimir AlertManager ?

The Mimir AlertManager is Prometheus style btw.

edtshuma · July 19, 2024, 8:10am

Its a bit confusing I would have expected to see a parameter for disabling that directly under the declaration of the alertmanager_url as below :

loki:
      config: |
        auth_enabled: true
        ruler:
          alertmanager_url: http://mimir-alertmanager.monitoring:8080/mimir-alertmanager
               [something-here]
          external_url: https://monitoring.${sre_domain}/mimir-alertmanager

zktaiga · August 16, 2024, 3:38pm

I’m in the same situation except I was never running single tenant to start with. Alerts just won’t fire to AlertManager at all, but the rule is definitely being evaluated. Even logs in DEBUG mode say nothing. Did you ever find the culprit / a workaround, or is switching to single-tenant my only option?

FWIW I’ve even tried to match the tenant between Mimir and Loki.

Topic		Replies	Views
Loki Rules and Alerts Issues sending to Mimir Grafana Loki	15	2542	June 12, 2024
Loki rulerConfig block definition Grafana Loki rulerconfig	1	104	February 27, 2025
Alerting plugin with multi-tenant Mimir Ruler Alerting	0	561	January 15, 2024
Loki Alerting via Grafana Grafana Loki alerting	2	736	December 22, 2023
Impossible to list multitenant loki datasource alert rules in grafana alerting Grafana alerting , loki	0	73	May 20, 2024

Multi-tenant Loki Ruler is not sending Alerts to Mimir AlertManager

Related topics