Loki ruler error log: Evaluating rule failed,err="wrong chunk metadata"

fanchibaoliu · June 12, 2024, 9:27am

ERROR:

level=error ts=2024-06-12T08:36:52.339930645Z caller=compat.go:78 user=myrules rule_name=GongjinLogsQingfenErrorCount rule_type=alerting query="(sum by (app)(count_over_time({project=\"xxx\", type=\"svclog\", app=~\"xxx\"} |= \"ERROR\" !~ \"INFO|客服\"[10m])) >= 1)" query_hash=1398603592 msg="rule evaluation failed" err="wrong chunk metadata"
level=warn ts=2024-06-12T08:36:52.339951352Z caller=manager.go:663 user=myrules file=/tmp/loki/scratch/myrules/alert_rules1.txt group=gongjin_error_logs name=GongjinLogsQingfenErrorCount index=0 msg="Evaluating rule failed" rule="alert: GongjinLogsQingfenErrorCount\nexpr: (sum by (app)(count_over_time({project=\"xxx\", type=\"svclog\", app=~\"xxx\"}\n  |= \"ERROR\"[10m])) >= 1)\nlabels:\n  alertype: logs\n  cluster: \nannotations:\n  description: '详细日志请查看: http://xxx.cn/d/ed223mvss4rcgb/6Iiq5L-hLeeUn-S6py3kuJrliqHml6Xlv5c?orgId=1&var-app=xxx&var-app=xxx&var-app=xxx&var-app=xxx&var-app=xxx&var-searchable_pattern=ERROR'\n  summary:  {{ $labels.app }} 服务 10 分钟内出现 {{ $value }} 次 `ERROR`  日志，请注意！\n" err="rule evaluation failed: wrong chunk metadata"

I have a ruler component,There is a probability that the above error messages will occur,

I used it to implement an alarm, and I have indeed received an alarm,

But I feel that the number of alarms is different from the result of my own proactive use of Grafana query,

I don’t know if this error is causing it, how can I troubleshoot and resolve it.

tonyswumac · June 13, 2024, 3:27pm

How are you sending alerts?

Most alerting solution nowadays have some sort of deduplication. That could be it. Without more info it’s hard to say.

fanchibaoliu · June 17, 2024, 7:54am

I am alerting through alertmanager, and the loki configuration is as follows:

      alertmanager_url: http://alertmanager-main.monitoring:9093
      external_url: http://alertmanager-main.monitoring:9093
      enable_api: true

The image version of alertmanager is as follows:

quay.io/prometheus/alertmanager:v0.26.0

The alertmanager itself does not have any error logs.

tonyswumac · June 17, 2024, 6:17pm

Can you provide some log examples, and your ruler configurations?

Also, when you said you feel that the number of alerts is different, how is it different? are you getting less or more alerts than you think you should? I’d like some concrete examples if you are able to provide any.

fanchibaoliu · June 18, 2024, 2:07am

my helm charts value.yaml:

  # -- Directories containing rules files
  directories:
    myrules:
      alert_rules1.txt: |
        groups:
          - name: example_error_logs
            rules:
            - alert: exampleLogsQingfenErrorCount
              expr: sum by(app) (count_over_time ({project="k8s-prod", type="svclog", app=~"app1|app2|app3|app4|app5"} |= `ERROR` !~ `INFO|测试`[10m]) ) >= 1
              for: 0
              labels:
                alertype: logs
                cluster: 集群生产平台
              annotations:
                summary:  "集群生产 {{ $labels.app }} 服务 10 分钟内出现 {{ $value }} 次 `ERROR` !~ `INFO|测试` 日志，请注意！"
                description: "详细日志请查看: http://logs.internal.cn/d/searchable_pattern=ERROR"
      record_rules1.txt: |
        groups:
          - name: exampleErrorRules
            interval: 5m
            rules:
            - record: example:error:count_over_time10m
              expr: count_over_time ({project="k8s-prod", type="svclog", app=~"app1|app2|app3|app4|app5"} |= `ERROR` !~ `INFO|测试`[10m])
              labels:
                cluster: "集群生产"

Example log:

2024-06-18 09:05:03.046 [traceId:54488818918647de7cb24350a78b0a22] [TID: N/A] ERROR 1 --- [   Thread-13816]  [doBcmCapitalChange][283]: [transId:HX8207BCM849121107661940326420240618][bankCode:BCM][interfaceNo:HX8207][明细查询失败:R(code=400, success=false, data=服务发生异常!Read timed out, msg=业务异常, traceId=b82da1b5c295a191997a7b35c6572531, bizCode=815, bizMsg=三方服务失败)][subAccountNo:1100609290188000473311375]

There are fewer alarms than I expected.
Because the result of my proactive query is>1,But I didn’t receive any alerts.
My alertmanager is normal.

tonyswumac · June 18, 2024, 2:54pm

Alertmanager has its own deduplication mechanism. It’ll group alerts by labels, so if you have rules that send it alerts with the same label you won’t get additional alerts.

You can try disabling alertmanager aggregation by doing this and see if this gives you more alerts:

route:
  group_by: ['...'] # Disable alert grouping
  ...<other config>

fanchibaoliu · June 19, 2024, 2:16am

Understood, I can modify and test.

But I would rather know the reason why the rule component reported an error msg=“Evaluating rule failed” err=“rule evaluation failed: wrong chunk metadata”

I proactively used the set logQL query and there was no problem,Is there any special query method for the ruler component?

tonyswumac · June 19, 2024, 2:39am

I’ve seen that alert before when the rule file isn’t place at the right place.

If you are using filesystem for rule files, you want to make sure it’s placed at the right directory path (I believe it needs to be under the directory with the orgid, if you don’t have auth_enabled set to true, then default orgid is fake)

fanchibaoliu · June 19, 2024, 8:26am

Based on what you said, I will change the configuration of the ruler to the following configuration:

  # -- Directories containing rules files
  directories:
    fake:
      alert_rules1.txt: |
        groups:
          - name: k8s_error_logs
            rules:

Change the directory name to fake,After restarting the Ruler service, the error disappeared as expected.

The problem has been solved. Thank you very much, Tony.

system · June 19, 2025, 8:27am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ruler evaluation producing nconsistent alerts and rc 500 on instant queries Grafana Loki alerting , loki , helm	2	57	October 11, 2024
Loki operator: Ruler not able to query Grafana Loki alerting	2	44	April 10, 2025
[Ruler] Unable to fetch metrics that exist Grafana Loki	2	701	August 4, 2023
Loki ruler sometimes does not raise alert Grafana Loki alerting	10	1744	February 4, 2023
Creating Alerts Grafana Loki alerting	9	5488	November 23, 2023

Loki ruler error log: Evaluating rule failed,err="wrong chunk metadata"

Related topics