ERROR:
level=error ts=2024-06-12T08:36:52.339930645Z caller=compat.go:78 user=myrules rule_name=GongjinLogsQingfenErrorCount rule_type=alerting query="(sum by (app)(count_over_time({project=\"xxx\", type=\"svclog\", app=~\"xxx\"} |= \"ERROR\" !~ \"INFO|客服\"[10m])) >= 1)" query_hash=1398603592 msg="rule evaluation failed" err="wrong chunk metadata"
level=warn ts=2024-06-12T08:36:52.339951352Z caller=manager.go:663 user=myrules file=/tmp/loki/scratch/myrules/alert_rules1.txt group=gongjin_error_logs name=GongjinLogsQingfenErrorCount index=0 msg="Evaluating rule failed" rule="alert: GongjinLogsQingfenErrorCount\nexpr: (sum by (app)(count_over_time({project=\"xxx\", type=\"svclog\", app=~\"xxx\"}\n |= \"ERROR\"[10m])) >= 1)\nlabels:\n alertype: logs\n cluster: \nannotations:\n description: '详细日志请查看: http://xxx.cn/d/ed223mvss4rcgb/6Iiq5L-hLeeUn-S6py3kuJrliqHml6Xlv5c?orgId=1&var-app=xxx&var-app=xxx&var-app=xxx&var-app=xxx&var-app=xxx&var-searchable_pattern=ERROR'\n summary: {{ $labels.app }} 服务 10 分钟内出现 {{ $value }} 次 `ERROR` 日志,请注意!\n" err="rule evaluation failed: wrong chunk metadata"
I have a ruler component,There is a probability that the above error messages will occur,
I used it to implement an alarm, and I have indeed received an alarm,
But I feel that the number of alarms is different from the result of my own proactive use of Grafana query,
I don’t know if this error is causing it, how can I troubleshoot and resolve it.
How are you sending alerts?
Most alerting solution nowadays have some sort of deduplication. That could be it. Without more info it’s hard to say.
I am alerting through alertmanager, and the loki configuration is as follows:
alertmanager_url: http://alertmanager-main.monitoring:9093
external_url: http://alertmanager-main.monitoring:9093
enable_api: true
The image version of alertmanager is as follows:
quay.io/prometheus/alertmanager:v0.26.0
The alertmanager itself does not have any error logs.
Can you provide some log examples, and your ruler configurations?
Also, when you said you feel that the number of alerts is different, how is it different? are you getting less or more alerts than you think you should? I’d like some concrete examples if you are able to provide any.
my helm charts value.yaml:
# -- Directories containing rules files
directories:
myrules:
alert_rules1.txt: |
groups:
- name: example_error_logs
rules:
- alert: exampleLogsQingfenErrorCount
expr: sum by(app) (count_over_time ({project="k8s-prod", type="svclog", app=~"app1|app2|app3|app4|app5"} |= `ERROR` !~ `INFO|测试`[10m]) ) >= 1
for: 0
labels:
alertype: logs
cluster: 集群生产平台
annotations:
summary: "集群生产 {{ $labels.app }} 服务 10 分钟内出现 {{ $value }} 次 `ERROR` !~ `INFO|测试` 日志,请注意!"
description: "详细日志请查看: http://logs.internal.cn/d/searchable_pattern=ERROR"
record_rules1.txt: |
groups:
- name: exampleErrorRules
interval: 5m
rules:
- record: example:error:count_over_time10m
expr: count_over_time ({project="k8s-prod", type="svclog", app=~"app1|app2|app3|app4|app5"} |= `ERROR` !~ `INFO|测试`[10m])
labels:
cluster: "集群生产"
Example log:
2024-06-18 09:05:03.046 [traceId:54488818918647de7cb24350a78b0a22] [TID: N/A] ERROR 1 --- [ Thread-13816] [doBcmCapitalChange][283]: [transId:HX8207BCM849121107661940326420240618][bankCode:BCM][interfaceNo:HX8207][明细查询失败:R(code=400, success=false, data=服务发生异常!Read timed out, msg=业务异常, traceId=b82da1b5c295a191997a7b35c6572531, bizCode=815, bizMsg=三方服务失败)][subAccountNo:1100609290188000473311375]
There are fewer alarms than I expected.
Because the result of my proactive query is>1,But I didn’t receive any alerts.
My alertmanager is normal.
Alertmanager has its own deduplication mechanism. It’ll group alerts by labels, so if you have rules that send it alerts with the same label you won’t get additional alerts.
You can try disabling alertmanager aggregation by doing this and see if this gives you more alerts:
route:
group_by: ['...'] # Disable alert grouping
...<other config>
Understood, I can modify and test.
But I would rather know the reason why the rule component reported an error msg=“Evaluating rule failed” err=“rule evaluation failed: wrong chunk metadata”
I proactively used the set logQL query and there was no problem,Is there any special query method for the ruler component?
I’ve seen that alert before when the rule file isn’t place at the right place.
If you are using filesystem for rule files, you want to make sure it’s placed at the right directory path (I believe it needs to be under the directory with the orgid, if you don’t have auth_enabled set to true, then default orgid is fake
)
Based on what you said, I will change the configuration of the ruler to the following configuration:
# -- Directories containing rules files
directories:
fake:
alert_rules1.txt: |
groups:
- name: k8s_error_logs
rules:
Change the directory name to fake,After restarting the Ruler service, the error disappeared as expected.
The problem has been solved. Thank you very much, Tony.