Hello,
I have a loki / altermanager setup running which works almost fine. I have just one tiny issues where I’m not able to fix it. It is the following scenario.
I have a loki alert manager rule like below.
groups:
- name: alerting
interval: 60s
rules:
- alert: error
expr: |
sum by (group,log)(
count_over_time(
{level="ERROR"}
# extract the entire log line as a label
| regexp `(?P<log>(?s).+)`
[60m]
)
)
> 0.0
for: 0s
If now an log entry arrives with “level” = “ERROR”, a new alertmanager event is triggered immediately and is visible in alertmanager. So far, so good.
In my loki logs I can also see that the “total_bytes=” are always bigger then 0B which means the is a result. Now, after 5 minutes, the alert is gone. In my loki logs I can see that “total_bytes=” is zero bytes now, which means there is no result anymore. If I copy the query from loki logs and run it manually in grafana explorer it looks fine too. There is the correct result. My guess is that it is somehow related to “range_type=instant” which is also shown in my loki logs and this is overwriting my range definition “60m” from the query itself.
btw. the resolve_timeout in alertmanager.yml is 60m too.
The main idea is that an log error related alert should be visible in alertmanager at least for one hour.
below, you can fine the related log entries.
Okt 15 13:19:44 loki[1531]: level=info ts=2021-10-15T11:19:44.907635395Z caller=metrics.go:92 org_id=default traceID=01d4f00a1b1c022e latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.096544ms status=200 limit=0 returned_lines=0 throughput=0B total_bytes=0B
Okt 15 13:20:44 loki[1531]: level=info ts=2021-10-15T11:20:44.907727873Z caller=metrics.go:92 org_id=default traceID=47099432b44aeef9 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.163781ms status=200 limit=0 returned_lines=0 throughput=52kB total_bytes=60B
Okt 15 13:21:44 loki[1531]: level=info ts=2021-10-15T11:21:44.908689801Z caller=metrics.go:92 org_id=default traceID=44289286cc3d63a0 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.251504ms status=200 limit=0 returned_lines=0 throughput=48kB total_bytes=60B
Okt 15 13:22:44 loki[1531]: level=info ts=2021-10-15T11:22:44.908729988Z caller=metrics.go:92 org_id=default traceID=1653decfb62e8469 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.309034ms status=200 limit=0 returned_lines=0 throughput=46kB total_bytes=60B
Okt 15 13:23:44 loki[1531]: level=info ts=2021-10-15T11:23:44.90837527Z caller=metrics.go:92 org_id=default traceID=76d6f7b4c662abeb latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.520726ms status=200 limit=0 returned_lines=0 throughput=40kB total_bytes=60B
Okt 15 13:24:44 loki[1531]: level=info ts=2021-10-15T11:24:44.908638487Z caller=metrics.go:92 org_id=default traceID=06cc7835a9fdc881 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.186319ms status=200 limit=0 returned_lines=0 throughput=51kB total_bytes=60B
Okt 15 13:25:44 loki[1531]: level=info ts=2021-10-15T11:25:44.908343946Z caller=metrics.go:92 org_id=default traceID=0d99c5cd76878fd5 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.378795ms status=200 limit=0 returned_lines=0 throughput=58kB total_bytes=80B
Okt 15 13:26:44 loki[1531]: level=info ts=2021-10-15T11:26:44.908551422Z caller=metrics.go:92 org_id=default traceID=6b6e5aa1bb862f34 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.140961ms status=200 limit=0 returned_lines=0 throughput=0B total_bytes=0B
Hi @holgerhees
My guess is that it is somehow related to “range_type=instant” which is also shown in my loki logs and this is overwriting my range definition “60m” from the query itself.
No, that’s not the case; instant queries will take that time range into account, so effectively they will start from this instant and do a count_over_time
for the last 60m.
Out of interest, can you run that query over a 1H range in Grafana and show the results please?
Also please send your Loki version and config.
here comes a new scenario where I took the screenshot directly after the alert was gone
Okt 15 14:45:44 loki[1531]: level=info ts=2021-10-15T12:45:44.907757977Z caller=metrics.go:92 org_id=default traceID=7a73fbdba8b0df5e latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.217731ms status=200 limit=0 returned_lines=0 throughput=0B total_bytes=0B
Okt 15 14:46:44 loki[1531]: level=info ts=2021-10-15T12:46:44.908212222Z caller=metrics.go:92 org_id=default traceID=440c1e6c1cc966d5 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.663494ms status=200 limit=0 returned_lines=0 throughput=36kB total_bytes=60B
Okt 15 14:47:44 loki[1531]: level=info ts=2021-10-15T12:47:44.908858766Z caller=metrics.go:92 org_id=default traceID=0bbfb276861be435 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.924627ms status=200 limit=0 returned_lines=0 throughput=31kB total_bytes=60B
Okt 15 14:48:44 loki[1531]: level=info ts=2021-10-15T12:48:44.907779184Z caller=metrics.go:92 org_id=default traceID=11b9146b62304b9f latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.240592ms status=200 limit=0 returned_lines=0 throughput=48kB total_bytes=60B
Okt 15 14:49:44 loki[1531]: level=info ts=2021-10-15T12:49:44.908712989Z caller=metrics.go:92 org_id=default traceID=656626a9277c9e36 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.691607ms status=200 limit=0 returned_lines=0 throughput=36kB total_bytes=60B
Okt 15 14:50:44 loki[1531]: level=info ts=2021-10-15T12:50:44.908702719Z caller=metrics.go:92 org_id=default traceID=0c3e124a2748020e latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.345019ms status=200 limit=0 returned_lines=0 throughput=45kB total_bytes=60B
Okt 15 14:51:44 loki[1531]: level=info ts=2021-10-15T12:51:44.908096427Z caller=metrics.go:92 org_id=default traceID=0e585ea19d0093bc latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.275716ms status=200 limit=0 returned_lines=0 throughput=63kB total_bytes=80B
Okt 15 14:52:44 loki[1531]: level=info ts=2021-10-15T12:52:44.908059088Z caller=metrics.go:92 org_id=default traceID=2f026112f61b7825 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.155692ms status=200 limit=0 returned_lines=0 throughput=0B total_bytes=0B
and my loki version is 2.3.0
and thats my config
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 5m
chunk_retain_period: 30s
max_transfer_retries: 0
schema_config:
configs:
- from: 2018-04-15
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 168h
storage_config:
boltdb:
directory: /loki/index
filesystem:
directory: /loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: true
retention_period: 672h
ruler:
storage:
type: local
local:
directory: /etc/loki/rules
rule_path: /tmp
alertmanager_url: http://alertmanager:9093
ring:
kvstore:
store: inmemory
enable_api: true
enable_alertmanager_v2: true
Checked in with a colleague who implemented alerting in Loki; neither of us are sure what’s going on here.
Are your rulers restarting at all?
I’m not sure what you mean, but to activate every config change I restart the whole loki container. Just to be sure.
is there any additional debug possible what would help you?
or if you mean, if my rules are still working. yes. each new log event is still triggering an alert which will be displayed for exactly 5 minutes.
I meant: are your rulers restarting by themselves during the time that you expect this alert to still be active.
Can you please send your alertmanager config as well?
no, rulers have not restarted by themselves. At least there are no log entries which should give a hint in this direction.
and my alertmanager config looks like
global:
smtp_smarthost: postfix:25
smtp_require_tls: False
smtp_from: holger.hees@xxx.xxx
resolve_timeout: 60m
route:
group_by:
- group
- alertname
group_wait: 1m
group_interval: 5m
repeat_interval: 60m
receiver: email-receiver
routes:
- matchers:
- group="netdata"
receiver: dummy
- matchers:
- group="crony"
receiver: dummy
- matchers:
- group="php"
- log=~".*oops, unknown child \([0-9]*\) exited with code 127.*"
receiver: dummy
- matchers:
- group="apache"
- log=~".*/grafana/api/live/ws.*Mobile.*"
receiver: dummy
receivers:
- name: dummy
- name: email-receiver
email_configs:
- to: holger.hees@xxx.xxx
headers:
Reply-To: 'holger.hees@xxx.xxx'
subject: '{{ .Status | toUpper }} • {{ .GroupLabels.SortedPairs.Values | join " • " }}{{ if eq .Status "firing" }} • {{ .Alerts.Firing | len }} time(s){{ end }}'
I get also a lot of messages like “msg=“Rule evaluation result discarded” err=unimplemented” but related to the following post, I ignore all of them
I’ve got a test alert rule set up, which in the logs shows to be firing, but it is giving me the error msg=“Rule evaluation result discarded” err=unimplemented
No other logs are output to determine what the problem is.
I configured now a test case where I was running loki with log.level=debug but my log file is to big to be inserted inside this post.
If you want I can send it to you via email.
looks like it is related to the setting “chunk_idle_period”
if I change chunk_idle_period from 5m to 60m everything works.
Hi @holgerhees
I’m dealing with same issue
can you confirm setting chunk_idle_period to 60m resolve the problem ?
thanks
Hi, I have not investigated deeper into it. I’m currently living with the situation, because I just need an alert trigger which sends an email for every log message.
system
Closed
May 24, 2023, 2:02pm
17
This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.