Alert disapears after 5 minutes

Hello,

I have a loki / altermanager setup running which works almost fine. I have just one tiny issues where I’m not able to fix it. It is the following scenario.

I have a loki alert manager rule like below.

groups:
  - name: alerting
    interval: 60s
    rules:
      - alert: error
        expr: |
            sum by (group,log)(
              count_over_time(
                {level="ERROR"}
                # extract the entire log line as a label
                | regexp `(?P<log>(?s).+)`
                [60m]
              )
            )
            > 0.0
        for: 0s

If now an log entry arrives with “level” = “ERROR”, a new alertmanager event is triggered immediately and is visible in alertmanager. So far, so good.

In my loki logs I can also see that the “total_bytes=” are always bigger then 0B which means the is a result. Now, after 5 minutes, the alert is gone. In my loki logs I can see that “total_bytes=” is zero bytes now, which means there is no result anymore. If I copy the query from loki logs and run it manually in grafana explorer it looks fine too. There is the correct result. My guess is that it is somehow related to “range_type=instant” which is also shown in my loki logs and this is overwriting my range definition “60m” from the query itself.

btw. the resolve_timeout in alertmanager.yml is 60m too.

The main idea is that an log error related alert should be visible in alertmanager at least for one hour.

below, you can fine the related log entries.

Okt 15 13:19:44 loki[1531]: level=info ts=2021-10-15T11:19:44.907635395Z caller=metrics.go:92 org_id=default traceID=01d4f00a1b1c022e latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.096544ms status=200 limit=0 returned_lines=0 throughput=0B total_bytes=0B
Okt 15 13:20:44 loki[1531]: level=info ts=2021-10-15T11:20:44.907727873Z caller=metrics.go:92 org_id=default traceID=47099432b44aeef9 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.163781ms status=200 limit=0 returned_lines=0 throughput=52kB total_bytes=60B
Okt 15 13:21:44 loki[1531]: level=info ts=2021-10-15T11:21:44.908689801Z caller=metrics.go:92 org_id=default traceID=44289286cc3d63a0 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.251504ms status=200 limit=0 returned_lines=0 throughput=48kB total_bytes=60B
Okt 15 13:22:44 loki[1531]: level=info ts=2021-10-15T11:22:44.908729988Z caller=metrics.go:92 org_id=default traceID=1653decfb62e8469 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.309034ms status=200 limit=0 returned_lines=0 throughput=46kB total_bytes=60B
Okt 15 13:23:44 loki[1531]: level=info ts=2021-10-15T11:23:44.90837527Z caller=metrics.go:92 org_id=default traceID=76d6f7b4c662abeb latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.520726ms status=200 limit=0 returned_lines=0 throughput=40kB total_bytes=60B
Okt 15 13:24:44 loki[1531]: level=info ts=2021-10-15T11:24:44.908638487Z caller=metrics.go:92 org_id=default traceID=06cc7835a9fdc881 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.186319ms status=200 limit=0 returned_lines=0 throughput=51kB total_bytes=60B
Okt 15 13:25:44 loki[1531]: level=info ts=2021-10-15T11:25:44.908343946Z caller=metrics.go:92 org_id=default traceID=0d99c5cd76878fd5 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.378795ms status=200 limit=0 returned_lines=0 throughput=58kB total_bytes=80B
Okt 15 13:26:44 loki[1531]: level=info ts=2021-10-15T11:26:44.908551422Z caller=metrics.go:92 org_id=default traceID=6b6e5aa1bb862f34 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.140961ms status=200 limit=0 returned_lines=0 throughput=0B total_bytes=0B

Hi @holgerhees

My guess is that it is somehow related to “range_type=instant” which is also shown in my loki logs and this is overwriting my range definition “60m” from the query itself.

No, that’s not the case; instant queries will take that time range into account, so effectively they will start from this instant and do a count_over_time for the last 60m.

Out of interest, can you run that query over a 1H range in Grafana and show the results please?

Also please send your Loki version and config.

here comes a new scenario where I took the screenshot directly after the alert was gone

Okt 15 14:45:44 loki[1531]: level=info ts=2021-10-15T12:45:44.907757977Z caller=metrics.go:92 org_id=default traceID=7a73fbdba8b0df5e latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.217731ms status=200 limit=0 returned_lines=0 throughput=0B total_bytes=0B
Okt 15 14:46:44 loki[1531]: level=info ts=2021-10-15T12:46:44.908212222Z caller=metrics.go:92 org_id=default traceID=440c1e6c1cc966d5 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.663494ms status=200 limit=0 returned_lines=0 throughput=36kB total_bytes=60B
Okt 15 14:47:44 loki[1531]: level=info ts=2021-10-15T12:47:44.908858766Z caller=metrics.go:92 org_id=default traceID=0bbfb276861be435 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.924627ms status=200 limit=0 returned_lines=0 throughput=31kB total_bytes=60B
Okt 15 14:48:44 loki[1531]: level=info ts=2021-10-15T12:48:44.907779184Z caller=metrics.go:92 org_id=default traceID=11b9146b62304b9f latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.240592ms status=200 limit=0 returned_lines=0 throughput=48kB total_bytes=60B
Okt 15 14:49:44 loki[1531]: level=info ts=2021-10-15T12:49:44.908712989Z caller=metrics.go:92 org_id=default traceID=656626a9277c9e36 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.691607ms status=200 limit=0 returned_lines=0 throughput=36kB total_bytes=60B
Okt 15 14:50:44 loki[1531]: level=info ts=2021-10-15T12:50:44.908702719Z caller=metrics.go:92 org_id=default traceID=0c3e124a2748020e latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.345019ms status=200 limit=0 returned_lines=0 throughput=45kB total_bytes=60B
Okt 15 14:51:44 loki[1531]: level=info ts=2021-10-15T12:51:44.908096427Z caller=metrics.go:92 org_id=default traceID=0e585ea19d0093bc latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.275716ms status=200 limit=0 returned_lines=0 throughput=63kB total_bytes=80B
Okt 15 14:52:44 loki[1531]: level=info ts=2021-10-15T12:52:44.908059088Z caller=metrics.go:92 org_id=default traceID=2f026112f61b7825 latency=fast query="(sum by(group,log)(count_over_time({level=\"ERROR\"} | regexp \"(?P<log>(?s).+)\"[1h])) > 0)" query_type=metric range_type=instant length=0s step=0s duration=1.155692ms status=200 limit=0 returned_lines=0 throughput=0B total_bytes=0B


and my loki version is 2.3.0

and thats my config

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s
  max_transfer_retries: 0

schema_config:
  configs:
    - from: 2018-04-15
      store: boltdb
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 168h

storage_config:
  boltdb:
    directory: /loki/index

  filesystem:
    directory: /loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: true
  retention_period: 672h
 
ruler:
  storage:
    type: local
    local:
      directory: /etc/loki/rules
  rule_path: /tmp
  alertmanager_url: http://alertmanager:9093
  ring:
    kvstore:
      store: inmemory
  enable_api: true
  enable_alertmanager_v2: true

Checked in with a colleague who implemented alerting in Loki; neither of us are sure what’s going on here.

Are your rulers restarting at all?

I’m not sure what you mean, but to activate every config change I restart the whole loki container. Just to be sure.

is there any additional debug possible what would help you?

or if you mean, if my rules are still working. yes. each new log event is still triggering an alert which will be displayed for exactly 5 minutes.

I meant: are your rulers restarting by themselves during the time that you expect this alert to still be active.

Can you please send your alertmanager config as well?

no, rulers have not restarted by themselves. At least there are no log entries which should give a hint in this direction.

and my alertmanager config looks like

global: 
  smtp_smarthost: postfix:25
  smtp_require_tls: False
  smtp_from: holger.hees@xxx.xxx
  resolve_timeout: 60m

route:
  group_by: 
    - group
    - alertname
  group_wait: 1m
  group_interval: 5m
  repeat_interval: 60m
  receiver: email-receiver
  routes:
    - matchers:
      - group="netdata"
      receiver: dummy
    - matchers:
      - group="crony"
      receiver: dummy
    - matchers:
      - group="php"
      - log=~".*oops, unknown child \([0-9]*\) exited with code 127.*"
      receiver: dummy
    - matchers:
      - group="apache"
      - log=~".*/grafana/api/live/ws.*Mobile.*"
      receiver: dummy

receivers:
  - name: dummy
  - name: email-receiver
    email_configs:
      - to: holger.hees@xxx.xxx
        headers:
          Reply-To: 'holger.hees@xxx.xxx'
          subject: '{{ .Status | toUpper }} • {{ .GroupLabels.SortedPairs.Values | join " • " }}{{ if eq .Status "firing" }} • {{ .Alerts.Firing | len }} time(s){{ end }}'

I get also a lot of messages like “msg=“Rule evaluation result discarded” err=unimplemented” but related to the following post, I ignore all of them

I configured now a test case where I was running loki with log.level=debug but my log file is to big to be inserted inside this post.

If you want I can send it to you via email.

looks like it is related to the setting “chunk_idle_period”

if I change chunk_idle_period from 5m to 60m everything works.