Ruler evaluation producing nconsistent alerts and rc 500 on instant queries

When alerts are deployed to Loki, it produces errors with status=500, and alerts are completely inconsistent compared to what is returned manually in grafana.

"log": "level=info ts=2024-10-09T19:46:01.141151674Z caller=engine.go:248 component=ruler evaluation_mode=local org_id=fake msg=\"executing query\" type=instant query=\"(sum(count_over_time({namespace=\\\"ai\\\"} |= \\\"check completed\\\"[10m])) < 30)\" query_hash=3077565986"

"log": "level=info ts=2024-10-09T19:46:01.228229623Z caller=metrics.go:217 component=ruler evaluation_mode=local org_id=fake latency=fast query=\"(sum(count_over_time({namespace=\\\"ai\\\"} |= \\\"check completed\\\"[10m])) < 30)\" query_hash=3077565986 query_type=metric range_type=instant length=0s start_delta=694.286852ms end_delta=694.286992ms step=0s duration=87.000739ms status=500 limit=0 returned_lines=0 throughput=4.1GB total_bytes=357MB total_bytes_structured_metadata=715kB lines_per_second=847349 total_lines=73720 post_filter_lines=48 total_entries=1 store_chunks_download_time=0s queue_time=0s splits=0 shards=0 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=155.03µs cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=73 ingester_requests=8 ingester_chunk_head_bytes=2.5MB ingester_chunk_compressed_bytes=41MB ingester_chunk_decompressed_bytes=354MB ingester_post_filter_lines=48 congestion_control_latency=0s index_total_chunks=0 index_post_bloom_filter_chunks=0 index_bloom_filter_ratio=0.00 index_shard_resolver_duration=0s disable_pipeline_wrappers=false"

Running Loki Chart 6.16.0 in SSD, deploying alerts to Loki using GitHub - grafana/cortex-rules-action

groups:
  - name: AiErrorAlerts
    rules:
      - alert: LowCheckCount
        expr: sum(count_over_time({namespace="ai"} |= `check completed` [10m])) < 30
        for: 10m
        labels:
          route: alerts-ai
          severity: medium
          source: loki
        annotations:
          summary: Low check completion rate.
          details: Checks completed per 10 minutes - `{{ $value }}`. Expected rate - `30`.

Ruler config:

rulerConfig:
    alertmanager_url: "https://<mimir_am>:8080/alertmanager/"
    enable_alertmanager_v2: true
    enable_sharding: true
    remote_write:
      enabled: true
      clients:
        local:
          url: "http://haproxy.prometheus.svc.cluster.local:8080/api/v1/push"
    storage:
      gcs:
        bucket_name: hostinger-loki-ruler
    wal:
      dir: /var/loki/wal

At this point, I am completely lost as to what could be causing this :thinking: so any help is appreciated

How big is your Loki cluster?

If you have a lot of logs, you might want to consider changing ruler to remote evaluation. With local evaluation ruler runs the query locally, but if you have a sizable cluster that’s usually not enough, and you would want to use the query frontend instead for querying (remote evaluation). See ruler.evaluation in the documentation here.

This is what our ruler evaluation block looks like:

ruler:
  <...>
  evaluation:
    mode: remote
    query_frontend:
      address: <loki_query_frontend_address>

The loki_query_frontend_address address should be the internal address of your query frontend (read target in SSD mode), meaning it should go directly from ruler to query frontend and not go through the gateway. Unfortunately I don’t use the helm chart for our Loki cluster, so I can’t really tell you where to make such a change.

1 Like

Wow, that solved the issue! Grafana champ for real hahaha. :star_struck:

Unfortunately, documentation about it is very unclear. It almost seems like “remote” mode should be used for remote cluster evaluation, but not just point to a different component. But maybe it’s just me.

Also, a note that the grpc port should be used for it to work.

  rulerConfig:
    evaluation:
      mode: remote
      query_frontend:
        address: loki-read.loki.svc.cluster.local:9095

The cluster itself is not that big, around 800gb daily with around 4k active streams, but yeah we are running a bunch of backend pods. Really appreciate it :bowing_man: