Hello,
I ran into some rate-limits by Mimir on the ingestion path this week, and took this as an occasion to setup some observability on our Mimir deployment, namely trying to get the official “Mimir Overview” dashboard to work, as a starting point.
Getting the dashboard to work has gone horribly wrong: it took down Mimir on both the read and write paths, and made it completely inoperable since the rollout-controller is also down. Everything is failing due to “empty ring” errors.
I am running Kubernetes v1.33 in which I have the following Helm charts installed:
- Grafana Alloy: installed in the
alloyk8s namespace; - Grafana: installed in the
grafananamespace; - Grafana mimir-distributed: installed in the
mimirnamespace.
$ helm list -A | rg '(NAME|alloy|grafana|mimir)'
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
alloy alloy 9 2025-10-21 22:22:43.200045227 +0000 UTC deployed alloy-1.3.0 v1.11.0
grafana grafana 24 2025-10-21 22:22:43.327446643 +0000 UTC deployed grafana-10.1.2 12.2.0
mimir mimir 15 2025-10-31 23:11:32.855614778 +0000 UTC failed mimir-distributed-6.0.0 3.0.0
$
I also have:
- kube-state-metrics running and scraped by Alloy;
- cadvisor running on each of my nodes and scraped by Alloy.
In order to fully work the “Mimir overview” dashboard requires you to setup some “recording rules”, so I’ve added the following to Mimir’s small.yaml:
metaMonitoring:
prometheusRule:
enabled: true
mimirRules: true
namespace: mimir
And the following to Alloy’s config (1.2.3.4 being the right IP for mimir):
mimir.rules.kubernetes "mimir_rules" {
address = "http://1.2.3.4:80/"
tenant_id = "foocorp"
}
I’ve also installed the prometheud-operator-crds Helm chart with the following values.yaml:
crds:
annotations: {}
alertmanagerconfigs:
enabled: false
alertmanagers:
enabled: false
podmonitors:
enabled: false
probes:
enabled: false
prometheusagents:
enabled: false
prometheuses:
enabled: false
prometheusrules:
enabled: true
scrapeconfigs:
enabled: false
servicemonitors:
enabled: false
thanosrulers:
enabled: false
I can see the rules were properly set with:
$ alias km="kubectl -n mimir"
$ km get prometheusrules.monitoring.coreos.com mimir-rules -o yaml
…
Now I understand the “empty ring” errors indicate something is wrong with service discovery yet spending a day trying to follow the documentation has led Grot to run in circles and here we are. If I follow correctly, it’s best to use ring-based discovery and so I have the following in my mimir.config.yaml:
mimir:
structuredConfig:
query_scheduler:
service_discovery_mode: ring
ring:
heartbeat_period: 1m
heartbeat_timeout: 5m
And the following in Mimir’s values.yaml:
querier:
extraArgs:
- -query-scheduler.service-discovery-mode=ring
- -query-scheduler.ring.store=memberlist
- -memberlist.join=dnssrv+mimir-gossip-ring.mimir.svc.cluster.local:7946
query_frontend:
extraArgs:
- -query-scheduler.service-discovery-mode=ring
- -query-scheduler.ring.store=memberlist
- -memberlist.join=dnssrv+mimir-gossip-ring.mimir.svc.cluster.local:7946
query_scheduler:
extraArgs:
- -query-scheduler.ring.store=memberlist
- -memberlist.join=dnssrv+mimir-gossip-ring.mimir.svc.cluster.local:7946
FYI:
$ dig +short -t SRV mimir-gossip-ring.mimir.svc.cluster.local @5.6.7.8 | wc -l
13
$
I’ve also made sure to drop scheduler_address per the docs:
This option should be set only when query-scheduler component is in use and -query-scheduler.service-discovery-mode is set to ‘dns’.
And opened mimir#13281 about that issue.
With all of that we can see that query-frontend does discover query-scheduler (and GET /ready for query-frontend does return “ready” btw):
ts=2025-10-31T23:45:29.825288203Z caller=frontend_scheduler_worker.go:146 level=info msg="adding connection to query-scheduler" addr=172.24.0.243:9095
ts=2025-10-31T23:45:29.825308043Z caller=mimir.go:1045 level=info msg="Application started"
ts=2025-10-31T23:45:29.825635045Z caller=frontend_scheduler_worker.go:146 level=info msg="adding connection to query-scheduler" addr=172.24.1.176:9095
ts=2025-10-31T23:45:29.884300392Z caller=memberlist_client.go:673 level=info phase=startup msg="joining memberlist cluster succeeded" reached_nodes=13 elapsed_time=59.703313ms
ts=2025-10-31T23:46:50.012096443Z caller=retry.go:51 query="node_filesystem_files_free / node_filesystem_files" query_timestamp=1761954410000 user=foo level=error user=foo msg="error processing request" try=0 err="empty ring"
[…]
ts=2025-10-31T23:46:50.017498755Z caller=handler.go:433 level=info user=foo msg="query stats" component=query-frontend method=POST path=/prometheus/api/v1/query route_name=prometheus_api_v1_query user_agent=Grafana/12.2.0 status_code=500 response_time=9.430336ms response_size_bytes=0 query_wall_time_seconds=0 fetched_series_count=0 fetched_chunk_bytes=0 fetched_chunks_count=0 fetched_index_bytes=0 sharded_queries=0 split_queries=0 spun_off_subqueries=0 estimated_series_count=0 queue_time_seconds=0 encode_time_seconds=0 samples_processed=0 samples_processed_cache_adjusted=0 param_query="node_filesystem_files_free{} / node_filesystem_files{}\n" param_time=2025-10-31T23:46:50Z length=4m59.999s time_since_min_time=5m0.006994939s time_since_max_time=7.994939ms results_cache_hit_bytes=0 results_cache_miss_bytes=0 header_cache_control= status=failed err="empty ring"
ts=2025-10-31T23:46:50.017568356Z caller=logging.go:144 level=warn msg="POST /prometheus/api/v1/query (500) 9.912699ms Response: \"{\\\"status\\\":\\\"error\\\",\\\"errorType\\\":\\\"internal\\\",\\\"error\\\":\\\"empty ring\\\"}\""
I can see that ring-based discovery for query-scheduler itself just work:
QUERY_SCHEDULER_CLUSTER_IP=…
km exec mimir-gateway-6878bc4d9d-88kh6 -- curl -s $QUERY_SCHEDULER_CLUSTER_IP:8080/query-scheduler/ring
[…]
<tbody>
<tr>
<td>mimir-query-scheduler-5c9ff87cb6-cr2zh</td>
<td></td>
<td>ACTIVE</td>
<td>10.0.0.41:9095</td>
<td>2025-10-31T23:24:07Z</td>
<td></td>
<td></td>
<td>54s ago (00:28:07)</td>
<td>1</td>
<td>19.5%</td>
<td>
<button name="forget" value="mimir-query-scheduler-5c9ff87cb6-cr2zh" type="submit">Forget</button>
</td>
</tr>
<tr bgcolor="#BEBEBE">
<td>mimir-query-scheduler-5c9ff87cb6-zgrrt</td>
<td></td>
<td>ACTIVE</td>
<td>10.0.0.42:9095</td>
<td>2025-10-31T23:24:59Z</td>
<td></td>
<td></td>
<td>2s ago (00:28:59)</td>
<td>1</td>
<td>80.5%</td>
<td>
<button name="forget" value="mimir-query-scheduler-5c9ff87cb6-zgrrt" type="submit">Forget</button>
</td>
</tr>
</tbody>
[…]
And for that matter all the /ring endpoints appear healthy:
Ingesters ring status:
$ INGESTER_CLUSTER_IPS=(a b c)
$ for i in "${INGESTER_CLUSTER_IPS[@]}" ; do km exec mimir-gateway-6878bc4d9d-88kh6 -- curl -s $i:8080/ingester/ring | grep -c ACTIVE ; done
3
3
3
$
Ruler ring status:
$ km exec mimir-gateway-6878bc4d9d-88kh6 -- curl -s 172.25.169.251:8080/ruler/ring | grep -c ACTIVE
1
$
Alertmanager ring status:
$ km exec mimir-gateway-6878bc4d9d-88kh6 -- curl -s $ALERT_MANAGER_CLUSTER_IP:8080/multitenant_alertmanager/ring | grep -c ACTIVE
2
$
Store-gateway ring status:
$ STORE_GATEWAYS=(a b c)
$ for i in "${STORE_GATEWAYS[@]}" ; do km exec mimir-gateway-6878bc4d9d-88kh6 -- curl -s $i:8080/store-gateway/ring | grep -c ACTIVE ; done
3
3
3
$
Compactor ring status:
$ km exec mimir-gateway-6878bc4d9d-88kh6 -- curl -s $COMPACTOR_CLUSTER_IP:8080/compactor/ring | grep -c ACTIVE
1
$
And fwiw: Overrides-exporter hash ring is disabled.
km get pods show everything as running correctly except for the rollout-operator.
I am looking for ideas on how could I fix this, let me know if you have any question or if I missed anything.
Thank you.