Cannot recover Mimir from empty ring errors

lopter · November 1, 2025, 1:32am

Hello,

I ran into some rate-limits by Mimir on the ingestion path this week, and took this as an occasion to setup some observability on our Mimir deployment, namely trying to get the official “Mimir Overview” dashboard to work, as a starting point.

Getting the dashboard to work has gone horribly wrong: it took down Mimir on both the read and write paths, and made it completely inoperable since the rollout-controller is also down. Everything is failing due to “empty ring” errors.

I am running Kubernetes v1.33 in which I have the following Helm charts installed:

Grafana Alloy: installed in the alloy k8s namespace;
Grafana: installed in the grafana namespace;
Grafana mimir-distributed: installed in the mimir namespace.

$ helm list -A | rg '(NAME|alloy|grafana|mimir)'
NAME                            NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
alloy                           alloy                   9               2025-10-21 22:22:43.200045227 +0000 UTC deployed        alloy-1.3.0                     v1.11.0    
grafana                         grafana                 24              2025-10-21 22:22:43.327446643 +0000 UTC deployed        grafana-10.1.2                  12.2.0     
mimir                           mimir                   15              2025-10-31 23:11:32.855614778 +0000 UTC failed          mimir-distributed-6.0.0         3.0.0      
$

I also have:

kube-state-metrics running and scraped by Alloy;
cadvisor running on each of my nodes and scraped by Alloy.

In order to fully work the “Mimir overview” dashboard requires you to setup some “recording rules”, so I’ve added the following to Mimir’s small.yaml:

metaMonitoring:
  prometheusRule:
    enabled: true
    mimirRules: true
    namespace: mimir

And the following to Alloy’s config (1.2.3.4 being the right IP for mimir):

mimir.rules.kubernetes "mimir_rules" {
  address = "http://1.2.3.4:80/"
  tenant_id = "foocorp"
}

I’ve also installed the prometheud-operator-crds Helm chart with the following values.yaml:

crds:
  annotations: {}
  alertmanagerconfigs:
    enabled: false
  alertmanagers:
    enabled: false
  podmonitors:
    enabled: false
  probes:
    enabled: false
  prometheusagents:
    enabled: false
  prometheuses:
    enabled: false
  prometheusrules:
    enabled: true
  scrapeconfigs:
    enabled: false
  servicemonitors:
    enabled: false
  thanosrulers:
    enabled: false

I can see the rules were properly set with:

$ alias km="kubectl -n mimir"
$ km get prometheusrules.monitoring.coreos.com mimir-rules -o yaml
…

Now I understand the “empty ring” errors indicate something is wrong with service discovery yet spending a day trying to follow the documentation has led Grot to run in circles and here we are. If I follow correctly, it’s best to use ring-based discovery and so I have the following in my mimir.config.yaml:

mimir:
  structuredConfig:
    query_scheduler:
      service_discovery_mode: ring
      ring:
        heartbeat_period: 1m
        heartbeat_timeout: 5m

And the following in Mimir’s values.yaml:

querier:
  extraArgs:
    - -query-scheduler.service-discovery-mode=ring
    - -query-scheduler.ring.store=memberlist
    - -memberlist.join=dnssrv+mimir-gossip-ring.mimir.svc.cluster.local:7946
query_frontend:
  extraArgs:
    - -query-scheduler.service-discovery-mode=ring
    - -query-scheduler.ring.store=memberlist
    - -memberlist.join=dnssrv+mimir-gossip-ring.mimir.svc.cluster.local:7946
query_scheduler:
  extraArgs:
    - -query-scheduler.ring.store=memberlist
    - -memberlist.join=dnssrv+mimir-gossip-ring.mimir.svc.cluster.local:7946

FYI:

$ dig +short -t SRV mimir-gossip-ring.mimir.svc.cluster.local @5.6.7.8 | wc -l
13
$

I’ve also made sure to drop scheduler_address per the docs:

This option should be set only when query-scheduler component is in use and -query-scheduler.service-discovery-mode is set to ‘dns’.

And opened mimir#13281 about that issue.

With all of that we can see that query-frontend does discover query-scheduler (and GET /ready for query-frontend does return “ready” btw):

ts=2025-10-31T23:45:29.825288203Z caller=frontend_scheduler_worker.go:146 level=info msg="adding connection to query-scheduler" addr=172.24.0.243:9095
ts=2025-10-31T23:45:29.825308043Z caller=mimir.go:1045 level=info msg="Application started"
ts=2025-10-31T23:45:29.825635045Z caller=frontend_scheduler_worker.go:146 level=info msg="adding connection to query-scheduler" addr=172.24.1.176:9095
ts=2025-10-31T23:45:29.884300392Z caller=memberlist_client.go:673 level=info phase=startup msg="joining memberlist cluster succeeded" reached_nodes=13 elapsed_time=59.703313ms
ts=2025-10-31T23:46:50.012096443Z caller=retry.go:51 query="node_filesystem_files_free / node_filesystem_files" query_timestamp=1761954410000 user=foo level=error user=foo msg="error processing request" try=0 err="empty ring"
[…]
ts=2025-10-31T23:46:50.017498755Z caller=handler.go:433 level=info user=foo msg="query stats" component=query-frontend method=POST path=/prometheus/api/v1/query route_name=prometheus_api_v1_query user_agent=Grafana/12.2.0 status_code=500 response_time=9.430336ms response_size_bytes=0 query_wall_time_seconds=0 fetched_series_count=0 fetched_chunk_bytes=0 fetched_chunks_count=0 fetched_index_bytes=0 sharded_queries=0 split_queries=0 spun_off_subqueries=0 estimated_series_count=0 queue_time_seconds=0 encode_time_seconds=0 samples_processed=0 samples_processed_cache_adjusted=0 param_query="node_filesystem_files_free{} / node_filesystem_files{}\n" param_time=2025-10-31T23:46:50Z length=4m59.999s time_since_min_time=5m0.006994939s time_since_max_time=7.994939ms results_cache_hit_bytes=0 results_cache_miss_bytes=0 header_cache_control= status=failed err="empty ring"
ts=2025-10-31T23:46:50.017568356Z caller=logging.go:144 level=warn msg="POST /prometheus/api/v1/query (500) 9.912699ms Response: \"{\\\"status\\\":\\\"error\\\",\\\"errorType\\\":\\\"internal\\\",\\\"error\\\":\\\"empty ring\\\"}\""

I can see that ring-based discovery for query-scheduler itself just work:

QUERY_SCHEDULER_CLUSTER_IP=…
km exec mimir-gateway-6878bc4d9d-88kh6 -- curl -s $QUERY_SCHEDULER_CLUSTER_IP:8080/query-scheduler/ring
[…]
        <tbody>
        
            
                <tr>
            
            <td>mimir-query-scheduler-5c9ff87cb6-cr2zh</td>
            <td></td>
            <td>ACTIVE</td>
            <td>10.0.0.41:9095</td>
            <td>2025-10-31T23:24:07Z</td>
            
            <td></td>
            <td></td>
            
            <td>54s ago (00:28:07)</td>
            
            <td>1</td>
            <td>19.5%</td>
            
            <td>
                <button name="forget" value="mimir-query-scheduler-5c9ff87cb6-cr2zh" type="submit">Forget</button>
            </td>
            </tr>
        
            
                <tr bgcolor="#BEBEBE">
            
            <td>mimir-query-scheduler-5c9ff87cb6-zgrrt</td>
            <td></td>
            <td>ACTIVE</td>
            <td>10.0.0.42:9095</td>
            <td>2025-10-31T23:24:59Z</td>
            
            <td></td>
            <td></td>
            
            <td>2s ago (00:28:59)</td>
            
            <td>1</td>
            <td>80.5%</td>
            
            <td>
                <button name="forget" value="mimir-query-scheduler-5c9ff87cb6-zgrrt" type="submit">Forget</button>
            </td>
            </tr>
        
        </tbody>
[…]

And for that matter all the /ring endpoints appear healthy:

Ingesters ring status:

$ INGESTER_CLUSTER_IPS=(a b c)
$ for i in "${INGESTER_CLUSTER_IPS[@]}" ; do km exec mimir-gateway-6878bc4d9d-88kh6 -- curl -s $i:8080/ingester/ring | grep -c ACTIVE ; done
3
3
3
$

Ruler ring status:

$ km exec mimir-gateway-6878bc4d9d-88kh6 -- curl -s 172.25.169.251:8080/ruler/ring | grep -c ACTIVE
1
$

Alertmanager ring status:

$ km exec mimir-gateway-6878bc4d9d-88kh6 -- curl -s $ALERT_MANAGER_CLUSTER_IP:8080/multitenant_alertmanager/ring | grep -c ACTIVE
2
$

Store-gateway ring status:

$ STORE_GATEWAYS=(a b c)
$ for i in "${STORE_GATEWAYS[@]}" ; do km exec mimir-gateway-6878bc4d9d-88kh6 -- curl -s $i:8080/store-gateway/ring | grep -c ACTIVE ; done
3
3
3
$

Compactor ring status:

$ km exec mimir-gateway-6878bc4d9d-88kh6 -- curl -s $COMPACTOR_CLUSTER_IP:8080/compactor/ring | grep -c ACTIVE
1
$

And fwiw: Overrides-exporter hash ring is disabled.

km get pods show everything as running correctly except for the rollout-operator.

I am looking for ideas on how could I fix this, let me know if you have any question or if I missed anything.

Thank you.

lopter · November 3, 2025, 9:58pm

It turned out that the upgrade to v6.0.0 of the chart happened at the same time and that we missed the manual instructions:

[ENHANCEMENT] Upgrade rollout-operator to 0.35.1. Note required actions for upgrading the rollout-operator chart. #12591, #12996

Which leads to:

Starting with v0.33.0 of the rollout-operator chart, the rollout-operator webhooks are enabled. See GitHub - grafana/rollout-operator: Kubernetes Rollout Operator.

Before upgrading to this version, make sure that the CustomResourceDefinitions (CRDs) in the crds directory are applied to your cluster.

Manually applying these CRDs is only required if upgrading from a chart <= v0.32.0.

See:

Moreover extraArgs in values.yaml expects a dict and not a list. I had those arguments applied by hand with kubectl edit last week. The fixed yaml snippet is:

querier:
  extraArgs:
    query-scheduler.service-discovery-mode: ring
    query-scheduler.ring.store: memberlist
    memberlist.join: dnssrv+mimir-gossip-ring.mimir.svc.cluster.local:7946
query_frontend:
  extraArgs:
    query-scheduler.service-discovery-mode: ring
    query-scheduler.ring.store: memberlist
    memberlist.join: dnssrv+mimir-gossip-ring.mimir.svc.cluster.local:7946
query_scheduler:
  extraArgs:
    query-scheduler.ring.store: memberlist
    memberlist.join: dnssrv+mimir-gossip-ring.mimir.svc.cluster.local:7946

edit: and fwiw we also ran into:

github.com/grafana/helm-charts

[rollout-operator][tempo-distributed][loki] self-signed webhook certificate SAN does not match service name (causing TLS verification errors)

opened 02:18PM - 30 Oct 25 UTC

Nikkelmann

### Issue details #### Charts involved: * tempo-distributed * loki * rollout-ope…rator #### Chart versions: * **tempo-distributed**: 1.52.4 * **loki**: 6.42.0 * **rollout-operator**: 0.33.2 / 0.35.1 ### Issue Description When installing the tempo-distributed or loki Helm chart with rollout-operator enabled, the rollout-operator will enable TLS and, if not supplied a certificate, create a self-signed certificate with `rollout-operator` as the Common Name and `rollout-operator.<namespace>.svc` as a Subject Alternative Name. The name of the service for the operator will be prefixed with the name of the main chart (loki or tempo), but the SAN in the self-signed certificate will not. This causes the webhooks to fail their TLS verification. **Log messages**: ```yaml W1030 09:36:06.457255 1 dispatcher.go:225] Failed calling webhook, failing closed prepare-downscale-loki.grafana.com: failed calling webhook "prepare-downscale-loki.grafana.com": failed to call webhook: Post "https://loki-rollout-operator.loki.svc:443/admission/prepare-downscale?timeout=10s": tls: failed to verify certificate: x509: certificate is valid for rollout-operator.loki.svc, not loki-rollout-operator.loki.svc W1030 09:36:42.344373 1 dispatcher.go:217] Failed calling webhook, failing closed pod-eviction-tempo.grafana.com: failed calling webhook "pod-eviction-tempo.grafana.com": failed to call webhook: Post "https://tempo-rollout-operator.tempo.svc:443/admission/pod-eviction?timeout=10s": tls: failed to verify certificate: x509: certificate is valid for rollout-operator.tempo.svc, not tempo-rollout-operator.tempo.svc ``` ### Workaround As a workaround, the SAN name of the self-signed certificate can be controlled by setting the `fullnameOverride` variable for the subchart like this: **values.yaml**: ```yaml rollout_operator: enabled: true fullnameOverride: "tempo-rollout-operator" ``` ### Expected Behavior The self-signed certificate's Common Name and Subject Alternative Names should reflect the actual service name, so the webhook TLS verification passes. ### Actual Behavior The generated self-signed certificate uses the default service name (rollout-operator), causing a mismatch and webhook TLS verification failure.

Which we fixed with the following snippet in our values.yaml:

rollout_operator:
  enabled: true
  fullnameOverride: "rollout-operator"

Followed by a manual clean-up of the “misnamed” resources:

kubectl delete service,pod,serviceaccount,deployment,replicaset,role,rolebinding mimir-rollout-operator

Topic		Replies	Views
Grafana Mimir 15 characters Configuration	1	278	February 16, 2024
Add Mimir datasource. There was an error returned querying the Prometheus API Grafana configuration , datasource , mimir	0	221	August 23, 2024
Cannot get Grafana to recognize Mimir monolitic instance, bare metal Data Links datasource , mimir	7	1827	April 9, 2024
Mimir data ingested but not exposed Prometheus mimir	2	1106	August 27, 2024
Mimir upgrade to helm chart 4.4.1 - now getting bucket errors in ingester and store gateway Configuration mimir	1	1186	June 9, 2023

Cannot recover Mimir from empty ring errors

Related topics