Ruler deployment crash looping

Hi,

After enabling the Ruler deployment in Loki using Tanka/Jsonnet, the ruler pods are failing like

mkdir : no such file or directory\nerror creating index client\ngithub.com/cortexproject/cortex/pkg/chunk/storage.NewStore\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/storage/factory.go:176\ngithub.com/grafana/loki/pkg/loki.(*Loki).initStore\n\t/src/loki/pkg/loki/modules.go:287\ngithub.com/cortexproject/cortex/pkg/util/mod │
│ ruler-84d9c69b5-6nb5l

Using boltdb shipper and GCS store for Ruler.

Relevant ruler config

 ruler:
        alertmanager_url: http://alertmanager.monitoring.svc:9093
        ring:
            kvstore:
                consul:
                    host: consul-server.consul.svc:8500
                prefix: loki/rulers/
        storage:
            gcs:
                bucket_name: ...

Thanks
Rajat Vig

Can you paste your entire config file (masking any sensitive data), that error message implies a problem with a different section of the config I believe.

apiVersion: v1
data:
  config.yaml: |
    chunk_store_config:
        chunk_cache_config:
            memcached:
                batch_size: 1024
                parallelism: 100
            memcached_client:
                consistent_hash: true
                host: memcached.loki.svc.cluster.local
                service: memcached-client
        max_look_back_period: 0
    distributor:
        ring:
            kvstore:
                consul:
                    consistent_reads: false
                    host: consul-server.consul.svc:8500
                    http_client_timeout: 20s
                    watch_burst_size: 1
                    watch_rate_limit: 1
                prefix: loki/collectors/
                store: consul
    frontend:
        compress_responses: true
        log_queries_longer_than: 10s
        max_outstanding_per_tenant: 4800
    frontend_worker:
        frontend_address: query-frontend.loki.svc.cluster.local:9095
        grpc_client_config:
            max_send_msg_size: 1.048576e+08
        parallelism: 2
    ingester:
        chunk_block_size: 262144
        chunk_idle_period: 15m
        lifecycler:
            heartbeat_period: 5s
            interface_names:
              - eth0
            join_after: 30s
            num_tokens: 512
            ring:
                heartbeat_timeout: 1m
                kvstore:
                    consul:
                        consistent_reads: true
                        host: consul-server.consul.svc:8500
                        http_client_timeout: 20s
                    prefix: loki/collectors/
                    store: consul
                replication_factor: 3
        max_transfer_retries: 60
    ingester_client:
        grpc_client_config:
            max_recv_msg_size: 6.7108864e+07
        pool_config:
            health_check_ingesters: true
        remote_timeout: 1s
    limits_config:
        enforce_metric_name: false
        ingestion_burst_size_mb: 30
        ingestion_rate_mb: 25
        ingestion_rate_strategy: global
        max_cache_freshness_per_query: 10m
        max_global_streams_per_user: 20000
        max_query_length: 12000h
        max_query_parallelism: 16
        max_streams_per_user: 0
        reject_old_samples: true
        reject_old_samples_max_age: 168h
    querier:
        query_ingesters_within: 2h
    query_range:
        align_queries_with_step: true
        cache_results: true
        max_retries: 5
        parallelise_shardable_queries: true
        results_cache:
            cache:
                memcached_client:
                    consistent_hash: true
                    host: memcached-frontend.loki.svc.cluster.local
                    max_idle_conns: 64
                    service: memcached-client
                    timeout: 500ms
                    update_interval: 1m
        split_queries_by_interval: 30m
    ruler:
        alertmanager_url: http://alertmanager.monitoring.svc:9093
        ring:
            kvstore:
                consul:
                    host: consul-server.consul.svc:8500
                prefix: loki/rulers/
        storage:
            gcs:
                bucket_name: ...
    schema_config:
        configs:
          - from: "2020-10-24"
            index:
                period: 24h
                prefix: loki_index_
            object_store: gcs
            schema: v11
            store: boltdb-shipper
    server:
        graceful_shutdown_timeout: 5s
        grpc_server_max_concurrent_streams: 1000
        grpc_server_max_recv_msg_size: 1.048576e+08
        grpc_server_max_send_msg_size: 1.048576e+08
        http_listen_port: 3100
        http_server_idle_timeout: 120s
        http_server_write_timeout: 1m
    storage_config:
        boltdb_shipper:
            shared_store: gcs
        gcs:
            bucket_name: ...
        index_queries_cache_config:
            memcached:
                batch_size: 1024
                parallelism: 100
            memcached_client:
                consistent_hash: true
                host: ...
                service: memcached-client
    table_manager:
        creation_grace_period: 3h
        poll_interval: 10m
        retention_deletes_enabled: false
        retention_period: 0
kind: ConfigMap
metadata:
  name: loki
  namespace: loki

Everything else is working fine, only when ruler is enabled it crashes with the message I posted earlier.

This is good information to know, thank you, because the error you posted is not one I would expect to see for a misconfigured ruler config… but here we are :slight_smile:

The only thing you are missing which would be worth trying is adding rule_path: to your ruler config:

      ruler:
        alertmanager_url: http://alertmanager.monitoring.svc:9093
        ring:
            kvstore:
                consul:
                    host: consul-server.consul.svc:8500
                prefix: loki/rulers/
        storage:
            gcs:
                bucket_name: ...
        rule_path: /tmp/loki/rules-temp
        enable_api: true

Loki needs a temporary directory for evaluating rules, it’s not required to be persisted.

enable_api is only necessary if you would like to interact with your rules via API, I added it here to note that it’s not enabled by default currently.

My bad earlier, the paste was with ruler disabled so it did not have all the entries.

apiVersion: v1
data:
  config.yaml: |
    chunk_store_config:
        chunk_cache_config:
            memcached:
                batch_size: 1024
                parallelism: 100
            memcached_client:
                consistent_hash: true
                host: memcached.loki.svc.cluster.local
                service: memcached-client
        max_look_back_period: 0
    distributor:
        ring:
            kvstore:
                consul:
                    consistent_reads: false
                    host: consul-server.consul.svc:8500
                    http_client_timeout: 20s
                    watch_burst_size: 1
                    watch_rate_limit: 1
                prefix: loki/collectors/
                store: consul
    frontend:
        compress_responses: true
        log_queries_longer_than: 10s
        max_outstanding_per_tenant: 4800
    frontend_worker:
        frontend_address: query-frontend.loki.svc.cluster.local:9095
        grpc_client_config:
            max_send_msg_size: 1.048576e+08
        parallelism: 2
    ingester:
        chunk_block_size: 262144
        chunk_idle_period: 15m
        lifecycler:
            heartbeat_period: 5s
            interface_names:
              - eth0
            join_after: 30s
            num_tokens: 512
            ring:
                heartbeat_timeout: 1m
                kvstore:
                    consul:
                        consistent_reads: true
                        host: consul-server.consul.svc:8500
                        http_client_timeout: 20s
                    prefix: loki/collectors/
                    store: consul
                replication_factor: 3
        max_transfer_retries: 60
    ingester_client:
        grpc_client_config:
            max_recv_msg_size: 6.7108864e+07
        pool_config:
            health_check_ingesters: true
        remote_timeout: 1s
    limits_config:
        enforce_metric_name: false
        ingestion_burst_size_mb: 30
        ingestion_rate_mb: 25
        ingestion_rate_strategy: global
        max_cache_freshness_per_query: 10m
        max_global_streams_per_user: 20000
        max_query_length: 12000h
        max_query_parallelism: 16
        max_streams_per_user: 0
        reject_old_samples: true
        reject_old_samples_max_age: 168h
    querier:
        query_ingesters_within: 2h
    query_range:
        align_queries_with_step: true
        cache_results: true
        max_retries: 5
        parallelise_shardable_queries: true
        results_cache:
            cache:
                memcached_client:
                    consistent_hash: true
                    host: memcached-frontend.loki.svc.cluster.local
                    max_idle_conns: 64
                    service: memcached-client
                    timeout: 500ms
                    update_interval: 1m
        split_queries_by_interval: 30m
    ruler:
        alertmanager_url: http://alertmanager.monitoring.svc:9093
        enable_alertmanager_v2: true
        enable_api: true
        enable_sharding: true
        ring:
            kvstore:
                consul:
                    host: consul-server.consul.svc:8500
                prefix: loki/rulers/
                store: consul
        rule_path: /tmp/rules
        storage:
            gcs:
                bucket_name: <ruler bucket>
            type: gcs
    schema_config:
        configs:
          - from: "2020-10-24"
            index:
                period: 24h
                prefix: loki_index_
            object_store: gcs
            schema: v11
            store: boltdb-shipper
    server:
        graceful_shutdown_timeout: 5s
        grpc_server_max_concurrent_streams: 1000
        grpc_server_max_recv_msg_size: 1.048576e+08
        grpc_server_max_send_msg_size: 1.048576e+08
        http_listen_port: 3100
        http_server_idle_timeout: 120s
        http_server_write_timeout: 1m
    storage_config:
        boltdb_shipper:
            shared_store: gcs
        gcs:
            bucket_name: <storage bucket>
        index_queries_cache_config:
            memcached:
                batch_size: 1024
                parallelism: 100
            memcached_client:
                consistent_hash: true
                host: memcached-index-queries.loki.svc.cluster.local
                service: memcached-client
    table_manager:
        creation_grace_period: 3h
        poll_interval: 10m
        retention_deletes_enabled: false
        retention_period: 0
kind: ConfigMap
metadata:
  name: loki
  namespace: loki

The ruler still crashes with the same error as before. From a rough reading of the code, I suspect it is loading the chunk config store.

level=error ts=2020-12-10T23:15:12.966824939Z caller=log.go:149 msg="error running loki" err="mkdir : no such file or directory\nerror creating index client\ngithub.com/cortexproject/cortex/pkg/chunk/storage.NewStore\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/storage/factory.go:176\ngithub.com/grafana/loki/pkg/loki.(*Loki).initStore\n\t/src/loki/pkg/loki/modules.go:287\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:103\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:204\nmain.main\n\t/src/loki/cmd/loki/main.go:130\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\nerror initialising module: store\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:105\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:204\nmain.main\n\t/src/loki/cmd/loki/main.go:130\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373"                                                                                                                                                                                                                                                  ```

I am also assuming that the chunk store bucket and rules buckets are different. Also, as we are using workload identity, the ruler does not need any permissions over the chunks bucket. Let me try tweaking the permissions a bit.

Think I figured it out.

It isn’t the permissions but when using the boltdb-shipper, the ruler is not setting the arguments for boltdb.shipper.active-index-directory or boltdb.shipper.cache-location which the querier and ingester setup and mount to their PVC.

The question now I do have is, should I configure the ruler with a PVC and set a cache location like the querier?

I am assuming https://github.com/grafana/loki/commit/dcbfecf9e549f264e5c16b1eefbe1b4071e508c1 might also be required.


Rajat

Have it running after

  1. patching the ruler args to use boltdb.shipper.cache-location set to /data/boltdb-cache
  2. using the latest build image grafana/loki:master-3f99a07
  3. mounting an emptydir mount to the container for /data

Though I am a little uncertain about stability and if the emptydir mount is valid to use.

If you want, I can create an issue in GitHub to help track it

Thanks
Rajat

Thanks so much for all the follow up @rajatvig, extremely helpful.

If you don’t mind creating an issue here would be very helpful, there is work we need to do here to improve this.

Created https://github.com/grafana/loki/issues/3076