Storegateway and ingester are getting expensive. Is there any way to optimize them?

gurumee92 · June 10, 2025, 9:43am

Hello. As the title says, our current storegateway and ingester are getting more expensive by the day.

Currently, our write path has the following metrics.

And Read path has following this metrics.

Our scales looks like this:

ingester
- replicas: 90
- cpu: 6
- mem: 60Gi
- disk: 100Gi (use average 40Gi)
store_gateway
- replicas: 24
- cpu: 14
- mem: 140Gi
- disk: 700Gi (use average 200Gi)

Our configurations looks like this:

activity_tracker:
      filepath: /active-query-tracker/activity.log
    alertmanager:
      data_dir: /data
      enable_api: true
      external_url: /alertmanager
      fallback_config_file: /configs/alertmanager_fallback_config.yaml
    blocks_storage:
      backend: s3
      bucket_store:
        index_header:
          lazy_loading_enabled: true
          verify_on_load: false
        sync_dir: /data/tsdb-sync
      s3:
        bucket_name: <REDACTED>
      tsdb:
        block_ranges_period:
        - 0h30m0s
        dir: /data/tsdb
        head_compaction_interval: 15m
        retention_period: 13h
        wal_compression_enabled: true
        wal_replay_concurrency: 3
    common:
      storage:
        backend: s3
        s3:
          bucket_name: ""
          endpoint: s3.ap-northeast-2.amazonaws.com
          region: ap-northeast-2
    compactor:
      block_ranges:
      - 0h30m0s
      - 2h0m0s
      - 12h0m0s
      - 24h0m0s
      compaction_interval: 30m
      data_dir: /data
      deletion_delay: 2h
      first_level_compaction_wait_period: 25m
      max_closing_blocks_concurrency: 2
      max_opening_blocks_concurrency: 4
      sharding_ring:
        heartbeat_period: 1m
        heartbeat_timeout: 4m
        wait_stability_min_duration: 1m
      symbols_flushers_concurrency: 4
    distributor:
      ha_tracker:
        enable_ha_tracker: true
        kvstore:
          consul:
            host: http://ha-tracker-backend-server:8500
          store: consul
      ring:
        heartbeat_period: 1m
        heartbeat_timeout: 4m
    enable_go_runtime_metrics: true
    frontend:
      grpc_client_config:
        grpc_compression: snappy
      max_outstanding_per_tenant: 30000
      parallelize_shardable_queries: false
      scheduler_address: mimir-query-scheduler-headless.mimir-production.svc:9095
      scheduler_worker_concurrency: 10
      split_queries_by_interval: 24h
    frontend_worker:
      grpc_client_config:
        grpc_compression: snappy
        max_recv_msg_size: 500000000
        max_send_msg_size: 500000000
      query_scheduler_grpc_client_config:
        grpc_compression: snappy
      scheduler_address: mimir-query-scheduler-headless.mimir-production.svc:9095
    ingester:
      ring:
        final_sleep: 0s
        heartbeat_period: 1m
        heartbeat_timeout: 4m
        num_tokens: 512
        tokens_file_path: /data/tokens
        unregister_on_shutdown: true
    ingester_client:
      grpc_client_config:
        grpc_compression: snappy
        max_recv_msg_size: 500000000
        max_send_msg_size: 500000000
    limits:
      max_cache_freshness: 10m
      max_fetched_chunks_per_query: 200000000
      max_label_names_per_series: 100
      max_query_parallelism: 30
      max_total_query_length: 12000h
    memberlist:
      abort_if_cluster_join_fails: false
      bind_port: 7945
      compression_enabled: false
      join_members:
      - dns+mimir-gossip-ring.mimir-production.svc.cluster.local.:7945
    querier:
      max_concurrent: 100
      shuffle_sharding_ingesters_enabled: false
      timeout: 5m
    query_scheduler:
      grpc_client_config:
        grpc_compression: snappy
      max_outstanding_requests_per_tenant: 30000
    ruler:
      alertmanager_url: ""
      enable_api: true
      query_frontend:
        grpc_client_config:
          grpc_compression: snappy
      rule_path: /data
    ruler_storage:
      s3:
        bucket_name: <REDACTED>
    runtime_config:
      file: /var/mimir/runtime.yaml
    server:
      grpc_server_max_connection_age: 2562047h
      grpc_server_max_connection_age_grace: 2562047h
      grpc_server_max_connection_idle: 2562047h
      grpc_server_max_recv_msg_size: 500000000
      grpc_server_max_send_msg_size: 500000000
      http_server_idle_timeout: 3m
      http_server_read_timeout: 5m
      http_server_write_timeout: 5m
    store_gateway:
      sharding_ring:
        heartbeat_period: 1m
        heartbeat_timeout: 4m
        replication_factor: 2
        tokens_file_path: /data/tokens
        unregister_on_shutdown: true
        wait_stability_min_duration: 1m
    usage_stats:
      installation_mode: helm

Are there any recommended settings, etc. that would optimize costs? What changes should I make to reduce costs while maintaining maximum performance?

I posted the same thing in discussion, but no one responded, so I reposted it to the community. Thanks.

Topic		Replies	Views
Loki Distributor outbound traffic to ingester drop Grafana Loki loki	3	469	April 18, 2025
Cannot find traceids in S3 blocks Grafana Tempo	12	3642	April 26, 2022
Ingester high memory usage Grafana Loki loki , configuration	2	2171	December 12, 2023
Tempo ram usage for 6k spans per hour Grafana Tempo	17	6572	October 10, 2024
Monitoring Loki SSD - Loki Replicated traffic per ingester Grafana Loki	1	353	January 8, 2024

Storegateway and ingester are getting expensive. Is there any way to optimize them?

Related topics