Hello. As the title says, our current storegateway and ingester are getting more expensive by the day.
Currently, our write path has the following metrics.
And Read path has following this metrics.
Our scales looks like this:
- ingester
- replicas: 90
- cpu: 6
- mem: 60Gi
- disk: 100Gi (use average 40Gi)
- store_gateway
- replicas: 24
- cpu: 14
- mem: 140Gi
- disk: 700Gi (use average 200Gi)
Our configurations looks like this:
activity_tracker:
filepath: /active-query-tracker/activity.log
alertmanager:
data_dir: /data
enable_api: true
external_url: /alertmanager
fallback_config_file: /configs/alertmanager_fallback_config.yaml
blocks_storage:
backend: s3
bucket_store:
index_header:
lazy_loading_enabled: true
verify_on_load: false
sync_dir: /data/tsdb-sync
s3:
bucket_name: <REDACTED>
tsdb:
block_ranges_period:
- 0h30m0s
dir: /data/tsdb
head_compaction_interval: 15m
retention_period: 13h
wal_compression_enabled: true
wal_replay_concurrency: 3
common:
storage:
backend: s3
s3:
bucket_name: ""
endpoint: s3.ap-northeast-2.amazonaws.com
region: ap-northeast-2
compactor:
block_ranges:
- 0h30m0s
- 2h0m0s
- 12h0m0s
- 24h0m0s
compaction_interval: 30m
data_dir: /data
deletion_delay: 2h
first_level_compaction_wait_period: 25m
max_closing_blocks_concurrency: 2
max_opening_blocks_concurrency: 4
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
wait_stability_min_duration: 1m
symbols_flushers_concurrency: 4
distributor:
ha_tracker:
enable_ha_tracker: true
kvstore:
consul:
host: http://ha-tracker-backend-server:8500
store: consul
ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
enable_go_runtime_metrics: true
frontend:
grpc_client_config:
grpc_compression: snappy
max_outstanding_per_tenant: 30000
parallelize_shardable_queries: false
scheduler_address: mimir-query-scheduler-headless.mimir-production.svc:9095
scheduler_worker_concurrency: 10
split_queries_by_interval: 24h
frontend_worker:
grpc_client_config:
grpc_compression: snappy
max_recv_msg_size: 500000000
max_send_msg_size: 500000000
query_scheduler_grpc_client_config:
grpc_compression: snappy
scheduler_address: mimir-query-scheduler-headless.mimir-production.svc:9095
ingester:
ring:
final_sleep: 0s
heartbeat_period: 1m
heartbeat_timeout: 4m
num_tokens: 512
tokens_file_path: /data/tokens
unregister_on_shutdown: true
ingester_client:
grpc_client_config:
grpc_compression: snappy
max_recv_msg_size: 500000000
max_send_msg_size: 500000000
limits:
max_cache_freshness: 10m
max_fetched_chunks_per_query: 200000000
max_label_names_per_series: 100
max_query_parallelism: 30
max_total_query_length: 12000h
memberlist:
abort_if_cluster_join_fails: false
bind_port: 7945
compression_enabled: false
join_members:
- dns+mimir-gossip-ring.mimir-production.svc.cluster.local.:7945
querier:
max_concurrent: 100
shuffle_sharding_ingesters_enabled: false
timeout: 5m
query_scheduler:
grpc_client_config:
grpc_compression: snappy
max_outstanding_requests_per_tenant: 30000
ruler:
alertmanager_url: ""
enable_api: true
query_frontend:
grpc_client_config:
grpc_compression: snappy
rule_path: /data
ruler_storage:
s3:
bucket_name: <REDACTED>
runtime_config:
file: /var/mimir/runtime.yaml
server:
grpc_server_max_connection_age: 2562047h
grpc_server_max_connection_age_grace: 2562047h
grpc_server_max_connection_idle: 2562047h
grpc_server_max_recv_msg_size: 500000000
grpc_server_max_send_msg_size: 500000000
http_server_idle_timeout: 3m
http_server_read_timeout: 5m
http_server_write_timeout: 5m
store_gateway:
sharding_ring:
heartbeat_period: 1m
heartbeat_timeout: 4m
replication_factor: 2
tokens_file_path: /data/tokens
unregister_on_shutdown: true
wait_stability_min_duration: 1m
usage_stats:
installation_mode: helm
Are there any recommended settings, etc. that would optimize costs? What changes should I make to reduce costs while maintaining maximum performance?
I posted the same thing in discussion, but no one responded, so I reposted it to the community. Thanks.