hello team,
we are running a big enough cluster to handle about >1TB of uncompressed data of logs and huge number of metrics.
we are using
LIDs - 0002 Remote Rule Evaluation
to perform rule execution. But it seems to be not stable and we are getting the following error.
│ loki-loki-distributed-ruler-749445c967-5t89k ts=2024-07-28T08:02:14.470716326Z caller=spanlogger.go:86 component=ruler evaluation_mode=remote user=platformpii method=ruler.remoteEvaluation.Query level=warn query_hash=1032185798 query="count_over_time ({container_name=\"pii-vault\"} |= \"latency\"[1m])" instant=2024-07-28T08:02:09.236487818Z response_time=1.797359754s msg="failed to evaluate rule" err="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (8393682 vs. 4194304)"
we changed the grpc settings for ruler_client, query_scheduler.grpc_client_config, ingester_client.grpc_client_config, index_gateway_client.grpc_client_config, frontend_worker.grpc_client_config, frontend.grpc_client_config, to move the default from 4MB to 104 MB. But we are still getting this error.
following is our code sample without the major aspects. and critical info.
Can you please help us stabilize the our Grafana stack.
ruler:
enable_sharding: true
query_stats_enabled: true
evaluation_interval: 10s
evaluation:
max_jitter: 2s
mode: remote
query_frontend:
address: "dns:///loki-loki-distributed-query-frontend-headless:9095"
ruler_client:
grpc_compression: snappy
max_recv_msg_size: 104857600
max_send_msg_size: 104857600
wal:
dir: ruler-wal
wal_cleaner:
period: 1h
storage:
type: s3
s3:
bucketnames: redacted
endpoint: null
region: region
access_key_id: Redacted
secret_access_key: redacted
ring:
kvstore:
store: memberlist
rule_path: /tmp/loki/scratch
alertmanager_url: http://mimir-alertmanager:8080
external_url: redacted
remote_write:
enabled: true
clients:
mimir:
url: http://mimir-nginx/api/v1/push