Always issues with querying Loki data via Grafana

Hello Together,

we recently installed Promtail on some of our worker servers and sent all the data to one Loki server instance. This seems to work fine.

When it comes to querying data we often run into situations where we will be quoted with such an error on Grafana:

rpc error: code = ResourceExhausted desc = trying to send message larger than max (486002239 vs. 209715200)

I already increased, even by knowit it’s not good, the server.grpc_server_max_recv_msg_size and server.grpc_server_max_send_msg_size to something very high.

Some queries work well, and then, rerunning them, they do not work again.

I don’t understand what I can do not to have such a high limit defined.

Maybe there is some guide on how to do it, right?

Here’s our current loki configuration file:

#jinja2:lstrip_blocks: True
---
{{ ansible_managed | comment }}

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_server_max_concurrent_streams: 1024
  grpc_server_max_recv_msg_size: 409715200 # 200 MB, might be too much, be careful
  grpc_server_max_send_msg_size: 409715200 # 200 MB, might be too much, be careful
  http_server_write_timeout: 310s
  http_server_read_timeout: 310s

ingester_client:
  grpc_client_config:
    max_recv_msg_size: 409715200  # 200 Mb
    max_send_msg_size: 409715200  # 200 Mb

querier:
  query_timeout: 300s
  engine:
    timeout: 300s
  max_concurrent: 24
          
ingester:
  chunk_encoding: snappy
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 2h
  chunk_target_size: 1536000
  chunk_retain_period: 30s
  max_chunk_age: 2h
  wal:
    dir: "/tmp/wal"

compactor:
  working_directory: /var/lib/loki/compactor
  shared_store: filesystem

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /var/lib/loki/index
    cache_location: /var/lib/loki/cache
    cache_ttl: 24h
    shared_store: filesystem
  filesystem:
    directory: /var/lib/loki/chunks

limits_config:
  retention_period: 72h
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_cache_freshness_per_query: 10m
  split_queries_by_interval: 15m
  # for big logs tune
  per_stream_rate_limit: 512M
  per_stream_rate_limit_burst: 1024M
  cardinality_limit: 200000
  ingestion_burst_size_mb: 1000
  ingestion_rate_mb: 10000
  max_entries_limit_per_query: 1000000
  max_label_value_length: 20480
  max_label_name_length: 10240
  max_label_names_per_series: 300
  max_query_parallelism: 24

frontend_worker:
  match_max_concurrent: true
  grpc_client_config:
      max_send_msg_size: 409715200

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: true
  retention_period: 336h

Will more Loki instances help? Would saving the logs to S3 help? My problem is understanding how to do it in a way that our developers can use Loki as a reliable tool to query application metrics.

Thank you all in advance for your replies!

  1. What is the volume of logs you are sending to Loki?

  2. There is max_recv_msg_size configuration for frontend_worker as well.

  3. If you are running a single Loki instance with local file storage you might consider changing split_queries_by_interval to a longer time frame such as 12h or 24h.