Configuration for IP aggregation

Hi,

From the access logs received from loadbalancers, I would like to aggregate them per IP and see for example the IP making the most traffic.

I can’t make it work currently due to the high number of lines stored, it is working on a stage cluster but stage traffic … so nearly nothing.

As an example, an instant query by ruler can take up to 30s: recording rule to create a metric on status_code or other non dynamic labels.

I’m not sure that it can possible or maybe with a lot more replicas and splitting the query by a small period …
For the moment we are not using loki for all aggregations request, but would love to be able to do it.
There is maybe something missing on the configuration or some improvements that can be made.

Using the community helm charts.
Config part:

      server:
        grpc_server_max_recv_msg_size: 104857600
        grpc_server_max_send_msg_size: 104857600
        grpc_server_max_concurrent_streams: 2000
        http_server_read_timeout: 5m
        http_server_write_timeout: 5m
      query_range:
        align_queries_with_step: true
        max_retries: 5
        split_queries_by_interval: 15m
        cache_results: true
        results_cache:
          cache:
            memcached_client:
              consistent_hash: true
              host: {{ include "loki.memcachedFrontendFullname" . }}
              max_idle_conns: 16
              service: http
              timeout: 1s
              update_interval: 1m
      frontend:
        log_queries_longer_than: 5s
        compress_responses: true
        tail_proxy_url: http://{{ include "loki.querierFullname" . }}:3100
      querier:
        query_timeout: 4m
        max_concurrent: 6
        engine:
          timeout: 4m
      frontend_worker:
        frontend_address: {{ include "loki.queryFrontendFullname" . }}:9095
        grpc_client_config:
          max_send_msg_size: 1.048576e+08
        parallelism: 6

querier replicas + memcached:

  querier:
    replicas: 6
    resources:
      limits:
        memory: 20Gi
      requests:
        cpu: 2
        memory: 20Gi

  queryFrontend:
    replicas: 3
    resources:
      limits:
        memory: 5Gi
      requests:
        cpu: 1
        memory: 5Gi

  memcachedChunks:
    replicas: 9
    extraArgs:
      - -m 29000
      - -I 2m
      - -v
    resources:
      requests:
        memory: 30Gi
      limits:
        memory: 30Gi

The current result is that querier replicas got OOM and restart after some time (and no results on grafana) for a query with a timerange of 5m.
Not sure if reducing split_queries_by_interval could help or add more querier replicas …