Loki compute sizing and query (topk) performance

Hi all, I have a cluster of 4 x loki instances running in monolitic mode ( -target=all ).
Each of these compute nodes have the following specification:

  • 2 x CPU cores (arm64, AWS EC2 c6g.large instances, 8 x cores total across all 4 nodes)
  • 4 GB of RAM (16GB total across all 4 nodes)

Loki version is 2.6.1, and is deployed using the precompiled binaries.
We store the logs in AWS S3.

We are currently shipping some access logs to this loki cluster, and we want to pull some metrics data from these logs.
As an example, we want to pull the top 10 ip addresses from these logs, but I am having issues getting some decent performance out of this cluster, so seeking some advice here.
We are wanting to investigate our ELK stack, using Loki as a replacement.

The query we are running is a variation of the Top Visitor IPs from the Acquisition and Behaviour section in the Loki/Grafana example dashboard here: Grafana

As an example, I would like to pull the top IPs for the dates 2023-03-06 - 2023-03-09.
We have the following query:

topk(5, sum by (haproxy_ip_address) (count_over_time({component="haproxy",environment="live",platform="web"} |= "ip_address" | json | __error__="" and haproxy_ip_address!="" [$__range])))

If I use this query in Grafana, I can only at the most get the top IP addresses for the last couple of hours.
I understand that this operation is expensive, as per this comment: Performance issue - #2 by ewelch

Is there any alternative approach to this type of query?
With this example, I would like ot be able to get the top IP addresses for the last 14 days.

An example of what these log lines look like as they are ingested by loki:

{
  "environment": "live",
  "file": "/var/log/haproxy.log",
  "geoip": {
    "city_name": "Sydney",
    "continent_code": "OC",
    "country_code": "AU",
    "country_name": "Australia",
    "latitude": "",
    "longitude": "",
    "metro_code": "",
    "postal_code": "2000",
    "region_code": "NSW",
    "region_name": "New South Wales",
    "timezone": "Australia/Sydney"
  },
  "haproxy": {
    "Tc": "0",
    "Tw": "0",
    "active_request": "0",
    "backend_concurrent_connections": "1",
    "backend_name": "WEB",
    "backend_queue": "0",
    "captured_request_cookie": "-",
    "captured_response_cookie": "-",
    "client_port": "62333",
    "frontend_concurrent_connections": "17",
    "frontend_name": "HTTP",
    "http_method": "GET",
    "http_timestamp": "14/Mar/2023:16:14:13.101",
    "http_version": "HTTP/2.0",
    "ip_address": "8.8.8.8",
    "process_concurrent_connections": "18",
    "referrer": "www.fakewebsite.com",
    "response_time": "0",
    "server_concurrent_connections": "1",
    "server_connection_retries": "0",
    "server_name": "WEB_2",
    "server_queue": "0",
    "size": "287",
    "status_code": "304",
    "termination_state": "--NI",
    "time_to_receive": "0",
    "unix_timestamp": "1678770853",
    "uri": "https://www.fakewebsite.com/siteassets/fake.min.css",
    "user_agent": "like Gecko) Chrome/111.0.0.0 Safari/537.36|2.0||"
  },
  "host": "lb-web-001",
  "message": " 8.8.8.8:62333 1678770853 [14/Mar/2023:16:14:13.101] HTTP WEB/WEB_2 0/0/0/0/0 304 287 - - --NI 18/17/1/1/0 0/0 {www.fakewebsite.com|like Gecko) Chrome/111.0.0.0 Safari/537.36|2.0||} \"GET https://www.fakewebsite.com/siteassets/fake.min.css HTTP/2.0\" 1984",
  "platform": "web",
  "source_type": "file",
  "time": "2023-03-14T05:14:13.101Z"
}

My other question, that ties in to this question is, is the sizing of my loki cluster large enough?
Or is this a case of throwing a huge amount of more compute at the cluster it in order to get some better performance out of this setup?

Here is some output from logcli hitting the load balancer that sits in front of out loki cluster, querying for a string (that does not exist) for the same time period as my previous example:

time ./logcli-linux-arm64 query --timezone=UTC --from="2023-03-06T00:00:00Z" --to="2023-03-09T00:00:00Z" '{component="haproxy",environment="live",platform="web"}|= "this-string-does-not-match-anything"' --stats
Ingester.TotalReached            0
Ingester.TotalChunksMatched      0
Ingester.TotalBatches            0
Ingester.TotalLinesSent          0
Ingester.TotalChunksRef          0
Ingester.TotalChunksDownloaded  0
Ingester.ChunksDownloadTime      0s
Ingester.HeadChunkBytes          0 B
Ingester.HeadChunkLines          0
Ingester.DecompressedBytes       0 B
Ingester.DecompressedLines       0
Ingester.CompressedBytes         0 B
Ingester.TotalDuplicates         0
Querier.TotalChunksRef   1277
Querier.TotalChunksDownloaded    1277
Querier.ChunksDownloadTime       35.320173727s
Querier.HeadChunkBytes   0 B
Querier.HeadChunkLines   0
Querier.DecompressedBytes        12 GB
Querier.DecompressedLines        8103575
Querier.CompressedBytes          1.4 GB
Querier.TotalDuplicates          0
Cache.Chunk.Requests             0
Cache.Chunk.EntriesRequested     0
Cache.Chunk.EntriesFound         0
Cache.Chunk.EntriesStored        0
Cache.Chunk.BytesSent            0 B
Cache.Chunk.BytesReceived        0 B
Cache.Index.Requests             0
Cache.Index.EntriesRequested     0
Cache.Index.EntriesFound         0
Cache.Index.EntriesStored        0
Cache.Index.BytesSent            0 B
Cache.Index.BytesReceived        0 B
Cache.Result.Requests            0
Cache.Result.EntriesRequested    0
Cache.Result.EntriesFound        0
Cache.Result.EntriesStored       0
Cache.Result.BytesSent   0 B
Cache.Result.BytesReceived       0 B
Summary.BytesProcessedPerSecond          1.8 GB
Summary.LinesProcessedPerSecond          1160333
Summary.TotalBytesProcessed              12 GB
Summary.TotalLinesProcessed              8103575
Summary.ExecTime                         6.983834629s
Summary.QueueTime                        1.119285312s

real    0m7.019s
user    0m0.021s
sys     0m0.017s

And here is the config that I are currently running:

auth_enabled: false

server:
  http_listen_port: 3100
  http_server_read_timeout: 300s
  http_server_write_timeout: 300s
  http_server_idle_timeout: 290s

  grpc_server_max_concurrent_streams: 500
  grpc_server_max_recv_msg_size: 10000000
  grpc_server_max_send_msg_size: 10000000

memberlist:
  abort_if_cluster_join_fails: false
  bind_port: 7946
  join_members:
  - loki_servicediscovery_cluster_name.loki_servicediscovery_endpoint:7946

  max_join_backoff: 1m
  max_join_retries: 10
  min_join_backoff: 1s

frontend:
  max_outstanding_per_tenant: 2048
  compress_responses: false

frontend_worker:
    grpc_client_config:
        max_send_msg_size: 1.048576e+08
    parallelism: 10

querier:
  max_concurrent: 40


ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 3
    final_sleep: 30s
  chunk_idle_period: 1h
  chunk_retain_period: 0s
  chunk_encoding: snappy
  wal:
    dir: /loki-data/wal

distributor:
  ring:
    kvstore:
      store: memberlist

query_range:
  parallelise_shardable_queries: true
  cache_results: true
  results_cache:
    cache:
      enable_fifocache: true
      fifocache:
        max_size_bytes: 2GB
        validity: 24h

compactor:
  working_directory: /loki-data/compactor
  shared_store: s3
  retention_enabled: true
  compaction_interval: 30m
  retention_delete_delay: 2h
  retention_delete_worker_count: 150


### Instance and object limit configuration
limits_config:
  split_queries_by_interval: 12h
  enforce_metric_name: true
  reject_old_samples: true
  reject_old_samples_max_age: 168h

  max_global_streams_per_user: 10000
  ingestion_rate_mb: 20
  max_query_series: 100000
  max_query_length: 0
  max_query_parallelism: 32

  
  ### Retention config
  # Global retention period applied if none of the below stream match
  # 31 days
  retention_period: 744h
  retention_stream:

  # 7 days retention for dev logs
  - selector: '{environment="dev"}'
    priority: 1
    period: 168h

  # 14 days retention for ote logs
  - selector: '{environment="ote"}'
    priority: 1
    period: 336h

  # 31 days retention for live logs
  - selector: '{environment="live"}'
    priority: 1
    period: 744h

schema_config:
  configs:
  - from: 2020-05-15
    store: boltdb-shipper
    object_store: s3
    schema: v11
    index:
      prefix: index_
      period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki-data/index
    cache_location: /loki-data/index_cache
    shared_store: s3
    cache_ttl: 168h
    resync_interval: 5m

  max_chunk_batch_size: 300
  aws:
    s3: s3://aws_region/s3_bucket
    region: aws_region

chunk_store_config:
  chunk_cache_config:
    fifocache:
      max_size_bytes: 2GB
      validity: 24h

Many thanks, any advice appreciated.

You have two options, one is to simply lengthen the timeout (if you don’t care about performance), one is to change how you deploy Loki (at least the reader part). More explanations below.

  1. Look into changing the value of http_server_read_timeout and query_timeout and see if you can make it long enough to finish the query without timing out.

  2. Loki’s read performance comes from distribution. For example, even though you currently have four readers, only one is used for running your query. In order to utilize distribution you need to have a Query Frontend and configure it to either push to querier instances or configure querier to pull from QF. You can find some examples of it here: Query Frontend | Grafana Loki documentation

Do note that pull mode is generally recommended. Also you may be able to get QF to work by just adding either frontend_address or downstream_url to the config, but I don’t personally deploy in monolithic mode, so I can’t say for certain.

Thank you, I got the query-frontend to work in a monolitic deployment.
I added the query-fe on to the loadbalancer, and I can certainly see that the query are distributed to all the cluster memebers in the ring, so that worked.

I’m still seeing poor performance, but maybe I am expecting too much considering that there are no indexing (as with ELK) and everything has to be calculated on the fly.
I have seen some crashed when running these queries, guessing they oom (have not had time to look in to the root cause). I doubled the instance size as well, but same issue (16 cores, 32gb ram, split between 4 nodes, EC2 c6g.xlarge).

I guess the only thing I have not tested is to go back to x86_64 and see if the architecture would make a difference.

Will hold off on using loki until TSDB support is officially announced, and will do some more tests then I think.

Thank you for your help.

No problem. Curious how much performance gain you’ve gotten.

Couple of other things you might try. For config:

  1. Change compress_responses to true.
  2. Change split_queries_by_interval to a smaller number. We do 30m, but 12h is way too much and is not practical unless most of your queries go for bigger time window than that.

On the deployment side:

  1. Upgrade to Loki 2.7.4 (latest release).
  2. Switch to simple scalable mode for deployment.

Below is our performance test on 2 query frontend nodes with 5 reader nodes, each with 4cpu / 8GB memory (we do simple scalable mode). 25 seconds to process 106GB of data (this is about 48 hours for us) is honestly pretty good.

Summary.BytesProcessedPerSecond  4.2 GB
Summary.LinesProcessedPerSecond  22707494
Summary.TotalBytesProcessed 	 106 GB
Summary.TotalLinesProcessed 	 568624865
Summary.ExecTime 		 25.041286429s
Summary.QueueTime 		 0s

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.