How to improve Loki Performance

Hello Loki Team!

I would like some help tweaking the performance of our Loki setup. I’ve read blog posts about blazing speeds but can’t seem to achieve this. The read path is what I am trying to optimize.

The current setup is Loki 2.0 in microservices mode deployed to AWS using S3 and DynamoDB for chunks/index. We are using memberlist for the Ring. ElasticCache Redis has been added as the cache.

We are running 2 query frontends, 3 distributors, 3 ingesters, 6 queriers and 1 table manager containers. These all run in ECS Fargate which maybe part of the problem. chunk_retain_period is very low because Fargate does not have persistent storage. My thought is the queriers have to pull everything from S3, but that would be the case for any query greater than your chunk_retain_period. The above mentioned blog post is run over a 1 hour period, perhaps these chunks had not been flushed? Served from ingester on persistent storage rather than from S3?

Anyway, please see my benchmarks below, and config further down.

Some benchmarking… (only queriers were scaled out)

6 Queriers over 48h of logs just straight up timed out.

12 Queriers…
time ./logcli query '{Environment=“production”,Application=“test-app”,Deployment=“production”,Service=“test-service”,LogGroup=“test-service-logs”} |~ “(?i)pppppppppppp” ’ --since=48h --stats

Ingester.TotalReached 12
Ingester.TotalChunksMatched 72
Ingester.TotalBatches 0
Ingester.TotalLinesSent 0
Ingester.HeadChunkBytes 923 kB
Ingester.HeadChunkLines 755
Ingester.DecompressedBytes 85 MB
Ingester.DecompressedLines 63807
Ingester.CompressedBytes 14 MB
Ingester.TotalDuplicates 0
Store.TotalChunksRef 8144
Store.TotalChunksDownloaded 8144
Store.ChunksDownloadTime 8m38.330467263s
Store.HeadChunkBytes 0 B
Store.HeadChunkLines 0
Store.DecompressedBytes 59 GB
Store.DecompressedLines 44850257
Store.CompressedBytes 9.6 GB
Store.TotalDuplicates 0
Summary.BytesProcessedPerSecond 375 MB
Summary.LinesProcessedPerSecond 285792
Summary.TotalBytesProcessed 59 GB
Summary.TotalLinesProcessed 44914819
Summary.ExecTime 2m37.158656521s

real 2m37.267s
user 0m0.119s
sys 0m0.016s

24 Queriers…
Ingester.TotalReached 12
Ingester.TotalChunksMatched 56
Ingester.TotalBatches 0
Ingester.TotalLinesSent 0
Ingester.HeadChunkBytes 1.0 MB
Ingester.HeadChunkLines 841
Ingester.DecompressedBytes 84 MB
Ingester.DecompressedLines 63012
Ingester.CompressedBytes 14 MB
Ingester.TotalDuplicates 0
Store.TotalChunksRef 8122
Store.TotalChunksDownloaded 8122
Store.ChunksDownloadTime 5m31.764491701s
Store.HeadChunkBytes 0 B
Store.HeadChunkLines 0
Store.DecompressedBytes 59 GB
Store.DecompressedLines 44704040
Store.CompressedBytes 9.6 GB
Store.TotalDuplicates 0
Summary.BytesProcessedPerSecond 500 MB
Summary.LinesProcessedPerSecond 380884
Summary.TotalBytesProcessed 59 GB
Summary.TotalLinesProcessed 44767893
Summary.ExecTime 1m57.536630781s

real 1m57.730s
user 0m0.104s
sys 0m0.009s

There are some gains here, but I am still not seeing the performance showcased in the blog. I have the /metrics endpoint being ingested if you would like to see something specific.

Config below, LOKI_ prefixed vars are replaced on deploy.

auth_enabled: false

server:
  http_listen_address: 0.0.0.0
  http_listen_port: LOKI_HTTP_LISTEN_PORT
  grpc_listen_address: 0.0.0.0
  grpc_listen_port: LOKI_GRPC_LISTEN_PORT
  http_server_read_timeout: 3m
  http_server_write_timeout: 3m
  log_level: LOKI_LOG_LEVEL

memberlist:
  bind_port: LOKI_HTTP_MEMBERLIST_LISTEN_PORT

  join_members:
    - dns+LOKI_QUERIER_RECORD_NAME.LOKI_SERVICE_DISCOVERY_NAMESPACE:LOKI_HTTP_MEMBERLIST_LISTEN_PORT
    - dns+LOKI_DISTRIBUTOR_RECORD_NAME.LOKI_SERVICE_DISCOVERY_NAMESPACE:LOKI_HTTP_MEMBERLIST_LISTEN_PORT
    - dns+LOKI_INGESTER_RECORD_NAME.LOKI_SERVICE_DISCOVERY_NAMESPACE:LOKI_HTTP_MEMBERLIST_LISTEN_PORT

  max_join_backoff: 1m
  max_join_retries: 10
  min_join_backoff: 1s
  dead_node_reclaim_time: 30s
  gossip_to_dead_nodes_time: 15s
  left_ingesters_timeout: 30s

ingester:
  lifecycler:
    join_after: 60s
    observe_period: 5s
    interface_names:
      - "eth1"

    ring:
      kvstore:
        store: memberlist
      replication_factor: 3

    final_sleep: 0s

  chunk_idle_period: 15m
  max_chunk_age: 1h
  chunk_retain_period: 30s
  max_transfer_retries: 0
  chunk_target_size: 1536000
  chunk_block_size: 262144

querier:
  query_timeout: 2m
  query_ingesters_within: 2h

query_range:
  split_queries_by_interval: 30m
  align_queries_with_step: true
  max_retries: 5
  parallelise_shardable_queries: true
  cache_results: true

  results_cache:
    cache:
      redis:
        endpoint: LOKI_REDIS_ENDPOINT
        timeout: 1s
        db: 0

schema_config:
  configs:
    - from: 2020-09-01
      store: aws
      object_store: aws
      schema: v11
      index:
        prefix: LOKI_LOG_TABLE_PREFIX

        period: 168h

storage_config:
  aws:
    s3: s3://LOKI_AWS_REGION/LOKI_LOG_BUCKET_NAME

    dynamodb:
      dynamodb_url: dynamodb://LOKI_AWS_REGION

  index_cache_validity: 14m
  index_queries_cache_config:
    redis:
      endpoint: LOKI_REDIS_ENDPOINT
      timeout: 1s
      db: 1

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  ingestion_rate_mb: 16
  ingestion_burst_size_mb: 24
  max_entries_limit_per_query: 50000
  max_query_parallelism: 12

chunk_store_config:
  chunk_cache_config:
    redis:
      endpoint: LOKI_REDIS_ENDPOINT
      timeout: 1s
      db: 2

  write_dedupe_cache_config:
    redis:
      endpoint: LOKI_REDIS_ENDPOINT
      timeout: 1s
      db: 3

  cache_lookups_older_than: 36h
  max_look_back_period: 672h

table_manager:
  index_tables_provisioning:
    enable_ondemand_throughput_mode: true
    enable_inactive_throughput_on_demand_mode: true

  retention_deletes_enabled: true
  retention_period: 672h

frontend:
  log_queries_longer_than: 5s
  downstream_url: https://LOKI_QUERIER_RECORD_NAME.LOKI_PRIVATE_ZONE_DOMAIN_NAME
  compress_responses: true
  max_outstanding_per_tenant: 3600
2 Likes

There are a couple settings for increasing parallelism, this is likely what you want to change:

Make sure you have your queries connected to your query-frontend’s via the GRPC worker configuration (make sure you define frontend_address in the frontend_worker config of the querier and also do NOT define downstream_url in the frontend config.

The querier worker pool configuration will be capable of more parallelism and better scheduling.

Next you want to look at the parallelism setting in your frontend_worker section, typically setting it to the number of cores your querier pod has access to on the machine it’s running on.

frontend_worker:
    frontend_address: query-frontend.loki-ops.svc.cluster.local:9095
    grpc_client_config:
        max_send_msg_size: 1.048576e+08
    parallelism: 6

Also then look at max_query_parallelism in the limits_config (this is applied to the query frontend)

limits_config:
    max_query_parallelism: 32

Loki will limit how much work is scheduled in parallel as a tradeoff to avoid processing a lot of data which might not be needed. For example, in theory you could set this to the number of queriers * the parallelism setting such that every query is processed in parallel with all available resources, however, if you are doing a query for logs for example, you might find the first 1000 log lines needed to fulfill the request in the first split, and all the other parallel work would have been wasted.

A value of 16 or 32 is probably reasonable here and should somewhat reflect how much work you can even do in parallel based on queriers * parallelism

Let us know if this helps!

Hi @ewelch,
I’m actually chasing the same goal here, to improve query times. I’m already using your suggested GRPC frontend worker configuration and achieving ~4GB bytes/sec with 16 queriers, so not bad. However, I cannot go beyond that while scaling out queriers to 32 or even 40 pods, so I’m looking for pointers about what to tune next.

Here’re some test results, all with 3 ingesters;

frontend_worker config:

    query_range:
      # make queries more cache-able by aligning them with their step intervals
      align_queries_with_step: true
      max_retries: 5
      # parallelize queries in 5min intervals
      split_queries_by_interval: 5m
      cache_results: true
      results_cache:
        cache:
          # We're going to use the in-process "FIFO" cache
          enable_fifocache: true
          fifocache:
            max_size_bytes: 2GB
            validity: 24h

...
    frontend:
      log_queries_longer_than: 15s
      compress_responses: true

    limits_config:
      max_cache_freshness_per_query: '10m'
      max_query_parallelism: 120

On the queriers I also have:

    frontend_worker:
      frontend_address: loki-query-frontend-svc.default.svc.cluster.local:9095
      grpc_client_config:
        max_send_msg_size: 1.048576e+08
      parallelism: 1 # **not sure about this, I've 1vCPU per pod**

Results:

# 8 queriers
 time logcli query '{host="FIREWALL"}|= "pppppppp" ' --since=8h --stats
2021-03-05 13:41:22.121108 I | proto: duplicate proto type registered: ingester.Series 
https://loki-query-ing/loki/api/v1/query_range?direction=BACKWARD&end=1614948082121988688&limit=30&query=%7Bhost%3D%22FIREWALL%22%7D%7C%3D+%22pppppppp%22+&start=1614
919282121988688                                                                                          
Ingester.TotalReached            48                                                                      
Ingester.TotalChunksMatched      821                                                                     
Ingester.TotalBatches            0                                                                       
Ingester.TotalLinesSent          0                                                                       
Ingester.HeadChunkBytes          496 kB                                                                  
Ingester.HeadChunkLines          1690                                                                    
Ingester.DecompressedBytes       422 MB                                                                  
Ingester.DecompressedLines       504281                                                                  
Ingester.CompressedBytes         62 MB                                                                   
Ingester.TotalDuplicates         0                                                                       
Store.TotalChunksRef             8850                                                                    
Store.TotalChunksDownloaded      8850                                                                    
Store.ChunksDownloadTime         1m52.629691823s                                                         
Store.HeadChunkBytes             0 B                                                                                                                                                                               
Store.HeadChunkLines             0                                                                       
Store.DecompressedBytes          79 GB                                                                                                                                                                             Store.DecompressedLines          94729736                                                                
Store.CompressedBytes            11 GB                                                                   
Store.TotalDuplicates            0                                                                       
Summary.BytesProcessedPerSecond          3.3 GB                                                          
Summary.LinesProcessedPerSecond          3911528                                                                                                                                                                   
Summary.TotalBytesProcessed              80 GB                                                           
Summary.TotalLinesProcessed              95235707                                                        
Summary.ExecTime                         24.347441285s  




# 16 queriers
time logcli query '{host="FIREWALL"}|= "pppppppp" ' --since=8h --stats
2021-03-05 13:37:54.302091 I | proto: duplicate proto type registered: ingester.Series
https://loki-query-ing/loki/api/v1/query_range?direction=BACKWARD&end=1614947874302516692&limit=30&query=%7Bhost%3D%22FIREWALL%22%7D%7C%3D+%22pppppppp%22+&start=1614919074302516692
Ingester.TotalReached            48
Ingester.TotalChunksMatched      892
Ingester.TotalBatches            0
Ingester.TotalLinesSent          0
Ingester.HeadChunkBytes          368 kB
Ingester.HeadChunkLines          1283
Ingester.DecompressedBytes       394 MB
Ingester.DecompressedLines       469502
Ingester.CompressedBytes         57 MB
Ingester.TotalDuplicates         0
Store.TotalChunksRef             8893
Store.TotalChunksDownloaded      8893
Store.ChunksDownloadTime         1m36.277325627s
Store.HeadChunkBytes             0 B
Store.HeadChunkLines             0
Store.DecompressedBytes          80 GB
Store.DecompressedLines          95259745
Store.CompressedBytes            11 GB
Store.TotalDuplicates            0
Summary.BytesProcessedPerSecond          4.6 GB
Summary.LinesProcessedPerSecond          5496657
Summary.TotalBytesProcessed              80 GB
Summary.TotalLinesProcessed              95730530
Summary.ExecTime                         17.416136328s



# 32 queriers
time logcli query '{host="FIREWALL"}|= "pppppppp" ' --since=8h --stats
2021-03-05 13:44:53.564897 I | proto: duplicate proto type registered: ingester.Series
https://loki-query-ing/loki/api/v1/query_range?direction=BACKWARD&end=1614948293565357509&limit=30&query=%7Bhost%3D%22FIREWALL%22%7D%7C%3D+%22pppppppp%22+&start=1614919493565357509
Ingester.TotalReached            48
Ingester.TotalChunksMatched      858
Ingester.TotalBatches            0
Ingester.TotalLinesSent          0
Ingester.HeadChunkBytes          1.2 MB
Ingester.HeadChunkLines          2905
Ingester.DecompressedBytes       381 MB
Ingester.DecompressedLines       456183
Ingester.CompressedBytes         55 MB
Ingester.TotalDuplicates         0
Store.TotalChunksRef             8940
Store.TotalChunksDownloaded      8940
Store.ChunksDownloadTime         2m34.24104511s
Store.HeadChunkBytes             0 B
Store.HeadChunkLines             0
Store.DecompressedBytes          80 GB
Store.DecompressedLines          95713844
Store.CompressedBytes            11 GB
Store.TotalDuplicates            0
Summary.BytesProcessedPerSecond          3.4 GB
Summary.LinesProcessedPerSecond          4114070
Summary.TotalBytesProcessed              80 GB
Summary.TotalLinesProcessed              96172932
Summary.ExecTime                         23.376589511s
logcli query '{host="FIREWALL"}|= "pppppppp" ' --since=8h --stats  0.14s user 0.01s system 0% cpu 23.743 total


# 40 queriers
time logcli query '{host="FIREWALL"}|= "pppppppp" ' --since=8h --stats
2021-03-05 13:25:33.245620 I | proto: duplicate proto type registered: ingester.Series
https://loki-query-ing/loki/api/v1/query_range?direction=BACKWARD&end=1614947133246114430&limit=30&query=%7Bhost%3D%22FIREWALL%22%7D%7C%3D+%22pppppppp%22+&start=1614918333246114430
Ingester.TotalReached            48
Ingester.TotalChunksMatched      948
Ingester.TotalBatches            0
Ingester.TotalLinesSent          0
Ingester.HeadChunkBytes          660 kB
Ingester.HeadChunkLines          2036
Ingester.DecompressedBytes       451 MB
Ingester.DecompressedLines       539221
Ingester.CompressedBytes         65 MB
Ingester.TotalDuplicates         0
Store.TotalChunksRef             8819
Store.TotalChunksDownloaded      8819
Store.ChunksDownloadTime         2m13.735993613s
Store.HeadChunkBytes             0 B
Store.HeadChunkLines             0
Store.DecompressedBytes          79 GB
Store.DecompressedLines          94423133
Store.CompressedBytes            11 GB
Store.TotalDuplicates            0
Summary.BytesProcessedPerSecond          4.4 GB
Summary.LinesProcessedPerSecond          5302292
Summary.TotalBytesProcessed              79 GB
Summary.TotalLinesProcessed              94964390
Summary.ExecTime                         17.910063025s

I was thinking about moving away from fifo caching(using ramdisks atm) as the next thing to change.

Make sure you have your queries connected to your query-frontend’s via the GRPC worker configuration (make sure you define frontend_address in the frontend_worker config of the querier and also do NOT define downstream_url in the frontend config.

@ewelch What’s the performance implication if I’m doing it the other way around, using downstream_url to point the frontend at the queriers, instead of frontend_address to point the queriers at the frontend?

Also as far as the value of the frontend_address is it ok for that to be a load balancer? Or do they workers need the specific address records of the frontends?

Thanks @ewelch

This has helped a lot and seeing some decent gains. Currently I have scaled to 48 Queriers with single vCPU, so parallelism is set to 1 with max_query_parallelism set to 48 (for benchmarking). This has given me up to 2.7GB/s in some of my benchmarks but usually averaging a little lower than that. I’m still looking for some more gains.

I noticed that Redis metrics show Cache Hit % at 0… I need to investigate this as I don’t believe the cache is working correctly. I see Redis is flagged as experimental, should I switch this to memcache? Can you see any cache misconfigurations with my original config?

Caching can make a huge difference, especially because with small split_queries_by_interval values there is often a lot of overlap in chunks processed between queriers, (a downside of setting this too small is some duplicate work is done but caching helps a lot here).

I would make sure you have a chunk cache, results cache, and index read cache.

We haven’t had any opportunity to experiment with Redis, we use Memcached and it works really well however we get probably 10% failure related to how Memcached does internal memory slabbing related to the large objects ~1MB we try to put in it, it wasn’t really designed for objects this size. We’d love to see if Redis is a better fit here but haven’t tried it yet.

The type of query will affect performance as well, filter queries are the fastest, but when you start doing parsing json and logfmt are much faster then regexp, and then when you start doing metrics on this it adds more computational load and can slow things down a little more. The most effective setting for this is going to be the split_queries_by_interval though you are already at 5m never really tried going lower than this.

Another note is that we currently don’t shard queries sent to ingesters, honestly because of the complexity and it’s just not a problem we’ve tackled yet. So queries to the ingesters for recent data are only split by the split_queries_by_interval setting.

Related to this it’s good to make sure you only send relevant queries to the ingesters by setting the query_ingesters_within flag to something big enough to make sure all your chunks are flushed (at least bigger than max_chunk_age)

querier:
    query_ingesters_within: 3h

This saves some unnecessary query work being sent to the queriers.

What’s the performance implication if I’m doing it the other way around, using downstream_url to point the frontend at the queriers, instead of frontend_address to point the queriers at the frontend?

Not sure about performance but in general the downstream_url setting exists because the bulk of the frontend code comes from cortex and this was added to support a mode where cortex could send queries directly to a prometheus server. It’s not the desired way to configure cortex/loki for proper query parallelization and isn’t something we really test or have experience with. Using the workers configuration has additional benefits like per tenant queues and better query load sharing in a multi-tenant system.