Help Loki config - query slowlyness

Hi there !

I’m unable to achieve decent perfomances with Loki 2.7.1 (binary install) and need some help to troubleshoot where the problem is. I don’t know if the query slowlyness I’m facing is due to the loki config, the hardware requirements or the query itself.

Here is an example query I use to investigate with logcli --since=168h that reveal poor stats (with cache disabled for tests):

count by(isApp) (count_over_time({env="production", artifact="MY_SERVER_NAME", isApp=~"true|false"} [1h]))

And here are the results stats:

Ingester.TotalReached            16
Ingester.TotalChunksMatched      8
Ingester.TotalBatches            15
Ingester.TotalLinesSent          95
Ingester.TotalChunksRef          83
Ingester.TotalChunksDownloaded  83
Ingester.ChunksDownloadTime      4.579255398s
Ingester.HeadChunkBytes          77 B
Ingester.HeadChunkLines          7
Ingester.DecompressedBytes       2.9 kB
Ingester.DecompressedLines       92
Ingester.CompressedBytes         4.0 kB
Ingester.TotalDuplicates         2
Querier.TotalChunksRef   8856
Querier.TotalChunksDownloaded    8856
Querier.ChunksDownloadTime       37.078656774s
Querier.HeadChunkBytes   0 B
Querier.HeadChunkLines   0
Querier.DecompressedBytes        311 kB
Querier.DecompressedLines        9837
Querier.CompressedBytes          434 kB
Querier.TotalDuplicates          0
Summary.BytesProcessedPerSecond          86 kB
Summary.LinesProcessedPerSecond          2710
Summary.TotalBytesProcessed              314 kB
Summary.TotalLinesProcessed              9936
Summary.ExecTime                         3.665225301s
Summary.QueueTime                        0s

Firstly, don’t you think these are poor stats ? 3,6 secondes to process less than 10K lines of 311kB, that’s not the stats I can see online, even when the question of slowlyness is set.

I think I missed something.

Here is my Loki config:

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  http_server_read_timeout: 3m
  grpc_server_max_recv_msg_size: 15194304
  grpc_server_max_send_msg_size: 15194304

common:
  path_prefix: /tmp/loki
  storage:
    filesystem: null
    gcs:
      bucket_name: MY_BUCKET_NAME
      service_account: MY_BUCKET_SERVICE_ACCOUNT
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

query_range:
  align_queries_with_step: true
  max_retries: 5
  parallelise_shardable_queries: true
  cache_results: true
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 512

frontend_worker:
  frontend_address: localhost:9096
  parallelism: 4
  grpc_client_config:
    max_send_msg_size: 1.048576e+08

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: gcs
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /data/loki/index
    shared_store: gcs
    cache_location: /data/loki/index_cache

As you can see, my logs are store in a GCS bucket, without slowlyness issue at writing.

My Loki instance (and Grafana) are installed on a small machine, do you think it can be the root cause (the CPU usage seems good during the slow queries):

  • Ubuntu 20.04 x86/64
  • 50GB SSD
  • GCP Instance type e2-standard-2 (2 vCPU, 8GB Memory)

Edit : Tried on local machine with 8 core CPU but still using GCS, result still slow

Looks like you have only one instance, is that correct? If so that may be the cause of slowness. Loki’s performance comes from distribution, you really need more than one reader with query frontend to achieve the sort of performance you want. This is what our test looks like with 2 QFs and 3 queriers:

Summary.BytesProcessedPerSecond  803 MB
Summary.LinesProcessedPerSecond  3579091
Summary.TotalBytesProcessed 	 3.4 GB
Summary.TotalLinesProcessed 	 15189251
Summary.ExecTime 		 4.243884866s
Summary.QueueTime 		 0s

Yes I use only one instance. Thanks for the suggestion, I’ll try to use more but don’t know how, looking for documentation about.

But, I think there is another cause in the GCS integration. The stat that makes me think that, or that I dont understand is this one:

Querier.ChunksDownloadTime       37.078656774s

It’s so weird, how may I have 37s of DownloadTime for a 3.6s ExecTime :upside_down_face: ?