Loki Explorer: Fetching labels timing out and process dies when no logs available

Hello,

I have Loki 3.1 running in monolithic mode on a single instance via docker. The storage backend is AWS S3. When I go to Grafana’s Explorer page for querying Loki with the time period set to the last 5 minutes and use the drop down to see the available labels, the loading of the labels just spins until it times out (like a minute) and then the Loki process dies. This only occurs when there have been no logs within the past 5 minutes, even though there were logs a few hours before. If there are logs within the past 5 minutes, the labels drop down populates as expected.

I would expect that the labels would either come back empty immediately or it would be smart enough to know what labels are present from the logs that happened hours before. What is the expected Loki behavior here?

A similar issue occurs when actually performing the query. If I enter the query directly via code, and have the time period set to last 5 minutes, when there are no logs in those past 5 minutes, rather than the query immediately coming back empty, it spins until it times out and process dies shortly after. Similarly, if I set the time period to last 30 days, it spins until it times out even though there are logs over that period.

Another observation is if I have recent logs, I am able to search them. I can progressively increase the time interval up to a point - say 7 days - when it will then time out. With other attempts, I was able to get it to search all the way back to 30 days without timing out. There are not that many unique log entries to return.

The Loki configuration:

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: warn
  grpc_server_max_concurrent_streams: 1000

common:
  path_prefix: /tmp/loki
  storage:
    s3:
      bucketnames: my-bucket-name
      region: us-west-2
      endpoint: s3.us-west-2.amazonaws.com
      sse:
        type: SSE-S3
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

limits_config:
  retention_period: 2160h # 90 days 

compactor:
  working_directory: /loki/retention
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
  delete_request_store: aws

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/index_cache
  aws:
    bucketnames: my-bucket-name
    endpoint: s3.us-west-2.amazonaws.com
    region: us-west-2
    sse:
      type: SSE-S3

schema_config:
  configs:
    - from: 2020-01-01
      store: tsdb
      object_store: aws
      schema: v13
      index:
        prefix: index_
        period: 24h

analytics:
  reporting_enabled: false

The docker command to run Loki:

sudo docker run --detach --name loki --restart unless-stopped --network observability --mount type=bind,src=/etc/loki/loki.yaml,dst=/etc/loki/loki.yaml,readonly -p 3100:3100 --log-driver local grafana/loki:3.1.0 -config.file=/etc/loki/loki.yaml -log-config-reverse-order

AWS S3 Policy

{ 
    "Version": "2012-10-17", 
    "Statement": [ 
        { 
            "Sid": "LokiStorage", 
            "Effect": "Allow", 
            "Action": [ 
                "s3:ListBucket", 
                "s3:PutObject", 
                "s3:GetObject", 
                "s3:DeleteObject" 
            ], 
            "Resource": [ 
                "arn:aws:s3:::my-bucket-name", 
                "arn:aws:s3:::my-bucket-name/*" 
            ] 
        } 
    ] 
} 

EC2 instance has a IAM profile with a role with this policy.

Loki logs when the label query finally times out (after which the process dies):

RequestCanceled: request context canceled
caused by: context deadline exceeded
error initialising module: store
github.com/grafana/dskit/modules.(*Manager).initModule
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:138
github.com/grafana/dskit/modules.(*Manager).InitModuleServices
	/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run
	/src/loki/pkg/loki/loki.go:458
main.main
	/src/loki/cmd/loki/main.go:129
runtime.main
	/usr/local/go/src/runtime/proc.go:271
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1695
level=error ts=2024-08-16T20:28:16.957048386Z caller=index_set.go:306 table-name=index_19951 msg="sync failed, retrying it" err="failed to get s3 object: RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
level=error ts=2024-08-16T20:28:16.957266279Z caller=cached_client.go:275 msg="failed to build table cache" table_name=index_19951 err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
level=error ts=2024-08-16T20:28:16.957281769Z caller=index_set.go:306 table-name=index_19951 msg="sync failed, retrying it" err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
ts=2024-08-16T20:28:16.957290159Z caller=spanlogger.go:109 table-name=index_19951 method=indexSet.Init level=error msg="failed to initialize table, cleaning it up" table=index_19951 err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
level=error ts=2024-08-16T20:28:16.957709875Z caller=log.go:216 msg="error running loki" err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded\nerror initialising module: store\ngithub.com/grafana/dskit/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:138\ngithub.com/grafana/dskit/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108\ngithub.com/grafana/loki/v3/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:458\nmain.main\n\t/src/loki/cmd/loki/main.go:129\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"

If I bypass Grafana and do the labels query from lokicli instead:

$ ./logcli-linux-amd64 labels
2024/08/16 20:42:43 http://localhost:3100/loki/api/v1/labels?end=1723840963266126671&start=1723837363266126671
2024/08/16 20:43:38 error sending request Get "http://localhost:3100/loki/api/v1/labels?end=1723840963266126671&start=1723837363266126671": EOF
2024/08/16 20:43:38 Error doing request: run out of attempts while querying the server

And then Loki dies.

Is there something wrong with the configuration?

Any help would be greatly appreciated! Thank you!

EDIT: Setup works correctly if I point Loki to the example filesystem based config instead of trying to go to S3.

How much resource do you have available on the EC2 instance where the Loki container is running on?

Hi @tonyswumac !

It is running on an instance with 1GB memory, 2 vCPU burstable to 2.5 GHz, and network bandwidth burstable to 5Gbps (t3a.micro). Looking at the ec2 instance’s monitoring graphs, it doesn’t appear the CPU was ever taxed enough as to use up any CPU credits.

I had tried running as a (t3a.medium), but it didn’t seem to make a difference, signaling memory was likely not an issue.

The same box is running containers for a monolithic Mimir, Grafana, and an Nginx server proxying Grafana. Given the results of the loki cli, I suspected neither Grafana nor its Nginx proxy had anything to do with it. Admittedly, I forgot to add --org-id to my logcli call but would’ve expected to see “Error response from server: no org id” (what I am seeing now when running in filesystem mode). An org id is being specified from Grafana in the Loki datasource definition.

To be clear, I do see writes being made to the S3 bucket.

Thanks!

Running everything on one host is probably a bit much. Here is a couple of things I’d try:

  1. Use an even bigger instance, say t3a.large, just for the heck of it.
  2. Try setting limit through Docker when you start the Loki container, specifically nofile and noproc.

Hi @tonyswumac

I tried even going up to a t3a.xlarge to get more vCPUs but unfortunately the result was the same.

I tried passing in --ulimit nofile=1048576 --ulimit nproc=1048576 to the docker run but unfortunately no change in behavior.

When storing data in S3 for Loki, should it be possible to destroy the docker container and recreate it and have it still function? Is there some other storage maybe being kept on the local container disk that should be going to S3 but is being lost when the container is recreated?

Thanks!

Interesting. I’ll have to test it out tomorrow, but I am not sure why you are running into this error.

Try setting this and see what happens:

server:
  grpc_server_max_recv_msg_size: 1.048576e+08
  grpc_server_max_send_msg_size: 1.048576e+08

querier:
  query_ingesters_within: 1m

@tonyswumac With those configuration changes, it still exhibits the same problem grabbing the labels (either from past hour or past 30 days).

I can see the querier must be working when the ingester has the data (things work if I have logs recently written) but once it writes it out to S3, it seems like it cannot be queried (at least in a reasonable amount of time). But then if it has recent log data, it seems I can start reading older logs off of S3 if I progressively increase my time range. This makes me think there’s some kind of caching going on because it gets to a point where I can get the labels and query logs even up to 30 days old, but restarting the container puts it back to “spinning” mode.

Thanks

@tonyswumac Realized I was mounting some local storage to tmp which could’ve been messing with data between restarts. So now bind mounting host’s /loki dir with container’s /loki dir. Also decided to completely wipe any old data in local storage and s3 and start clean.

Also added some ingester configs to reduce number of chunks slightly on idle. The config I’m using:

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info
  grpc_server_max_concurrent_streams: 1000
  grpc_server_max_recv_msg_size: 1.048576e+08
  grpc_server_max_send_msg_size: 1.048576e+08

common:
  path_prefix: /loki
  storage:
    s3:
      bucketnames: bucket-name
      region: us-west-2
      endpoint: s3.us-west-2.amazonaws.com
      sse:
        type: SSE-S3
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

ingester:
  max_chunk_age: 2h
  chunk_idle_period: 1h

querier:
  query_ingesters_within: 1m

limits_config:
  retention_period: 90d

compactor:
  working_directory: /loki/retention
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
  delete_request_store: aws

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/index_cache
  aws:
    bucketnames: bucket-name
    endpoint: s3.us-west-2.amazonaws.com
    region: us-west-2
    sse:
      type: SSE-S3

schema_config:
  configs:
    - from: 2020-01-01
      store: tsdb
      object_store: aws
      schema: v13
      index:
        prefix: index_
        period: 24h

analytics:
  reporting_enabled: false

Noticed this behavior on startup:

level=info ts=2024-08-22T21:29:39.94593928Z caller=main.go:126 msg="Starting Loki" version="(version=3.1.0, branch=HEAD, revision=935aee77ed)"
level=info ts=2024-08-22T21:29:39.94750719Z caller=main.go:127 msg="Loading configuration file" filename=/etc/loki/loki.yaml
level=info ts=2024-08-22T21:29:39.951179768Z caller=server.go:352 msg="server listening on addresses" http=[::]:3100 grpc=[::]:9096
level=info ts=2024-08-22T21:29:39.956527526Z caller=table_manager.go:136 index-store=tsdb-2020-01-01 msg="uploading tables"
level=info ts=2024-08-22T21:29:39.957468528Z caller=table_manager.go:436 index-store=tsdb-2020-01-01 msg="loading local table index_19927"
ts=2024-08-22T21:29:39.959844489Z caller=spanlogger.go:109 table-name=index_19927 user-id=selectivetrust level=info msg="building table names cache"
level=info ts=2024-08-22T21:30:39.957160975Z caller=table_manager.go:136 index-store=tsdb-2020-01-01 msg="uploading tables"
ts=2024-08-22T21:30:39.959416493Z caller=spanlogger.go:109 table-name=index_19927 user-id=selectivetrust level=info msg="table names cache built" duration=59.999561384s
level=error ts=2024-08-22T21:30:39.959445644Z caller=cached_client.go:189 msg="failed to build table names cache" err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
ts=2024-08-22T21:30:39.959490744Z caller=spanlogger.go:109 table-name=index_19927 user-id=selectivetrust level=info msg="building table names cache"
ts=2024-08-22T21:30:39.959739138Z caller=spanlogger.go:109 table-name=index_19927 user-id=selectivetrust level=info msg="table names cache built" duration=240.794µs
level=error ts=2024-08-22T21:30:39.959754408Z caller=cached_client.go:189 msg="failed to build table names cache" err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
ts=2024-08-22T21:30:39.95993074Z caller=spanlogger.go:109 table-name=index_19927 level=info msg="building table names cache"

It repeatedly fails attempting to build table names cache due to a timeout when working with table index_19927. It eventually moves onto table index_19928 and that fails due to context deadline exceeded. It moves onto table index_19929 which succeeds and then finishes all the way through index_19957 (note there is very little log data data stored over this period).

4 minutes after starting up, I get this log message:

level=info ts=2024-08-22T21:33:46.624417835Z caller=loki.go:508 msg="Loki started" startup_time=4m6.751672433s

I see now that I must wait until startup is complete before label loading works in Grafana. I just did not anticipate startup would take 4m when the file_system store seems to start instantly.

However, querying at this point on the last 3 hours times out. I try a few more times and each time it times out. Then I got an EOF error on the query instead of a timeout. Then tried searching again and it worked.

At this point, I resized the instance from a t3a.micro back to a t3a.xlarge to see the behavior. It still repeatedly timed out “building table names cache” and took 4m to start up.

This log is interesting

level=info ts=2024-08-22T22:33:35.622618707Z caller=table_manager.go:280 index-store=tsdb-2020-01-01 msg="query readiness setup completed" duration=2.99µs distinct_users_len=0 distinct_users=

I have seen the query succeed after it happens, but it takes awhile to happen (5 additional minutes from startup complete on the t3a.xlarge).

I wonder why are those “building table names cache” timing out such that it takes 4m to start up? And why can I not query without first having it time out a few times?

UPDATE: As I’m writing this up, I found When querier restarts, it always get error `msg="error querying storage" err="context canceled"` on the first query · Issue #4779 · grafana/loki · GitHub which talked about an s3 related timeouts, I tried adding the setting:

s3:
  http_config:
    response_header_timeout: 5s

Startup time went from 4m to 11s. Also queries worked right after this. Thank you for your help.