Gaps on grafana with loki SSD

We’ve recently shifted from single binary loki using persistent volume to loki SSD with s3 bucket. After the migration we started seeing gaps in our logs. Any help?

Those gaps on provided graph look like standard gaps of barchart. So are you sure that you are chcecking correct query/graph?

I am sure that the gap is real. If you see the timestamps then the logs are missing for more than 20 minutes. I am seeing the same thing with other services as well.

Here’s a graph from another service.

On the old loki, we use to see very consistent logs with this service as well.

Please share your loki configuration.

Do you see gaps on recent logs only, or all logs? If you query for the past, say, 12 hours, are there still gaps?

Hi Tony,

Yes, the gaps have been there for quite some time. Here’s the loki config, I am using loki SSD -

apiVersion: v1
data:
config.yaml: |2

auth_enabled: false
chunk_store_config:
  chunk_cache_config:
    background:
      writeback_buffer: 500000
      writeback_goroutines: 1
      writeback_size_limit: 500MB
    default_validity: 70m
    memcached:
      batch_size: 4
      parallelism: 5
      expiration: 70m
    memcached_client:
      addresses: dnssrvnoa+_memcached-client._tcp.loki-chunks-cache.loki-ssd.svc
      consistent_hash: true
      max_idle_conns: 72
      timeout: 2000ms
common:
  compactor_address: 'http://loki-backend:3100'
  path_prefix: /var/loki
  replication_factor: 3
  storage:
    s3:
      bucketnames: chunks-store-loki
      insecure: false
      region: <REGION>
      s3: <S3_ENDPOINT>
      s3forcepathstyle: false
compactor:
  compaction_interval: 5m
  delete_request_cancel_period: 24h
  delete_request_store: s3
  retention_delete_delay: 24h
  retention_enabled: true
  working_directory: /var/loki/compactor
frontend:
  scheduler_address: ""
  tail_proxy_url: http://loki-querier.loki-ssd.svc.cluster.local:3100
frontend_worker:
  scheduler_address: ""
index_gateway:
  mode: simple
limits_config:
  max_cache_freshness_per_query: 10m
  query_timeout: 300s
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 90d
  split_queries_by_interval: 15m
  volume_enabled: true
  per_stream_rate_limit: 512M
  per_stream_rate_limit_burst: 1024M
  ingestion_burst_size_mb: 1000
  ingestion_rate_mb: 10000
  max_entries_limit_per_query: 1000000
  max_label_value_length: 20480
  max_label_name_length: 10240
  max_label_names_per_series: 300
memberlist:
  join_members:
  - loki-memberlist
pattern_ingester:
  enabled: false
querier:
  max_concurrent: 50
query_range:
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache:
      background:
        writeback_buffer: 500000
        writeback_goroutines: 1
        writeback_size_limit: 500MB
      default_validity: 12h
      memcached_client:
        addresses: dnssrvnoa+_memcached-client._tcp.loki-results-cache.loki-ssd.svc
        consistent_hash: true
        timeout: 500ms
        update_interval: 1m
ruler:
  storage:
    s3:
      bucketnames: ruler-store-loki
      insecure: false
      region: <REGION>
      s3: <S3-ENDPOINT>
      s3forcepathstyle: false
    type: s3
runtime_config:
  file: /etc/loki/runtime-config/runtime-config.yaml
schema_config:
  configs:
  - from: "2023-06-15"
    index:
      period: 24h
      prefix: index_
    object_store: s3
    schema: v13
    store: tsdb
server:
  grpc_listen_port: 9095
  http_listen_port: 3100
  http_server_read_timeout: 2h
  http_server_write_timeout: 2h
storage_config:
  boltdb_shipper:
    index_gateway_client:
      server_address: dns+loki-backend-headless.loki-ssd.svc.cluster.local:9095
  hedging:
    at: 250ms
    max_per_second: 20
    up_to: 3
  tsdb_shipper:
    index_gateway_client:
      server_address: dns+loki-backend-headless.loki-ssd.svc.cluster.local:9095
tracing:
  enabled: false

kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: loki
meta.helm.sh/release-namespace: loki-ssd
creationTimestamp: “2024-05-29T12:51:58Z”
labels:
app.kubernetes.io/instance: loki
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: loki
app.kubernetes.io/version: 3.0.0
helm.sh/chart: loki-6.6.1
name: loki
namespace: loki-ssd
resourceVersion: “809039742”
uid: 1859d20b-c416-4ac0-aef7-02a79194f694

I don’t mean if the gaps have been there for a long time or not. I meant if you were to query your Loki instance for logs with the intervals of [ -6h to now ] and [-12h to -6g], do you see gaps in both?

I don’t see anything obvious wrong, couple of questions other than the one above:

  1. How many writer / read pods do you have?
  2. If you hit the /ring endpoint on one of the write and read pods, what do you see?
  3. Are you able to replicate this in a dev / test environment?

I think 6hrs and 12hrs are looking the same, but 12hours has more sticks. I’ve attached the screenshots for both.

12h


6h

  1. I have 3 writers and 3 read pods.
  2. Here’s the ss for /ring endpoint.
  3. We don’t have Loki in dev/test environments.

Don’t see anything obviously wrong. There is a chance that there isn’t actually a problem with Loki. Have you confirmed you are actually losing logs?

Also, can you try this:

  1. Write a simple curl command to write 1 log to your Loki cluster. Be sure to specify timestamp accordingly in nanoseconds. Use a test label.
  2. Verify the log is in loki.
  3. Run the curl command once a minute, and see results for past hour.
  4. Run the same curl command once a second, and see results for past hour.

Please provide screenshots if possible.

@tonyswumac I am facing the same issue. I am new to the loki setup can you guide me on how to start troubleshooting this issue.

There is a consistent 5 min gap in loki prod logs, for every service. We are using loki SSD with s3 bucket. I have attached the logs for 1h 3h and 6h respectively


What is your configuration for query_ingesters_within, and do you see any gaps beyond that? For example, if it’s set to 3h, do you see gaps querying logs older than 3h?

So Based on Promtail’s logs, it appears that Promtail cannot establish a connection with the Loki gateway in some instances, which could be causing logs to be skipped. Promtail is attempting to connect to <IP>:31065 =, but the gateway accepts requests on port 8080.

Promtail logs-

error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:51:23.747677286Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:51:24.348715053Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:51:26.224444152Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:51:30.071605784Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:51:34.439505848Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:51:44.864574623Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:52:12.062596727Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:53:11.255311289Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"

Promtail config-

Name:         promtail
Namespace:    loki-stack
Labels:       app=promtail
              app.kubernetes.io/instance=promtail
              app.kubernetes.io/managed-by=Helm
              chart=promtail-2.0.2
              heritage=Helm
              release=promtail
Annotations:  meta.helm.sh/release-name: promtail
              meta.helm.sh/release-namespace: loki-stack

client:
  backoff_config:
    max_period: 5m
    max_retries: 10
    min_period: 500ms
  batchsize: 1048576
  batchwait: 1s
  external_labels: {}
  timeout: 10s
positions:
  filename: /run/promtail/positions.yaml
clients:
  - url: http://<IP>:8080/loki/api/v1/push
server:
  http_listen_port: 3101
target_config:
  sync_period: 10s