We’ve recently shifted from single binary loki using persistent volume to loki SSD with s3 bucket. After the migration we started seeing gaps in our logs. Any help?
Those gaps on provided graph look like standard gaps of barchart. So are you sure that you are chcecking correct query/graph?
I am sure that the gap is real. If you see the timestamps then the logs are missing for more than 20 minutes. I am seeing the same thing with other services as well.
Here’s a graph from another service.
On the old loki, we use to see very consistent logs with this service as well.
Please share your loki configuration.
Do you see gaps on recent logs only, or all logs? If you query for the past, say, 12 hours, are there still gaps?
Hi Tony,
Yes, the gaps have been there for quite some time. Here’s the loki config, I am using loki SSD -
apiVersion: v1
data:
config.yaml: |2
auth_enabled: false
chunk_store_config:
chunk_cache_config:
background:
writeback_buffer: 500000
writeback_goroutines: 1
writeback_size_limit: 500MB
default_validity: 70m
memcached:
batch_size: 4
parallelism: 5
expiration: 70m
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-chunks-cache.loki-ssd.svc
consistent_hash: true
max_idle_conns: 72
timeout: 2000ms
common:
compactor_address: 'http://loki-backend:3100'
path_prefix: /var/loki
replication_factor: 3
storage:
s3:
bucketnames: chunks-store-loki
insecure: false
region: <REGION>
s3: <S3_ENDPOINT>
s3forcepathstyle: false
compactor:
compaction_interval: 5m
delete_request_cancel_period: 24h
delete_request_store: s3
retention_delete_delay: 24h
retention_enabled: true
working_directory: /var/loki/compactor
frontend:
scheduler_address: ""
tail_proxy_url: http://loki-querier.loki-ssd.svc.cluster.local:3100
frontend_worker:
scheduler_address: ""
index_gateway:
mode: simple
limits_config:
max_cache_freshness_per_query: 10m
query_timeout: 300s
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 90d
split_queries_by_interval: 15m
volume_enabled: true
per_stream_rate_limit: 512M
per_stream_rate_limit_burst: 1024M
ingestion_burst_size_mb: 1000
ingestion_rate_mb: 10000
max_entries_limit_per_query: 1000000
max_label_value_length: 20480
max_label_name_length: 10240
max_label_names_per_series: 300
memberlist:
join_members:
- loki-memberlist
pattern_ingester:
enabled: false
querier:
max_concurrent: 50
query_range:
align_queries_with_step: true
cache_results: true
results_cache:
cache:
background:
writeback_buffer: 500000
writeback_goroutines: 1
writeback_size_limit: 500MB
default_validity: 12h
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-results-cache.loki-ssd.svc
consistent_hash: true
timeout: 500ms
update_interval: 1m
ruler:
storage:
s3:
bucketnames: ruler-store-loki
insecure: false
region: <REGION>
s3: <S3-ENDPOINT>
s3forcepathstyle: false
type: s3
runtime_config:
file: /etc/loki/runtime-config/runtime-config.yaml
schema_config:
configs:
- from: "2023-06-15"
index:
period: 24h
prefix: index_
object_store: s3
schema: v13
store: tsdb
server:
grpc_listen_port: 9095
http_listen_port: 3100
http_server_read_timeout: 2h
http_server_write_timeout: 2h
storage_config:
boltdb_shipper:
index_gateway_client:
server_address: dns+loki-backend-headless.loki-ssd.svc.cluster.local:9095
hedging:
at: 250ms
max_per_second: 20
up_to: 3
tsdb_shipper:
index_gateway_client:
server_address: dns+loki-backend-headless.loki-ssd.svc.cluster.local:9095
tracing:
enabled: false
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: loki
meta.helm.sh/release-namespace: loki-ssd
creationTimestamp: “2024-05-29T12:51:58Z”
labels:
app.kubernetes.io/instance: loki
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: loki
app.kubernetes.io/version: 3.0.0
helm.sh/chart: loki-6.6.1
name: loki
namespace: loki-ssd
resourceVersion: “809039742”
uid: 1859d20b-c416-4ac0-aef7-02a79194f694
I don’t mean if the gaps have been there for a long time or not. I meant if you were to query your Loki instance for logs with the intervals of [ -6h to now ] and [-12h to -6g], do you see gaps in both?
I don’t see anything obvious wrong, couple of questions other than the one above:
- How many writer / read pods do you have?
- If you hit the
/ring
endpoint on one of the write and read pods, what do you see? - Are you able to replicate this in a dev / test environment?
I think 6hrs and 12hrs are looking the same, but 12hours has more sticks. I’ve attached the screenshots for both.
12h
6h
- I have 3 writers and 3 read pods.
- Here’s the ss for /ring endpoint.
- We don’t have Loki in dev/test environments.
Don’t see anything obviously wrong. There is a chance that there isn’t actually a problem with Loki. Have you confirmed you are actually losing logs?
Also, can you try this:
- Write a simple curl command to write 1 log to your Loki cluster. Be sure to specify timestamp accordingly in nanoseconds. Use a test label.
- Verify the log is in loki.
- Run the curl command once a minute, and see results for past hour.
- Run the same curl command once a second, and see results for past hour.
Please provide screenshots if possible.
@tonyswumac I am facing the same issue. I am new to the loki setup can you guide me on how to start troubleshooting this issue.
There is a consistent 5 min gap in loki prod logs, for every service. We are using loki SSD with s3 bucket. I have attached the logs for 1h 3h and 6h respectively
What is your configuration for query_ingesters_within
, and do you see any gaps beyond that? For example, if it’s set to 3h, do you see gaps querying logs older than 3h?
So Based on Promtail’s logs, it appears that Promtail cannot establish a connection with the Loki gateway in some instances, which could be causing logs to be skipped. Promtail is attempting to connect to <IP>:31065
=, but the gateway accepts requests on port 8080
.
Promtail logs-
error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:51:23.747677286Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:51:24.348715053Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:51:26.224444152Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:51:30.071605784Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:51:34.439505848Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:51:44.864574623Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:52:12.062596727Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
level=warn ts=2024-12-20T11:53:11.255311289Z caller=client.go:419 component=client host=<IP_ADDRESS>:31065 msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://<IP_ADDRESS>:31065/loki/api/v1/push\": dial tcp <IP_ADDRESS>:31065: connect: connection refused"
Promtail config-
Name: promtail
Namespace: loki-stack
Labels: app=promtail
app.kubernetes.io/instance=promtail
app.kubernetes.io/managed-by=Helm
chart=promtail-2.0.2
heritage=Helm
release=promtail
Annotations: meta.helm.sh/release-name: promtail
meta.helm.sh/release-namespace: loki-stack
client:
backoff_config:
max_period: 5m
max_retries: 10
min_period: 500ms
batchsize: 1048576
batchwait: 1s
external_labels: {}
timeout: 10s
positions:
filename: /run/promtail/positions.yaml
clients:
- url: http://<IP>:8080/loki/api/v1/push
server:
http_listen_port: 3101
target_config:
sync_period: 10s