Loki logline count and Cloudwatch logline count mismatch (missing log issue)

We were originally using cloudwatch logs to collect logs.

We recently adopted loki and before we move our log system completely to loki, we need to check if we are missing any logs, so we are now using loki and cloudwatch logs together.

I am using fluentbit as a client and the output is set to cloudwatch logs and loki.
However, when I compare the number of log lines per hour between cloudwatch and loki, there is a discrepancy.
Some hours they match perfectly, while other hours a significant amount of logs are missing from loki - about 40,000 log lines.

Worse, we didn’t find any error logs from any of the loki components, including fluentbit (client side) and the write component.
I can’t see what’s causing the problem, and there’s nothing I can do to fix it, but the missing logs are so obvious that I don’t know what to do next.

[version info]

  • Loki version: 2.8.2
  • Helm version: 5.8.8
  • fluentbit(client) version: 1.9

[query for log line count]

  • loki
    sum(count_over_time({cluster="test-cluster"} [$__range]))
    
  • cloudwatch
    fields @timestamp, @message, @logStream, @log
    | sort @timestamp desc
    | stats count(@message) as log_line_count
    

Without knowing more it’s rather hard to say. Couple of things to try:

  1. What percentage would 40,000 lines of logs be compared to your total volume?
  2. What deployment mode are you using for Loki? Please also share your configuration.
  3. I’d try to narrow the time range to maybe 5 minutes or so when logs are going missing, print out the log lines from both CloudWatch and Loki, do a diff, and see if you can identify any sort of characteristics of the missing logs or if they are completely random.
1 Like

Thank you for your response.

1. number of loglines in the same time (1 hour)

To be precise, there was a difference in the number of loglines of 38,666.

  • Approximately 11.7% are missing (38,666)
  • Total number of loglines in cloudwatch: 328,479
  • LOKI total number of loglines: 289,813

2. Loki Deployment Mode

I don’t know exactly what mode of loki we are running now, probably Microservices mode or Simple scalable deployment mode.

I’ll write the pod configuration and each component configuration below.

  • gateway
  • read (query frontend, querier)
  • write (distributer, ingester)
  • backend (ruler, compactor,…)
  • loki-canary

3. To characterize the missing logs, we compared the logs on a minute-by-minute basis.

I compared the logs manually, one by one, but it’s unlikely that a particular client is missing something. I saw the same thing across multiple clients.

  • Time: Randomly selected around 1 minute increments
  • Total number of loglines in cloudwatch: 2129
  • LOKI total number of loglines: 1717
  • Difference is approximately 412

What else can we see?

There is no error log in the write component(just warning), but there is in the other components.

When we lost 38,666 logs, we reduced 6 write pods to 4.
(One thing to note though, if the problem was when we reduced the WRITE PODs, we shouldn’t have had the same problem before that, and similarly, we have inconsistent log line counts at different times on different days).

From what I can see, it’s hard to see what’s going on other than the number of pods being written to and read from and the error logs being generated.

Could you look at the logs below to see what the problem is?

I’ve removed as many duplicates as possible and made sure to hide as much internal information as possible.

carnary log

“error running metric query test: expected only a single series result in the metric test query vector, instead received 2”,
“Loki returned an error code: 500, waiting 1.394788912s before next query.error querying loki: error response from server: empty ring”,
“error reading websocket, will retry in 10 seconds: websocket: close 1011 (internal server error): all ingesters closed the connection”,
“error reading websocket, will retry in 10 seconds: websocket: close 1011 (internal server error): too many unhealthy instances in the ring”,
“error reading websocket, will retry in 10 seconds: websocket: close 1006 (abnormal closure): unexpected EOF”

read log

“level=error caller=http.go:316 org_id=fake msg="Error from client" err="websocket: close 1006 (abnormal closure): unexpected EOF"”,
“level=error caller=scheduler_processor.go:158 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=xx.xx.169.172:9095”,
“level=error caller=scheduler_processor.go:106 msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=xx.xx.169.172:9095”,
“level=error caller=retry.go:73 org_id=fake msg="error processing request" try=0 query="{container=\"loki\"} != \"\" !~ \"info\"" err="context canceled"”,
“level=error caller=tail.go:224 org_id=fake msg="Error receiving response from grpc tail client" err=EOF”,
“level=error caller=tail.go:224 org_id=fake msg="Error receiving response from grpc tail client" err="rpc error: code = Unknown desc = Ingester is stopping"”,
“level=error caller=tail.go:90 msg="Error reconnecting to disconnected ingesters" err="failed to connect with one or more ingester(s) during tailing: context canceled"”,
“level=warn caller=scheduler_processor.go:98 msg="error contacting scheduler" err="rpc error: code = Canceled desc = context canceled" addr=xx.xx.169.152:9095”,
“level=warn caller=pool.go:193 msg="removing frontend failing healthcheck" addr=xx.xx.128.217:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"”,
“caller=spanlogger.go:85 middleware=QueryShard.astMapperware org_id=fake org_id=fake level=warn msg="failed mapping AST" err="rpc error: code = Code(500) desc = empty ring\n" query="{stream=\"stdout\",pod=\"loki-canary-k249m\"}"”,
“level=error caller=retry.go:73 org_id=fake msg="error processing request" try=4 query="{stream=\"stdout\", pod=\"loki-canary-k249m\"}" err="rpc error: code = Code(500) desc = empty ring\n"”,
“level=error caller=http.go:347 org_id=fake msg="Error from iterator" err="all ingesters closed the connection"”,
“level=error caller=tail.go:140 msg="Error reconnecting to ingesters" err="failed to connect with one or more ingester(s) during tailing: empty ring"”,
“level=warn caller=scheduler_processor.go:78 msg="failed to notify querier shutdown to query-scheduler" address=xx.xx.136.82:9095 err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"”,
“level=warn caller=scheduler_processor.go:78 msg="failed to notify querier shutdown to query-scheduler" address=xx.xx.136.82:9095 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.136.82:9095: connect: no route to host\""”,
“level=error caller=frontend_scheduler_worker.go:237 msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.136.82:9095: connect: no route to host\"" addr=xx.xx.136.82:9095”,
“level=error caller=frontend_scheduler_worker.go:248 msg="error sending requests to scheduler" err="scheduler is shutting down" addr=xx.xx.136.82:9095”,
“level=error caller=scheduler_processor.go:106 msg="error processing requests from scheduler" err="rpc error: code = Unknown desc = queue is stopped" addr=xx.xx.136.82:9095”,
“level=warn caller=pool.go:193 msg="removing frontend failing healthcheck" addr=xx.xx.138.78:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"”,
“level=warn caller=scheduler_processor.go:78 msg="failed to notify querier shutdown to query-scheduler" address=xx.xx.173.74:9095 err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"”,
“level=error caller=frontend_scheduler_worker.go:237 msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.173.74:9095: connect: no route to host\"" addr=xx.xx.173.74:9095”,
“level=warn caller=scheduler_processor.go:98 msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.173.74:9095: connect: no route to host\"" addr=xx.xx.173.74:9095”,
“level=warn caller=scheduler_processor.go:78 msg="failed to notify querier shutdown to query-scheduler" address=xx.xx.173.74:9095 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.173.74:9095: connect: no route to host\""”,
“level=warn caller=scheduler_processor.go:78 msg="failed to notify querier shutdown to query-scheduler" address=xx.xx.173.74:9095 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.173.74:9095: connect: connection refused\""”,
“level=error caller=frontend_scheduler_worker.go:248 msg="error sending requests to scheduler" err="scheduler is shutting down" addr=xx.xx.173.74:9095”,
“level=error caller=frontend_scheduler_worker.go:237 msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.173.74:9095: connect: connection refused\"" addr=xx.xx.173.74:9095”,
“level=warn caller=pool.go:193 msg="removing frontend failing healthcheck" addr=xx.xx.167.79:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"”,
“level=error caller=scheduler_processor.go:106 msg="error processing requests from scheduler" err="rpc error: code = Unknown desc = queue is stopped" addr=xx.xx.173.74:9095”,
“level=error caller=frontend_scheduler_worker.go:237 msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.139.105:9095: connect: no route to host\"" addr=xx.xx.139.105:9095”,
“level=error caller=frontend_scheduler_worker.go:237 msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.170.13:9095: connect: connection refused\"" addr=xx.xx.170.13:9095”,
“level=warn caller=scheduler_processor.go:78 msg="failed to notify querier shutdown to query-scheduler" address=xx.xx.143.63:9095 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.143.63:9095: connect: connection refused\""”,
“level=warn caller=scheduler_processor.go:78 msg="failed to notify querier shutdown to query-scheduler" address=xx.xx.170.13:9095 err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"”,
“level=warn caller=scheduler_processor.go:78 msg="failed to notify querier shutdown to query-scheduler" address=xx.xx.139.105:9095 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.139.105:9095: connect: connection refused\""”,
“level=warn caller=scheduler_processor.go:78 msg="failed to notify querier shutdown to query-scheduler" address=xx.xx.143.63:9095 err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"”,
“level=warn caller=scheduler_processor.go:78 msg="failed to notify querier shutdown to query-scheduler" address=xx.xx.170.13:9095 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.170.13:9095: connect: no route to host\""”,
“level=error caller=frontend_scheduler_worker.go:237 msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.170.13:9095: connect: no route to host\"" addr=xx.xx.170.13:9095”,
“level=error caller=frontend_scheduler_worker.go:237 msg="error contacting scheduler" err="rpc error: code = Canceled desc = context canceled" addr=xx.xx.170.13:9095”,
“level=warn caller=scheduler_processor.go:78 msg="failed to notify querier shutdown to query-scheduler" address=xx.xx.143.63:9095 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.143.63:9095: connect: no route to host\""”,
“level=error caller=frontend_scheduler_worker.go:237 msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.143.63:9095: connect: no route to host\"" addr=xx.xx.143.63:9095”,
“level=error caller=frontend_scheduler_worker.go:237 msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.143.63:9095: connect: connection refused\"" addr=xx.xx.143.63:9095”,
“level=error caller=frontend_scheduler_worker.go:237 msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp xx.xx.139.105:9095: connect: connection refused\"" addr=xx.xx.139.105:9095”,
“level=error caller=frontend_scheduler_worker.go:248 msg="error sending requests to scheduler" err="scheduler is shutting down" addr=xx.xx.143.63:9095”,
“level=error caller=frontend_scheduler_worker.go:248 msg="error sending requests to scheduler" err="scheduler is shutting down" addr=xx.xx.170.13:9095”,
“level=error caller=retry.go:73 org_id=fake msg="error processing request" try=0 query="{stream=\"stdout\", pod=\"loki-canary-lv2j8\"}" err="rpc error: code = Code(500) desc = failed to enqueue request"”,
“level=error caller=frontend_scheduler_worker.go:248 msg="error sending requests to scheduler" err="scheduler is shutting down" addr=xx.xx.139.105:9095”

backend log

Normally see a lot of logs saying that the index schema is not found. However, I only wrote one because it seems redundant…
“level=error caller=compactor.go:507 msg="skipping compaction since we can’t find schema for table" table=loki_index_19556”

write

Here I see the error message "POST /loki/api/v1/push (500)", but it’s important to note that my application log agent is fluentbit There are no logs that show the agent as fluentbit…
“level=warn caller=logging.go:86 traceID=0e69042c3c34cb77 orgID=fake msg="POST /loki/api/v1/push (500) 1.789064ms Response: \"rpc error: code = Unknown desc = Ingester is shutting down\\n\" ws: false; Connection: close; Content-Length: 6480; Content-Type: application/x-protobuf; User-Agent: promtail/2.8.2; "”,
“level=warn caller=logging.go:86 traceID=007a2d82f74c8f7d orgID=fake msg="POST /loki/api/v1/push (500) 1.786022ms Response: \"rpc error: code = Unknown desc = Ingester is shutting down\\n\" ws: false; Connection: close; Content-Length: 5971; Content-Type: application/x-protobuf; User-Agent: GrafanaAgent/; "”,
“level=warn caller=logging.go:86 traceID=3ff892d7dd53efb7 orgID=fake msg="POST /loki/api/v1/push (500) 1.860704ms Response: \"rpc error: code = Unknown desc = Ingester is shutting down\\n\" ws: false; Connection: close; Content-Length: 4379; Content-Type: application/x-protobuf; User-Agent: promtail/2.8.2; "”,
“level=warn caller=logging.go:86 traceID=3320abf78a9ce2a3 orgID=fake msg="POST /loki/api/v1/push (500) 1.698046ms Response: \"rpc error: code = Unknown desc = Ingester is shutting down\\n\" ws: false; Connection: close; Content-Length: 4300; Content-Type: application/x-protobuf; User-Agent: GrafanaAgent/; "”,

gateway log

“[error] 9#9: *57 connect() failed (111: Connection refused) while connecting to upstream, client: xx.xx.141.65, server: , request: "POST /loki/api/v1/push HTTP/1.1", upstream: "http://xx.xx.238.68:3100/loki/api/v1/push\”, host: "loki-gateway"“,
“[error] 10#10: *8314 connect() failed (111: Connection refused) while connecting to upstream, client: xx.xx.129.12, server: , request: "POST /loki/api/v1/push HTTP/1.1", upstream: "http://xx.xx.238.68:3100/loki/api/v1/push\”, host: "loki.domain.com"”,
“[error] 13#13: *8291 connect() failed (111: Connection refused) while connecting to upstream, client: xx.xx.164.246, server: , request: "POST /loki/api/v1/push HTTP/1.1", upstream: "http://xx.xx.238.68:3100/loki/api/v1/push\”, host: "loki-gateway.loki.svc.cluster.local""

client

“level=warn caller=client.go:419 component=client host=loki-gateway msg="error sending batch, will retry" status=502 tenant= error="server returned HTTP status 502 Bad Gateway (502): "”,
“level=warn caller=client.go:419 component=client host=loki-gateway msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://loki-gateway/loki/api/v1/push\\\”: context deadline exceeded"“,
“level=warn caller=client.go:419 component=client host=loki-gateway msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://loki-gateway/loki/api/v1/push\\\”: dial tcp xx.xx.27.64:80: connect: connection refused"”,
“level=warn caller=client.go:419 component=client host=loki-gateway msg="error sending batch, will retry" status=500 tenant= error="server returned HTTP status 500 Internal Server Error (500): rpc error: code = Unknown desc = Ingester is shutting down"”

logs

“caller=client.go:419 level=warn component=logs logs_config=loki/loki component=client host=loki-gateway.loki.svc.cluster.local msg="error sending batch, will retry" status=502 tenant= error="server returned HTTP status 502 Bad Gateway (502): "”,

“caller=client.go:419 level=warn component=logs logs_config=loki/loki component=client host=loki-gateway.loki.svc.cluster.local msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://loki-gateway.loki.svc.cluster.local/loki/api/v1/push\\\”: context deadline exceeded"",

“caller=client.go:419 level=warn component=logs logs_config=loki/loki component=client host=loki-gateway.loki.svc.cluster.local msg="error sending batch, will retry" status=500 tenant= error="server returned HTTP status 500 Internal Server Error (500): rpc error: code = Unknown desc = Ingester is shutting down"”,

“caller=client.go:419 level=warn component=logs logs_config=loki/loki component=client host=loki-gateway.loki.svc.cluster.local msg="error sending batch, will retry" status=-1 tenant= error="Post \"http://loki-gateway.loki.svc.cluster.local/loki/api/v1/push\\\”: dial tcp xx.xx.27.64:80: connect: connection refused"",

I see a lot of EOF errors, I am going to assume those are because of restart. Other than that don’t see a whole lot. I would recommend trying a couple of things:

  1. Double check your ring membership by hitting the /ring endpoint. Perhaps post a screenshot of that as well.

  2. I am not sure if you already have the infrastructure needed for this, but in fluentd you can add flow counter and have the counter be forwarded to a metrics platform. Might be worth adding a counter on both the Loki and CloudWatch output and compare.

  3. If you are using multiple worker per fluentd agent, try adding the worker ID as a label when sending logs to Loki. The idea is to keep log stream from each worker within each agent unique.

  4. As a last resort, Grafana Cloud offers 50GB of free storage, and in my opinion is an excellent playground. You could try sending your logs to there for a short period of time and compare results, you’ll of course have to pay for egress from your cloud provider, so please exercise discretion.

1 Like

For now, the issue is resolved.
I just thought you might be interested to know.

The number of log lines in cloudwatch and loki were obviously different, and I was struggling with no error logs anywhere.

The lack of error logs on loki led me to suspect a network issue, and while reviewing the network architecture again, I found two vulnerabilities.

  1. there was only 1 pod on istio acting as ingress for the front end
  2. there was only 1 loki-gateway pod.

Both 1 and 2 could cause network related issues, so we fixed the minimum number to 3.

Since then, I’ve continued to check the logline count, and there are no additional missing values.

Thank you so much for your attention and help.
Thanks to you looking at the logs with me, I realized it was a network issue and not a problem with loki.