Hi,
I’m facing a wired issue that after my query-scheduler node gets restarted the query requests start to fail in the query frontend ([query-frontend] failed mapping AST
)
What I noticided is that the restarted scheduler doesnt receive queries anymore and the loki_query_scheduler_enqueue_count
is not emitted anymore
But from what I read from the logs and the metrics the scheduler and frontend are still able to connect:
Intially a bit bumpy (connected/disconnected)
2025-05-14 12:46:12.333 [query-scheduler] scheduler is JOINING in the ring
2025-05-14 12:46:12.333 [query-scheduler] CAS attempt failed
2025-05-14 12:46:12.553 [query-scheduler] frontend connected
2025-05-14 12:46:12.553 [query-scheduler] frontend disconnected
2025-05-14 12:46:12.554 [query-frontend] error sending requests to scheduler
2025-05-14 12:46:12.565 [query-scheduler] frontend connected
2025-05-14 12:46:12.565 [query-scheduler] frontend disconnected
2025-05-14 12:46:12.567 [query-frontend] error sending requests to scheduler
2025-05-14 12:46:12.651 [query-scheduler] frontend connected
2025-05-14 12:46:12.651 [query-scheduler] frontend disconnected
2025-05-14 12:46:12.652 [query-frontend] error sending requests to scheduler
2025-05-14 12:46:12.807 [query-scheduler] frontend connected
2025-05-14 12:46:12.807 [query-scheduler] frontend disconnected
2025-05-14 12:46:12.808 [query-frontend] error sending requests to scheduler
2025-05-14 12:46:12.906 [query-scheduler] frontend connected
2025-05-14 12:46:12.906 [query-scheduler] frontend disconnected
2025-05-14 12:46:12.907 [query-frontend] error sending requests to scheduler
2025-05-14 12:46:12.956 [query-scheduler] Failed to join <redacted>:23637: dial tcp <redacted>:23637: connect: connection refused
2025-05-14 12:46:12.966 [query-scheduler] Initiating push/pull sync with: <redacted>:30474
2025-05-14 12:46:12.969 [query-frontend] Stream connection from=<redacted>:44954
2025-05-14 12:46:12.989 ....
2025-05-14 12:46:13.183 [query-scheduler] joining memberlist cluster succeeded
2025-05-14 12:46:13.334 [query-scheduler] waiting until scheduler is ACTIVE in the ring
2025-05-14 12:46:13.334 [query-scheduler] scheduler is ACTIVE in the ring
2025-05-14 12:46:13.334 [query-scheduler] module waiting for initialization
2025-05-14 12:46:13.334 [query-scheduler] starting
2025-05-14 12:46:13.334 [query-scheduler] Loki started
2025-05-14 12:46:14.352 [query-frontend] GET /loki/api/v1/tail? .... (404)
But then also “stable”
2025-05-14 12:46:14.817 [query-scheduler] frontend connected
2025-05-14 12:46:15.034 [query-scheduler] frontend connected
2025-05-14 12:46:15.038 [query-scheduler] frontend connected
2025-05-14 12:46:15.267 [query-scheduler] frontend connected
2025-05-14 12:46:15.576 [query-scheduler] frontend connected
2025-05-14 12:46:15.790 [query-scheduler] querier connected
2025-05-14 12:46:15.884 [query-scheduler] querier connected
2025-05-14 12:46:15.926 [query-scheduler] querier connected
2025-05-14 12:46:15.931 [query-scheduler] querier connected
2025-05-14 12:46:16.267 [query-scheduler] querier connected
2025-05-14 12:46:16.334 [query-scheduler] this scheduler is in the ReplicationSet, will now accept requests.
2025-05-14 12:46:16.464 [query-scheduler] querier connected
But incoming query request to the frontend are not forwared.
After I restart the query-frontend it starts to work again.
Is there any setting I missed? Is there a special graceful shutdown protocol between frontend and scheduler required?
Loki: 3.5.0 on Nomad
Thanks in advance!