From Loki Canary doc
Loki Canary is a standalone app that audits the log-capturing performance of a Grafana Loki cluster.
For a Loki cluster, how many Loki Canary instances will need to be deployed? Is it necessary needed to run Loki Canary on all client nodes in a single environment? eg in an environment with 50+ nodes, do we need to run Loki Canary in all 50+ nodes or it will be fine to run on like 3-5 nodes to collect the necessary Loki performance’s data?
The doc said
The defaults of
spot-check-max means that after 4 hours of running the canary will have a list of 16 entries it will query every minute (default
spot-check-query-rate interval is 1m), so be aware of the query load this can put on Loki if you have a lot of canaries.
I will assume it is not a good idea to run too many canaries.
In my opinion Canary is sort of like an end-to-end test for the Loki cluster and the environment where your logs would be coming from. For us, we deploy exactly one canary to each of our VPC. For our development and staging environments we sometimes disable the canary after initial testing, for upper environments we leave them running.
This is what I thought too where we don‘t really need to run canary in all nodes but like one ( or few ) per enviornment or per vpc to collect the metric on how Gafana Loki cluster work. However the doc didn’t really mention anything other than one comment said if you have many canaries it will add load to Loki. Even the diagram seem to show many canaries
It didn’t bother me too much, and while it can be more clear it is also technically correct. We are running many canaries (maybe about 20), even though they are from different environments. And it’s probably not the responsibility of the doc to assume how canary should be deployed.
you have 20 canaries against one loki cluster? did you see any short spike in loki querier?
Not really, compared to real log traffic canary logs are trivial. We still took some precautions:
We kept the size default (which is small, 100 bytes).
We set the
interval to be higher. Default is 1s, but i think we run 60s in lower environments and 30s in production.
We use a dedicated tenant (with 7-day retention override) in Loki for canaries. This is only applicable if you run Loki with multi-tenant.