We’re currently checking if the error rate of our logs is above 5% with this query:
sum(rate({log_kubernetes_namespace=“production”,log_level=“error”}[1m]))
/
sum(rate({log_kubernetes_namespace=“production”}[1m])) > 0.05
However, I can’t seem to turn this into a percentage over time, i.e. if we consider error rates above 5% as downtime, I’d like to get the uptime percentage over past weeks. Wrapping with an average over time for example doesn’t work as I’d expected.
Does this require a recording rule or is there some way to get the average of a boolean value over time?