I’m using Grafana alerting with Loki as the data source.
In my LogQL query, I apply topk() over a metric extracted using unwrap.
Now I want to filter the result of topk to exclude any series where the label status = "start".
Is there a way to filter out those series aftertopk() — either in a Grafana alert rule or using a recording rule?
If so, how can I implement that?
My goal is to monitor the final state of containers, and only alert on those that have ended with a DIE or OOM status.So I’m using this query to extract and filter only the relevant failure cases.
my logql + metric query is like below
topk(1, last_over_time(
{container_name=~"docker-events-logger.*"}
| json
| Type = `container`
| status =~ `die|oom|start`
| unwrap timeNano [123s]
) by (Actor_Attributes_name, Actor_Attributes_image, host_name, host_ip, status)
) by (Actor_Attributes_name, Actor_Attributes_image, host_name, host_ip)
A bit confused, normally if you want to filter something out you’d do it before you do any other calculation to avoid performing unnecessary work. Is there any reason not to just remove start from
Thx for reply.
Actually I want to capture docker container’s last status among DIE, START, OOM
by (Actor_Attributes_name, Actor_Attributes_image, host_name, host_ip, status)
→ last events for each container’s status
) by (Actor_Attributes_name, Actor_Attributes_image, host_name, host_ip)
→ last status for each container
after all this query captures container’s last status, but
I dont want to get alarm from status “start”
so.. I am struggling to find way to do that using log query + metric query…
Your objective is to be alerted if your containers aren’t healthy, yes? If so, it may be a better idea to implement some sort of poke test, or monitor the containers directly via something like cAdvisor and alert from there.
If you have to do it from logs, then your only option really is to be alerted when a container is die or oom, because with any container platform you should reasonably expect containers to restart themselves, which is to say you should always expect start to follow after die or oom, therefore you should simply ignore start status altogether (unless for some reason this isn’t the case for you). Also you should also not expect the start status to be an indication of containers being healthy because they could fail to actually start. This is why monitoring the containers directly is a better idea.
If you really want to do what you originally set out to do, then you can use last_over_time to determine the last/latest value of an entry. The problem is it only works with a metric value, so you’d have to first convert your status into some sort of number. Something like this (not tested):
sum by (Actor_Attributes_name, Actor_Attributes_image, host_name, host_ip) (
last_over_time({container_name=~"docker-events-logger.*"}
| json
| Type = `container`
| status =~ `die|oom|start`
| label_format status_code="{{ if eq .status "die" }}1{{ eles if eq .status "oom"}}2{{ else }}0{{ end }}"
| unwrap status_code
[123s]
)
)
formatting label would be great. but I cannot find way to convert label value with condition in metric query & log query.
maybe I should add code to container agent(promtail or cAdvisor)