I have an alert set up to attempt to catch processes that fail to run. I’m trying to ensure that at least 2 processes ran within the past 5 minutes. I have this alert:
But i’m getting the alert transition into Pending unexpectedly. I need help making sense of what i’m seeing. It looks like it’s going into pending based on A and B values being 1…? i’m not sure where these numbers are coming from or what they represent.
I guess you have non zero Pending period
in your alert config:
Set it to 0, so no pending will be used.
1 Like
I’m trying to understand why it’s going into pending in the first place. That doesn’t make sense to me. You can see in the graph that the green line(what the alert is based on) never falls below 2, which is my alert threshold value.
The reason is because the last value returned by your query (the value of the Reduce expression A) is 1, and 1 is below your threshold of 2.
What you’re saying makes sense but i don’t see that happening in my graph. Maybe i don’t understand the Reduce → Last function. I understand that to always look at the latest value. If you look at my graph though, the green line never falls below 10.
could it be because my time grouping is 5 minutes, which is the same as the length of time the alert is covering, now-5m to now
?
Hard to know, we can’t see the rest of the query. My suggestion would be to run the query interactively (outside Grafana) and check manually what the latest value is.
I would like to see state history for this alert.
You can see that result from A is flapping between 1 and 4,5. It looks like your data in DB are “delayed” 5 min, so I would use last 10min time range for alert query, to avoid state when there are no data for last 5 minutes.
ok, i’ll give that a shot
@jangaraj nope, still regularly going into Pending
I wouldn’t use the graph as a reference, instead use state history as @jangaraj mentioned. The points on the graph are aligned to the start of the 5th minute, but the alert rule is not guaranteed to be evaluated at that exact second.
If I had to guess, the query isn’t correct when run in between 5 minute offsets (i.e. at 12:48) for example. Perhaps you could share the query?
sure, i can share some of it. I’m using sql to loop through databases, execute a query in each, and store those into a temporary table. The query that executes for each database has this bit on the where
clause:
and accessed_timeIn AT TIME ZONE ''Central Standard Time'' AT TIME ZONE ''UTC'' <= ''' + $__timeTo() + '''
and accessed_timeIn AT TIME ZONE ''Central Standard Time'' AT TIME ZONE ''UTC'' >= ''' + $__timeFrom() + '''
Here is the final query that is behind the graph and alert:
SELECT
$__timeGroup(accessed_timeIn, '5m') as time,
count(*) as 'All Runs'
FROM
#CombinedResults
group BY
$__timeGroup(accessed_timeIn, '5m')
order by
time;
How did you configure Configure no data and error handling
?
Use: Alert state if no data or all values are null: no data - this can be causing alerting. Make sure your query returns some not null data every time.
hmm, i’m not sure i want that. If i don’t get any data back, then i want to be alerted. That would mean the processes i’m trying to monitor aren’t being executed. I want to alert on that.
Also, if i was getting nulls, wouldn’t that show up in the alert’s state history?
You want to investigate why alert is going to pending state. That’s a task.
Later you can configure alert based on your need of course.
SELECT
$__timeGroup(accessed_timeIn, '5m') as time,
count(*) as 'All Runs'
FROM
#CombinedResults
group BY
$__timeGroup(accessed_timeIn, '5m')
order by
time;
The rule is evaluated once per minute. That means the time ranges that are queried are (for example):
(09:00UTC, -5m) = 08:55 to 09:00
(09:01UTC, -5m) = 08:56 to 09:01
(09:02UTC, -5m) = 08:57 to 09:02
(09:03UTC, -5m) = 08:58 to 09:03
(09:04UTC, -5m) = 08:59 to 09:04
(09:05UTC, -5m) = 09:00 to 09:05
When are the processes expected to run? Are there potential windows here where just one process has run?