Multi-Dimensional Alerting Is Hot Garbage Or Maybe I Am Dumb

Hi, I am trying to set up MD alerts in Grafana Version 10.0.1 (commit: 5a30620b85, branch: HEAD). I set up a simple shell script input to Telegraf to simulate router interface traffic so i could modify the shell script to make traffic increase and decrease and (hopefully) make alerts fire. Here is my alert config:

I think what I am seeing looks correct, but the “multi-dimensional” aspect isn’t working. I have to make all 3 “interfaces” exceed 50% else no alert fires. This reminds me of the bad old days of panel-based alerting with Grafana. What am I doing wrong?

Hi! :wave:

I have to make all 3 “interfaces” exceed 50% else no alert fires.

I’m not sure I follow as the screenshot above above shows just one interface (eth0) exceeding 50%, and so the alert with instance=eth0 is firing, but not the others.

1 Like

Well, the thing is, that “et0” instance shows “firing” but there is no alert in my Slack. Only when I cause all of them to be “firing” do I get a Slack alert. I included the screen shot to illustrate how I have the alert logic set up, in the hopes that someone would point out something I am doing wrong :slight_smile:

If you open the Alert list page do you see it firing? If the answer is yes, can we see your notification policies?

Sorry, “Alerts List Page”? Is that this one?

Here are my notification policies:

FWIW the rules indeed show “firing” in the Alert Rules “overview” page, and alerts are indeed sent to Slack, once I make all the monitored interfaces exceed 50%. I am not having a problem with alert delivery, at least IMHO.

One thing I notice is that in my Alert rule I can see that there are 3 series in each of my queries/expressions. However when I look at the Alerts overview and I expand the “Matching Instances” under this particular rule, I don’t see 3 series. I don’t see anything.

That’s the one! Looking at the screenshots you should get a notification for eth0 (for example) if the alert 1. is above 0.5 for at least 20 seconds due to For being 20s and 2. remains above the condition for at least another 30 seconds once it’s firing (as default group_wait is 30 seconds), a total of 50 seconds.

If you leave eth0 firing for 1 minute without interruption do you get a notification?

Nope…it has been firing all night, in fact :slight_smile:

Can you show me 1. The Health column from the Contacts points page for mm-network-automation-alerts and 2. The state history for the alert. You can see the button for that next to Silence on the Alert rules page.

The “Alerting” indication on the Alert History page is from when I made the Alert happen yesterday evening due to my making all 3 interfaces be >50% .

Alert:

Clear:

Those duplicate normal alerts in the first screenshot are odd. There should just be one normal alert per state change (i.e. Normal to Pending, Pending to Alerting and Alerting to Normal). There should not be Normal to Normal. I can see that those are not due to make changes to the rule as it would have said Normal (Updated).

The second interesting observation is you said you need to make all three interfaces be > 50%, but there is just one alert in the notification, and the instance name (i.e. eth0) isn’t mentioned in the labels, instead there is just one firing alert for instance name test_bm_in_2.

Soooooo… :slight_smile: what do I do here, @georgerobinson ? Not to put too fine a point on it, but Grafana is sort of useless to my team WRT alerts. We’re going to look for some other “alerting platform” to hook up to influxdb at this point, or write our own :frowning:

Is there any other support avenue available to us besides these forums?

@jangaraj do you have anything to add?

I think it would help me understand what’s happening if you can turn on debug logging in Grafana and share logs matching logger=ngalert.state and logger=alertmanager. For example:

logger=ngalert.state.manager rule_uid=a654dd16-b476-4564-a828-7cd7e07bb365 org_id=1 t=2023-07-05T10:27:00.025492+03:00 level=debug msg="State manager processing evaluation results" resultCount=1
logger=ngalert.state.manager rule_uid=a654dd16-b476-4564-a828-7cd7e07bb365 org_id=1 instance= t=2023-07-05T10:27:00.025547+03:00 level=debug msg="Setting next state" handler=resultAlerting
logger=ngalert.state.manager rule_uid=a654dd16-b476-4564-a828-7cd7e07bb365 org_id=1 instance= t=2023-07-05T10:27:00.025576+03:00 level=debug msg="Keeping state" state=Alerting
logger=ngalert.state.manager rule_uid=a654dd16-b476-4564-a828-7cd7e07bb365 org_id=1 t=2023-07-05T10:27:00.025631+03:00 level=debug msg="Saving alert states" count=1
logger=alertmanager org=1 t=2023-07-05T10:27:00.029786+03:00 level=debug component=alertmanager orgID=1 component=dispatcher msg="Received alert" alert=Test[1da2cb0][active]
logger=alertmanager org=1 t=2023-07-05T10:27:00.029989+03:00 level=debug component=alertmanager orgID=1 component=dispatcher aggrGroup="{}:{alertname=\"Test\", grafana_folder=\"Test Folder\"}" msg=flushing alerts=[Test[1da2cb0][active]]
logger=alertmanager org=1 t=2023-07-05T10:27:00.28467+03:00 level=debug component=alertmanager orgID=1 component=dispatcher receiver=grafana-default-email integration=email[0] msg="Notify success" attempts=1

This will help me understand what’s happening from when Grafana queries InfluxDB to when it sends a notification.

You can enable debug logs following the instructions here.

Thank you @georgerobinson . I am trying to figure out how to attach a file which is not an image.

In the log i will eventually upload are these important times:

  • 16:51:15 - et1 exceeds threshold
  • 16:54:43 - et2 exceeds threshold
  • 16:59:00 - et3 exceeds threshold
  • 17:00:?? - alert is received in slack
  • 17:02:09 - et3 returns to normal
  • 17:05:?? - clear is received in slack
  • 17:06:07 - et2 returns to normal
  • 17:09:57 - et1 returns to normal

Please let me know what else you need, I appreciate your looking into this.

Here is the log excerpt you requested:

Hi! :wave: I think I found the issue :slight_smile:

You appear to be unintentionally overriding the name label that comes from your InfluxDB query with a custom label that you’ve added to the your rule that is also called name:

logger=ngalert.state.manager rule_uid=a3361bd9-7dc9-4ee6-913e-26b4482cada7 org_id=1 t=2023-07-05T16:51:42.507692895Z level=warn msg="Evaluation result contains either reserved labels or labels declared in the rules. Those labels from the result will be ignored" labels="name=et1"

What is happening here is that each eth0, eth1 and eth2 result is overwriting each other in turn because the custom label is unintentionally coalesing them into the same alert, when they should be three separate alerts:

logger=ngalert.state.manager rule_uid=a3361bd9-7dc9-4ee6-913e-26b4482cada7 org_id=1 t=2023-07-05T16:51:42.507692895Z level=warn msg="Evaluation result contains either reserved labels or labels declared in the rules. Those labels from the result will be ignored" labels="name=et1"
logger=ngalert.state.manager rule_uid=a3361bd9-7dc9-4ee6-913e-26b4482cada7 org_id=1 instance="name=et1" t=2023-07-05T16:51:42.507711624Z level=debug msg="Setting next state" handler=resultAlerting
logger=ngalert.state.manager rule_uid=a3361bd9-7dc9-4ee6-913e-26b4482cada7 org_id=1 instance="name=et1" t=2023-07-05T16:51:42.50771971Z level=debug msg="Changing state" previous_state=Normal next_state=Pending
logger=ngalert.state.manager rule_uid=a3361bd9-7dc9-4ee6-913e-26b4482cada7 org_id=1 t=2023-07-05T16:51:42.507774886Z level=warn msg="Evaluation result contains either reserved labels or labels declared in the rules. Those labels from the result will be ignored" labels="name=et2"
logger=ngalert.state.manager rule_uid=a3361bd9-7dc9-4ee6-913e-26b4482cada7 org_id=1 instance="name=et2" t=2023-07-05T16:51:42.507788157Z level=debug msg="Setting next state" handler=resultNormal
logger=ngalert.state.manager rule_uid=a3361bd9-7dc9-4ee6-913e-26b4482cada7 org_id=1 instance="name=et2" t=2023-07-05T16:51:42.50779526Z level=debug msg="Changing state" previous_state=Pending next_state=Normal
logger=ngalert.state.manager rule_uid=a3361bd9-7dc9-4ee6-913e-26b4482cada7 org_id=1 t=2023-07-05T16:51:42.507835853Z level=warn msg="Evaluation result contains either reserved labels or labels declared in the rules. Those labels from the result will be ignored" labels="name=et3"
logger=ngalert.state.manager rule_uid=a3361bd9-7dc9-4ee6-913e-26b4482cada7 org_id=1 instance="name=et3" t=2023-07-05T16:51:42.507848811Z level=debug msg="Setting next state" handler=resultNormal
logger=ngalert.state.manager rule_uid=a3361bd9-7dc9-4ee6-913e-26b4482cada7 org_id=1 instance="name=et3" t=2023-07-05T16:51:42.50785583Z level=debug msg="Keeping state" state=Normal

Removing the name custom label from the rule should this fix the issue :slight_smile:

1 Like

Hi @georgerobinson TL;DR you have helped me fix it so thank you…but i don’t think it is as you explained. Please help me understand.

This is my script on the telegraf side:

echo “acro_test,test_type=alert_test_simple,source=router1,name=et1,description=isp-test1 in_pct=5,out_pct=5”
echo “acro_test,test_type=alert_test_simple,source=router1,name=et2,description=isp-test2 in_pct=10,out_pct=10”
echo “acro_test,test_type=alert_test_simple,source=router1,name=et3,description=isp-test3 in_pct=15,out_pct=15”

As you can see, this script feeds measurements to telegraf via influx line protocol. I attempt to cause alerts by changing the “in_pct” field in one or more of the lines in the script. i am using “source” and “name” as labels because those are the labels used by the telegraf input plugin that we primarily use (gnmi).

So what I did is add another section the the script like so:

echo “acro_test,test_type=alert_test_noname,source=router1,interface=et1,description=isp-test1 in_pct=5,out_pct=5”
echo “acro_test,test_type=alert_test_noname,source=router1,interface=et2,description=isp-test2 in_pct=10,out_pct=10”
echo “acro_test,test_type=alert_test_noname,source=router1,interface=et3,description=isp-test3 in_pct=15,out_pct=15”

As you can see, “name” is now “interface”. Et voila, multi-dimensional alerts work like they should. So great news there.

HOWEVER

I don’t think I was doing what you indicated, or perhaps I misunderstand. I had no such

custom label that you’ve added to the your rule that is also called name

The only mention of name in my nonworking rule occurs in the query itself:

SELECT
mean(“in_pct”) as “IN_PCT”
FROM
“autogen”.“acro_test”
WHERE
“test_type”::tag = ‘alert_test_simple’
AND $timeFilter
GROUP BY
time($__interval)
,“name”::tag

(i was also doing format as time series and alias by $tag_source $tag_name. So i don’t think i was overriding anything so much as i was just using unfortunately-named tags to begin with. Essentially I think Grafana alerting doesn’t like queries that GROUP BY a tag called name that is present in the raw measurement data. Do i have that right?

Hi! :wave:

What I mean is looking at the logs and screenshots I think you had both a label called name from the InfluxDB query and a custom label called name at the same time.

I’ve highlighted in blue the custom label called name that would have been conflicting with the name label from the query.

You can delete this using the Trash can icon under Custom Labels:

Hi @georgerobinson … I just don’t believe that is the case. In order to get the alerts to work, all I changed was the tag i was grouping by in the first query of the alert rule. I did not change anything else.

As you see, I use the “name” label in the alert rule simply to route the alert…and i am still using that label. The alerts work fine even if I have that label.