Alerting in Grafana 8

Hi!

  • What Grafana version and what operating system are you using?
    I am using Grafana 8.0.4 on ubuntu 20.04.2 LTS, running on a odroid C2 (single board computer, similar to raspberry pi)

  • What are you trying to achieve?
    I am trying to make grafana send alarms to my signal messenger account.

  • How are you trying to achieve it?

  • enabled ngAlerts (seems to be working)
  • configured a contact point: Webhook.
  • set up a default notification policy
    There, I set up custom timings:
    Group wait = 1s
    Group interval =1s
    Repeat interval = 1d

The webhook that forwards the message to signal-cli is based on:

I added a schema for the new json fromat of the ngAlerts.

  • What happened?

I sucessfully received altert messes on my signal account. However, a message for one alert is send by Grafana every ~30s, unless i mute the alarm.

  • What did you expect to happen?

As the repeat interval is set to 1d, I was expecting to receive a message only after another day, or in case that the alarm condition was not fulfilled for a rime and then fulfilled again (which was not the case).

  • Can you copy/paste the configuration(s) that you are having problems with?

I guess I don’t really understand how the webhook works / what the format of the json object is. Also I was not able to find any proper documentation for it.

Maybe the webhook contact point awaits a confirmation on the sent message?

I tried to build a schema from an alert json message sent by grafana. The schema looks like this:

class NGALabels(BaseModel):
    alertname: str
    rule_uid: str


class NGAAnnotations(BaseModel):
    message: str


class NGAlert(BaseModel):
    status: str
    labels: NGALabels
    annotations: NGAAnnotations
    startsAt: str
    endsAt: str
    generatorURL: str
    fingerprint: str
    silenceURL: str
    dashboardURL: str
    panelURL: str
    valueString: str


class NGAGrafanaOutgoing(BaseModel):
    receiver: str
    status: str
    alerts: List[NGAlert]
    groupLabels: Dict[str, Any]
    commonLabels: Dict[str, Any]
    commonAnnotations: Dict[str, Any]
    externalURL: str
    version: str
    groupKey: str
    truncatedAlerts: int
    title: str
    state: str
    message: str


class MessageSentNGAGrafana(NGAGrafanaOutgoing):
    timestamp: datetime

the message is generated here:

@router.post("{number}/ngagrafana/", response_model=MessageSentNGAGrafana, status_code=201)
async def send_message(
    message: NGAGrafanaOutgoing, number: str, background_tasks: BackgroundTasks,
    receiver: str, group: bool = True,
) -> Any:
    """
    send message
    """
    message_string = "numebr of alerts: "+ str(len(message.alerts)) + "\n"
    
    i=0
    #print ( "numebr of alerts: ", len(message.alerts))
    for alert in message.alerts:
        i=i+1
        
        message_string += "#" + str(i) + ":\n"
        message_string += alert.status + "\n"
        message_string += alert.labels.alertname + "\n"
        message_string += alert.annotations.message + "\n"
        message_string += alert.silenceURL + "\n"
        message_string += alert.valueString + "\n\n"
        
        """
        if message.evalMatches is not None:
            for match in message.evalMatches:
                metrics += "\n" + match.metric + ": " + str(match.value) 
    
        message_string = message.title + "\n" + "State: " +  message.ruleName + "\n" + "Message: " + message.message + "\n" + "URL: " + message.ruleUrl + "\n\n" + "Metrics: " + metrics
        """

    cmd = ["-u", quote(number), "send", "-m", quote(message_string)]
    
    receivers = []
    if group:
        cmd.append("-g")
        cmd.append(quote(receiver))
    else:
        receivers.append(receiver)
        cmd += list(map(quote, receivers))

    response = await run_signal_cli_command(cmd)
    
    print ("response: ", response)

    return MessageSentNGAGrafana(**message.dict(), timestamp=response.split("\n")[0])

at the very end, I think the confirmation is sent to grafana. if i understand it correctly, that is just the same message, extended by the timestamp of the execution.

the modified files are:

signal_cli_rest_api/schemas.py
signal_cli_rest_api/api/messages.py

Can I just send one message once the alert condition is fulfilled?
Or did I understand the concept of the new altertin incorrectly?
Should I use a different apporoach?

Best regards
Christian

Have you found a solution to this?
I am experiencing something similar but not sure if its related, I get resolved/firing alerts that seem to have an interval matching the Group interval time, where i’m expecting to get the once of alert when triggered.
Do you get resolved messages while the alert is active at all?

I have not found a solution yet. Currently I have disabled resolve messages. Also, I manually change the alert rule to generate test messages. Not sure if there is a resolve message at all, if the condition is not fulfilled anymore because the rule changed and not because the value is back to normal. BTW I find it super annoying, that there is no proper way to send a test message… Or am I just too stupid to find it?

What I have seen monitoring the TCP connection is that the webhook seems not to be sending a reply once a message from grafana is received (weirdly, if I use the webpage of fastAPI, it does send an OK).
So I wrote a new webhook using flask. This one does now return an OK message. Unfortunately, grafana is still bombarding me with alerts.

What I did not try yet:

  1. A colleague of mine said that he had the impression, that grafana re-sends an alert after a internal timeout occurred and for him it helped to increade the interval in which the alert is being checked. Maybe this is a thing that can help you?
  2. Prometheus seems to be using the same (similar?) backend. Maybe studying the documentation will help to understand what is going on?
    Configuration | Prometheus

Anyway, I am not really sure if I still want to use grafana alerts. There are some sensors where my alarm condition is supposed to be if (a < b): alert=true.
I don’t see that this is even possible with the alerting concept. Maybe with flux (my data is stored in influxDB). However, that seems not to be fully supported by grafana yet. Also I killed my whole system when trying to enable flux.

So, I am thinking now about writing a python script to check my alerts.

@christianho Can you paste the JSON for the configuration of your alerting rule?

I don’t think that I found the right way to export the right json.

I don’t see any way in the alert manager. The json from the according panel seems incomplete: how often the rule is checked seems to be missing.

Here what I found there:

"alert": {
        "alertRuleTags": {},
        "conditions": [
          {
            "evaluator": {
              "params": [
                -5
              ],
              "type": "gt"
            },
            "operator": {
              "type": "and"
            },
            "query": {
              "params": [
                "A",
                "5m",
                "now"
              ]
            },
            "reducer": {
              "params": [],
              "type": "avg"
            },
            "type": "query"
          },
          {
            "evaluator": {
              "params": [
                -10
              ],
              "type": "gt"
            },
            "operator": {
              "type": "or"
            },
            "query": {
              "params": [
                "B",
                "5m",
                "now"
              ]
            },
            "reducer": {
              "params": [],
              "type": "avg"
            },
            "type": "query"
          }
        ],
        "executionErrorState": "alerting",
        "for": "1m",
        "frequency": "5m",
        "handler": 1,
        "message": "Alert! Freezer too warm",
        "name": "Inside Freezer alert",
        "noDataState": "alerting",
        "notifications": [
          {
            "uid": "******"
          }
        ]
      }

This looks like an alerting rule for the legacy alerting system. Are you sure you’ve got the feature toggle enabled for the new Alerting system?

Here’s an alert I have that checks disk usage on a personal server

{
	"name": "asdfgdfg",
	"interval": "1m",
	"rules": [{
		"grafana_alert": {
			"title": "asdfgdfg",
			"condition": "C",
			"no_data_state": "NoData",
			"exec_err_state": "Alerting",
			"data": [{
				"refId": "A",
				"queryType": "",
				"relativeTimeRange": {
					"from": 600,
					"to": 0
				},
				"datasourceUid": "******",
				"model": {
					"exemplar": true,
					"expr": "avg(100 - ((node_filesystem_avail_bytes{mountpoint=~\"/some|/thing|/cool\"} * 100) / node_filesystem_size_bytes{mountpoint=~\"/some|/thing|/cool\"})) by (mountpoint)",
					"hide": false,
					"interval": "",
					"intervalMs": 1000,
					"legendFormat": "",
					"maxDataPoints": 43200,
					"refId": "A"
				}
			}, {
				"refId": "B",
				"queryType": "",
				"relativeTimeRange": {
					"from": 0,
					"to": 0
				},
				"datasourceUid": "-100",
				"model": {
					"conditions": [{
						"evaluator": {
							"params": [0, 0],
							"type": "gt"
						},
						"operator": {
							"type": "and"
						},
						"query": {
							"params": []
						},
						"reducer": {
							"params": [],
							"type": "avg"
						},
						"type": "query"
					}],
					"datasource": "__expr__",
					"expression": "A",
					"hide": false,
					"intervalMs": 1000,
					"maxDataPoints": 43200,
					"reducer": "mean",
					"refId": "B",
					"type": "reduce"
				}
			}, {
				"refId": "C",
				"queryType": "",
				"relativeTimeRange": {
					"from": 0,
					"to": 0
				},
				"datasourceUid": "-100",
				"model": {
					"conditions": [{
						"evaluator": {
							"params": [0, 0],
							"type": "gt"
						},
						"operator": {
							"type": "and"
						},
						"query": {
							"params": []
						},
						"reducer": {
							"params": [],
							"type": "avg"
						},
						"type": "query"
					}],
					"datasource": "__expr__",
					"expression": "$B > 90",
					"hide": false,
					"intervalMs": 1000,
					"maxDataPoints": 43200,
					"refId": "C",
					"type": "math"
				}
			}],
			"uid": "******"
		},
		"for": "5m",
		"annotations": {},
		"labels": {}
	}]
}
2 Likes

Thanks for the hint! Indeed this rule was created with the legacy alerting. I was under the impression that the rules would automatically be migrated when enabling ngalerts. I will try to delete all rules and recreate them.

From my grafana.ini:

[feature_toggles]
# enable features, separated by spaces
enable = ngalert

This is why I think that ngalerts are enabled. If there is more to do, then I am not aware of it.

1 Like

Hmm that should work. You can also try enabling it via environment variable, which is what I for my local dev setup GF_FEATURE_TOGGLES_ENABLE=ngalert. Once you enable that you should have all of your alerting rules migrated into the new format. Double check for me by going to the alerting UI? Click the little bell icon in the left hand sidebar. If you see things like “contact points” as a tab in the UI then you should have things turned on correctly.

1 Like

@davidparrott Thanks! I think you gave the relevant hint!

I deleted the rule and recreated it. One alarm was sent but no further alarm after almost three hours. So, seems like the (incorrectly?) migrated alert was the problem.

This topic was automatically closed after 365 days. New replies are no longer allowed.