Custom templated Alert Message for MSTeams contains default templating?

Hi there.

I’ve been on a templating adventure with Grafana for the last couple of days (Grafana v9.4) and MSTeams.

This is the message for a single alert:
image

The result is pretty readable now but there is an additional title (“Alert:”) in each Teams post that I can’t for the life of me figure out where is coming from. My guess is that this is a part of the default template that is seeping through from somewhere, somehow, since the line “alert:” does not appear twice or additionally in any of the templates that I’ve made (I have templates for the message, the alerts and the title of the message)

Can I override this? It’s a minor annoyance compared to what the messages used to look like so if it can’t be done, I’m still pretty happy.

Since I’m posting anyway, I’ve tried to get headers (markdown ##Header) to work in the MSTeams template. It isn’t parsing (shows up as escaped instead of being formatted as a header. I have tried ##Header, ##Header##, ## Header and ## Header ## to no avail. Conversely I’ve got bold text, bulletpoints and Italics to work fine. Is this supported at all for MSTeams as a contact point?

Hi! I think it would be easiest if you could share your templates in a reply and then we can take a look at what the issue might be?

Hello, thank you for responding!
I should have included those, sorry about that.

These are my templates:

And this is the contact point configuration for MSTeams:
image

Here are the custom annotations:

Hi! I think there is an error in the alert_test template on Line 3. You have {{.Labels. }} which is invalid. I think you want {{ .Labels }} without the second dot?

I blurred out a label that goes there. Think of it as being .Labels.node-name

Oh! OK. To answer the question of where the “Alert:” comes from, I can see it here in this screenshot :slight_smile:

image

Oh my,
I can’t believe I missed that.

Thank you so much! It’s always great to get a second pair of eyes on things. I almost didn’t see it even though you pointed it out.

Do you have any idea about the markdown headers?

Grafana uses Microsoft Teams Adaptive Cards. It uses text blocks for both the header and the message (https://github.com/grafana/alerting/blob/main/receivers/teams/teams.go#L257-L267). I’m afraid looking at their docs it seems headers are not supported Text Features - Adaptive Cards | Microsoft Learn.

I see!
I’m new to MSTeams and I was not aware of Adaptive Cards.
I’ve been looking at the wrong markdown settings, it turns out.

Thank you so much for the quick responses and explanations!

Hey Guys,

I am totally new to Grafana alerting customization. Can someone please share a working template guide?
The below guides are too generic and do not help me understand the source or mapping to the grafana dashboard or variable.

They are too generic and too high level. Please guide.

Thank you

@georgerobinson Could you please help here?

Hi! :wave: The two pages you linked are excellent introductions to templating notifications. What do you need help with exactly?

We are working on designing meaningful alert templates, that provide clear, detailed information about the specific issue detected, beyond just stating the high-level problem. For instance, rather than simply alerting that “500 errors are above threshold,” the alert should include relevant query strings, endpoints, and other specifics about the source of the errors.

Ideally, the alert should arm our responding engineer with enough context to begin troubleshooting the root cause before they even log in. To continue the example, the 500 error alert could state:

"More than 50 500 errors have occurred in the last 5 minutes on the GET /api/v1/reports endpoint, specifically from the following user queries:

User 123 querying report 4567

User 456 querying report 7890

This points to a potential issue with the reporting API under heavy load. Please investigate query performance and infrastructure health around the /api/v1/reports endpoint immediately."

By providing this level of granular detail - the specific endpoint, sample queries, and plausible root cause - the alert gives engineers a head start in diagnosing the problem quickly.

Does this make sense? These articles do not help me do any of this. In fact, it not even clear about how we can get the query details in our alerts.

Hello! :wave:

It sounds to me like the documentation you need right now is not how to template notifications, but how multi-dimensional alerts work? As far as I understood from your reply, your immediate problem is actually how to get information from the query into the alert, which is a much different problem from templating notifications. Is that correct?

Well you are partly right. What is the point of seeking that information, if I don’t know where I will use it :wink:
I simply provided you with the big picture. I am hoping you have a better idea than me about how we can achieve our goal.
I mean, “if getting information from query is not the right way forward, then what are our alternatives to configure meaningful alerts,” is what I am hoping to get back from you.

Hope that clarifies my perspective.

Thanks

The information I think you need is the number of errors in the last 5 minutes grouped by endpoint, and possibly grouped by the user and report too. However, I would consider ignoring both the user and report as you may find your alerts have very high cardinality, unless you know the number of users and reports is bounded.

For example, suppose you have 10 users, querying 10 reports each. However, because the website is down, all HTTP requests are returning HTTP 500 Internal Server Error. In Grafana, that would mean 10 x 10 = 100 firing alerts. Is that really actionable for your responding engineer, or will they be overwhelmed?

I believe, there is a mechanism in OnCall to silence the alerts for a time frame, right? My plan was to train the engineers to use OnCall and Grafana Incident workflows to manage the alerts in such a scenario.

I believe, there is a mechanism in OnCall to silence the alerts for a time frame, right?

Yes there is, but I would question the usefulness of having so many alerts in the first place. Instead, what I would recommend (again this is just a recommendation so you don’t have to follow it) is to just alert on endpoint and HTTP 500 error rate. Then either in your runbook, or using the DashboardUID and PanelID annotations, link to a Grafana Dashboard showing the breakdown of which users are experiencing HTTP 500 errors and for which reports.

That means then in the alert rule you only need to query average error rate by endpoint for the last 5 minutes. In Prometheus, this could look something like this:

sum by (endpoint) (rate(http_request_failed_total{status="500"}[5m])) > 50

And then you would have one alert per endpoint that is exceeding 50 HTTP 500 errors in the last 5 minutes.

You would then group these alerts together in your notification policy, and that is where notification templates comes in.

What do you mean? Would it not be a single alert monitoring for Error 500 for a time period? I think we misunderstood each other somewhere.
Could you please help me trace back, where you got the impression that this will be creating duplicate alerts?