Struggling with Alert Fatigue – Looking for Best Practices and Tool Recommendations

Hi everyone,
We’re currently facing alert fatigue in our monitoring setup. Too many alerts are firing—many of them are noisy or not actionable, and it’s becoming hard to identify the truly critical ones.

Our current stack:

  • Prometheus + Alertmanager
  • Grafana dashboards

We’ve also tried basic alert grouping and silencing in Alertmanager,

We’ve also recently started using Skedler to generate scheduled reports from Grafana dashboards. This helps reduce some noise by shifting focus to digest-style reporting, but real-time alerts are a lot to handle.

I’m looking for suggestions on:

  1. Any tools or workflows that helped your team reduce alert noise
  2. How you report on alerts/metrics without overwhelming the team
  3. Any tips, playbooks, or resources would be super helpful!

Thanks in advance

Hi, it might not solve your case, but some tips that come from my experience:

  • if the alert appears and there’s nothing broken, you should either increase the threshold or the pending period (or review if the alert makes sense)
  • leverage different severities of alerting - some alerts are not meant to be resolved right away, so they can go the a separate notification channel that most (if not all of them) of people can have silenced - they can look at it whenever they get the change. Another approach would be to have set notifications on designated hours or have people check for new alerts either periodically or on designated hours.
  • you can use OnCall or a designated Grafana dashboard to monitor which alerts are now active - again, probably not all the alerts have to be resolved right away, some of them might not even need a notification - resolve them when you have the chance.

I think the most important thing to do is to check the alerts that appear often - maybe they are too sensitive for your case?

This is an important topic indeed. One time a company told me that too many false or unimportant alerts at night is the number one reason why people in the team quit their jobs.

From my experience there are several aspects to this.

1. Get critical alerts
I heard of people who place their phones close to the shower in order not to miss an alert, or who can hardly sleep because they are afraid of overhearing the “beep”.
Make sure that critical alerts reach the responsible people (and only them). Make sure these alerts are loud and clear, e.g. using different notification channels like app push, SMS text, voice calls. If the alert is not acknowledged in time, automatically escalate.
Use on-call plans for automatically routing the alerts to the right people at the right time.
My favorite tip is to use a fitness band to wake you up softly at night and if you miss this gentle alarm configure a loud phone call - but this wakes up the whole family.

2. Reduce false and unimportant alerts
Wake the on-call engineer up at at night if it is critical but let them sleep otherwise
Use smart filtering to cut through the noise. If the issue is only temporary, e.g. revolved by auto-recovery, or flickering sensors, delay the alerts and only fire the notifications if the issue is still present after a certain period of time.

3. Get all relevant information
Alert messages like “Email not working” or “Error X439217” at night might be a bit frustrating. Provide all the information an engineer needs to resolve a certain issue. This includes clear and concise incident information, and also additional information like colors to indicate severity or priority, icons, images, links to documentations or knowledgebase articles, etc. Ideally, you provide information on how to resolve the issue or even provide a way to resolve the issue using a remote action (e.g. to restart a server) from your phone.

4. Offer help
On-call engineers are the heroes who keep your business running - often alone at night. Offer help, e.g. experts to call and clear procedures in case of really critical situations.

All this is a process, and it is team work. For example, fine-tuning the filtering and defining the procedures required some effort - but it is work it. And working on it as a team and acknowledging the problem is already a great way to show respect and to help the on-call engineers.

There is a helpful article available here: