Hi team,
Im using Grafana with Prometheus,
Scenario : To monitor the server resources of CPU, RAM, Disk Space, instance Down or UP status.
Issue I’m facing:
- some times in alert rules the status of “Health” is (Nodata), why is this happening, is the alert is incorrect or anything else?
- kindly give alert message template with alert message, instance ip and job name, so it will be easy to understand the alert message for my scenario.
- Is there anything i need to change in Alert Rules and Alert message template ?
- even after changing the alert rules description and summary it’s showing the old values used in summary or description in alerts saying no value as below
[[FIRING x 3 | ] || DatasourceNoData|attachment](http://grafana.staged-by-discourse.com/alerting/list)
**Summary**: High RAM usage on <no value> (<no value>)
**Description**: RAM usage on <no value> (<no value>) is above 90%. ( For the last 1 Minute)
**Instance Details**: ``
**Job Details**: ``
Labels:
I have used the following rules for alerting
- CPU
(A)
100 - (avg by (instance, job) (irate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)
(B)
Reduce
Function - Last
Input - A
Mode - Strict
(C)
Threshold
input - B
IS ABOVE - 70
- Disk Space
(A)
100 - (avg(node_filesystem_avail_bytes{job=~".+", instance=~".+"}) by (instance) / avg(node_filesystem_size_bytes{job=~".+", instance=~".+"}) by (instance)) * 100 > 80
(B)
Reduce
Function - Last
Input - A
Mode - Strict
(C)
Threshold
input - B
IS ABOVE - 80
- RAM
(A)
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 90
(B)
Reduce
Function - Last
Input - A
Mode - Strict
(C)
Threshold
input - B
IS ABOVE - 90
- Instance Down status
(A)
sum_over_time(up{job=~".+", instance=~".+"}[10s]) < count(up{job=~".+", instance=~".+"})
(B)
Reduce
Function - Last
Input - A
Mode - Strict
(C)
Threshold
input - B
IS ABOVE - 0
For the above 4 rules the Alert Condition is set to (C)
Notification channel is set to Slack
For the Alert Message Template
{{ define "alert_severity_prefix_emoji" -}}
{{- if ne .Status "firing" -}}
[OK]
{{- else if eq .CommonLabels.severity "critical" -}}
[CRITICAL]
{{- else if eq .CommonLabels.severity "warning" -}}
[WARNING]
{{- end -}}
{{- end -}}
{{ define "slack.title" -}}
{{ template "alert_severity_prefix_emoji" . }}
[{{- .Status | toUpper -}}{{- if eq .Status "firing" }} x {{ .Alerts.Firing | len -}}{{- end }} | {{ .CommonLabels.env | toUpper -}} ] || {{ .CommonLabels.alertname -}}
{{- end -}}
{{- define "slack.text" -}}
{{- range .Alerts -}}
{{ if gt (len .Annotations) 0 }}
*Summary*: {{ .Annotations.summary}}
*Description*: {{ .Annotations.description }}
*Instance Details*: `{{ .Labels.instance }}`
*Job Details*: `{{ .Labels.job }}`
Labels:
{{ range .Labels.SortedPairs }}{{ if or (eq .Name "env") (eq .Name "instance") (eq .Name "job") }}• {{ .Name }}: `{{ .Value }}`
{{ end }}{{ end }}
{{ end }}
{{ end }}
{{ end }}