Hi All,
We have this monitoring alert, which is being thrown every 3 hours:
[FIRING:2] InstanceDown (: opsgenie)
Alert: Instance :9113 - web-xyz down
Description: :9113 - web-xyz of job nginx-exporter-stage has been down for more than 1 minutes.
Details:
alertname: InstanceDown
Environment: stage
instance: :9113
instance_id: i-0a2s3d4fg56yhj78k9
instance_name: web-xyz
job: nginx-exporter-stage
prometheus_host: .ec2.internal
severity: opsgenie
Upon checking (querying) in Grafana, result shows it’s down for a long time now.
up{job=“node-exporters”} == 1
-
The thing is, the instance never went down.
It’s been up & running all the time, but Grafana (Prometheus) shows differently. -
The port 9113 is never really open… (9100 is open, Prometheus uses this, right?)
- What Grafana version and what operating system are you using?
Grafana v8.0.3 (cae5c5e46b), Ubuntu
- What are you trying to achieve?
figure out the fix for this.
- How are you trying to achieve it?
I’m checking if there’s somethings wrong with the prometheus.yml
- What happened?
Grafana/Prometheus is throwing false alerts frequently.
- What did you expect to happen?
we expect to only received this alert when the server/s is actually down.
- Can you copy/paste the configuration(s) that you are having problems with?
snippet:
###########
- ec2_sd_configs:
- filters:
- name: tag:Role
values:- web-staging-ec2
port: 9113
region: us-east-1
- web-staging-ec2
- name: tag:Role
- filters:
- name: tag:Role
values:- web-prod
port: 9113
region: us-west-2
- web-prod
- name: tag:Role
- filters:
- name: tag:Role
values:- web-prod
port: 9113
region: ca-central-1
- web-prod
- name: tag:Role
- filters:
- name: tag:Role
values:- web-prod
port: 9113
region: ap-southeast-2
- web-prod
- name: tag:Role
- filters:
- name: tag:Role
values:- web-prod
port: 9113
region: eu-west-1
job_name: nginx-exporter-stage
relabel_configs:
- web-prod
- name: tag:Role
- action: keep
regex: .*
source_labels:- __meta_ec2_tag_Name
- source_labels:
- __meta_ec2_tag_Name
target_label: instance_name
- __meta_ec2_tag_Name
- source_labels:
- __meta_ec2_instance_id
target_label: instance_id
- __meta_ec2_instance_id
- source_labels:
- __meta_ec2_tag_Environment
target_label: Environment
###############
- __meta_ec2_tag_Environment
- filters:
- Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.
No UI errors.
- Did you follow any online instructions? If so, what is the URL?
no.
Any tips in troubleshooting/finding a resolution?