Hello!
I am monitoring the status (up or down) of my services using the default up metric. I understand the value of this metric is 1 if the last scrape was successful, else it is 0. My metrics are available at my-domain.com/service-name/prometheus, which is only down when the whole service is down.
I have the following queries:
up{instance="my-dev-domain.com", service_name=~"*service"}
up{instance="my-prod-domain.com", service_name=~"*service"}
The names of the services all end in service.
Problem
When I bring down any service in the dev environment, the second query also returns 0 for that service. This is an issue because it triggers alerting and a change in my Grafana visuals, even though the service was never down in prod to begin with (I would have noticed that). I don’t understand why Prometheus evaluates the second query to be 0, when the instance label is clearly different.
What I expected to happen
The second query should be independent from the first one and not return 0 for a given service when the first one does.
What I tried
I did a sanity check of the DNS records of the two domains and they are set up correctly. So Prometheus should be scraping different data for the two environments. I also tried adding a new label, environment=dev or environment=prod to each target, hoping that would create less ambiguity between queries (even though instance was already there), but it didn’t make a difference. It seems to me that Prometheus is mistaking the two jobs for each other somehow.
Configuration
I deploy Prometheus to my Kubernetes cluster using Helm. Here’s a snippet from my config, using bit of relabeling magic to keep the config DRY, else I would have to repeat the targets for each service:
prometheus.yml:
scrape_configs:
- job_name: dev-services
scheme: https
scrape_interval: 15s
basic_auth:
username_file: /etc/prometheus/secrets/basicauth/username
password_file: /etc/prometheus/secrets/basicauth/password
static_configs:
- targets:
- name-of-my-first-service
- name-of-my-second-service
relabel_configs:
- source_labels: [ __address__ ]
target_label: service_name
- source_labels: [ __address__ ]
target_label: __metrics_path__
replacement: /service/$1/prometheus
- target_label: __address__
replacement: my-dev-domain.com
- job_name: prod-services
scheme: https
scrape_interval: 15s
basic_auth:
username_file: /etc/prometheus/secrets/basicauth/username
password_file: /etc/prometheus/secrets/basicauth/password
static_configs:
- targets:
- name-of-my-first-service
- name-of-my-second-service
relabel_configs:
- source_labels: [ __address__ ]
target_label: service_name
- source_labels: [ __address__ ]
target_label: __metrics_path__
replacement: /service/$1/prometheus
- target_label: __address__
replacement: my-prod-domain.com
I am running Prometheus 3.5.0. Thanks and let me know if something’s unclear. I’m bit of in the dark here, because I don’t understand what’s happening, so I hope I provided enough details.