DatasourceNoData on Prometheus datasource

Hello guys! Maybe somebody can help me out with this.

I have a VM running Grafana and Prometheus (both on the same machine) and I keep on getting DatasourceNoData throughout the day.

The alert triggers and then 1 minute later it recovers, and I have no idea what is happening to cause that.

My alert condition is quite simple and not something fancy or complicated.

This is the query that determines when to trigger the alert:
aspnetcore_healthcheck_status{job="hangfire-prd", name="Process"}

Prometheus is scrapping an asp.net core application and retrieving this health check variable every 5 seconds.

The system route is:
Grafana → Prometheus → dotnet core app

The network was checked, and everything seems to be fine.

Would somebody happen to know what I can do in order to figure out what is causing the DatasourceNoData issue?

I’m using Grafana version 10.3.1

Thanks in advance!

Hi,

What’s the group interval on your alert rule? Can you share the config of your alert (query and expressions, group interval and for settings and also the settings or no data and error - screen below):

Prometheus mentions in docs that it’s a distributed system and latency is unvoidable. Therefore, if your alert is too restrictive the data might not be available when you check on them. The easy workaround would be to set the Alert state if no data or all values are null to Keep Last State / Normal / Alerting (whatever suits your case the best) instead of default No Data (which fires an alert and a notification the moment the first No Data appears),

Thanks for the quick reply!

Well, I changed the alert “For” value/parameter from 1 minute to 5 minutes hoping that the alert will trigger only if the problem lasts for 5m. Hopefully this will do the trick.

What do you think?

If not, I’ll post all the configuration here for you.

Thanks once again!

If the setting Alert state if no data or all values are null is still set to No Data it doesn’t really matter how the for setting is set - the alert will still trigger (you’d notice that in message there’s No Data somewhere in the title)

Well, you’re right, it indeed triggered anyway.

Here is the export of the rule:

apiVersion: 1
groups:
    - orgId: 1
      name: PRD
      folder: Hangfire Alerts
      interval: 1m
      rules:
        - uid: e048fa82-9feb-4bb6-bb5c-1ce7f770cbd1
          title: Hangfire Offline (PRD)
          condition: C
          data:
            - refId: A
              relativeTimeRange:
                from: 60
                to: 0
              datasourceUid: c8297b88-26b8-4e2f-9ab9-df3579c75a35
              model:
                disableTextWrap: false
                editorMode: builder
                expr: aspnetcore_healthcheck_status{job="hangfire-prd", name="Process"}
                fullMetaSearch: false
                includeNullMetadata: true
                instant: true
                intervalMs: 30000
                legendFormat: __auto
                maxDataPoints: 43200
                range: false
                refId: A
                useBackend: false
            - refId: B
              relativeTimeRange:
                from: 60
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params: []
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - B
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: A
                intervalMs: 1000
                maxDataPoints: 43200
                reducer: last
                refId: B
                type: reduce
            - refId: C
              relativeTimeRange:
                from: 60
                to: 0
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 1
                        type: lt
                      operator:
                        type: and
                      query:
                        params:
                            - C
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: B
                intervalMs: 1000
                maxDataPoints: 43200
                refId: C
                type: threshold
          noDataState: NoData
          execErrState: Error
          for: 5m
          annotations: {}
          labels:
            env_to_notify: prd
          isPaused: false

rule uses instant_query. The Prometheus returns the most recent point in the interval now-lookback_period. It is configured in Prometheus, see --query.lookback-delta, which is 5 minutes by default.
If 5 seconds scape is true, then there must be enough points in that interval. Try “Explore” the metric and see if points are returned.
To mitigate the problem until you figure out what happens, try to use “Keep Last State”. I do not remember when it was re-introduced but perhaps you need to upgrade to use it. Otherwise, you can map NoData to Normal

Thank you so much for your reply.

But by doing that I wouldn’t know if the service behind the scenes stopped or crashed, correct?

If so, that’s exactly what I’m trying to alert. In case the service stops communicating for any reason, the alert needs to be triggered.

Also, when you say “check the points” what exactly should I do?

Thanks in advance!

Run the query in Range mode in the range from X - 1m to X where X is the timestamp when alert instance NoData was created. The timestamp can be taken from logs or from the Rule history.