Grafana "could not find data source" for Prometheus

Hi there, I’m running Grafana 4.3.1 inside a docker container in an AWS instance. This is connecting to a Prometheus instance using a ‘proxy’ connection. I have setup alerts and I can often see them firing, with the error message “Could not find datasource”.

However, if I go to the command-line and repeatedly fire curl requests against the Prometheus endpoint I never have any problems so I’m not sure what the issue is here. I also noticed that I cannot connect to the same backend using a ‘direct’ connection. Would this have any bearing on this? Do alerts not work if I use a ‘direct’ connection type?

The errors I see in the logs are like the following:
{“log”:“t=2017-09-22T13:31:16+0000 lvl=dbug msg=“Job Execution completed” logger=alerting.engine timeMs=72.716 alertId=1207 name=“Total CPU Limits vs Requests alert” firing=false\n”,“stream”:“stdout”,“time”:“2017-09-22T13:31:16.197509513Z”}
{“log”:“t=2017-09-22T13:32:08+0000 lvl=dbug msg=“Scheduler: Putting job on to exec queue” logger=alerting.scheduler name=“Total CPU Limits vs Requests alert” id=1207\n”,“stream”:“stdout”,“time”:“2017-09-22T13:32:08.123768657Z”}
{“log”:“t=2017-09-22T13:32:08+0000 lvl=dbug msg=“Scheduler: Putting job on to exec queue” logger=alerting.scheduler name=“Total CPU Limits vs Requests alert” id=1206\n”,“stream”:“stdout”,“time”:“2017-09-22T13:32:08.124005873Z”}
{“log”:“t=2017-09-22T13:32:08+0000 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalHandler ruleId=1207 name=“Total CPU Limits vs Requests alert” error=“Could not find datasource” changing state to=alerting\n”,“stream”:“stdout”,“time”:“2017-09-22T13:32:08.153675027Z”}
{“log”:“t=2017-09-22T13:32:08+0000 lvl=eror msg=“Alert Rule Result Error” logger=alerting.evalHandler ruleId=1206 name=“Total CPU Limits vs Requests alert” error=“Could not find datasource” changing state to=alerting\n”,“stream”:“stdout”,“time”:“2017-09-22T13:32:08.157802517Z”}
{“log”:“t=2017-09-22T13:32:09+0000 lvl=dbug msg=“Job Execution completed” logger=alerting.engine timeMs=27.401 alertId=1207 name=“Total CPU Limits vs Requests alert” firing=true\n”,“stream”:“stdout”,“time”:“2017-09-22T13:32:09.617459283Z”}

Any ideas on what the issue might be appreciated, thanks.

Make sure your panel is not using mixed datasource and is directly using prometheus, also yes you should use proxy mode, grafana server needs to be able to access prometheus

There are multiple Prometheus data sources defined in that Grafana installation, as well as a Graphite data source. The single panels within the dashboard with those alerts are individually using a single Prometheus data source, however, the “dashboard” itself has multiple panels, and some of the panels are using Graphite and others Prometheus, could this be a source of the problem?

Also, at the moment, I’m using a ‘direct’ connection and my alerts DO work, so I’m not sure why I need to use a proxy? Is there more documentation on why it needs to be that way? Would that cause ‘datasource not found’ errors? And why?

direct will work as long as prometheus is accessable from browser and grafana-server.

The alert panel that is causing the error, can you show the panel Json?

I’m seeing the same error with grafana 4.6.0.

It might be an unstable sql connection that causes this error.
MySQL is currently in another data center.

Panel JSON
{
  "alert": {
    "conditions": [
      {
        "evaluator": {
          "params": [
            1
          ],
          "type": "gt"
        },
        "operator": {
          "type": "and"
        },
        "query": {
          "params": [
            "A",
            "10s",
            "now"
          ]
        },
        "reducer": {
          "params": [],
          "type": "avg"
        },
        "type": "query"
      }
    ],
    "executionErrorState": "alerting",
    "frequency": "60s",
    "handler": 1,
    "message": "A rancher service is unhealty.",
    "name": "Service Health",
    "noDataState": "no_data",
    "notifications": [
      {
        "id": 1
      }
    ]
  },
  "aliasColors": {},
  "bars": false,
  "dashLength": 10,
  "dashes": false,
  "datasource": "Prometheus Services",
  "decimals": null,
  "fill": 5,
  "id": 21,
  "legend": {
    "alignAsTable": true,
    "avg": true,
    "current": true,
    "hideEmpty": false,
    "hideZero": false,
    "max": false,
    "min": false,
    "rightSide": true,
    "show": true,
    "total": false,
    "values": true
  },
  "lines": true,
  "linewidth": 1,
  "links": [],
  "nullPointMode": "null",
  "percentage": false,
  "pointradius": 5,
  "points": false,
  "renderer": "flot",
  "seriesOverrides": [],
  "spaceLength": 10,
  "span": 8,
  "stack": false,
  "steppedLine": false,
  "targets": [
    {
      "expr": "sum by (stack_name, service_name) (rancher_service_health_status{health_state=\"unhealthy\"})",
      "format": "time_series",
      "intervalFactor": 2,
      "legendFormat": "{{stack_name}}/{{service_name}}",
      "refId": "A",
      "step": 30
    }
  ],
  "thresholds": [
    {
      "value": 1,
      "op": "gt",
      "fill": true,
      "line": true,
      "colorMode": "critical"
    }
  ],
  "timeFrom": null,
  "timeShift": null,
  "title": "Service Health",
  "tooltip": {
    "shared": true,
    "sort": 0,
    "value_type": "individual"
  },
  "type": "graph",
  "xaxis": {
    "buckets": null,
    "mode": "time",
    "name": null,
    "show": true,
    "values": []
  },
  "yaxes": [
    {
      "format": "short",
      "label": null,
      "logBase": 1,
      "max": null,
      "min": "0",
      "show": true
    },
    {
      "format": "short",
      "label": null,
      "logBase": 1,
      "max": null,
      "min": null,
      "show": true
    }
  ]
}