Problem with SQL server monitoring dashboard

Hi, I have a problem with SQL server monitoring dashboard. I’ve set everything up, but can’t seem to visualize the data. It states n/a in every graph. I’m new to grafana and this probably is a configuration issue. I checked that I get a response from prometheus.

The dashboard: https://grafana.com/dashboards/1883

I’m using the following stack:

  • Telegraf for Microsoft SQL metrics
  • Prometheus-client plugin
  • Prometheus

I have the following configurations for telegraf, they work but I think this might need change to make it working.

[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  metric_buffer_limit = 1000
  flush_buffer_when_full = true
  collection_jitter = "0s"

  flush_interval = "5s"
  flush_jitter = "0s"
  debug = false
  quiet = false


###############################################################################
#                                  OUTPUTS                                    #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.prometheus_client]]
  # Address to listen on
  listen = ":9273"

  # Path to publish the metrics on, defaults to /metrics
  path = "/metrics"   

  # Expiration interval for each metric. 0 == no expiration
  expiration_interval = "60s"

###############################################################################
#                                  INPUTS                                     #
###############################################################################

# Windows Performance Counters plugin.
# These are the recommended method of monitoring system metrics on windows,
# as the regular system plugins (inputs.cpu, inputs.mem, etc.) rely on WMI,
# which utilize more system resources.
#
# See more configuration examples at:
#   https://github.com/influxdata/telegraf/tree/master/plugins/inputs/win_perf_counters

[[inputs.win_perf_counters]]
  [[inputs.win_perf_counters.object]]
        # Processor usage, alternative to native, reports on a per core.
        ObjectName = "Processor"
        Instances = ["*"]
        Counters = [
          "% Idle Time",
          "% Interrupt Time",
          "% Privileged Time",
          "% User Time",
          "% Processor Time",
          "% DPC Time",
        ]
        Measurement = "win_cpu"
        # Set to true to include _Total instance when querying for all (*).
        IncludeTotal=true

  [[inputs.win_perf_counters.object]]
        # Disk times and queues
        ObjectName = "LogicalDisk"
        Instances = ["*"]
        Counters = [
          "% Idle Time",
          "% Disk Time",
          "% Disk Read Time",
          "% Disk Write Time",
          "Current Disk Queue Length",
          "% Free Space",
          "Free Megabytes",
        ]
        Measurement = "win_disk"
        # Set to true to include _Total instance when querying for all (*).
        #IncludeTotal=false

  [[inputs.win_perf_counters.object]]
        ObjectName = "PhysicalDisk"
        Instances = ["*"]
        Counters = [
          "Disk Read Bytes/sec",
          "Disk Write Bytes/sec",
          "Current Disk Queue Length",
          "Disk Reads/sec",
          "Disk Writes/sec",
          "% Disk Time",
          "% Disk Read Time",
          "% Disk Write Time",
        ]
        Measurement = "win_diskio"

  [[inputs.win_perf_counters.object]]
        ObjectName = "Network Interface"
        Instances = ["*"]
        Counters = [
          "Bytes Received/sec",
          "Bytes Sent/sec",
          "Packets Received/sec",
          "Packets Sent/sec",
          "Packets Received Discarded",
          "Packets Outbound Discarded",
          "Packets Received Errors",
          "Packets Outbound Errors",
        ]
        Measurement = "win_net"

  [[inputs.win_perf_counters.object]]
        ObjectName = "System"
        Counters = [
          "Context Switches/sec",
          "System Calls/sec",
          "Processor Queue Length",
          "System Up Time",
        ]
        Instances = ["------"]
        Measurement = "win_system"
        # Set to true to include _Total instance when querying for all (*).
        #IncludeTotal=false

  [[inputs.win_perf_counters.object]]
        # Example query where the Instance portion must be removed to get data back,
        # such as from the Memory object.
        ObjectName = "Memory"
        Counters = [
          "Available Bytes",
          "Cache Faults/sec",
          "Demand Zero Faults/sec",
          "Page Faults/sec",
          "Pages/sec",
          "Transition Faults/sec",
          "Pool Nonpaged Bytes",
          "Pool Paged Bytes",
          "Standby Cache Reserve Bytes",
          "Standby Cache Normal Priority Bytes",
          "Standby Cache Core Bytes",

        ]
        # Use 6 x - to remove the Instance bit from the query.
        Instances = ["------"]
        Measurement = "win_mem"
        # Set to true to include _Total instance when querying for all (*).
        #IncludeTotal=false

  [[inputs.win_perf_counters.object]]
        # Example query where the Instance portion must be removed to get data back,
        # such as from the Paging File object.
        ObjectName = "Paging File"
        Counters = [
          "% Usage",
        ]
        Instances = ["_Total"]
        Measurement = "win_swap"

  [[inputs.win_perf_counters.object]]
        ObjectName = "Network Interface"
        Instances = ["*"]
        Counters = [
          "Bytes Sent/sec",
          "Bytes Received/sec",
          "Packets Sent/sec",
          "Packets Received/sec",
          "Packets Received Discarded",
          "Packets Received Errors",
          "Packets Outbound Discarded",
          "Packets Outbound Errors",
        ]




# Read metrics from Microsoft SQL Server
[[inputs.sqlserver]]
  # Specify instances to monitor with a list of connection strings.
  # All connection parameters are optional. 
  # By default, the host is localhost, listening on default port (TCP/1433) 
  #    for Windows, the user is the currently running AD user (SSO).
  #    See https://github.com/denisenkom/go-mssqldb for detailed connection parameters.

  servers = [
        "Server=10.xxx.x.6;Port=1433;User Id=xxx;Password=xxxx;app name=telegraf;log=1;"
        ]

I scrape everything with prometheus with this config:

    # my global config
global:
  scrape_interval:     5s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  #- "resources.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: prometheus

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["worker:9090"]

  - job_name: node

    static_configs:
      - targets: ["worker:9100","esb:9100","plus:9100"]

  - job_name: 'win-exporter'

    static_configs:
      - targets: ['sql:9182']

  - job_name: 'telegraf'

    static_configs:
      - targets: ['sql:9273']

I have working dashboards connected to this prometheus instance. Logs are clear, prometheus can connect to the client and is receiving data. But the dashboard looks like this:

What can I do to fix this issue?

Hi,/m

I’m a bit unsure of your prometheus scrape config. If you naviate to http://sql:9273/metrics, do you see any metrics output there and are there any sql server metrics?

If you navigate to http://<prometheus url>/targets, usually http://localhost:9090/targets, can you include a screenshot of what you see?

Are your Default datasource your configured prometheus datasource?

Maybe telegraf have changed something lately. Haven’t used this dashboard for a while.

Marcus

Hi, thx alot for your help!

I’m behind a esb so had to work around with a curl command: but this is the output. While you look at this, I’ll get the metrics for you as well.

The datasource is also working:

Metrics are to big to include so a screenshot in stead. I can see sql metrics like the Active_cursors___TSQL_Local_Cursor, …

EDIT:
Alse would like to add, that when I select the “production” datasource, the result is the same. Just forgot to switch that one when I took the screenshot in my initial question.

Hi,

Thanks.

Please make sure that after you’ve selected the production datasource, save the dashboard. Then hard reload in the browser (ctrl+r in chrome). Just to make sure there are no cached stuff.

Then go to settings, template variables (variables in Grafana 5.0) and click on the $instance variable. Make sure that your production datasource are selected in the Data source field. Click on Update button. Save the dashboard and hard reload. Instance drop down populated now?

As a side note, I can see that you haven’t specified any scrape interval in edit data source which means that 15s are used per default. However if I understand your prometheus config correctly you’re using a scrape interval of 5 seconds. These two settings must correlate and will help you from getting weird results in Grafana.

Marcus

Sorrh that doesnt change anything. The datasource was already correct. Just in case I repeated the process a couple of times but to no avail.

Marcus Efraimsson grafana@discoursemail.com schreef op 9 januari 2018 22:50:38 CET:

Oh right. Then I suggest that you change the value of the Query field from label_values(State_OFFLINE_total, servername) to some other metric. For example

label_values(Active_cursors___TSQL_Local_Cursor, servername) that you referenced above and that was included in the output from http://sql:9273/metrics. You can basically use whatever metric name here as long as it is a sql server metric.

If that doesn’t work I suggest you to head over to prometheus web ui and insert Active_cursors___TSQL_Local_Cursor or other sql server metric as expression just to verify that you can see that data have been stored in Prometheus. Would be nice if you can include a screenshot of the console tab with the result of your expression. I expect a label servername to be present.

Marcus

Thx a million for sharing this dashboard by the way. It is very nice and detailed! Love it!

Well, it’s actually based on this dashboard, I made some changes so that it works with prometheus as datasource.

I hope you saw my earlier answer at 11:53 PM.

Marcus

one reply didn’t get through I see. I still have a little problem:

Wow thx! Starting to look good, instance is showing signs of life, most of the graphs are reporting. Thx for the tip about scraping. I’ll adjust that first thing in the morning. Could you please take a look at the first single stats. How come they are not reporting? For instance the State_OFFLINE_total is showing in the metrics with the servername field present.

Hi,

Cool :+1:

Not sure why those not are working. Can be related to scrape interval settings. Maybe something has changed with those in Telegraf since I built the dashboard?

Marcus

Yes it is strange. Seems it should work. I can see them in the output. I just adjusted the scrape intervals, but no change. This is a screenshot of the first query, perhaps you can see somsthing wrong with it?

It looks valid. If you click on the link to prometheus, see screenshot below:

image

How does the result look in prometheus ui? Maybe you need to change the expression to State_OFFLINE_total leaving out the {servername=~“^”} part.

Marcus

I can’t reach the prometheus ui because I’m behind an ESB. But when I change the query to this, it seems to work…

It is strange, if I first copy the exact string from the metrics, then it starts to work, then I can change it with the $instance variable. On first sight I don’t see any difference, when I just remove strange characters.

Okay.

So first you have

State_OFFLINE_total{servername=~"^$instance"}

Then you changed it to

State_OFFLINE_total{servername="$instance"}

And then it works?

I spot another difference and that is that the Data source field in your latest screenshot is set explicitly to Production, before it was a variable. You sure it wasn’t this change that made it work?

Marcus

I’m sure, I changed it back. But now I see what has changed. The metric name itself changed total to Total, that is why it could not find it, I think. What do the ~ and ^ do? Do I need it in my queries? Think I can change it back if need be.

What :smile:

Usually when you have template variable with possibility to select multiple values in a drop down you need to use prometheus regex =~"^$variable" for Grafana to be able to create a correct query using a regex that will match those multiple values. In this case you should be good with ="$variable" since you only can select one Instance at a time.

I’ll keep that in mind. Thx for the help. My issues are resolved. Realy love this dashboard. Nice of you to port it to prometheus and to help me figure this out. It realy helped my understanding of how everything works together.

1 Like

Relpace all “total” to “Total” in json, and Dashboard will work properly.