Using a metric as a boolean (solved)

In Grafana 9.1.6, I am troubleshooting someone else’s alert rule that is supposed to calculate mysql “seconds behind master” for replicas.

The original query is:

mysql_slave_status_master_server_id > 0 and ON (instance)
  (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay)
  > 30

I am testing the query in the Explore panel. The first half before the “and” is returning the value of the master server id (in my example, one set of servers is 101 and the other set of servers is 201). The second half is usually somewhere close to zero (-1 or -2 on lightly loaded servers, almost always 0 on heavily loaded servers). I read this query overall as “if a mysql server has a master, then calculate number of seconds lag”.

Unexpectedly, the graphed value of the above is a straight line of the first half of the query, before the “and”. The query ends up comparing the server_id to the limit of “30”. I expected it to use the first half as a boolean true/false, and then calculate/compare the second half to the limit. I now feel like this has probably never worked.

I have been experimenting, and I can get something closer to what I expected if I adjust/hack the formula like this:

clamp_max(mysql_slave_status_master_server_id > 0, 1) * ON (instance)
  (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay)
  > 30

The theory is that clamp_max(metric, 1) results in effectively converting “does it have a master server id defined” to true/false (1 or 0), which I then multiply times the actual computed values. It seems to be giving me the expected graphs.

Is this a ridiculous way to do it? Is there a better way to convert a metric like this to a boolean? (I googled a bunch but the overloading of the terms “metrics” “boolean” “convert”, etc resulted in a lot of blogs and documentation that didn’t seem to attack to what I am trying to solve … but I would be thrilled to be proven wrong.)

This query comes from the awesome prometheus rules collection:

I think I may have stumbled across something that works, though it’s purely by accident. I reasoned if it’s graphing the first half, what happens if I switch the order of the two? The query now looks like this:

( (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) and 
  ON (instance) mysql_slave_status_master_server_id > 0
) > 30

It’s resulting in data points that look accurate (screencap at end of message where I look at any value > 0).

Is the above query correct? I suspect yes because it looks right, but not because I understand why.

What part of “and” behavior or “ON” behavior am I not understanding ? Why does changing the order change what is being graphed? Is it generally true that the order of “and” groupings is important as to which item(s) you want it to actually graph?

Rubber ducky for the win.

I was explaining what I was working through to a co-worker, showing each piece, how reversing the order changes the output, and referring back to the documentation in Querying → Operators → Logical Operators. Then all of a sudden, the answer revealed itself in the docs (emphasis mine) :

vector1 and vector2 results in a vector consisting of the elements of vector1 for which there are elements in vector2 with exactly matching label sets. … The metric name and values are carried over from the left-hand side vector.

So yes, the behavior I was seeing (order of the two parts) matters. Values come from the left hand side of the logical operator.

Sorry for polluting your timelines.

2 Likes