In Grafana 9.1.6, I am troubleshooting someone else’s alert rule that is supposed to calculate mysql “seconds behind master” for replicas.
The original query is:
mysql_slave_status_master_server_id > 0 and ON (instance)
(mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay)
> 30
I am testing the query in the Explore panel. The first half before the “and” is returning the value of the master server id (in my example, one set of servers is 101 and the other set of servers is 201). The second half is usually somewhere close to zero (-1 or -2 on lightly loaded servers, almost always 0 on heavily loaded servers). I read this query overall as “if a mysql server has a master, then calculate number of seconds lag”.
Unexpectedly, the graphed value of the above is a straight line of the first half of the query, before the “and”. The query ends up comparing the server_id to the limit of “30”. I expected it to use the first half as a boolean true/false, and then calculate/compare the second half to the limit. I now feel like this has probably never worked.
I have been experimenting, and I can get something closer to what I expected if I adjust/hack the formula like this:
clamp_max(mysql_slave_status_master_server_id > 0, 1) * ON (instance)
(mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay)
> 30
The theory is that clamp_max(metric, 1)
results in effectively converting “does it have a master server id defined” to true/false (1 or 0), which I then multiply times the actual computed values. It seems to be giving me the expected graphs.
Is this a ridiculous way to do it? Is there a better way to convert a metric like this to a boolean? (I googled a bunch but the overloading of the terms “metrics” “boolean” “convert”, etc resulted in a lot of blogs and documentation that didn’t seem to attack to what I am trying to solve … but I would be thrilled to be proven wrong.)
This query comes from the awesome prometheus rules collection: