I’m setting up Grafana as a monitoring tool for a fleet of robotic systems.
Each robot runs Alloy as a telemetry forwarder, sending data to a Grafana Cloud instance. Metrics are sent via OTLP, and Alloy is also configured to tail logs.
The robotic system includes a state machine that I want to monitor over time. Right now, I’m parsing these states from logs using LogQL, but this approach frequently exceeds the fair use query limits, which makes me think I’m doing it the wrong way.
Would it make more sense to expose the system state as a metric, perhaps casting the states as an enum?
My concern with that approach is versioning, if I add or modify states later, I’d have to update the metric schema and adjust visualizations to handle multiple “versions” of the state machine. Also I have defined it as an enum in my code already, which makes it feel weird that I have to define it in two places. It all seems a little messy, making me wonder if this is a good route as well.
Has anyone found a clean way to handle evolving state machines in Grafana while keeping dashboards maintainable and query costs reasonable? Something that is like a “string” metric?
Can you share what your logs look like and what query you are using for alerts, please?
Normally metrics query would incur less data usage than log query. You can definitely consider changing the state monitoring from logs to metrics, and you don’t necessarily need to pre-define an enum for the state type. For example, let’s say you have state “fail” and “success”, you can produce metrics like so:
# if success
machine_state{state="success",machine_id="123"} 1
# if fail
machine_state{state="fail",machine_id="123"} 1
Then you can match the state at query time, to check for either success or fail. This also makes changing the state easier, because you don’t need to touch the machines, you just need to adjust your alerts.
Hi @tonyswumac thanks for answering!
I am not using an alert, i am using a logQL query directly to generate the input for the visualization, with a limit line of one, since I thought that may decrease the incurred query size.
{hostname="system-hostname"} |= `publishing system state` | pattern `<timestamp> [<node>] <misc>: <mis1>: <system_state>e[0me[0m`
which I use to parse a log line that looks like this:
1761027275.4510579 [system_monitor_node-52] [INFO] [1761027275.442901531] [system_monitoring_node]: publishing system state: IDLE
According to Grafana Explore “This query will process approximately 8.4 MiB”, but this is also probably because the logs are rather verbose.
With the solution that you suggest I would be making an active series per state, which I guess is fine, but is it than still possible to create a state timeline?