How do you track state machine states, metrics or logs?

I’m setting up Grafana as a monitoring tool for a fleet of robotic systems.

Each robot runs Alloy as a telemetry forwarder, sending data to a Grafana Cloud instance. Metrics are sent via OTLP, and Alloy is also configured to tail logs.

The robotic system includes a state machine that I want to monitor over time. Right now, I’m parsing these states from logs using LogQL, but this approach frequently exceeds the fair use query limits, which makes me think I’m doing it the wrong way.

Would it make more sense to expose the system state as a metric, perhaps casting the states as an enum?

My concern with that approach is versioning, if I add or modify states later, I’d have to update the metric schema and adjust visualizations to handle multiple “versions” of the state machine. Also I have defined it as an enum in my code already, which makes it feel weird that I have to define it in two places. It all seems a little messy, making me wonder if this is a good route as well.

Has anyone found a clean way to handle evolving state machines in Grafana while keeping dashboards maintainable and query costs reasonable? Something that is like a “string” metric?

Can you share what your logs look like and what query you are using for alerts, please?

Normally metrics query would incur less data usage than log query. You can definitely consider changing the state monitoring from logs to metrics, and you don’t necessarily need to pre-define an enum for the state type. For example, let’s say you have state “fail” and “success”, you can produce metrics like so:

# if success
machine_state{state="success",machine_id="123"} 1

# if fail
machine_state{state="fail",machine_id="123"} 1

Then you can match the state at query time, to check for either success or fail. This also makes changing the state easier, because you don’t need to touch the machines, you just need to adjust your alerts.

Hi @tonyswumac thanks for answering!

I am not using an alert, i am using a logQL query directly to generate the input for the visualization, with a limit line of one, since I thought that may decrease the incurred query size.

{hostname="system-hostname"} |= `publishing system state` | pattern `<timestamp> [<node>] <misc>: <mis1>: <system_state>e[0me[0m`

which I use to parse a log line that looks like this:

1761027275.4510579 [system_monitor_node-52] [INFO] [1761027275.442901531] [system_monitoring_node]: publishing system state: IDLE

According to Grafana Explore “This query will process approximately 8.4 MiB”, but this is also probably because the logs are rather verbose.

With the solution that you suggest I would be making an active series per state, which I guess is fine, but is it than still possible to create a state timeline?

I see, didn’t realize you have logs in Loki already.

Let’s say we have the following log line:

1761027275.4510579 [system_monitor_node-52] [INFO] [1761027275.442901531] [system_monitoring_node]: publishing system state: IDLE

What is your goal that you want to achieve?

I want to be able to track the state that the system is in, and be able to see it within the resolution of 1 DPM. From this I think I can then also derive other metrics, such as setting up alerts that will let me know when a system has stayed in a certain state for too long or if the system has failed a mission or not.

Do you think that this is the wrong way of handling it? I guess it is also possible to have the state tracking done on the device and it only output a metric if it is over a certain threshold.

I am surprised by how many hurdles I need to overcome to make this work, leading me to believe that the design of grafana is actively discouraging me from doing so.

“the fair use query limit” is not the design of Grafana limit. It is Grafana Cloud limit.
Use your own metric/log storage, and you can query it nonstop without reaching the fair use query limit. You just need to manage your own storage (e.g. Loki if you want to still use logs)

As Jan mentioned above, you are using a free version of Grafana Cloud, limitation is expected. If you go over that then the only thing you can really do is reduce the log volume or make it less verbose.

To address your other questions, in order to create a time series graph you will need to map state to some sort of value. I don’t know what possible states you have, but just for example let’s say:

IDLE = 0
ACTIVE = 1
FAILED = -1
UNKNOWN = -2

Then you’d need to craft a query like this (not tested)

{hostname="system-hostname"}
  |= `publishing system state`
  | pattern `<timestamp> [<node>] <misc>: <mis1>: <system_state>e[0me[0m`
  | label_format system_state_int=`{{ if eq .system_state "IDLE" }}0{{ else if eq .system_state "ACTIVE" }}1{{ else if eq .system_state "FAILED" }}-1{{ else }}-2{{ end }}`
  | unwrap "system_state_int"

And then you can wrap metrics functions such as sum_over_time or avg_over_time around that entire thing now that it’s producing a number.

If you are doing alerts then you don’t need to be quite as elaborate, you could just compare the state string and alert based on your desired alerting conditions.

I am hesitant to host my own metric/log storage, since it will be another thing to maintain.

@tonyswumac regardless of grafana cloud tier that I am using, the fair use policy applies link

I think that I will use the route of making a metric per system state, since even if I have 10 states, i have more than enough space within the 10k active series that are part of the free tier.

I am currently think of naming the states by prepending them with “system_state.“. I can then use this to query Prometheus with the following:

{__name__=~"system_state.*"} 

This will return a series per state.

I can the apply three transformations:

this is showing some promise

what do you guys think of this approach?

Yes, I think that’s a fine solution. And if you are approaching the limit then generally speaking metrics take up less volume than logs.