Building a Dashboard for Bots - status and alerts

Hey guys,
I am looking for some help on how to build a dash board that can give our development team some metrics about their in production bots.

We currently have a table that holds all data about how many tickets are in the queue for each bot and what we essentially want to do is to set alerts based on the percentage difference between each instance of ticket count for each both

(for example if the queue has raised over the last 3 hours than something must be wrong because it indicates the bot is down).

Our table is in MS sql and I’ve been fighting grafana, time series, and various queries for the last week or so. I’d appreciate any help or any ideas you guys have.

Your description so far doesn’t really give us:

a) any idea of what the data is that you’re starting with

b) much of an idea about the visualisation you want to end up with.

I suggest:

  1. Tell us the structure of the database table/s you’re starting from - what
    fields do they have, and where it isn’t obvious, what does the data mean

  2. Tell us what information / statistics you’re trying to get Grafana to
    display from this data - single values, tables, graphs, histograms…

  3. Show us some manual queries you can perform on the database (nothing to do
    with Grafana now) in order to get the sort of information you’re looking for
    out of the data you’ve got.

Then we should have a better idea of where you’re starting from and where
you’re trying to get to and we might be able to help.

(To be honest, it’s not even clear to me what you mean by “bot” in the
description, but I suspect that’s not really material to getting Grafana to do
what you want.)

Regards,

Antony.

Hey, Thanks for your reply.
So we have a few automation bots that essentially does some kind of action. Most of them will close tickets to indicate they are running and live.
The structure of the database tables we have is basically a special logs table that counts for each bot how many tickets it has in queue from last run and an attribute if there were any tickets closed from the last run.
The information / statistics we’re trying to get is basically a green / red status that can tell us if the bot is up or down including alerts which i’ve already set. So basically what i’m having an hard time figuring out here is how to set an alert if the “ticket count” value stays the same for a few data points. For example we have a queue now which is stuck on 17 but my alert shows the bot is OK because it’s trying to look if there has been any percent difference over 50% therefore I think the rule does not fire.

The queries you can do to check if the bot is alive from example is
SELECT COUNT(*) FROM
SELECT * FROM bots_table
WHERE CAST(Datetime as DATETIME2) >= DATEADD(HOUR, -1, GETDATE())
AND BOT_Name = ‘SomeBotName’ and Tickets_In_Q = ‘0’

this should be the output:
image
since the number of tickets in queue only goes down over time, it means the bot is alive.

I’d like to ask a more basic question.

Isn’t there some way of actually checking the bots themselves to see whether
they’re alive?

It seems to me that you’re planning to wait until some number of unprocessed
tickets have built up over some period of time, and then you declare “whatever
is supposed to be processing these clearly isn’t”, so you raise an alert.

I would far prefer to run a continuous check on the status of the bots
themselves, so that you know as soon as they fail, and can react to that,
instead of waiting until you have a build-up of unprocessed tickets.

Antony.

Yes, we have another logs table for each process and I can just try and look if there have been any errors or successful messages in the last ~24 hours.
like it should be

SELECT * FROM Logs WHERE Error = ‘’
AND Date >= DATEADD(HOUR, -24, GETDATE())

Okay, sorry for having to ask this, but what’s the problem with doing exactly
that?

Personally I would probably use “select count(0)” instead of “select *” so
that you get the number of errors (which you want to be zero) rather than a
whole load of (presumably) textual data, bit other than that this looks liek a
good solution to me.

What’s not working with it?

Antony.