New alternative to scripted dashboards

Hi!
My team at Telenor Norway love Grafana and use it a lot. Thanks for making it. We would love to be a part of making Grafana better for our self and others in the future. With that in mind, we have some needs which we see that Grafana cannot solve right now and would like to discuss it with all of you.

We were big fans of the scripted JS dashboards, and were sorry when they got deprecated. The reason we liked that approach was the flexibility and control this introduced by allowing us to run code as a part of the dashboards.

As a way of gaining the same advantages we suggest introducing hooks for running scripts at different stages of the dashboard loading. For instance there could be a hook who allowed for running code between the initialisation of the template variables and fetching the graph data from the data source. This could be a great way of solving template variable that is not user input (or from the url), but instead are calculated or the result of a query.

Another example of this being useful is a hook between fetching graph data from the data source and the presentation. This could be a great way of introducing functionality which now is dependent on the query language, like time-shift.

Im sure there is a ton of use cases for this that we have not realised yet. It could be a way of providing advanced functionality without having to cater for every need that could arise for complex users.

What do you think? Could it be a viable way forward?

1 Like

Hi, I’m a colleague of Jostein and feel like adding “some” context.

Jostein and myself are part of a team of 5-6 people that work exclusively on developing and maintaining our monitoring and alerting stacks. We currently handle between 10 and 100 thousand (10 000-100 000) devices at several thousand geographical locations, in addition to passive infrastructure. This is typically routers, switches, microwave links, CMTSes, DSLAMS, OLTs, WDM-stuff, etc, and we’re now also introducing data from CPE’s, which will mean several hundred thousand more devices. But this is a long term project. We are gradually moving everything to Grafana, but today only a tiny fraction of use-cases are covered by Grafana, though we do have around 200 active every-day users, ranging from developers to on-site technicians.

I mention all this not because it makes our needs more important somehow, but to emphasis that we are willing to do the work. We are not asking for a hand-out here, but for help on how to best contribute. I would have preferred to have a talk face to face to throw some ideas back and forth, but that seems unrealistic in the immediate future.

Instead, I’ll present some actual scenarios we’re facing.

Topology-lookups

We have a topology database, and a set of rules for how we build our network. E.g.: All core routers come in pairs that are located at different geographical locations. Everything that is connected to a core router is also connected to its sibling.

As such, we need dashboards that can reflect this. When viewing a core router, it is very useful to also see its sibling. If there is/was a traffic spike on one link, it should be reflected in the inverse traffic spike on the sibling. E.g.: If a fiber cable is cut, the traffic will move to the other core router.

We have been able to hack this together with template variables, but it’s not pretty. And different rules applies to different parts of the infrastructure. Further out in the network, the topology could be “half-rings”, where a number of routers are connected in serial, but both ends are connected to core. Each node will then have redundant uplinks, but you need to see more than just the immediate neighbor to get the full picture. (e.g.: Sites connected core-A-B-C-D-E-core, looking at C, you need more than just C-D and C-B to understand the state of the “half ring”).

Retention periods

We store data at different resolutions in Influx. When viewing data that is less than 2 days old, you get real-time, less than 14 days gives you 5 minute data, and so forth. We’ve solved this by using template variables with a lookup in influx, but it means two things: First, every dashboard needs to manually copy these template variables around, which seem pretty magical unless you really dig into them. Secondly: Suddenly retention-periods, which is an internal technical detail, are part of the url. This is not a huge deal, until we consider that we’re probably going to be making several hundred dashboards that will be used daily.

Grouping of interfaces

Many of our routers have several hundred links, and being able to group them is a requirement. The simplest type of grouping is doing a database lookup on what interfaces are links to core, or “more central” routers and sorting them first. Having a separate panel/table for this is fine, but influx doesn’t give us this information. We need to correlate with a traditional SQL database.

An other example is for big customers, we might want to do the inverse. E.g.: Look up the customer, find all routers and interfaces they use and make some fancy graphs. That would be some really nasty template variables today.

Tables

This one seems almost silly…

We generate a table of interfaces that contain the interfacename, calculated bandwidth usage and utilization, and description for the interface. But because influx doesn’t support joins (unless you use Flux, and that’s a different story), we end up with a monstrosity of a query. I am inclined to blame this on Influx, but the end result is the same: We get a table that takes MUCH longer to load than loading actual traffic stats.

Part of the challenge is that an interface isn’t necessarily just an interface. Traffic stats tend to be per physical link, but a logical interface might have multiple physical links and it’s the logical one we’re after.

I’m not sure I see a clear solution to this - we’re probably going to be experimenting with table panels in the near future, but I can imagine being able to post-process data would give us far more options.

Conclusion

I suppose what I really want to see first is the ability to populate “internal” template variables with a script we can provide per dashboard.

My suggestion is we start working on a prototype, but first I’d very much like some thoughts from someone who’s more familiar with the future plans for Grafana. Maybe some big-picture takes. And ideally some communication during the prototyping so it isn’t just us guessing at what would be best for Grafana and praying that after 12 months of work, it would be accepted upstream.

1 Like

Would love to explore an evolution and re-imagining of scripted dashboards. This is on our roadmap for next year. I realize this is not a good timeline for you so let’s see what we can do short term.

The problem with retention periods is a big problem for InfluxDB as it does not have a transparent way to query different rollups / retentions. Not sure it’s a Grafna problem or something that can be fixed in Influx primarily.

Great to hear from you and thanks for the reply. What are you initial thought around our “script hooks” proposal in this topic?

Also, where do we go from here? What would be the best way for us to discuss and collaborate going forward?

I have posted the same question to the “Grafana developers” mail group with some code examples and a screenshot. I hope we can continue this conversation either here or through the mail group. If this is the best place for it I will post the code examples and screenshots here as well.