Hi, I’m a colleague of Jostein and feel like adding “some” context.
Jostein and myself are part of a team of 5-6 people that work exclusively on developing and maintaining our monitoring and alerting stacks. We currently handle between 10 and 100 thousand (10 000-100 000) devices at several thousand geographical locations, in addition to passive infrastructure. This is typically routers, switches, microwave links, CMTSes, DSLAMS, OLTs, WDM-stuff, etc, and we’re now also introducing data from CPE’s, which will mean several hundred thousand more devices. But this is a long term project. We are gradually moving everything to Grafana, but today only a tiny fraction of use-cases are covered by Grafana, though we do have around 200 active every-day users, ranging from developers to on-site technicians.
I mention all this not because it makes our needs more important somehow, but to emphasis that we are willing to do the work. We are not asking for a hand-out here, but for help on how to best contribute. I would have preferred to have a talk face to face to throw some ideas back and forth, but that seems unrealistic in the immediate future.
Instead, I’ll present some actual scenarios we’re facing.
Topology-lookups
We have a topology database, and a set of rules for how we build our network. E.g.: All core routers come in pairs that are located at different geographical locations. Everything that is connected to a core router is also connected to its sibling.
As such, we need dashboards that can reflect this. When viewing a core router, it is very useful to also see its sibling. If there is/was a traffic spike on one link, it should be reflected in the inverse traffic spike on the sibling. E.g.: If a fiber cable is cut, the traffic will move to the other core router.
We have been able to hack this together with template variables, but it’s not pretty. And different rules applies to different parts of the infrastructure. Further out in the network, the topology could be “half-rings”, where a number of routers are connected in serial, but both ends are connected to core. Each node will then have redundant uplinks, but you need to see more than just the immediate neighbor to get the full picture. (e.g.: Sites connected core-A-B-C-D-E-core, looking at C, you need more than just C-D and C-B to understand the state of the “half ring”).
Retention periods
We store data at different resolutions in Influx. When viewing data that is less than 2 days old, you get real-time, less than 14 days gives you 5 minute data, and so forth. We’ve solved this by using template variables with a lookup in influx, but it means two things: First, every dashboard needs to manually copy these template variables around, which seem pretty magical unless you really dig into them. Secondly: Suddenly retention-periods, which is an internal technical detail, are part of the url. This is not a huge deal, until we consider that we’re probably going to be making several hundred dashboards that will be used daily.
Grouping of interfaces
Many of our routers have several hundred links, and being able to group them is a requirement. The simplest type of grouping is doing a database lookup on what interfaces are links to core, or “more central” routers and sorting them first. Having a separate panel/table for this is fine, but influx doesn’t give us this information. We need to correlate with a traditional SQL database.
An other example is for big customers, we might want to do the inverse. E.g.: Look up the customer, find all routers and interfaces they use and make some fancy graphs. That would be some really nasty template variables today.
Tables
This one seems almost silly…
We generate a table of interfaces that contain the interfacename, calculated bandwidth usage and utilization, and description for the interface. But because influx doesn’t support joins (unless you use Flux, and that’s a different story), we end up with a monstrosity of a query. I am inclined to blame this on Influx, but the end result is the same: We get a table that takes MUCH longer to load than loading actual traffic stats.
Part of the challenge is that an interface isn’t necessarily just an interface. Traffic stats tend to be per physical link, but a logical interface might have multiple physical links and it’s the logical one we’re after.
I’m not sure I see a clear solution to this - we’re probably going to be experimenting with table panels in the near future, but I can imagine being able to post-process data would give us far more options.
Conclusion
I suppose what I really want to see first is the ability to populate “internal” template variables with a script we can provide per dashboard.
My suggestion is we start working on a prototype, but first I’d very much like some thoughts from someone who’s more familiar with the future plans for Grafana. Maybe some big-picture takes. And ideally some communication during the prototyping so it isn’t just us guessing at what would be best for Grafana and praying that after 12 months of work, it would be accepted upstream.