Best practice for montioring a SaaS estate

I’m working on trying to create a single pane of glass to give us an overview of a SaaS setup that our firm is offering to the market, this is running in azure across multiple resource groups, generally one resource group per tenant of our service.

I’m looking for some advice on how best to do this, please point me at any case studies or documentation if they exist on this sort of topic (I wasn’t able to find any searching hence the posting).

What I’m looking to do is create a dashboard that gives us an overall idea of the health of the key metrics across our estate and I want this to be easy to extend as we bring on new clients, so ideally not having to add a new series for each tenant as they come online (it looks like repeating resource groups isn’t supported for azure, so I’m hoping the grafana API can be a help here).

I would like to create different panels for different metrics which just show the hottest series across all the resources that are being monitored, e.g. across all our PaaS db’s which are the top 5 CPU systems and hide all the other systems, this therefore giving us a view on the systems under the most load, the theory being is resource usage is lower then these system are just running as expected and don’t need to be highlighted.

Is this something that is possible?

Thanks in advance!

So after some more investigation I think I might have hit on a solution, sharing here encase it helps others or if anyone wants to suggest why this isn’t a good idea feel free!

Breakthrough was realising that Asure Monitor isn’t a good source of data if your data is spread across multiple resource groups as this can’t be easily queried across.

I am now feeding my different SQL and AppService instances into a new central Log Analytics workspace, this is what I have configured grafana to query.

Then writing Kusto queries like the following I am able ot list for instance the length of my HTTP request queue, but only the top five worse offenders:

let Top_5 = AzureMetrics | where $__timeFilter(TimeGenerated)
| where ResourceProvider == "MICROSOFT.WEB"
| where MetricName == "RequestsInApplicationQuee"
| top-hitters 5 of ResourceGroup by Maximum;
| where $__timeFilter(TimeGenerated)| where ResourceProvider == "MICROSOFT.WEB"
| where MetricName == "RequestsInApplicationQueue"
| where ResourceGroup in (Top_5)
| summarize by ResourceGroup, Maximum, TimeGenerated
| order by TimeGenerated asc

So I can now get a visulization showing only the most pressing problems and as an when new resource groups are configured to feed my central Log Analytics workspace they will get picked up by this query which means zero overhead on the monitoring which is a requirement.