We currently use Azure Monitor (Log Analytics + Metrics) for monitoring and alerting across our Azure estate.
We’re considering a change in architecture:
-
Retain Azure Monitor for:
-
Log ingestion (Log Analytics workspace)
-
Metrics ingestion
-
Data storage & retention
-
-
Introduce self-hosted Grafana for:
-
Dashboards / visualisation
-
Alerting
-
Jira integration
-
This is driven by dashboard flexibility and alerting requirements, but we need to justify the operational complexity.
Proposed Architecture
-
Azure Monitor continues ingesting logs/metrics
-
Grafana connects via Azure Monitor / Log Analytics data source
-
Grafana self-hosted (not Grafana Cloud)
-
Multiple Grafana instances behind a Load Balancer
-
Shared database backend for HA
-
5-second dashboard refresh requirement
Key Questions
-
What are the practical pros/cons of using Grafana for visualisation while keeping Azure Monitor for ingestion?
-
What advantages does Grafana Alerting provide over Azure Monitor Alerts in real-world use?
-
Is self-hosting Grafana (HA behind LB) worth the operational overhead vs staying fully in Azure Monitor?
-
How mature is Jira integration in Grafana vs Azure Monitor Action Groups?
-
Are there hidden performance or cost considerations when querying Log Analytics from Grafana at 5-second refresh intervals?
Known Constraints
-
Grafana must be self-hosted (cost reasons — no SaaS).
-
High availability required.
-
Need Jira ticket creation from alerts.
-
Need rich dashboards beyond what Azure Workbooks currently provide.
What I’m Looking For
I’m not looking for marketing comparisons — I need architectural tradeoffs, operational realities, and real-world experience from teams that have:
-
Replaced Azure dashboards with Grafana
-
Or run both side-by-side
-
Or reverted back to Azure Monitor
Especially interested in:
-
Alert noise handling differences
-
Multi-dimensional alerting
-
RBAC complexity
-
Maintenance overhead
-
Scaling implications
If anyone has implemented this pattern (Azure Monitor ingestion + Grafana visualisation/alerting), I’d appreciate insight into whether the complexity is justified.