Can we switch monitoring for prod and dr environments

Hello Community,

We are currently using Grafana Cloud as our primary observability and monitoring solution for our banking workloads. We are looking for advice on the best way to handle monitoring transitions during Disaster Recovery (DR) events.

Our Scenario:

  • Primary Environment: OCI-prod (Active)

  • DR Environment: GCP-dr (Standby)

  • Workloads: Kubernetes clusters, PostgreSQL on VMs, and various applications on Linux/Windows.

  • Current Tagging Strategy:

    • provider: OCI / GCP

    • env: prod / dr

    • os: linux / windows

    • workload-type: k8s / vm

The Challenge:
During a failover, our entire workload shifts from OCI to GCP. We need a streamlined solution to switch our monitoring views, alerts, and dashboards from the production environment to the DR environment and back again (failback) without manual reconfiguration.

Specific Questions:

  1. Dashboarding: What is the best way to build dashboards that can dynamically toggle between OCI and GCP data based on which environment is currently “Active”?

  2. Alerting: How can we prevent “false positive” alerts from the standby (DR) environment while ensuring that alerts automatically activate once the workload shifts?

  3. Cross-Cloud Correlation: Are there recommended ways to use Grafana Cloud’s synthetic monitoring or global labels to maintain a single “Golden Signal” view during the transition?

  4. Data Persistence: How do you recommend handling the continuity of metrics so that historical OCI data and new GCP data appear as a single timeline?

We would love to hear how others have architected this, especially regarding Alertmanager silencing or Variable-driven dashboards.

Thanks

Hi @skhamitkar

Good challenge… I will try to give some hints that may fit your ue case and your processes.

Dashboards:

What I understand is that you have labels on metrics / logs for provider / env stating the source of information. If you can have a metric (lets say current_env{env=<prod|dr>} or current_provider{provider=<oci|gcp>}) stating the current source (whether it is prod or dr) then you can adjust queries to group_left your metrics to choose correct env via that new metric. Important part is labels should be same as your standard metric labels (env and provider).

An other way can be to have two queries in each panel to select from both environments.. There is a catch as such, if time range selected in the dashboard is wide enough to cover both environments (OCI was active, then switched to GCP and then back to OCI) then you will have data from both env in the same panel

Alerting:

First option for dashboards above will solve the issue for alerting.

Also similarly, you may double your alerts to have one for OCI one for GCP but then there is the chance of false positives, as alerting cannot be sure if it is active site or not. First option seems better fit and safer.

Cross-Cloud Correlation:

Can you explain a bit more on what you want to achieve?

Data persistence:

Based on your retention policies, the above “option 1” scenarios will keep data from both environments continuously.

Hope this helps a bit ans shed some light for more ideas.. If you have further questions and things to discuss please continue in the thread.

Suleyman Kutlu (a.k.a. SNK)

And just saw this was your first question. Welcome to the Grafana community :tada:

Hi @snk Thanks for responding to my query. Appricate your response.

Since I’m new to Grafana Cloud, I’m still a bit confused about how to apply your suggestion.

Coming from Dynatrace, we’re used to a very straightforward setup with two entirely separate environments for Prod and DR—each with its own dashboards and alerts. Can we achieve a similar separation in Grafana?

If your suggested approach is the best way forward, could you walk me through the steps to implement it across two VMs and two Kubernetes clusters (one for Prod, one for DR)? Any guidance would be hugely appreciated!"

Hi @skhamitkar

It depends how you want to implement observability for your Prod and DR environments, independent of the tool.

  1. Do you have active telemetry generation from your DR in a steady / normal state (where Prod is active and running normally)?

  2. Do you want to have a separate observability tool environment to collect telemetry from both Prod and DR environment as such

  • Prod environment send telemetry to Prod observability tool instance
  • DR environment send telemetry to DR observability tool instance

Now naming the observability tool as “Grafana Cloud”, yes you can have two instances (or more) in your Grafana Cloud account. Just you need to pay attention to select which Cloud Provider / region you want your DR instance (better not on same as your Prod instance :slight_smile: )

Assuming those two questions are applicable for your case (DR environment generates telemetry data in steady state and you prefer to have two Grafana Cloud instances) then what you need to do is:

  • configure observability agents (Alloy for VMs and k8s-monitoring-helm for Kubernetes clusters) to send telemetry to respective instance
  • just have two set of dashboards / alerts: one deployed to Prod instance, one for DR instance with proper data source adjustments

Side effect of this approach is: This may increase your telemetry cost, depending how much telemetry data DR will generate during steady state.

My personal opinion: This approach also (kind of) implements DR for your SaaS observbility tool - Grafana Cloud which I think is “overkill”, if I may say. It is SaaS and we should trust the provider’s abilities for continuity of service. We need to focus on our DR plans.

I would go with single Grafana Cloud instance approach.

Please think about this, share your thoughts on this. Then we can go forward for further bits and pieces..

  1. Do you have active telemetry generation from your DR in a steady / normal state (where Prod is active and running normally)? Yes on prod and dr we are running alloy collector

  2. Do you want to have a separate observability tool environment to collect telemetry from both Prod and DR environment as such → No why seperate obervability? If grafana can collect both.

Simple requirement:
Run alloy on both env’s

Collect telemetry data only for prod while prod is active

Collect telemetry data only for dr while dr is active thats it. This is due to cost optimisation

If you have alloy running on DR env and there are nothing running on DR (standby DR) then nothing will be sent. When your DR is activated then it will start sending data..

Do you want to have a separate observability tool environment to collect telemetry from both Prod and DR environment as such → No why seperate obervability? If grafana can collect both.

From your initial message, about Dynatrace, I got it like you want to keep your setup similar. My bad, assumptions are not good at all :slight_smile:

Definitely you don’t need two instances at all. For cost optimization, it is on you to make sure nothing from DR site sends data while DR is not active…

All your alerts / dashboards will work as is for both DR is active or not..

yes, alerts and dashboards will work for both.

I need step by step imlementation details

Well it is not easy to give you step by step details, as no one here is aware of your setup, your infra, your applications, your alert setup, your dashboard setup, …

If you can provide details for alerting and dashboard setup, some sample queries, community can guide you on how to make it work for both env but “Step by step implementation details” is quite a big sentence in its own :slight_smile: