Best practices for alerting with Kubernetes and Grafana Cloud LGTM stack

Hello :wave:

For our workloads/apps running on Kubernetes we use Grafana agent to export metrics & logs to Grafana Cloud. We use both the grafana agent operator and custom resources such as GrafanaAgent, MetricsInstance, etc.

How do you manage micro-services alerting with IaC? What are the best practices? Only things I can find online are about Terraform or ClickOps.

We deploy our services with helm charts, and ideally I would like a way of easily add the alerting configuration in the helm chart, and in the cloud alerting have a global config stating:
check the app label of the alert, and route the alert to the slack channel #alerts-<$app>. Or team to route alerts per team.

I had a previous experience of shipping PrometheusRules along each micro-services (to keep it close to the app. And on the alert, there was a label with the app name, and the team to alert). The Prometheus & AlertManager were deployed inside the cluster itself.

But with the grafana agent, PrometheusRules is not supported, and I don’t want to “self-host” anything as we use & pay for the Cloud offering.

How do you deal with alerting in this case?

  • Do you have a separate terraform config that handles only Grafana cloud alerting? (would like to keep it K8s native, to avoid split the tools, and keep the alert config close to the app)

  • Deploy your own AlertManager & Prometheus instance? (Then we loose the purpose of a managed Prometheus, and we don’t leverage cloud offering)

  • Only do ClickOps through the cloud UI? (I want to keep everything IaC, so it is not an option)

Am I missing a clear and obvious way of dealing with Alerting?

Ahoi!

Prometheus/Mimir evaluates the rules server-side. As we cannot “look into” your cluster, these resources are not supported by the deployment structure you described.

The only way to install recording/alerting rules into Grafana Cloud is by talking to the ruler api directly. As you have correctly discovered, this can be done with Grafana or Terraform. Both don’t work well for your use case as you’ve already explained.

But there is another way! Using mimirtool you can interact with the ruler by specifying your rule definitions as YAML files. That way, you can keep all the code in one place and don’t have to onboard people to use terraform. To apply this, the only additional step is to add the mimirtool rules sync command to your deployment pipeline.

Here’s more cloud specific documentation on this topic: https://grafana.com/docs/grafana/v10.0/alerting/set-up/set-up-cloud/

Hope this helps!

Hey,

Thanks for the detailed answer!

Do you know if in a near future, the grafana agent in flow mode would be able to use PrometheusRule from the prometheus CRDs and upload them to the ruler directly?

As the agent is mostly focused on telemetry data, support for this usecase won’t be implemented there.

We’re currently looking into offering a way to support this use case, but it’ll probably be a separate service/operator taking care of this - I’ll keep you posted!

Thanks, looking forward to it :slight_smile: