App ELB Healthy Host Count Alert - Critical
rule {
name = “App ELB Healthy Host Count Critical”
condition = “B”
annotations = {
summary = "App ELB has critically low healthy host count"
description = "**Load Balancer:** {{ $labels.LoadBalancer }}\n**Target Group:** {{ $labels.TargetGroup }}\n\nCritical: No healthy instances available. Immediate investigation required."
runbook_url = local.runbook_url
}
labels = {
severity = "critical"
service = "app-elb"
environment = var.environment
team = "platform"
}
data {
ref_id = "A"
datasource_uid = local.cloudwatch_uid
relative_time_range {
from = 600 # 10 minutes
to = 0
}
model = jsonencode({
expression = ""
id = ""
matchExact = false
metricName = "HealthyHostCount"
namespace = "AWS/ApplicationELB"
period = "300"
refId = "A"
region = var.aws_region
statistics = ["Average"]
dimensions = {
LoadBalancer = "*"
TargetGroup = "*"
}
returnData = true
maxDataPoints = 100
})
}
data {
ref_id = "B"
datasource_uid = "__expr__"
relative_time_range {
from = 0
to = 0
}
model = jsonencode({
conditions = [
{
evaluator = {
params = [local.app_elb_healthy_host_critical_threshold]
type = "lt"
}
operator = {
type = "and"
}
query = {
params = ["A"]
}
reducer = {
params = []
type = "last"
}
type = "query"
}
]
datasource = {
name = "Expression"
type = "__expr__"
uid = "__expr__"
}
expression = ""
hide = false
intervalMs = 1000
maxDataPoints = 43200
reducer = "last"
refId = "B"
type = "classic_conditions"
})
}
no_data_state = "NoData"
exec_err_state = "Alerting"
}
I have a dimension using wild card for both Load Balancer and Target Group, problem is the values are not displaying at all, I am using grafana version 10.4.1, using Amazon Managed Grafana and handling through Terraform.
Is there a way to get the exact value for Load Balancer and Target Group? Can some one send me links to documentation on how to improve the alerting and labels, thankyou
I am trying to change classic condition to reduce as suggested by Grot, I will update if the issue resolves
Thankyou Jangaraj, will try that
I tried manually creating an alert rule via Grafana UI and extracted the hcl code for it but when I apply the same extracted code via Terraform, I get some errors, is there a documentation link which can help with Terraform side, thankyou
vanditaakaul:
I get some errors,
Always be specific, what are those some errors, how that developed alert query looks like in the ui, what is expected…
Apologies for not being specific, the way Terraform code perceives the extracted code from grafana UI, it needs a few more fields,
App ELB Healthy Host Count Alert - Critical
rule {
name = “App ELB Healthy Host Count Critical”
condition = “C”
for = “5m”
annotations = {
summary = “App ELB has critically low healthy host count”
description = “Load Balancer: {{ $labels.LoadBalancer }}\nTarget Group: {{ $labels.TargetGroup }}\nCurrent Healthy Host Count: {{ $values.B.Value }}\n\nCritical: No healthy instances available. Immediate investigation required.”
runbook_url = local.runbook_url
}
labels = {
severity = “critical”
service = “app-elb”
environment = var.environment
team = “platform”
}
data {
ref_id = “A”
datasource_uid = local.cloudwatch_uid
relative_time_range {
from = 300 # 5 minutes
to = 0
}
model = jsonencode({
expression = “”
id = “”
matchExact = false
metricName = “HealthyHostCount”
namespace = “AWS/ApplicationELB”
period = “300”
refId = “A”
region = var.aws_region
statistics = [“Minimum”]
dimensions = {
LoadBalancer = “"
TargetGroup = " ”
}
intervalMs = 1000
maxDataPoints = 43200
metricEditorMode = 0
metricQueryType = 0
queryMode = “Metrics”
})
}
data {
ref_id = “B”
datasource_uid = “expr ”
relative_time_range {
from = 0
to = 0
}
model = jsonencode({
conditions =
datasource = {
name = “Expression”
type = “expr ”
uid = “expr ”
}
expression = “A”
hide = false
intervalMs = 1000
maxDataPoints = 43200
refId = “B”
type = “reduce”
reducer = “last”
settings = {
mode = “replaceNN”
replaceWithValue = 0
mode = “dropNN”
}
})
}
data {
ref_id = "C"
datasource_uid = "__expr__"
relative_time_range {
from = 300 # 5 minutes
to = 0
}
model = jsonencode({
conditions = [
{
evaluator = {
params = [local.app_elb_healthy_host_critical_threshold]
type = "lt"
}
operator = {
type = "and"
}
query = {
params = ["B"]
}
type = "query"
}
]
datasource = {
name = "Expression"
type = "__expr__"
uid = "__expr__"
}
expression = "B"
hide = false
intervalMs = 1000
maxDataPoints = 43200
refId = "C"
type = "threshold"
})
}
no_data_state = "NoData"
exec_err_state = "Alerting"
}
}
on running with Terraform Apply I still see errors in grafana
As I said develop the alert in the UI first and then use terraform.
thanks for your assistance, the terraform code in previous post, was what was generated through Grafana UI but I had to make few changes, like I mentioned Grafana provider with Terraform could be a cause of this, I will figure out a way, thanks again.