I have a set of applications which are being monitored by Prometheus. Below are sample time series of the metrics:
fastapi_responses_total{app_name="orion-gen-proj_managed-cus-test-1_checkout-api", job="app-b", method="GET", path="/metrics", status_code="200"}
fastapi_responses_total{app_name="orion-gen-proj_managed-cus-test-1_registration-api", job="app-a", method="GET", path="/metrics", status_code="200"
These metrics are captured from two different applications. The application names are represented by the substrings checkout-api and registration-api . The applications run on an organisation entity and this entity is represented by the substring managed-cus-test-1. The name of the organisation that an application belongs to always starts with the string managed- but can it have any wildcard value after the string “managed-” e.g managed-cus-test-1 , managed-cus-test-2, managed-cus-test-3
To calculate the availability SLO for these applications I have prepared the following set of recording rules:
groups:
- name: registration-availability
rules:
- record: slo:sli_error:ratio_rate5m
expr: |-
(avg_over_time( ( (
(sum(rate(
fastapi_responses_total{
app_name=~".*registration-api.*",
status_code!~"5.."}[5m]
))
)
/ on(app_name) group_right()
(sum(rate(
fastapi_responses_total{
app_name=~".*registration-api.*"}[5m])
))
) OR on() vector(0))[5m:60s])
)
labels:
slo_id: registration-availability
slo_service: registration
slo: availability
- name: checkout-availability
rules:
- record: slo:sli_error:ratio_rate5m
expr: |-
(avg_over_time( ( (
(sum(rate(
fastapi_responses_total{
app_name=~".*checkout-api.*",
status_code!~"5.."}[5m]
))
)
/ on(app_name) group_right()
(sum(rate(
fastapi_responses_total{
app_name=~".*checkout-api.*"}[5m])
))
) OR on() vector(0))[5m:60s])
)
labels:
slo_id: checkout-availability
slo_service: checkout
slo: availability
The recording rules are evaluating correctly and they return two different SLO values, one for each of the applications. I have a requirement to calculate the overall SLO of these two applications. This overall SLO should be based on the organisation to which an application belongs.
For example, because the applications checkout-api and registration-api belong to the same organisation, the SLO calculation should return one consolidated value.
What I want is a label_replace that adds a new label “org” and then does grouping by “org”.
The label_replace should add the new label and preserve the existing filter based on app_name, not replace it.
This is my attempt at that but its not working as expected:
groups:
- name: registration-availability
rules:
- record: slo:sli_error:ratio_rate5m
expr: |-
avg_over_time(
(
sum by (org) (
label_replace(
sum(rate(fastapi_responses_total{app_name=~".*registration-api.*", status_code!~"5.."}[5m])),
"org", "$1", "app_name", "[^_]*_([^_]*).*"
)
/ on(org) group_right()
sum(rate(fastapi_responses_total{app_name=~".*registration-api.*"}[5m]))
)
OR on() vector(0)
)[5m:5m]
)
labels:
slo_id: registration-availability
slo_service: registration
slo: availability
version: "7"
- record: aps:slo:info
expr: vector(1)
labels:
slo_id: registration-availability
slo_service: registration
slo: availability
- name: checkout-availability
rules:
- record: slo:sli_error:ratio_rate5m
expr: |-
avg_over_time(
(
sum by (org) (
label_replace(
sum(rate(fastapi_responses_total{app_name=~".*checkout-api.*", status_code!~"5.."}[5m])),
"org", "$1", "app_name", "[^_]*_([^_]*).*"
)
/ on(org) group_right()
sum(rate(fastapi_responses_total{app_name=~".*checkout-api.*"}[5m]))
)
OR on() vector(0)
)[5m:5m]
)
labels:
slo_id: checkout-availability
slo_service: checkout
slo: availability
version: "7"
- record: aps:slo:info
expr: vector(1)
labels:
slo_id: checkout-availability
slo_service: checkout
slo: availability
With this change I would expect to see one value in the StatPanel but the panel is repeating. I have executed a couple of requests to both applications. The recording rules (appear) to be aggregrating by organisation but why the panel repeats I am not sure:
Is the label_replace fitting in correctly with the rest of the logic ?
My expectation is to get something similar to this question
The dashboard is a simple StatPanel with Repeat not enabled. By default it should render just one panel for the aggregated value as per the organisation. The dashboard is designed as below: