-
What Grafana version and what operating system are you using?
Currently running Beyla - 1.8.8
Prometheus - 2.55.1
-
What are you trying to achieve?
We enabled Beyla in our development environment as a proof of concept. The goal is to have Beyla and eventually Tempo running in this environment to showcase the observability and build this out for all our clusters and aws accounts.
-
How are you trying to achieve it?
By installing using the available helm charts
-
What happened?
Our Prometheus pod installed via the prometheus community helm chart typically sits around 3.6 GB in mem utilization in this environment. When beyla gets enabled the cardinality on some metrics spikes dramatically over a short period (minutes) and memory consumption rises until the pod gets OOM killed. looking at the “tsdb-status” page we can see the cardinality top 10 has 1 label (server_port) with 30k+ where all the rest are 5k or below. Looking at the job itself I do not see that label on the job which makes it seem like its directly on the metrics themselves. I tried dropping it with a metric relabel rule but that didnt seem to work
-
What did you expect to happen?
we expected SOME increase in cardinality and memory consumption but not more than 24GB
-
Can you copy/paste the configuration(s) that you are having problems with?
Not sure what is needed here. I tried adding the following settings to Prometheus which did not help the situation at all:
max-block-duration: 15m
# min-block-duration: 15m
# head-chunks-write-queue-size: 500
# max-series-per-shard: 2500
# max-series: 10000
# max-bytes-to-drop: 5368709120
# wal-compression: true
# head-chunks-limit: 50000
# out-of-order-time-window: 5m
# max-exemplars: 50
# no-lockfile: true
-
Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.
No errors just the pods getting OOM killed
-
Did you follow any online instructions? If so, what is the URL?
Co-worker went through some ai and the beyla documentation before doing the initial setup.
so the ask is to help drop the server_port properly and or reduce the overall memory footprint so beyla does not cause prometheus to become unstable
Thanks for the response, but that document points to only a few attributes enabled by default. Specifically:
" By default, only the following attributes are reported: k8s.src.owner.name
, k8s.src.namespace
, k8s.dst.owner.name
, k8s.dst.namespace
, and k8s.cluster.name
."
none of which is the server_port which is the label with high cardinality. This is again applied at the metric level and not the pod level coming from the kubernetes-pods job in prometheus.
so, If I am reading the documentation provided correctly the label should already be excluded as an attribute from the metrics provided by beyla but isnt? What steps can be taken here to verify and correct this?
I read through that document and can see that the network.flow.bytes metric is the one that has the attr for server_port. I have tried adding drop rules from the scrape job and from the export side
prometheus_export:
port: 9090
path: /metrics
metric_relabel_configs:
- action: labelkeep
regex: ^(?!server_port$).*$
and in prometheus:
- honor_labels: true
job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: drop
regex: ‘network_flow_bytes’
source_labels: [‘name’]
- action: labeldrop
regex: ‘server_port’
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
Neither of which prevented this:
also apologies for the parsing making the configs look weird just assume they are properly indented and has the correct number of _ characters
If I am missing something obvious feel free to point it out. Just trying to get this to be stable and not consume 30GB of ram in 1 pod.
Why are you dropping on the Prometheus side when you can specify which attributes will be generated on the Beyla side? From already linked Beyla doc:
sorry for the delayed reply. was a holiday in the US. with that said the reason we were trying to dump it on the prometheus side and I apologize for not being more clear is the above configuration had been attempted in a couple of different ways unsuccessfully (attempting to turn it off in beyla, the exporter and also in prometheus itself)
However in the interest in of trying to leave no stone unturned here we used the above verbatim and re-created the beyla pods / infrastructure.
The result was that the server port STILL shows on the metrics. From the documentation I only saw it associated with the network_flow_bytes but in the metric output I see it on http_client_request_body_size_bytes_bucket
If we take a step back there are really two issues at play.
- beyla causes high cardinality and extreme memory usage in its default configuration
- this one metric in particular has exceedingly high cardinality
I have made an assumption in that 2. is the cause of 1.
its possible they have very little to do with eachother
so we should do a couple of things in this thread if possible.
Identify the amount of memory that would be appropriate or optimal for running a prometheus pod collecting beyla metrics
identify potential causes and solutions to tuning the memory utilization down to a normalized level so that the infrastructure can be run successfully regardless of what environment I am running it in.
It is experimental feature - it can be changed anytime:
Don’t trust a doc - check source code of your version, when you want to be 100% sure.
1 beyla causes high cardinality and extreme memory usage in its default configuration
2 this one metric in particular has exceedingly high cardinality
But that’s mentioned in the doc:
And you still have an option to tweak in the Beyla config for your need.
Another option is to tweak your TDSB storage, so it doesn’t have a problem with high cardinality: