Log Processing with k8s-monitoring Helm Chart: Configuration Issues and Best Practices

Following this community discussion which worked well, I’m now trying to implement a similar configuration using the new version of the k8s-monitoring helm chart with OpenTofu for deployment.

I’m encountering some issues with my current approach and would appreciate guidance on the best way to implement this configuration.

Current Implementation

OpenTofu Resource Configuration
# All values here: https://github.com/grafana/k8s-monitoring-helm/blob/main/charts/k8s-monitoring/values.yaml
resource "helm_release" "grafana-k8s-monitoring" {
  name       = "grafana-k8s-monitoring"
  repository = "https://grafana.github.io/helm-charts"
  chart      = "k8s-monitoring"
  # same namespace for the time being
  atomic    = true
  timeout   = 300
  version   = var.versions.k8s_monitoring

  # Logs destination (Loki)
  # alloy-logs configuration
  set {
    name  = "alloy-logs.enabled"
    value = "true"
  }

  set {
    name  = "destinations[1].name"
    value = "logs-service"
  }

  set {
    name  = "destinations[1].type"
    value = "loki"
  }

  set {
    name  = "destinations[1].url"
    value = "${local.apis.loki.host}/loki/api/v1/push"
  }

  set {
    name  = "destinations[1].auth.type"
    value = "basic"
  }

  set_sensitive {
    name  = "destinations[1].auth.username"
    value = local.apis.loki.username
  }

  set_sensitive {
    name  = "destinations[1].auth.password"
    value = local.apis.loki.password
  }


  # alloy-singleton configuration
  set {
    name  = "alloy-singleton.enabled"
    value = "true"
  }

# Using templatefile for extraConfig
  set {
  name  = "alloy-logs.extraConfig"
  value = file("${path.module}/files/task-analysis-alloy.tpl")
}

  # Enable node logs
  set {
    name  = "nodeLogs.enabled"
    value = "true"
  }

  # Enable cluster events logs
  set {
    name  = "clusterEvents.enabled"
    value = "true"
  }

  # Enable pod logs
  set {
    name  = "podLogs.enabled"
    value = "true"
  }
Alloy Configuration Template (task-analysis-alloy.tpl)
// Extract container name from __meta_docker_container_name label and add as label
discovery.relabel "task_analysis" {
  targets = discovery.kubernetes.pods.targets

  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    regex         = "engine-task-controller-.*"
    action        = "keep"
  }

  // Ensure the "task_analysis" label is added if it doesn't exist
  rule {
    action        = "replace"
    target_label  = "action"
    replacement   = "task_analysis"
  }

  // Ensure the "cluster" label is added if it doesn't exist
  rule {
    action        = "replace"
    target_label  = "cluster"
    replacement   = env("CLUSTER_NAME")
  }
}

loki.source.kubernetes "engine_task_controller_analysis" {
  targets    = discovery.relabel.task_analysis.output
  forward_to = [loki.process.task_analysis_json_extraction.receiver]
}

loki.process "task_analysis_json_extraction" {

    // Parse the JSON first
    stage.json {
        expressions = {
            app = "",
            app_version = "",
            cloud_provider = "",
            duration = "",
            graphical = "",
            namespace = "",
            reconcile_id = "",
            state = "",
            task_id = "",
            region = "",
            ressource_type = "",
        }
    }

    stage.labels {
        values = {
            has_duration = "duration",
            namespace = "namespace",
            app_version = "app_version",
            graphical = "graphical",
            state = "state",
            circle_app = "app",
            cloud_provider = "cloud_provider",
            ressource_type = "ressource_type",
        }
    }

    stage.structured_metadata {
        values = {
          reconcile_id = "reconcile_id",
          task_id = "task_id",
          duration = "duration",
          region = "region",
        }
    }

    // Drop logs where the duration label is empty
    stage.match {
        selector = "{has_duration=\"\"}"
        action = "drop"
        drop_counter_reason = "missing_duration_field"
    }

    stage.label_drop {
    values = [ "has_duration" , "duration" ]
    }

    // For logs where the label is empty, add a default value
    stage.match {
        selector = "{cloud_provider=\"\"}"

        // Inside this stage, we only process logs that matched the selector
        stage.static_labels {
            values = {
                "cloud_provider" = "Not assigned",
            }
        }
    }

    stage.timestamp {
          source = "ts"
          format = "RFC3339"
    }

    forward_to = [loki.write.logs_service.receiver]

}

Issues Encountered

When applying this configuration, I’m getting the following error:

[...]
, key " \"duration\" ]\n    }\n\n    // For logs where the label is empty" has no value (cannot end with ,)

The error seems to be related to syntax in the Alloy configuration, but I’m having trouble pinpointing the exact cause. I suspect it might be related to escaping or formatting issues when passing the configuration through OpenTofu.

Questions

  1. Syntax Error: How can I resolve the parsing error? Is there a specific escaping requirement when using alloy-logs.extraConfig with a template file?
  2. Best Practice: Would it be more appropriate to use podLogs.extraLogProcessingStages instead of alloy-logs.extraConfig for this type of log processing configuration?
  3. Configuration Approach: Is there a recommended pattern for implementing complex log processing pipelines with the k8s-monitoring helm chart?

Additional Context

  • Using the latest version of the k8s-monitoring helm chart
  • Deploying with OpenTofu (Terraform fork)
  • Goal: Process logs from specific pods, extract JSON fields, and drop logs missing certain fields
  • The previous approach worked with a specific configmap created by hand

Any guidance on resolving the syntax error and best practices for this type of configuration would be greatly appreciated! :blush:

I don’t see any obvious error, can you share the actual configuration mounted into your alloy pod? ( or the configmap)

Thanks @tonyswumac for the suggestion!

You were absolutely right - checking the actual configuration mounted in the pod revealed the issue. Here’s what I discovered:

:magnifying_glass_tilted_left: Problem Diagnosis

Current Observed Behavior

Case 1: With stage.match blocks present

  • Helm chart deployment fails with this error:
key " \"duration\" ]\n    }\n\n    // For logs where the label is empty" has no value (cannot end with ,)

Case 2: Without stage.match blocks

  • Chart deploys successfully
  • BUT pods fail to start with multiple syntax errors:
Error: /etc/alloy/config.alloy:369:21: missing ',' in field list
Error: /etc/alloy/config.alloy:371:8: expected =, got .
Error: /etc/alloy/config.alloy:371:9: cannot use a block as an expression
[... more syntax errors ...]

Generated Configmap Truncation

The generated ConfigMap shows severe truncation of the Alloy configuration. The file abruptly cuts off mid-block:

# What should be a complete stage.json block:
stage.json {
    expressions = {
        app = ""
        app_version = ""
        cloud_provider = ""
        duration = ""
        # ... more fields should be here
    }
}

# Instead, it gets truncated to:
stage.json {
    expressions = {
        app = ""
# Destination: logs-service (loki)  # ← Jumps directly to destination config!
otelcol.exporter.loki "logs_service" {
    forward_to = [loki.write.logs_service.receiver]
}

:wrench: Technical Details

Complete "working" (ie generating a not working either configmap) configuration (without stage.match blocks)
// Extract container name from __meta_docker_container_name label and add as label
discovery.relabel "task_analysis" {
  targets = discovery.kubernetes.pods.targets

  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    regex         = "engine-task-controller-.*"
    action        = "keep"
  }

  // Ensure the "task_analysis" label is added if it doesn't exist
  rule {
    action        = "replace"
    target_label  = "action"
    replacement   = "task_analysis"
  }

  // Ensure the "cluster" label is added if it doesn't exist
  rule {
    action        = "replace"
    target_label  = "cluster"
    replacement   = env("CLUSTER_NAME")
  }
}

loki.source.kubernetes "engine_task_controller_analysis" {
  targets    = discovery.relabel.task_analysis.output
  forward_to = [loki.process.task_analysis_json_extraction.receiver]
}

loki.process "task_analysis_json_extraction" {

    // Parse the JSON first
    stage.json {
        expressions = {
            app = "",
            app_version = "",
            cloud_provider = "",
            duration = "",
            graphical = "",
            namespace = "",
            reconcile_id = "",
            state = "",
            task_id = "",
            region = "",
            ressource_type = "",
        }
    }

    stage.labels {
        values = {
            has_duration = "duration",
            namespace = "namespace",
            app_version = "app_version",
            graphical = "graphical",
            state = "state",
            circle_app = "app",
            cloud_provider = "cloud_provider",
            ressource_type = "ressource_type",
        }
    }

    stage.structured_metadata {
        values = {
          reconcile_id = "reconcile_id",
          task_id = "task_id",
          duration = "duration",
          region = "region",
        }
    }


    stage.label_drop {
    values = [ "has_duration" , "duration" ]
    }

    stage.timestamp {
          source = "ts"
          format = "RFC3339"
    }

    forward_to = [loki.write.logs_service.receiver]

}

Full truncated ConfigMap output
apiVersion: v1
data:
  config.alloy: |-
    // Feature: Node Logs
    declare "node_logs" {
      argument "logs_destinations" {
        comment = "Must be a list of log destinations where collected logs should be forwarded to"
      }

      loki.relabel "journal" {

        // copy all journal labels and make the available to the pipeline stages as labels, there is a label
        // keep defined to filter out unwanted labels, these pipeline labels can be set as structured metadata
        // as well, the following labels are available:
        // - boot_id
        // - cap_effective
        // - cmdline
        // - comm
        // - exe
        // - gid
        // - hostname
        // - machine_id
        // - pid
        // - stream_id
        // - systemd_cgroup
        // - systemd_invocation_id
        // - systemd_slice
        // - systemd_unit
        // - transport
        // - uid
        //
        // More Info: https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html
        rule {
          action = "labelmap"
          regex = "__journal__(.+)"
        }

        rule {
          action = "replace"
          source_labels = ["__journal__systemd_unit"]
          replacement = "$1"
          target_label = "unit"
        }

        // the service_name label will be set automatically in loki if not set, and the unit label
        // will not allow service_name to be set automatically.
        rule {
          action = "replace"
          source_labels = ["__journal__systemd_unit"]
          replacement = "$1"
          target_label = "service_name"
        }

        forward_to = [] // No forward_to is used in this component, the defined rules are used in the loki.source.journal component
      }

      loki.source.journal "worker" {
        path = "/var/log/journal"
        format_as_json = false
        max_age = "8h"
        relabel_rules = loki.relabel.journal.rules
        labels = {
          job = "integrations/kubernetes/journal",
          instance = sys.env("HOSTNAME"),
        }
        forward_to = [loki.process.journal_logs.receiver]
      }

      loki.process "journal_logs" {
        stage.static_labels {
          values = {
            // add a static source label to the logs so they can be differentiated / restricted if necessary
            "source" = "journal",
            // default level to unknown
            level = "unknown",
          }
        }

        // Attempt to determine the log level, most k8s workers are either in logfmt or klog formats
        // check to see if the log line matches the klog format (https://github.com/kubernetes/klog)
        stage.match {
          // unescaped regex: ([IWED][0-9]{4}\s+[0-9]{2}:[0-9]{2}:[0-9]{2}\.[0-9]+)
          selector = "{level=\"unknown\"} |~ \"([IWED][0-9]{4}\\\\s+[0-9]{2}:[0-9]{2}:[0-9]{2}\\\\.[0-9]+)\""

          // extract log level, klog uses a single letter code for the level followed by the month and day i.e. I0119
          stage.regex {
            expression = "((?P<level>[A-Z])[0-9])"
          }

          // if the extracted level is I set INFO
          stage.replace {
            source = "level"
            expression = "(I)"
            replace = "INFO"
          }

          // if the extracted level is W set WARN
          stage.replace {
            source = "level"
            expression = "(W)"
            replace = "WARN"
          }

          // if the extracted level is E set ERROR
          stage.replace {
            source = "level"
            expression = "(E)"
            replace = "ERROR"
          }

          // if the extracted level is I set INFO
          stage.replace {
            source = "level"
            expression = "(D)"
            replace = "DEBUG"
          }

          // set the extracted level to be a label
          stage.labels {
            values = {
              level = "",
            }
          }
        }

        // if the level is still unknown, do one last attempt at detecting it based on common levels
        stage.match {
          selector = "{level=\"unknown\"}"

          // unescaped regex: (?i)(?:"(?:level|loglevel|levelname|lvl|levelText|SeverityText)":\s*"|\s*(?:level|loglevel|levelText|lvl)="?|\s+\[?)(?P<level>(DEBUG?|DBG|INFO?(RMATION)?|WA?RN(ING)?|ERR(OR)?|CRI?T(ICAL)?|FATAL|FTL|NOTICE|TRACE|TRC|PANIC|PNC|ALERT|EMERGENCY))("|\s+|-|\s*\])
          stage.regex {
            expression = "(?i)(?:\"(?:level|loglevel|levelname|lvl|levelText|SeverityText)\":\\s*\"|\\s*(?:level|loglevel|levelText|lvl)=\"?|\\s+\\[?)(?P<level>(DEBUG?|DBG|INFO?(RMATION)?|WA?RN(ING)?|ERR(OR)?|CRI?T(ICAL)?|FATAL|FTL|NOTICE|TRACE|TRC|PANIC|PNC|ALERT|EMERGENCY))(\"|\\s+|-|\\s*\\])"
          }

          // set the extracted level to be a label
          stage.labels {
            values = {
              level = "",
            }
          }
        }

        // Only keep the labels that are defined in the `keepLabels` list.
        stage.label_keep {
          values = ["instance","job","level","name","unit","service_name","source"]
        }

        forward_to = argument.logs_destinations.value
      }
    }
    node_logs "feature" {
      logs_destinations = [
        loki.write.logs_service.receiver,
      ]
    }
    // Feature: Pod Logs
    declare "pod_logs" {
      argument "logs_destinations" {
        comment = "Must be a list of log destinations where collected logs should be forwarded to"
      }

      discovery.relabel "filtered_pods" {
        targets = discovery.kubernetes.pods.targets
        rule {
          source_labels = ["__meta_kubernetes_namespace"]
          action = "replace"
          target_label = "namespace"
        }
        rule {
          source_labels = ["__meta_kubernetes_pod_name"]
          action = "replace"
          target_label = "pod"
        }
        rule {
          source_labels = ["__meta_kubernetes_pod_container_name"]
          action = "replace"
          target_label = "container"
        }
        rule {
          source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_container_name"]
          separator = "/"
          action = "replace"
          replacement = "$1"
          target_label = "job"
        }

        // set the container runtime as a label
        rule {
          action = "replace"
          source_labels = ["__meta_kubernetes_pod_container_id"]
          regex = "^(\\S+):\\/\\/.+$"
          replacement = "$1"
          target_label = "tmp_container_runtime"
        }

        // make all labels on the pod available to the pipeline as labels,
        // they are omitted before write to loki via stage.label_keep unless explicitly set
        rule {
          action = "labelmap"
          regex = "__meta_kubernetes_pod_label_(.+)"
        }

        // make all annotations on the pod available to the pipeline as labels,
        // they are omitted before write to loki via stage.label_keep unless explicitly set
        rule {
          action = "labelmap"
          regex = "__meta_kubernetes_pod_annotation_(.+)"
        }

        // explicitly set service_name. if not set, loki will automatically try to populate a default.
        // see https://grafana.com/docs/loki/latest/get-started/labels/#default-labels-for-all-users
        //
        // choose the first value found from the following ordered list:
        // - pod.annotation[resource.opentelemetry.io/service.name]
        // - pod.label[app.kubernetes.io/name]
        // - k8s.pod.name
        // - k8s.container.name
        rule {
          action = "replace"
          source_labels = [
            "__meta_kubernetes_pod_annotation_resource_opentelemetry_io_service_name",
            "__meta_kubernetes_pod_label_app_kubernetes_io_name",
            "__meta_kubernetes_pod_container_name",
          ]
          separator = ";"
          regex = "^(?:;*)?([^;]+).*$"
          replacement = "$1"
          target_label = "service_name"
        }

        // set resource attributes
        rule {
          action = "labelmap"
          regex = "__meta_kubernetes_pod_annotation_resource_opentelemetry_io_(.+)"
        }
        rule {
          source_labels = ["__meta_kubernetes_pod_annotation_k8s_grafana_com_logs_job"]
          regex = "(.+)"
          target_label = "job"
        }
        rule {
          source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"]
          regex = "(.+)"
          target_label = "app_kubernetes_io_name"
        }
      }

      discovery.kubernetes "pods" {
        role = "pod"
        selectors {
          role = "pod"
          field = "spec.nodeName=" + sys.env("HOSTNAME")
        }
      }

      discovery.relabel "filtered_pods_with_paths" {
        targets = discovery.relabel.filtered_pods.output

        rule {
          source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
          separator = "/"
          action = "replace"
          replacement = "/var/log/pods/*$1/*.log"
          target_label = "__path__"
        }
      }

      local.file_match "pod_logs" {
        path_targets = discovery.relabel.filtered_pods_with_paths.output
      }

      loki.source.file "pod_logs" {
        targets    = local.file_match.pod_logs.targets
        forward_to = [loki.process.pod_logs.receiver]
      }

      loki.process "pod_logs" {
        stage.match {
          selector = "{tmp_container_runtime=~\"containerd|cri-o\"}"
          // the cri processing stage extracts the following k/v pairs: log, stream, time, flags
          stage.cri {}

          // Set the extract flags and stream values as labels
          stage.labels {
            values = {
              flags  = "",
              stream  = "",
            }
          }
        }

        stage.match {
          selector = "{tmp_container_runtime=\"docker\"}"
          // the docker processing stage extracts the following k/v pairs: log, stream, time
          stage.docker {}

          // Set the extract stream value as a label
          stage.labels {
            values = {
              stream  = "",
            }
          }
        }

        // Drop the filename label, since it's not really useful in the context of Kubernetes, where we already have cluster,
        // namespace, pod, and container labels. Drop any structured metadata. Also drop the temporary
        // container runtime label as it is no longer needed.
        stage.label_drop {
          values = [
            "filename",
            "tmp_container_runtime",
          ]
        }
        stage.structured_metadata {
          values = {
            "k8s_pod_name" = "k8s_pod_name",
            "pod" = "pod",
          }
        }

        // Only keep the labels that are defined in the `keepLabels` list.
        stage.label_keep {
          values = ["app_kubernetes_io_name","container","instance","job","level","namespace","service_name","service_namespace","deployment_environment","deployment_environment_name","k8s_namespace_name","k8s_deployment_name","k8s_statefulset_name","k8s_daemonset_name","k8s_cronjob_name","k8s_job_name","k8s_node_name"]
        }

        forward_to = argument.logs_destinations.value
      }
    }
    pod_logs "feature" {
      logs_destinations = [
        loki.write.logs_service.receiver,
      ]
    }



    // Extract container name from __meta_docker_container_name label and add as label
    discovery.relabel "task_analysis" {
      targets = discovery.kubernetes.pods.targets

      rule {
        source_labels = ["__meta_kubernetes_pod_name"]
        regex         = "engine-task-controller-.*"
        action        = "keep"
      }

      // Ensure the "task_analysis" label is added if it doesn't exist
      rule {
        action        = "replace"
        target_label  = "action"
        replacement   = "task_analysis"
      }

      // Ensure the "cluster" label is added if it doesn't exist
      rule {
        action        = "replace"
        target_label  = "cluster"
        replacement   = env("CLUSTER_NAME")
      }
    }

    loki.source.kubernetes "engine_task_controller_analysis" {
      targets    = discovery.relabel.task_analysis.output
      forward_to = [loki.process.task_analysis_json_extraction.receiver]
    }

    loki.process "task_analysis_json_extraction" {

        // Parse the JSON first
        stage.json {
            expressions = {
                app = ""
    // Destination: logs-service (loki)
    otelcol.exporter.loki "logs_service" {
      forward_to = [loki.write.logs_service.receiver]
    }

    loki.write "logs_service" {
      endpoint {
        url = "https://loki-prod.orchestrator.circledental.cloud/loki/api/v1/push"
        basic_auth {
          username = convert.nonsensitive(remote.kubernetes.secret.logs_service.data["username"])
          password = remote.kubernetes.secret.logs_service.data["password"]
        }
        tls_config {
          insecure_skip_verify = false
        }
        min_backoff_period = "500ms"
        max_backoff_period = "5m"
        max_backoff_retries = "10"
      }
      external_labels = {
        "cluster" = "XXX",
        "k8s_cluster_name" = "XXX",
      }
    }

    remote.kubernetes.secret "logs_service" {
      name      = "logs-service-grafana-k8s-monitoring"
      namespace = "default"
    }
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: grafana-k8s-monitoring
    meta.helm.sh/release-namespace: default
  creationTimestamp: "2025-06-02T14:40:27Z"
  labels:
    app.kubernetes.io/managed-by: Helm
  name: grafana-k8s-monitoring-alloy-logs
  namespace: default
  resourceVersion: "19925"
  uid: 03b128f9-e1af-4a30-9b71-e9e3baf53cee

:thought_balloon: Analysis & Questions

The configuration is being corrupted during the template processing pipeline:

  • OpenTofu’s file() functionHelm’s --set mechanismConfigMap generation

Possible causes:

  1. String escaping issues in the Helm template processing
  2. Size limitations in Helm’s --set parameter handling
  3. Special character conflicts (quotes, newlines, etc.) between OpenTofu and Helm
  4. YAML parsing issues when complex multi-line strings are processed

:red_question_mark: Request for Help

Has anyone encountered similar truncation issues when passing complex Alloy configurations through Helm charts?

Specific questions:

  • Is there a better way to inject large configuration blocks into Helm charts?
  • How to get this configuration working without stage.match first, and then with stage.match to get my workflow working
  • Should I use ConfigMap files instead of --set parameters?
  • Are there known limitations with OpenTofu’s file() function and Helm integration?

Any insights on the proper way to handle this template → Helm → ConfigMap workflow would be greatly appreciated!

:white_check_mark: SOLVED: Configuration Truncation Fixed!

TL;DR: The issue was with using --set parameters for large configurations. Solution: Use Helm values files with templatefile() instead.


:wrench: Working Solution

Instead of passing the complex Alloy configuration through --set parameters, I switched to using a values file approach:

resource "helm_release" "grafana-k8s-monitoring" {
  name       = "grafana-k8s-monitoring"
  repository = "https://grafana.github.io/helm-charts"
  chart      = "k8s-monitoring"
  namespace  = "default"
  atomic     = true
  timeout    = 300
  version    = var.versions.k8s_monitoring

  values = [
    templatefile("${path.module}/templates/grafana-k8s-monitoring-values.yaml.tpl", {
      docker_registry_root        = local.docker_registry_root
      cluster_suffix              = var.cluster_suffix
      apis                        = local.apis
      alloy_metrics_enabled       = var.alloy_metrics_enabled
      cluster_metrics_enabled     = var.cluster_metrics_enabled
      task_analysis_alloy_content = file("${path.module}/files/task-analysis-alloy.tpl")
      worker_node_label           = var.worker_node_label
    })
  ]
}

Key Configuration Block

In the values template file, I use this block for the extraConfig:

alloy-logs:
  enabled: true
  extraConfig: |
    ${indent(4, task_analysis_alloy_content)}

:tada: Results

  • :white_check_mark: No more truncation - The complete configuration is now properly rendered
  • :white_check_mark: Proper formatting - All blocks are correctly indented and structured
  • :white_check_mark: ConfigMap generation works as expected

:bug: Current Status

The configuration is now well-formatted, but I’m debugging a scope issue:

Error: /etc/alloy/config.alloy:336:13: component "discovery.kubernetes.pods.targets" does not exist or is out of scope

335 | discovery.relabel "task_analysis" {
336 |   targets = discovery.kubernetes.pods.targets
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
337 |

This is clearly a reference issue in my Alloy config that needs updating for the current deployment method. I’ll work on fixing this separately since it’s no longer related to the truncation problem.


:light_bulb: Key Takeaway

For complex configurations: Use Helm values files with templatefile() instead of --set parameters to avoid truncation and parsing issues.

Thanks again @tonyswumac for pointing me in the right direction! :folded_hands:

I’ll leave this issue open until I’ve fully resolved the remaining scope issues, but I believe we’re on the right track now.