Traces and Logs intermittently disappearing from Tempo and Loki

Introduction

I am deploying Loki, Promtail, Tempo (single binary mode), Prometheus, and Grafana to a 3-node K3s cluster where I intend to run other applications (exposing web services) that I want to instrument using these tools.

I’m deploying all of this with Helm and I made my own chart of charts so I can deploy this monitoring/observability stack all together. These are my chart dependencies which shows the specifics of what I’m deploying and which Helm charts are being used:

dependencies:
  - name: prometheus  # includes: prometheus server, alertmanager, and pushgateway
    version: 15.1.1
    repository: https://prometheus-community.github.io/helm-charts
  - name: loki-stack   # includes: loki and promtail
    version: 2.5.1
    repository: https://grafana.github.io/helm-charts
  - name: tempo
    version: 0.13.0
    repository: https://grafana.github.io/helm-charts
  - name: grafana
    version: 6.21.0
    repository: https://grafana.github.io/helm-charts
  - name: opentelemetry-collector
    version: 0.8.1
    repository: https://open-telemetry.github.io/opentelemetry-helm-charts
    alias: otel-collector

The actual web service applications I’m deploying are .NET 6 applications using the OpenTelemetry .NET Library so I am using the OpenTelemetry-Collector as an agent to receive trace data from my .NET app and send it to Tempo. Logs are collected from my cluster pods using Promtail and sent to Loki. I have also set up a “derived field” in Loki which grabs the OpenTelemetry Trace Id from my logs so I can link over to the Tempo traces.

My Grafana chart values file looks like this:

grafana:
  grafana.ini:
    server:
      domain: my-cluster-hostname.mydomain.org
      root_url: "%(protocol)s://%(domain)s/grafana"
      serve_from_sub_path: true

  ingress:
    enabled: true
    ingressClassName: nginx
    path: /grafana
    pathType: Prefix
    hosts:
      - my-cluster-hostname.mydomain.org
    tls:
      - secretName: cluster-tls-cert
        hosts:
          - my-cluster-hostname.mydomain.org

  admin:
    existingSecret: grafana-admin-credentials
    userKey: admin-user
    passwordKey: admin-password

  ## Configure grafana datasources
  ## ref: http://docs.grafana.org/administration/provisioning/#datasources
  ##
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
      - name: Prometheus
        type: prometheus
        uid: prometheus
        access: proxy
        url: http://obs-prometheus-server.obs:9090
        version: 1
      - name: Loki
        type: loki
        uid: loki
        access: proxy
        url: http://obs-loki.obs:3100
        version: 1
        jsonData:
          derivedFields:
          - datasourceUid: tempo
            matcherRegex: .*\,\"TraceId\":\"(\w+)\"
            url: '$${__value.raw}'
            name: TraceID
      - name: Tempo
        type: tempo
        uid: tempo
        access: proxy
        url: http://obs-tempo.obs:3100
        version: 1

There you can see my data sources include Prometheus, Loki, and Tempo. The URLs point to my K8s services which are named obs-prometheus-server, obs-loki, and obs-tempo and are also in the namespace obs. AFAIK the ports are correct as well based on examples I’ve found online (though they’re not always the same across guides/articles which is confusing).

The problem

Now for some time while I’m running my application and it is producing logs and traces, I see data in Loki and Tempo inside Grafana (I go to Explore and pick the associated data source). I can even jump from Logs in Loki to Traces in Tempo but then suddenly, I can not find any logs or traces that were previously there as little as an hour ago. If I restart promtail, loki, and tempo (which are deamonsets or statefulsets), I magically start getting logs and traces again but it’s been extremely frustrating when it just suddenly stops working even though my pods are all running without obvious issues.

I’m hoping it’s just something with my deployment approach that is wrong and someone can provide feedback. I’ve already shared my grafana chart values above but the following are the other chart settings:

loki-stack chart values

loki-stack:
  loki:
    enabled: true
    replicas: 3
    persistence:
      enabled: true
      accessModes:
      - ReadWriteOnce
      size: 10Gi
      # storageClassName: local-path
      annotations: {}
    config:
      limits_config:
        retention_period: 336h
      compactor:
        compaction_interval: 10m
        retention_enabled: true
        retention_delete_delay: 2h

## tempo chart values
```yaml
tempo:
  replicas: 3
  tempo:
    retention: 24h
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: "0.0.0.0:4317"
    resources:
      requests:
        cpu: 500m
        memory: 500Mi
      limits:
        cpu: 1000m
        memory: 1Gi

  persistence:
    enabled: true
    # storageClassName: local-path
    accessModes:
      - ReadWriteOnce
    size: 10Gi

opentelemetry collector chart values:

otel-collector:
  config:
    exporters:
      otlp:
        endpoint: obs-tempo.obs:4317
        tls:
          insecure: true
    extensions:
      health_check: {}
      memory_ballast: {}
    processors:
      batch: {}
      # If set to null, will be overridden with values based on k8s resource limits
      memory_limiter: null
    receivers:
      jaeger: null
      prometheus: null
      zipkin: null
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    service:
      extensions:
        - health_check
        - memory_ballast
      pipelines:
        logs: null
        metrics: null
        traces:
          exporters:
            - otlp
          processors:
            - memory_limiter
            - batch
          receivers:
            - otlp
  
  agentCollector:
    enabled: false

  standaloneCollector:
    enabled: true
    replicaCount: 3
    resources:
      requests:
        cpu: 500m
        memory: 500Mi
      limits:
        cpu: 1000m
        memory: 1Gi

The request

I don’t see any errors in my OpenTelemetry-Collector pod and my .NET app appears to be reporting traces to it successfully (I enabled the otel collector debug output and I can see it claiming to send traces to Tempo).

I also don’t see any obvious errors in Promtail, Grafana, Loki, or Tempo pod logs…

Am I doing something wrong with my deployment here? I have 3 Loki pods running, 3 Tempo pods, 3 Promtails, and 1 Grafana. It sometimes works but then stops working (i.e. not showing traces and logs).

Here is an example of what I am seeing – I managed to enable the search beta so I can see some spans showing up with that. However, if I click on the “Trace ID” link on the left, it tries to open the trace details on the right… but it cannot find it.

Thanks for all the detail! Let’s take a look at this issue and see if we can’t resolve it. The first thing I’m noticing is that you are saying that you are operating in single binary mode but I’m seeing replicas: 3 in both your loki and tempo helm charts.

Single binary mode tempo (and loki) are designed to be run as a standalone component. If you are running 3 replicas it’s possible that your search query went to one replica and your trace by id query went to a different one. I’d change those values to 1 and see if things start working.

I didn’t realize we exposed replicas as a parameter in the single binary helm chart. We should remove that :).

Thanks Joe, that is one thing I was considering… I will try that out but that also begs the question: If I want my tracing and logging to by highly available, do I have to use the tempo-distributed and loki-distributed charts?

Currently yes. We do support a single binary scalable mode, but I don’t think the helm chart supports it directly. You may have to monkey with the settings some.

https://grafana.com/docs/tempo/latest/operations/deployment/#scalable-single-binary

OK, I will dig around in the documentation some more. BTW, reducing my replicas for Loki and Tempo seems to have resolved the issue so far… Your hypothesis seems to be correct :slight_smile:

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.