Triggering a datasource error from a different alert name

When our production base cluster triggers alerts named “Datasource error,” the alert names vary each day, with common names like “datasource error” and specific alerts such as “LokiRequestErrors” and “AlloyPodCrashLooping.” Sometimes, it self-heals. During my research, I found that the issue might be related to Prometheus Thanos being unavailable, although no alert specifically indicates this. I need to identify why Thanos is sometimes not found. My assumption might be incorrect, so could someone help me troubleshoot this process? I will add my alert description below! in the alert I coudn’t find the label valuse as well!

FIRING:1 | DatasourceError |
1x Alerts Firing
Summary: Loki is experiencing request errors.
Description: [no value] [no value] in cluster [no value] is experiencing a high number of errors.
alertname: DatasourceError
grafana_folder: Observability Enablement
ref_id: A
rulename: LokiRequestErrors
sc_component: loki
sc_env: production
sc_provider: k8s
sc_system: oe
severity: P1

What is this documentation means? I can get different alerts in same Datasource error for an example
Summary: A container is using more than 90% of its memory limit.

Description: The memory usage for container [no value] in pod: [no value], namespace: [no value], cluster: [no value] is higher than 90% for over 15 minutes.

Labels:

  • alertname: DatasourceError
  • datasource_uid: XXXXXXXXXXXXXXX
  • grafana_folder: Observability Enablement
  • ref_id: A
  • rulename: ContainerCriticalMemoryUsage
  • sc_component: kubernetes
  • sc_env: production
  • sc_provider: k8s
  • sc_system: oe
  • severity: P1

You asked for troubleshooting. So configure state history and provide logs from there. There should be more details why you have datasource error state. There can be billions and billions reasons, e. g. query timeout, unrechable datasource,…

If I follow this documentation, will my datasource error be resolved? Moreover, I have many alert triggers with the name of the datasource error, not only from Loki but also from Thanos, etc.

No, you didn’t identify/fix root cause first. Datasourceerror is just consequence of another problem.

That is the problem I am facing now. I tried, but I couldn’t find the problem.

My assumption is I can see that two specific errors occur repeatedly during a certain time slot, which triggers alerts. The logs I’m adding below show that one error is caused by Thanos Querier timing out when reaching the sidecar Thanos, and the other error involves a Thanos component (likely Querier or Compactor) failing to connect to the Store Gateway at the specified address. I believe these two issues are the root cause of the datasource error.

Error - 1

{"address":"euw-i1-contenthub-sb-aks-01-sidecar-1.sitecorecloud.app:443","caller":"endpointset.go:459","component":"endpointset","duplicates":"2","extLset":"{cluster=\"euw-i1-contenthub-sb-aks-01\", location=\"westeurope\", prometheus=\"oe-prometheus/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-1\"}","level":"warn","msg":"found duplicate storeEndpoints producer (sidecar or ruler). This is not advised as it will malform data in the same bucket","ts":"2025-06-13T04:50:13.558035888Z"}

Error - 2

{"address":"euw-i1-contenthub-aks-01-store-gateway.sitecorecloud.app:443","caller":"endpointset.go:471","component":"endpointset","err":"getting metadata: rpc error: code = DeadlineExceeded desc = latest balancer error: name resolver error: produced zero addresses","level":"warn","msg":"update of endpoint failed","ts":"2025-06-13T04:50:13.58148948Z"}

Do you have any ideas?

You provided some logs, but they don’t look like a Grafana logs. Sorry, you are misleading here. Pls, provide Grafana alerting logs with error mesagge about root cause of Datasource error state.

This is real log for your example (nicely formatted for humans) what you can get:

{
    "schemaVersion": 1,
    "previous": "Normal",
    "current": "Error",
    "error": "[sse.dataQueryError] failed to execute query [A]: unexpected response with status code 500: {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"the bucket index is too old. It was last updated at ---, which exceeds the maximum allowed staleness period of 1h0m0s (err-mimir-bucket-index-too-old)\"}",
    "values": {},
    "condition": "D",
    "dashboardUID": "---",
    "panelID": 123,
    "fingerprint": "---",
    "ruleTitle": "alert-rule-name",
    "ruleID": 123,
    "ruleUID": "---",
    "labels": {
        "alertname": "alert-rule-name",
        "rule_uid": "---"
    }
}

It is obvious why alert went from normal to datasource error state from this log:

[sse.dataQueryError] failed to execute query [A]: unexpected response with status code 500: {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"the bucket index is too old. It was last updated at ---, which exceeds the maximum allowed staleness period of 1h0m0s (err-mimir-bucket-index-too-old)\"}

You have root cause, so user/admin can dig what he can do with this error..

Of course I’m not saying this is exactly your problem. I’m trying to explain: provide proper logs and then you will know root cause. With your current approach you are just guessing that some warnings/errors from the Thanos might be a problem. Yes they may, but also may not be a root cause. That’s just guess.

These are the firing in our Team channel, continuously but specific time. after 20 minutes it is self healing
FIRING:6 | DatasourceError |

6x Alerts Firing

Summary: A container is using more than 90% of its memory limit.

Description: The memory usage for container [no value] in pod: [no value], namespace: [no value], cluster: [no value] is higher than 90% for over 15 minutes.

Labels:

  • alertname: DatasourceError
  • datasource_uid: P9421284538FAC2F5
  • grafana_folder: Observability Enablement
  • ref_id: A
  • rulename: ContainerCriticalMemoryUsage
  • sc_component: kubernetes
  • sc_env: production
  • sc_provider: k8s
  • sc_system: oe
  • severity: P1

Summary: A daemonset does not meet its expected number of pods.

Description: Not all of the desired Pods of DaemonSet [no value]/[no value] in cluster [no value] are scheduled and ready.

Labels:

  • alertname: DatasourceError
  • datasource_uid: P9421284538FAC2F5
  • grafana_folder: Observability Enablement
  • ref_id: A
  • rulename: KubeDaemonSetRolloutStuck
  • sc_component: kubernetes
  • sc_env: production
  • sc_provider: k8s
  • sc_system: oe
  • severity: P1

Summary: Node memory pressure (instance [no value])

Description: Node [no value] in cluster [no value] has Memory Pressure condition.

Explore: https://grafana-prod-euw.sitecorecloud.app/alerting/list?search=DatasourceError

Labels:

  • alertname: DatasourceError
  • datasource_uid: P9421284538FAC2F5
  • grafana_folder: Observability Enablement
  • ref_id: A
  • rulename: KubeNodeMemoryPressure
  • sc_component: kubernetes
  • sc_env: production
  • sc_provider: k8s
  • sc_system: oe
  • severity: P1

Summary: A persistent volume has consistently been in an error state.

Description: The persistent volume [no value] in cluster [no value] has status [no value].

Labels:

  • alertname: DatasourceError
  • datasource_uid: P9421284538FAC2F5
  • grafana_folder: Observability Enablement
  • ref_id: A
  • rulename: KubePersistentVolumeErrors
  • sc_component: kubernetes
  • sc_env: production
  • sc_provider: k8s
  • sc_system: oe
  • severity: P1

Summary: A persistent volume is running out of space.

Description: The PersistentVolume claimed by [no value] in Namespace [no value] and cluster [no value] is less than 10% free.

Labels:

  • alertname: DatasourceError
  • datasource_uid: P9421284538FAC2F5
  • grafana_folder: Observability Enablement
  • ref_id: A
  • rulename: KubePersistentVolumeUsageCritical
  • sc_component: kubernetes
  • sc_env: production
  • sc_provider: k8s
  • sc_system: oe
  • severity: P1

Silence this alert

Summary: A statefulset has version mismatches.

Description: StatefulSet generation for [no value]/[no value] in cluster [no value] does not match, this indicates that the StatefulSet has failed but has not been rolled back.

Labels:

  • alertname: DatasourceError
  • datasource_uid: P9421284538FAC2F5
  • grafana_folder: Observability Enablement
  • ref_id: A
  • rulename: KubeStatefulSetGenerationMismatch
  • sc_component: kubernetes
  • sc_env: production
  • sc_provider: k8s
  • sc_system: oe
  • severity: P1

Let me know your contact point, to share the more details!