Triggering a datasource error from a different alert name

lvithu123 · June 6, 2025, 3:42am

When our production base cluster triggers alerts named “Datasource error,” the alert names vary each day, with common names like “datasource error” and specific alerts such as “LokiRequestErrors” and “AlloyPodCrashLooping.” Sometimes, it self-heals. During my research, I found that the issue might be related to Prometheus Thanos being unavailable, although no alert specifically indicates this. I need to identify why Thanos is sometimes not found. My assumption might be incorrect, so could someone help me troubleshoot this process? I will add my alert description below! in the alert I coudn’t find the label valuse as well!

FIRING:1 | DatasourceError |
1x Alerts Firing
Summary: Loki is experiencing request errors.
Description: [no value] [no value] in cluster [no value] is experiencing a high number of errors.
alertname: DatasourceError
grafana_folder: Observability Enablement
ref_id: A
rulename: LokiRequestErrors
sc_component: loki
sc_env: production
sc_provider: k8s
sc_system: oe
severity: P1

jangaraj · June 6, 2025, 8:01am

lvithu123 · June 13, 2025, 9:05am

What is this documentation means? I can get different alerts in same Datasource error for an example
Summary: A container is using more than 90% of its memory limit.

Description: The memory usage for container [no value] in pod: [no value], namespace: [no value], cluster: [no value] is higher than 90% for over 15 minutes.

Labels:

alertname: DatasourceError
datasource_uid: XXXXXXXXXXXXXXX
grafana_folder: Observability Enablement
ref_id: A
rulename: ContainerCriticalMemoryUsage
sc_component: kubernetes
sc_env: production
sc_provider: k8s
sc_system: oe
severity: P1

jangaraj · June 13, 2025, 9:37am

You asked for troubleshooting. So configure state history and provide logs from there. There should be more details why you have datasource error state. There can be billions and billions reasons, e. g. query timeout, unrechable datasource,…

lvithu123 · June 13, 2025, 9:41am

If I follow this documentation, will my datasource error be resolved? Moreover, I have many alert triggers with the name of the datasource error, not only from Loki but also from Thanos, etc.

jangaraj · June 13, 2025, 9:43am

No, you didn’t identify/fix root cause first. Datasourceerror is just consequence of another problem.

lvithu123 · June 13, 2025, 12:02pm

That is the problem I am facing now. I tried, but I couldn’t find the problem.

My assumption is I can see that two specific errors occur repeatedly during a certain time slot, which triggers alerts. The logs I’m adding below show that one error is caused by Thanos Querier timing out when reaching the sidecar Thanos, and the other error involves a Thanos component (likely Querier or Compactor) failing to connect to the Store Gateway at the specified address. I believe these two issues are the root cause of the datasource error.

Error - 1

{"address":"euw-i1-contenthub-sb-aks-01-sidecar-1.sitecorecloud.app:443","caller":"endpointset.go:459","component":"endpointset","duplicates":"2","extLset":"{cluster=\"euw-i1-contenthub-sb-aks-01\", location=\"westeurope\", prometheus=\"oe-prometheus/kube-prometheus-stack-prometheus\", prometheus_replica=\"prometheus-kube-prometheus-stack-prometheus-1\"}","level":"warn","msg":"found duplicate storeEndpoints producer (sidecar or ruler). This is not advised as it will malform data in the same bucket","ts":"2025-06-13T04:50:13.558035888Z"}

Error - 2

{"address":"euw-i1-contenthub-aks-01-store-gateway.sitecorecloud.app:443","caller":"endpointset.go:471","component":"endpointset","err":"getting metadata: rpc error: code = DeadlineExceeded desc = latest balancer error: name resolver error: produced zero addresses","level":"warn","msg":"update of endpoint failed","ts":"2025-06-13T04:50:13.58148948Z"}

Do you have any ideas?

jangaraj · June 13, 2025, 4:14pm

You provided some logs, but they don’t look like a Grafana logs. Sorry, you are misleading here. Pls, provide Grafana alerting logs with error mesagge about root cause of Datasource error state.

This is real log for your example (nicely formatted for humans) what you can get:

{
    "schemaVersion": 1,
    "previous": "Normal",
    "current": "Error",
    "error": "[sse.dataQueryError] failed to execute query [A]: unexpected response with status code 500: {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"the bucket index is too old. It was last updated at ---, which exceeds the maximum allowed staleness period of 1h0m0s (err-mimir-bucket-index-too-old)\"}",
    "values": {},
    "condition": "D",
    "dashboardUID": "---",
    "panelID": 123,
    "fingerprint": "---",
    "ruleTitle": "alert-rule-name",
    "ruleID": 123,
    "ruleUID": "---",
    "labels": {
        "alertname": "alert-rule-name",
        "rule_uid": "---"
    }
}

It is obvious why alert went from normal to datasource error state from this log:

[sse.dataQueryError] failed to execute query [A]: unexpected response with status code 500: {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"the bucket index is too old. It was last updated at ---, which exceeds the maximum allowed staleness period of 1h0m0s (err-mimir-bucket-index-too-old)\"}

You have root cause, so user/admin can dig what he can do with this error..

Of course I’m not saying this is exactly your problem. I’m trying to explain: provide proper logs and then you will know root cause. With your current approach you are just guessing that some warnings/errors from the Thanos might be a problem. Yes they may, but also may not be a root cause. That’s just guess.

lvithu123 · June 15, 2025, 10:24am

These are the firing in our Team channel, continuously but specific time. after 20 minutes it is self healing
FIRING:6 | DatasourceError |

6x Alerts Firing

Summary: A container is using more than 90% of its memory limit.

Description: The memory usage for container [no value] in pod: [no value], namespace: [no value], cluster: [no value] is higher than 90% for over 15 minutes.

Labels:

alertname: DatasourceError
datasource_uid: P9421284538FAC2F5
grafana_folder: Observability Enablement
ref_id: A
rulename: ContainerCriticalMemoryUsage
sc_component: kubernetes
sc_env: production
sc_provider: k8s
sc_system: oe
severity: P1

Summary: A daemonset does not meet its expected number of pods.

Description: Not all of the desired Pods of DaemonSet [no value]/[no value] in cluster [no value] are scheduled and ready.

Labels:

alertname: DatasourceError
datasource_uid: P9421284538FAC2F5
grafana_folder: Observability Enablement
ref_id: A
rulename: KubeDaemonSetRolloutStuck
sc_component: kubernetes
sc_env: production
sc_provider: k8s
sc_system: oe
severity: P1

Summary: Node memory pressure (instance [no value])

Description: Node [no value] in cluster [no value] has Memory Pressure condition.

Explore: https://grafana-prod-euw.sitecorecloud.app/alerting/list?search=DatasourceError

Labels:

alertname: DatasourceError
datasource_uid: P9421284538FAC2F5
grafana_folder: Observability Enablement
ref_id: A
rulename: KubeNodeMemoryPressure
sc_component: kubernetes
sc_env: production
sc_provider: k8s
sc_system: oe
severity: P1

Summary: A persistent volume has consistently been in an error state.

Description: The persistent volume [no value] in cluster [no value] has status [no value].

Labels:

alertname: DatasourceError
datasource_uid: P9421284538FAC2F5
grafana_folder: Observability Enablement
ref_id: A
rulename: KubePersistentVolumeErrors
sc_component: kubernetes
sc_env: production
sc_provider: k8s
sc_system: oe
severity: P1

Summary: A persistent volume is running out of space.

Description: The PersistentVolume claimed by [no value] in Namespace [no value] and cluster [no value] is less than 10% free.

Labels:

alertname: DatasourceError
datasource_uid: P9421284538FAC2F5
grafana_folder: Observability Enablement
ref_id: A
rulename: KubePersistentVolumeUsageCritical
sc_component: kubernetes
sc_env: production
sc_provider: k8s
sc_system: oe
severity: P1

Silence this alert

Summary: A statefulset has version mismatches.

Description: StatefulSet generation for [no value]/[no value] in cluster [no value] does not match, this indicates that the StatefulSet has failed but has not been rolled back.

Labels:

alertname: DatasourceError
datasource_uid: P9421284538FAC2F5
grafana_folder: Observability Enablement
ref_id: A
rulename: KubeStatefulSetGenerationMismatch
sc_component: kubernetes
sc_env: production
sc_provider: k8s
sc_system: oe
severity: P1

lvithu123 · June 18, 2025, 7:12am

Let me know your contact point, to share the more details!

Topic		Replies	Views
DatasourceError with no information on what triggers it Prometheus alerting , datasource	6	789	October 4, 2024
DataSourceError alerts with PostgreSQL, HA and no error on logs Alerting	1	211	March 2, 2024
Ideal way of handling DatasourceError alerts Alerting alerting	2	1703	November 19, 2024
Error in Grafana Alerting - Failed getting data source Alerting alerting , datasource	0	497	July 31, 2023
Intermitent alertname="DatasourceError" Alerting email	7	11805	March 7, 2024

Triggering a datasource error from a different alert name

Related topics