Loki memcached chunks Out of memory errors

I have been seeing a problem in our Loki install (Kubernetes, using the loki-distributed Helm chart) where the memcachedChunks container will quickly run out of memory, which causes logs to fail to be ingested.

set operation results in:
“level=error ts=2021-10-04T15:35:46.721335152Z caller=memcached.go:235 msg=“failed to put to memcached” name=chunks err=“server=192.168.191.8:11211: memcache: unexpected response line from \“set\”: \“SERVER_ERROR out of memory storing object\\r\\n\””\n”

or

“level=error ts=2021-10-04T15:35:58.30206562Z caller=memcached.go:235 msg=“failed to put to memcached” name=chunks err=“server=192.168.191.8:11211: memcache: unexpected response line from \“set\”: \“SERVER_ERROR Out of memory during read\\r\\n\””\n”

Currently we have the following settings:
memcachedChunks:
resources:
requests:
cpu: 500m
memory: 19073Mi
enabled: true
extraArgs:
- -m 18000
- -I 32m

all memcached config settings are set to:
batch_size: 100
parallelism: 100
expiration: 30m

split_queries_by_interval is set to 15m and align_queries_with_step is set to True.

chunk_target_size is 1536000

What it appears like to me is that chunks are filling up the memcache server but they are never being purged. This is also causing logs to not be ingested which seems odd to me, since I thought this cache was for query cache to speed up retrieval for queries. Why is the memcached server being out of memory causing logs to not be ingested, and what would be the recommendation here? Also, I was under the impression that memcached would evict older entries when newer ones were set that would cause it to exceed memory limits?

Any help is appreciated.

Thanks.

1 Like

Hi @stdiluted , we too have deployed the loki-distributed Helm Chart (including memcached) in our Kubernetes clusters and are seeing these error messages.

Actually the distributed helm chart defines multiple memcached instances. And the one that is related to the ingesters is a different one than the one that is used for the queries.

In our setup I see 3 different memcached pods:

loki-memcached-chunks-0 3/3 Running 0 17h
loki-memcached-frontend-0 3/3 Running 0 17h
loki-memcached-index-queries-0 3/3 Running 0 17h

The first one is responsible to cache chunks that are uploaded to wherever your logs are stored. But that is not used for queries.

Also seeing the same problem.

We use the distributed helm chart.

distributor:
  replicas: 4
  resources:
    limits:
      cpu: 500m
      memory: 256Mi
    requests:
      cpu: 500m
      memory: 256Mi
  nodeSelector:
    lifecycle: spot

ingester:
  persistence:
    enabled: true
    storageClass: gp2
    size: 40G
  resources:
    limits:
      cpu: 2000m
      memory: 14Gi
    requests:
      cpu: 2000m
      memory: 14Gi
  replicas: 6
  nodeSelector:
    lifecycle: spot

memcachedChunks:
  enabled: true
  replicas: 8
  extraArgs:
    - -m 19000
    - -I 10m
    - -vvv
  resources:
    requests:
      cpu: 1000m
      memory: 20Gi
    limits:
      cpu: 1000m
      memory: 20Gi
  nodeSelector:
    lifecycle: spot

memcachedIndexQueries:
  replicas: 4
  resources:
    requests:
      cpu: 500m
      memory: 2Gi
    limits:
      cpu: 500m
      memory: 2Gi
  enabled: true
  nodeSelector:
    lifecycle: spot
memcachedExporter:
  enabled: true
  resources:
    limits:
      cpu: 100m
      memory: 50Mi
    requests:
      cpu: 100m
      memory: 50Mi

queryFrontend:
  replicas: 3
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:
      cpu: 200m
      memory: 256Mi
  nodeSelector:
    lifecycle: spot

querier:
  resources:
    requests:
      cpu: 2000m
      memory: 2Gi
    limits:
      cpu: 7000m
      memory: 10Gi
  replicas: 24
  nodeSelector:
    lifecycle: spot

memcachedFrontend:
  nodeSelector:
    lifecycle: spot
  serviceMonitor:
    enabled: true
    labels:
      release: kube-prometheus-stack
    interval: 30s

seeing loads of rerrors in logs:

loki-distributed-memcached-chunks-1 memcached >29 SERVER_ERROR Out of memory during read
loki-distributed-ingester-1 ingester level=error ts=2022-02-22T12:03:00.672944713Z caller=memcached.go:224 msg="failed to put to memcached" name=chunks err="server=redacted:11211: memcache: unexpected response line from \"set\": \"SERVER_ERROR Out of memory during read\\r\\n\""
loki-distributed-ingester-1 ingester level=error ts=2022-02-22T12:03:00.842328067Z caller=memcached.go:224 msg="failed to put to memcached" name=chunks err="server=redacted:11211: memcache: unexpected response line from \"set\": \"SERVER_ERROR Out of memory during read\\r\\n\""

Could it be that your uploads for the logs are too slow? Then the chunks would pile up faster than they can be removed and thus create memory issues in memcached.

I’m still receiving the memcached oom errors, additionally, all of the ingesters filled up their PV’s. how can i find out why this happened?

logs don’t really give me any indication.