Memcached Config in K8S Distributed model

Wondering if someone can sanity check my config and possibly shed some light on I/o timeouts to memcached.

I’m using the helm distributed deployment in K8S (1 compactor, 1 distributor, 3 ingestors, 2 gateways, 3 queriers, 3 frontend, 1 table manager, 1 memcached chunk, 1 memcached index queries, 1 memcached frontend).

When I look at the pod logs, I only see the querier pods referencing memcached and that’s the one giving I/o timeouts in the logs, but the IP it’s using is the chunk memcached instance. Is that right?

storage_config:
  aws:
    s3: s3://lokibucket.s3.us-east-1
    bucketnames: lokibucket

  boltdb_shipper:
    shared_store: s3
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 168h

  index_queries_cache_config:
    memcached:
      batch_size: 100
      parallelism: 100
    memcached_client:
      host: loki-loki-distributed-memcached-index-queries.loki.svc.cluster.local
      service: http

chunk_store_config:
  max_look_back_period: 0s
  chunk_cache_config:
    memcached:
      batch_size: 100
      parallelism: 100
    memcached_client:
      host: loki-loki-distributed-memcached-chunks.loki.svc.cluster.local
      service: http
table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

query_range:
  align_queries_with_step: true
  max_retries: 5
  split_queries_by_interval: 15m
  cache_results: true
  results_cache:
    cache:
      memcached:
        batch_size: 100
        parallelism: 100
      memcached_client:
        host: loki-loki-distributed-memcached-frontend.loki.svc.cluster.local
        service: http

If I echo stats - say on the chunk instance, I see connections (all 3 memcached show current connections):

echo stats | nc 127.0.0.1 11211
STAT pid 1
STAT uptime 82280
STAT time 1629219205
STAT version 1.6.10
STAT libevent 2.1.12-stable
STAT pointer_size 64
STAT rusage_user 7.186096
STAT rusage_system 13.927203
STAT max_connections 1024
STAT curr_connections 33
STAT total_connections 8563
STAT rejected_connections 0
STAT connection_structures 52
STAT response_obj_oom 0
STAT response_obj_count 1
STAT response_obj_bytes 65536
STAT read_buf_count 33
STAT read_buf_bytes 540672
STAT read_buf_bytes_free 458752
STAT read_buf_oom 0
STAT reserved_fds 20
STAT cmd_get 38988
STAT cmd_set 26965
STAT cmd_flush 0
STAT cmd_touch 0
STAT cmd_meta 0
STAT get_hits 12235
STAT get_misses 26753
STAT get_expired 0
STAT get_flushed 0
STAT delete_misses 0
STAT delete_hits 0
STAT incr_misses 0
STAT incr_hits 0
STAT decr_misses 0
STAT decr_hits 0
STAT cas_misses 0
STAT cas_hits 0
STAT cas_badval 0
STAT touch_hits 0
STAT touch_misses 0
STAT auth_cmds 0
STAT auth_errors 0
STAT bytes_read 9828550092
STAT bytes_written 3137411488
STAT limit_maxbytes 67108864
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT time_in_listen_disabled_us 0
STAT threads 4
STAT conn_yields 0
STAT hash_power_level 16
STAT hash_bytes 524288
STAT hash_is_expanding 0
STAT slab_reassign_rescues 20
STAT slab_reassign_chunk_rescues 0
STAT slab_reassign_evictions_nomem 123
STAT slab_reassign_inline_reclaim 1
STAT slab_reassign_busy_items 6
STAT slab_reassign_busy_deletes 0
STAT slab_reassign_running 0
STAT slabs_moved 73
STAT lru_crawler_running 0
STAT lru_crawler_starts 52
STAT lru_maintainer_juggles 193726
STAT malloc_fails 0
STAT log_worker_dropped 0
STAT log_worker_written 0
STAT log_watcher_skipped 0
STAT log_watcher_sent 0
STAT unexpected_napi_ids 0
STAT round_robin_fallback 0
STAT bytes 45260110
STAT curr_items 200
STAT total_items 26985
STAT slab_global_page_pool 0
STAT expired_unfetched 0
STAT evicted_unfetched 16715
STAT evicted_active 42
STAT evictions 19772
STAT reclaimed 0
STAT crawler_reclaimed 0
STAT crawler_items_checked 44
STAT lrutail_reflocked 2751
STAT moves_to_cold 26787
STAT moves_to_warm 2645
STAT moves_within_lru 2352
STAT direct_reclaims 20281
STAT lru_bumps_dropped 0
END

I see the following in the querier pods:

level=error ts=2021-08-17T16:26:45.218691428Z caller=memcached.go:235 msg=“failed to put to memcached” name=chunks err=“server=10.42.5.10:11211: write tcp 10.42.4.64:49594->10.42.5.10:11211: i/o timeout”

ts=2021-08-17T16:26:45.717102364Z caller=spanlogger.go:87 org_id=fake traceID=3b5b799d0f27974d method=Memcache.GetMulti level=error msg=“Failed to get keys from memcached” err=“read tcp 10.42.4.64:48388->10.42.5.10:11211: i/o timeout”

Frontend doesn’t mention memcached at all.
Neither does investor, but does reference: level=warn ts=2021-08-17T16:08:55.650619791Z caller=experimental.go:19 msg=“experimental feature in use” feature=“In-memory (FIFO) cache”

Any thoughts on why it may not be working or how I can troubleshoot?

My search performance for anything > 3 hours is horrible, so I’m trying to get better performance with caching.

Thanks!

1 Like

I have the same issue:

ts=2021-09-01T19:46:41.830245033Z caller=spanlogger.go:87 org_id=1 traceID=4132e74cbcc9207e method=Memcache.GetMulti level=error msg=“Failed to get keys from memcached” err=“read tcp 1.2.3.4:60292->1.2.3.4:11211: i/o timeout”

Running load-tests against memcached passes fine. I tried different configurations and pod sizes and replicas.

I have the same problem, everything seems fine on Memcached but I get i/o timeout error