Wondering if someone can sanity check my config and possibly shed some light on I/o timeouts to memcached.
I’m using the helm distributed deployment in K8S (1 compactor, 1 distributor, 3 ingestors, 2 gateways, 3 queriers, 3 frontend, 1 table manager, 1 memcached chunk, 1 memcached index queries, 1 memcached frontend).
When I look at the pod logs, I only see the querier pods referencing memcached and that’s the one giving I/o timeouts in the logs, but the IP it’s using is the chunk memcached instance. Is that right?
storage_config: aws: s3: s3://lokibucket.s3.us-east-1 bucketnames: lokibucket boltdb_shipper: shared_store: s3 active_index_directory: /var/loki/index cache_location: /var/loki/cache cache_ttl: 168h index_queries_cache_config: memcached: batch_size: 100 parallelism: 100 memcached_client: host: loki-loki-distributed-memcached-index-queries.loki.svc.cluster.local service: http chunk_store_config: max_look_back_period: 0s chunk_cache_config: memcached: batch_size: 100 parallelism: 100 memcached_client: host: loki-loki-distributed-memcached-chunks.loki.svc.cluster.local service: http table_manager: retention_deletes_enabled: false retention_period: 0s query_range: align_queries_with_step: true max_retries: 5 split_queries_by_interval: 15m cache_results: true results_cache: cache: memcached: batch_size: 100 parallelism: 100 memcached_client: host: loki-loki-distributed-memcached-frontend.loki.svc.cluster.local service: http
If I echo stats - say on the chunk instance, I see connections (all 3 memcached show current connections):
echo stats | nc 127.0.0.1 11211
STAT pid 1
STAT uptime 82280
STAT time 1629219205
STAT version 1.6.10
STAT libevent 2.1.12-stable
STAT pointer_size 64
STAT rusage_user 7.186096
STAT rusage_system 13.927203
STAT max_connections 1024
STAT curr_connections 33
STAT total_connections 8563
STAT rejected_connections 0
STAT connection_structures 52
STAT response_obj_oom 0
STAT response_obj_count 1
STAT response_obj_bytes 65536
STAT read_buf_count 33
STAT read_buf_bytes 540672
STAT read_buf_bytes_free 458752
STAT read_buf_oom 0
STAT reserved_fds 20
STAT cmd_get 38988
STAT cmd_set 26965
STAT cmd_flush 0
STAT cmd_touch 0
STAT cmd_meta 0
STAT get_hits 12235
STAT get_misses 26753
STAT get_expired 0
STAT get_flushed 0
STAT delete_misses 0
STAT delete_hits 0
STAT incr_misses 0
STAT incr_hits 0
STAT decr_misses 0
STAT decr_hits 0
STAT cas_misses 0
STAT cas_hits 0
STAT cas_badval 0
STAT touch_hits 0
STAT touch_misses 0
STAT auth_cmds 0
STAT auth_errors 0
STAT bytes_read 9828550092
STAT bytes_written 3137411488
STAT limit_maxbytes 67108864
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT time_in_listen_disabled_us 0
STAT threads 4
STAT conn_yields 0
STAT hash_power_level 16
STAT hash_bytes 524288
STAT hash_is_expanding 0
STAT slab_reassign_rescues 20
STAT slab_reassign_chunk_rescues 0
STAT slab_reassign_evictions_nomem 123
STAT slab_reassign_inline_reclaim 1
STAT slab_reassign_busy_items 6
STAT slab_reassign_busy_deletes 0
STAT slab_reassign_running 0
STAT slabs_moved 73
STAT lru_crawler_running 0
STAT lru_crawler_starts 52
STAT lru_maintainer_juggles 193726
STAT malloc_fails 0
STAT log_worker_dropped 0
STAT log_worker_written 0
STAT log_watcher_skipped 0
STAT log_watcher_sent 0
STAT unexpected_napi_ids 0
STAT round_robin_fallback 0
STAT bytes 45260110
STAT curr_items 200
STAT total_items 26985
STAT slab_global_page_pool 0
STAT expired_unfetched 0
STAT evicted_unfetched 16715
STAT evicted_active 42
STAT evictions 19772
STAT reclaimed 0
STAT crawler_reclaimed 0
STAT crawler_items_checked 44
STAT lrutail_reflocked 2751
STAT moves_to_cold 26787
STAT moves_to_warm 2645
STAT moves_within_lru 2352
STAT direct_reclaims 20281
STAT lru_bumps_dropped 0
END
I see the following in the querier pods:
level=error ts=2021-08-17T16:26:45.218691428Z caller=memcached.go:235 msg=“failed to put to memcached” name=chunks err=“server=10.42.5.10:11211: write tcp 10.42.4.64:49594->10.42.5.10:11211: i/o timeout”
ts=2021-08-17T16:26:45.717102364Z caller=spanlogger.go:87 org_id=fake traceID=3b5b799d0f27974d method=Memcache.GetMulti level=error msg=“Failed to get keys from memcached” err=“read tcp 10.42.4.64:48388->10.42.5.10:11211: i/o timeout”
Frontend doesn’t mention memcached at all.
Neither does investor, but does reference: level=warn ts=2021-08-17T16:08:55.650619791Z caller=experimental.go:19 msg=“experimental feature in use” feature=“In-memory (FIFO) cache”
Any thoughts on why it may not be working or how I can troubleshoot?
My search performance for anything > 3 hours is horrible, so I’m trying to get better performance with caching.
Thanks!