Loki EKS IPv6 - Too many Colons

Hey,

I am trying to run loki in EKS using IPv6. After following a number of github issues (Loki unable to start on IPv6 EKS cluster · Issue #6251 · grafana/loki · GitHub in particular), I am able to start Loki up. However, when I try to write logs, I am getting the following error related to the ingester:

level=warn ts=2023-07-04T12:05:27.754911816Z caller=pool.go:193 msg="removing ingester failing healthcheck" addr=2600:1f11:908:b900:f342::a:9095 reason="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: address 2600:1f11:908:b900:f342::a:9095: too many colons in address\""
level=warn ts=2023-07-04T12:05:27.754946634Z caller=pool.go:193 msg="removing ingester failing healthcheck" addr=2600:1f11:908:b901:de7::11:9095 reason="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: address 2600:1f11:908:b901:de7::11:9095: too many colons in address\""
level=warn ts=2023-07-04T12:05:38.964732582Z caller=logging.go:86 traceID=2e07840a3b835c4a orgID=fake msg="POST /loki/api/v1/push (500) 3.529494ms Response: \"rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing dial tcp: address 2600:1f11:908:b901:de7::11:9095: too many colons in address\\\"\\n\" ws: false; Connection: close; Content-Length: 152470; Content-Type: application/x-protobuf; User-Agent: GrafanaAgent/; "
level=warn ts=2023-07-04T12:05:42.755648798Z caller=pool.go:193 msg="removing ingester failing healthcheck" addr=2600:1f11:908:b901:de7::11:9095 reason="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: address 2600:1f11:908:b901:de7::11:9095: too many colons in address\""
level=warn ts=2023-07-04T12:05:42.755649919Z caller=pool.go:193 msg="removing ingester failing healthcheck" addr=2600:1f11:908:b900:f342::a:9095 reason="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: address 2600:1f11:908:b900:f342::a:9095: too many colons in address\""
level=warn ts=2023-07-04T12:05:59.539257414Z caller=logging.go:86 traceID=5301968a412a8937 orgID=fake msg="POST /api/prom/push (500) 3.698641ms Response: \"rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing dial tcp: address 2600:1f11:908:b901:de7::11:9095: too many colons in address\\\"\\n\" ws: false; Content-Length: 168152; Content-Type: application/x-protobuf; User-Agent: promtail/; X-Scope-Orgid: \"\"; "
level=warn ts=2023-07-04T12:05:59.774633519Z caller=logging.go:86 traceID=3983451bd44bc58e orgID=fake msg="POST /api/prom/push (500) 31.088504ms Response: \"rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing dial tcp: address 2600:1f11:908:b901:de7::11:9095: too many colons in address\\\"\\n\" ws: false; Content-Length: 212772; Content-Type: application/x-protobuf; User-Agent: promtail/; X-Scope-Orgid: \"\"; "
level=warn ts=2023-07-04T12:06:00.698643077Z caller=logging.go:86 traceID=5f291bdf0cd34693 orgID=fake msg="POST /api/prom/push (500) 5.670154ms Response: \"rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing dial tcp: address 2600:1f11:908:b901:de7::11:9095: too many colons in address\\\"\\n\" ws: false; Content-Length: 212772; Content-Type: application/x-protobuf; User-Agent: promtail/; X-Scope-Orgid: \"\"; "
level=warn ts=2023-07-04T12:06:02.359378802Z caller=logging.go:86 traceID=6324c90323d18b4a orgID=fake msg="POST /api/prom/push (500) 4.75841ms Response: \"rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing dial tcp: address 2600:1f11:908:b901:de7::11:9095: too many colons in address\\\"\\n\" ws: false; Content-Length: 212772; Content-Type: application/x-protobuf; User-Agent: promtail/; X-Scope-Orgid: \"\"; "
level=warn ts=2023-07-04T12:06:03.705543294Z caller=logging.go:86 traceID=612a7729ca525377 orgID=fake msg="POST /loki/api/v1/push (500) 1.056221ms Response: \"rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing dial tcp: address 2600:1f11:908:b900:f342::a:9095: too many colons in address\\\"\\n\" ws: false; Connection: close; Content-Length: 31398; Content-Type: application/x-protobuf; User-Agent: GrafanaAgent/; "
level=warn ts=2023-07-04T12:06:06.095292975Z caller=logging.go:86 traceID=6539e2f39d446d32 orgID=fake msg="POST /api/prom/push (500) 10.102688ms Response: \"rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing dial tcp: address 2600:1f11:908:b901:de7::11:9095: too many colons in address\\\"\\n\" ws: false; Content-Length: 212772; Content-Type: application/x-protobuf; User-Agent: promtail/; X-Scope-Orgid: \"\"; "
level=warn ts=2023-07-04T12:06:12.471044607Z caller=logging.go:86 traceID=6437e80eac365c40 orgID=fake msg="POST /api/prom/push (500) 5.533771ms Response: \"rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing dial tcp: address 2600:1f11:908:b900:f342::a:9095: too many colons in address\\\"\\n\" ws: false; Content-Length: 212772; Content-Type: application/x-protobuf; User-Agent: promtail/; X-Scope-Orgid: \"\"; "
level=info ts=2023-07-04T12:06:12.743950433Z caller=table_manager.go:134 msg="uploading tables"
level=info ts=2023-07-04T12:06:12.746168582Z caller=table_manager.go:166 msg="handing over indexes to shipper"
level=warn ts=2023-07-04T12:06:12.754793371Z caller=pool.go:193 msg="removing ingester failing healthcheck" addr=2600:1f11:908:b900:f342::a:9095 reason="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: address 2600:1f11:908:b900:f342::a:9095: too many colons in address\""
level=warn ts=2023-07-04T12:06:12.754813794Z caller=pool.go:193 msg="removing ingester failing healthcheck" addr=2600:1f11:908:b901:de7::11:9095 reason="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: address 2600:1f11:908:b901:de7::11:9095: too many colons in address\""

I have also tried adding brackets around the IPv6 pod IP as outlined here. simple scalable deployment: rpc error: code = Unimplemented desc = unknown service logproto.Querier · Issue #5578 · grafana/loki · GitHub, but when I do that the pods fail to start.

Has anyone seen this behaviour?

Thanks

Hey,

I am currently seeing the same behaviour within a ipv6 only EKS cluster (k8s version 1.25). I put all the hours into debugging and then switched to using the singleBinary mode first. With the singleBinary mode and an inmemory ring, it works.

    common:
      instance_addr: "[${MY_POD_IP}]"
      ring:
        kvstore:
          store: inmemory
        instance_addr: "[${MY_POD_IP}]"
    memberlist:
      join_members:
      - loki-memberlist
      bind_addr:
      - ${MY_POD_IP}

But nevertheless, I would run Loki in microservice mode.

Best

Hey affirm0449,

FYI after a lot of trial and error I was able to get the square brackets in the right place to get this working in microservice mode, although with a single gateway. I ended up explicitly defining config.yaml so I could overwrite the scheduler address for frontend/frontend_worker.. It did not work correctly if I tried to do it in values.yaml:
image

This is the config I’m using now and it is working correctly. This config also works with S3 backend using IRSA, which was another issue I couldn’t find a straightforward answer for in the documentation/github issues.


loki:
  config: |
    auth_enabled: false
    common:
      compactor_address: 'loki-backend'
      instance_addr: ${MY_POD_IP}
      path_prefix: /var/loki
      replication_factor: 3
      ring:
        instance_addr: '[${MY_POD_IP}]'
        kvstore:
          store: memberlist
      storage:
        s3:
          bucketnames: {{ .Values.loki_bucket_name }}
          insecure: false
          region: {{ .Values.loki_bucket_region }}
          s3forcepathstyle: false

    frontend_worker:
      frontend_address: '[${MY_POD_IP}]:9095'

    index_gateway:
      mode: simple
    limits_config:
      enforce_metric_name: false
      max_cache_freshness_per_query: 10m
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      split_queries_by_interval: 15m
    memberlist:
      join_members:
      - loki-memberlist
    query_range:
      align_queries_with_step: true
    ruler:
      storage:
        s3:
          bucketnames: {{ .Values.loki_bucket_name }}
          insecure: false
          region: {{ .Values.loki_bucket_region }}
          s3forcepathstyle: false
        type: s3
    runtime_config:
      file: /etc/loki/runtime-config/runtime-config.yaml
    schema_config:
      configs:
      - from: "2022-01-11"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v12
        store: boltdb-shipper
    server:
      grpc_listen_port: 9095
      http_listen_port: 3100
    storage_config:
      aws:
        bucketnames: {{ .Values.loki_bucket_name }}
        insecure: false
        region: ca-central-1
      hedging:
        at: 250ms
        max_per_second: 20
        up_to: 3
    table_manager:
      retention_deletes_enabled: false
      retention_period: 0
      
enable: "true"
serviceAccount:
  create: false
  name: loki-sa

query_scheduler:
  use_scheduler_ring: false

write:
  replicas: 2
  extraArgs:
    - -config.expand-env=true
  extraEnv:
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP
read:
  replicas: 1
  extraArgs:
    - -config.expand-env=true
  extraEnv:
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP

backend:
  replicas: 1
  extraArgs:
    - -config.expand-env=true
  extraEnv:
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP

gateway:
  replicas: 1
  extraArgs:
    - -config.expand-env=true
  extraEnv:
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP
  ingress:
    enabled: true
    ingressClassName: alb
    annotations:
      alb.ingress.kubernetes.io/ssl-redirect: "443"    
      alb.ingress.kubernetes.io/backend-protocol: HTTP
      alb.ingress.kubernetes.io/scheme: internal
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
      alb.ingress.kubernetes.io/ip-address-type: dualstack
      alb.ingress.kubernetes.io/target-type: ip            
    hosts:
      - host: loki.{{ .Values.env }}.{{ .Values.domain }}
        paths:
          - path: /
            pathType: Prefix
    tls:
      - hosts:
          - loki.{{ .Values.env }}.{{ .Values.domain }}
singleBinary:
  extraArgs:
    - -config.expand-env=true
  extraEnv:
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP

minio:
    enabled: false
  

Hey kevinsagle,

thanks for that hint! I will try this next week.

Seems grafana / loki devs are already working on better handling ipv6 stuff:

I do not know when the new Loki release will be released but I hope this will improve handling of ipv6 stuff.

Best

Hey,

Any update on handling of ipv6 in loki? I cannot find this in any new release.