Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-76530

intermittent etcd peer communication failures

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.20
    • HyperShift
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      etcd peer connections intermittently rejected due to duplicate PTR records from etcd-client headless service

      Version-Release number of selected component (if applicable):

      How reproducible:

      standard BM HCP deployment

      Steps to Reproduce:

      In any deployment on bare metal platform, etcd pods continuously log warnings like:

      {"level":"warn","ts":"...","caller":"embed/config_logging.go:168","msg":"rejected connection on peer endpoint","remote-addr":"10.133.3.83:36982","server-name":"etcd-0.etcd-discovery.hosted-nested.svc","ip-addresses":[],"dns-names":["*.etcd-discovery.hosted-nested.svc","*.etcd-discovery.hosted-nested.svc.cluster.local","127.0.0.1","::1"],"error":"tls: \"10.133.3.83\" does not match any of DNSNames [\"*.etcd-discovery.hosted-nested.svc\" \"*.etcd-discovery.hosted-nested.svc.cluster.local\" \"127.0.0.1\" \"::1\"]"}

      1. Two headless services in the HCP namespace select the same etcd pods:
      etcd-discovery (ClusterIP: None, selector: app=etcd, ports: 2380, 2379)
      etcd-client (ClusterIP: None, selector: app=etcd, ports: 2379, 2381)

      2/ CoreDNS creates two PTR records per etcd pod IP:
      10.130.4.10 → etcd-2.etcd-discovery.{ns}.svc.cluster.local
      10.130.4.10 → 10-130-4-10.etcd-client.{ns}.svc.cluster.local

      3. The peer certificate SANs only cover etcd-discovery:
      DNS:*.etcd-discovery.{ns}.svc
      DNS:*.etcd-discovery.{ns}.svc.cluster.local
      DNS:127.0.0.1
      DNS:::1

      4. When a peer connects, etcd's checkSAN() (in client/pkg/transport/listener_tls.go) calls isHostInDNS(), which does a reverse DNS lookup via net.DefaultResolver.LookupAddr(). Go uses getnameinfo() which returns only one PTR record per call.
      Which PTR record is returned is non-deterministic. When getnameinfo() returns the etcd-client record, the wildcard match against *.etcd-discovery... fails, and the connection is rejected.

              Unassigned Unassigned
              mskrivan@redhat.com Michal Skrivanek
              None
              None
              Yu Li Yu Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: