-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.20
Description of problem:
etcd peer connections intermittently rejected due to duplicate PTR records from etcd-client headless service
Version-Release number of selected component (if applicable):
How reproducible:
standard BM HCP deployment
Steps to Reproduce:
In any deployment on bare metal platform, etcd pods continuously log warnings like:
{"level":"warn","ts":"...","caller":"embed/config_logging.go:168","msg":"rejected connection on peer endpoint","remote-addr":"10.133.3.83:36982","server-name":"etcd-0.etcd-discovery.hosted-nested.svc","ip-addresses":[],"dns-names":["*.etcd-discovery.hosted-nested.svc","*.etcd-discovery.hosted-nested.svc.cluster.local","127.0.0.1","::1"],"error":"tls: \"10.133.3.83\" does not match any of DNSNames [\"*.etcd-discovery.hosted-nested.svc\" \"*.etcd-discovery.hosted-nested.svc.cluster.local\" \"127.0.0.1\" \"::1\"]"}1. Two headless services in the HCP namespace select the same etcd pods:
etcd-discovery (ClusterIP: None, selector: app=etcd, ports: 2380, 2379)
etcd-client (ClusterIP: None, selector: app=etcd, ports: 2379, 2381)
2/ CoreDNS creates two PTR records per etcd pod IP:
10.130.4.10 → etcd-2.etcd-discovery.{ns}.svc.cluster.local
10.130.4.10 → 10-130-4-10.etcd-client.{ns}.svc.cluster.local
3. The peer certificate SANs only cover etcd-discovery:
DNS:*.etcd-discovery.{ns}.svc
DNS:*.etcd-discovery.{ns}.svc.cluster.local
DNS:127.0.0.1
DNS:::1
4. When a peer connects, etcd's checkSAN() (in client/pkg/transport/listener_tls.go) calls isHostInDNS(), which does a reverse DNS lookup via net.DefaultResolver.LookupAddr(). Go uses getnameinfo() which returns only one PTR record per call.
Which PTR record is returned is non-deterministic. When getnameinfo() returns the etcd-client record, the wildcard match against *.etcd-discovery... fails, and the connection is rejected.