Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-17446

Wrong advertise address is used in hosted control plane etcd

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done-Errata
    • Undefined
    • 4.14.0
    • 4.14
    • HyperShift
    • None
    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      The advertise address configured for our hcp etcd clusters is not resolvable via DNS (ie. etcd-0.etcd-client.namespace.svc:2379). This impacts some of the etcd tooling that expects to access each member by their advertise address.

      Version-Release number of selected component (if applicable):

      4.14 (and earlier)

      How reproducible:

      Always

      Steps to Reproduce:

      1. Create a HostedCluster and wait for it to come up.
      2. Exec into an etcd pod and query cluster endpoint health:
         $ oc rsh etcd-0
         $ etcdctl --cacert /etc/etcd/tls/etcd-ca/ca.crt \
                   --cert /etc/etcd/tls/server/server.crt \
                   --key /etc/etcd/tls/server/server.key \
                   --endpoints https://localhost:2379 \
                   endpoint health --cluster -w table
      

      Actual results:

      An error is returned similar to:
      {"level":"warn","ts":"2023-08-07T20:40:49.890254Z","logger":"client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000378fc0/etcd-0.etcd-client.clusters-test-cluster.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp: lookup etcd-0.etcd-client.clusters-test-cluster.svc on 172.30.0.10:53: no such host\""}
      

      Expected results:

      Actual cluster health is returned:
      +--------------------------------------------------------------+--------+-------------+-------+
      |                           ENDPOINT                           | HEALTH |    TOOK     | ERROR |
      +--------------------------------------------------------------+--------+-------------+-------+
      | https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2379 |   true |  9.372168ms |       |
      | https://etcd-2.etcd-discovery.clusters-cewong-guest.svc:2379 |   true | 12.269226ms |       |
      | https://etcd-1.etcd-discovery.clusters-cewong-guest.svc:2379 |   true | 12.291392ms |       |
      +--------------------------------------------------------------+--------+-------------+-------+

      Additional info:

      The etcd statefulset is created with spec.serviceName set to `etcd-discovery`. This means that pods in the statefulset, get subdomain set to `etcd-discovery` and names like etcd-0.etcd-discovery.[ns].svc are resolvable. However, the same is not true for the etcd-client service. etcd-0.etcd-client.[ns].svc is not resolvable. The fix would be to change the advertise address of each member to a resolvable name (ie. etcd-0.etcd-discvoery.[ns].svc) and adjust the server certificate to allow those names as well.

      Attachments

        Activity

          People

            cewong@redhat.com Cesar Wong
            cewong@redhat.com Cesar Wong
            Jie Zhao Jie Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: