Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31512

Cluster-network-operator doesn't use node local kube-apiserver loadbalancer when templating in cluster resources

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • 4.16.0
    • 4.14, 4.15, 4.16
    • Networking / multus
    • None
    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      The cluster-network-operator in hypershift when templating in cluster resources does not use the node local address of the client side haproxy load balancer that runs on all nodes. This bypasses a level of health checks for the backend redundant apiserver addresses that is performed by the local kube-apiserver-proxy pods that run on every node in a hypershift environment. In environments where the backend api servers are not fronted through an additional cloud load balancer: this leads to a percentage of request failures from the in cluster components occuring when a control plane endpoint goes down even if other endpoints are available. 

      Version-Release number of selected component (if applicable):

        4.16 4.15 4.14

      How reproducible:

          100%

      Steps to Reproduce:

          1. Setup a hypershift cluster in a baremetal/non cloud environment where there are redundant API servers behind a DNS that point directly to the node IPs.
          2. Power down one of the control plane nodes
          3. Schedule workload into cluster that depends on kube-proxy and/or multus to setup networking configuration
          4. You will see errors like the following 
      ```
      add): Multus: [openshiftai/moe-8b-cmisale-master-0/9c1fd369-94f5-481c-a0de-ba81a3ee3583]: error getting pod: Get "https://[p9d81ad32fcdb92dbb598-6b64a6ccc9c596bf59a86625d8fa2202-c000.us-east.satellite.appdomain.cloud]:30026/api/v1/namespaces/openshiftai/pods/moe-8b-cmisale-master-0?timeout=1m0s": dial tcp 192.168.98.203:30026: connect: timeout
      ```
          

      Actual results:

          When a control plane node fails intermittent timeouts occur when kube-proxy/multus resolve the dns and a failed control plane node ip is returned

      Expected results:

          No requests fail (which will occur if all traffic is routed through the node local load balancer instance

      Additional info:

          Additionally: control plane components in the management cluster that live next to the apiserver are adding uneeded dependencies by using an external DNS entry to talk to the kube-apiserver when it can use the local kube-apiserver address to have it all go over cluster local networking

      Attachments

        Issue Links

          Activity

            People

              lisowskiibm Tyler Lisowski
              lisowskiibm Tyler Lisowski
              Weibin Liang Weibin Liang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: