Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18336

HA konnectivity server causes connectivity issues from kas to worker kubelets

    XMLWordPrintable

Details

    • No
    • Proposed
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      
      On August 24th, a bugfix was merged into the hypershift repo to address OCPBUGS-16813 (https://github.com/openshift/hypershift/pull/2942). This resulted in a change in the konnectivity server with the HCP namespace. The change is that we went from a single konnectivity server to multiple when HA hcps are in use.
      
      The konnectivity agents within the HCP worker nodes connect to the server through a route. When connecting through this route, the agents on the worker are supposed to discover all the HA konnectivity servers through round robin load balancing, meaning if the agents try to connect to the route endpoint enough times, the theory is that they should eventually discover all the servers.
      
      With the kubevirt platform, only a single konnectivity server is discovered by the agents in the worker nodes, which leads to the inability for the kas on the HCP to reliably contact kubelets within the worker nodes.
      
      The outcome of this issue is that webhooks (and other connections that require the kas (api server) in the HCP to contact worker nodes) to fail the majority of the time.
      
      

      Version-Release number of selected component (if applicable):

      
      

      How reproducible:

      
      create a kubevirt platform HCP using the `hcp` cli tool. This will default to HA mode, and the cluster will never fully roll out. The ingress, monitoring, and console clusteroperators will flap back and forth between failing and success. Usually we'll see an error about webhook connectivity failing.
      
      During this time, any `oc` command that attempts to tunnel a connection through the kas to the kubelets will fail the majority of the time. This means `oc logs`, `oc exec`, etc... will not work. 
      
      
      Actual results:{code:none}
      
      kas -> kubelet connections are unreliable
      
      

      Expected results:

      
      kas -> kubelet connections are reliable
      
      

      Additional info:

      
      

      Attachments

        Activity

          People

            sjenning Seth Jennings
            rhn-engineering-dvossel David Vossel
            Liangquan Li Liangquan Li
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: