Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-44291

Limited Live migration is stuck because of connectivity issue between ovn and sdn

XMLWordPrintable

    • Critical
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:
      It has been observed that the Limited Live migration may get stuck during MachineConfig rollout because PodDisruptionBudget is preventing a OpenShift Container Platform 4 - Node from being drained.

      When looking at the PodDisruptionBudget, it has been observed that connectivity between pods in the same service is failing/not working when the pods are running on OpenShift Container Platform 4 - Node(s) that were already moved to OVN vs. the ones still running on OpenShiftSDN.

      Application such as distributed caches and similar are therefore failing to join back the cluster, leaving the service in degraded state and with the PodDisruptionBudget in place in a uninterruptable state. The migration won't proceed until the problem is solved and this is not possible, without forcefully breaking the PodDisruptionBudget.

      When checking it seems that both, communication using kubernetes services as well as pod communication is failing, unless services are running on the same OpenShift Container Platform 4 - Node.

      The cause for the issue is unknown to this point as everything seems to setup and configured correctly.

      Important, during regular OpenShift Container Platform 4 updates such behavior is not observed and the updates always complete without issue. It's therefore not related to any problematic state on the OpenShift Container Platform 4 - Cluster but rather to the actual mirgation that is happening and the fact that communication is failing out of a sudden.

      Version-Release number of selected component (if applicable):
      OpenShift Container Platform 4.16.16 and OpenShift Container Platform 4.16.18

      How reproducible:
      Random

      Steps to Reproduce:
      1. The steps from https://docs.openshift.com/container-platform/4.16/networking/ovn_kubernetes_network_provider/migrate-from-openshift-sdn.html#nw-ovn-kubernetes-live-migration-about_migrate-from-openshift-sdn are run and then it happens out of a sudden. Cause not clear and while it was working on other OpenShift Container Platform 4 - Cluster it's failing on the most recent ones

      Actual results:
      The Limited Live Migration is stuck, unable to proceed because PodDisruptionBudget are blocking draining of OpenShift Container Platform 4 - Node(s). Important, this state is because application failing to work, due to connectivity issue, which actually puts the PodDisruptionBudget into the state where it would block the MachineConfig rollout.

      Expected results:
      Connectivity between OVN and OpenShiftSDN is expected to work during the entire procedure of the Limited Live migration. Therefore PodDisruptionBudget should not block OpenShift Container Platform 4 - Node(s) from draining as application pods should be healthy after restart and hence allow to proceed with the MachineConfig rollout.

      Additional info:

      Affected Platforms:
      OpenStack and AWS

              pliurh Peng Liu
              rhn-support-sreber Simon Reber
              Weibin Liang Weibin Liang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: