Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-48358

Solution for OCPBUGS-43740 is not enough for upgrades

XMLWordPrintable

    • Critical
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      In OCPBUGS-43740, it was necessary to add the {{ --cluster-manager-v4-transit-switch-subnet }} and {{ --cluster-manager-v6-transit-switch-subnet }} startup options to the ovnkube-node daemonset because otherwise it would fail doing subnet overlap validation.

      The problem is that when the fix was backported to 4.14.z, it was backported in a way such that it only works for the usual multizone OVN-K deployment (the usual one with interconnect enabled) and not the temporary singlezone one used during the upgrade. To be more concrete, PR#2607 only introduced the change in the startup scripts of the ovnkube-script-lib configmap (which are used for the multizone daemonsets) , but the singlezone daemonset has the startup command embebed and doesn't use the library configmap, so it didn't catch the fix.

      It is easy to see by just following the reproducer steps.

      Version-Release number of selected component (if applicable):

      4.13.z-->4.14.44 upgrade

      How reproducible:

      Always if there are subnets that overlap with the default transit switch subnet.

      Steps to Reproduce:

      1. Install a 4.13 cluster whose service network (or pod network) overlaps with the default transit switch subnet 100.88.0.0/16.

      2. Start the upgrade

      3. Specify the custom transit switch subnet in the network.operator/cluster object.

      Actual results:

      Upgrade stalled because one or more ovnkube-node pods are in crashloopbackoff with an error like this

      illegal network configuration: transit switch subnet "100.88.0.0/16" overlaps cluster subnet "100.x.x.x/12"
      

      Expected results:

      Upgrade to work properly if the right custom transit switch subnet is specified.

      Additional info:

      A workaround is to force the direct deployment of the multizone daemonset by patching the ovn-interconnect configmap like this:

      $ oc -n openshift-ovn-kubernetes patch cm/ovn-interconnect-configuration --type merge --patch '{"data":{"zone-mode":"multizone","fast-forward-to-multizone":""}}'
      

      However, this cannot be considered a definitive solution, because it causes expected outage in the pod network.

      More information in the comments.

              bbennett@redhat.com Ben Bennett
              rhn-support-palonsor Pablo Alonso Rodriguez
              Zhanqi Zhao Zhanqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: