Uploaded image for project: 'Use OCPBUGS now'
  1. Use OCPBUGS now
  2. GRPA-4594

OCP4.16.z: It is possible to accidentally trigger an upgrade to 4.17 during a migration from SDN --> OVN leading to soft-lock state

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Blocker Blocker
    • None
    • None
    • None
    • 5
    • False
    • Hide

      None

      Show
      None
    • False
    • Release Note Not Required
    • CORENET Sprint 273, CORENET Sprint 274
    • Critical
    • 10

      Description of problem:

      • Attempting an upgrade to 4.17 from 4.16 is blocked by a handler in the cluster Network Operator: https://github.com/openshift/cluster-network-operator/blob/2dc3099a8689a5df9797fe9c14257d7b06886741/pkg/controller/statusmanager/status_manager.go#L403C3-L422C4
      • However, the detection and lockstate conditional is based explicitly on the value of `Spec.DefaultNetwork.Type` being defined as: `OpenShiftSDN`.
      • If a customer requests an upgrade, and we block the rollout, and then (without clearing the upgrade request yaml state), they proceed to migrate to OVN using the limited-live (or offline) migration method - we will see that the spec.defaultNetwork.Type value is changed to: `OVNKubernetes` mid-migration.
      • This update to the spec, removes the safeguard/blocker preventing the upgrade, and the cluster will begin to upgrade to 4.17.
      • Cluster will upgrade all components excepting Network Operator, because the process of restarts/machine-config rollout and network teardown takes longer than the upgrade tasks do.
      • This leads to a scenario in which OVNkube is up/defined, but so too is OpenShift SDN and the new 4.17 operator build of Network Operator is unable to complete the migration tasking because the apis are removed. (soft-locked).

       

      Version-Release number of selected component (if applicable):

      4.16 --> 4.17

      How reproducible:

      • Haven't replicated in the lab, but looking at the code, I expect very easily:
        • Deploy cluster on 4.16 using SDN
        • Request upgrade to 4.17
        • Observe denial due to blocker code detecting OpenshiftSDN as spec.DefaultNetwork.Type
        • Proceed with limited live migration
        • Observe spec.DefaultNetwork.Type change to OVNKubernetes
        • Observe cluster upgrade begin even though we're not fully migrated yet.

      Actual results:

      • Cluster degraded

      Expected results:

      • Cluster should not be allowed to upgrade until OVNKube is in place and OpenShift SDN is fully torn down (block on migration status as well).
      •  

      Additional info:

      • This should I think be fairly easy to fix with an additional spec check to ensure we aren't in migration state before allowing upgrade as a blocker, specifically on this version to ensure we have FINISHED the upgrade before we can move up.
      •  

       

              Unassigned Unassigned
              jluhrsen Jamo Luhrsen
              Courtney Ruhm
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: