Uploaded image for project: 'OpenShift Core Networking'
  1. OpenShift Core Networking
  2. CORENET-6483

Impact ovn-kubernetes control plane crashes when ovn bridge mapping is changed

XMLWordPrintable

    • Icon: Spike Spike
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • None
    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • False
    • 1
    • None
    • None

      Impact statement for the OCPBUGS-62475 series:

      Which 4.y.z to 4.y'.z' updates increase vulnerability?

      Any updates to 4.19.13 and later. Updates within the exposed versions (e.g. 4.19.13 to 4.19.14) do not make things worse.

      Which types of clusters?

      All cluster types using OVN-Kubernetes networking that perform OVS bridge mapping changes via nmstate are vulnerable. Specifically:

      • Clusters using kubernetes-nmstate to manage network configuration
      • Clusters performing OVN bridge mapping modifications (adding/removing localnet mappings)
      • Particularly impacts CNV (Container Native Virtualization) deployments that dynamically manage network bridges

      Direct PromQL is hard, but at the cluster level, if your platform is None, BareMetal, OpenStack, or VSphere and you have the OpenShift Virtualization operator (kubevirt-hyperconverged-operator) installed, you might be exposed (other platforms are not exposed). Exposed platforms without the OpenShift Virtualization operator installed are only likely to be exposed if they have the ovn.bridge-mappings features applied.

      topk by (_id) (1,
        group by (_id, type) (cluster_infrastructure_provider{_id="",type=~"None|BareMetal|OpenStack|VSphere"})
        or on (_id)
        0 * group by (_id, type) (cluster_infrastructure_provider{_id="",type!~"None|BareMetal|OpenStack|VSphere"})
      )
      * on (_id) group_left (name)
      (
        group by (_id, name) (csv_succeeded{_id="", name=~"kubevirt-hyperconverged-operator[.].*"})
        or on (_id)
        group by (_id, name) (kubernetes_nmstate_features_applied{_id="", name="ovn.bridge-mappings"} > 0)
        or on (_id)
        0 * label_replace(group by (_id) (csv_succeeded{_id=""} + on (_id) group_left () group by (_id) (kubernetes_nmstate_features_applied{_id=""})), "name", "not hyperconverged", "", "")
        or on (_id)
        0 * label_replace(group by (_id) (cluster_infrastructure_provider{_id="",type!~"None|BareMetal|OpenStack|VSphere"}), "name", "not sure about hyperconverged or ovn.bridge-mappings, but the whole platform is safe", "", "")
      )
      

      What is the impact? Is it serious enough to warrant removing update recommendations?

      A bug in NetworkManager is deleting ovs-ports and because of this we are losing OVN connectivity. This breaks Nodes updating into the exposed RHCOS. MachineConfigPools will notice the updated Node failing to return to Ready=True, and will either stop updating Nodes based on their own maxUnavailable configuration, or when further Node drains are blocked by PodDisruptionBudgets guarding workloads.

      How involved is remediation?

      Updating to a fixed release (once they ship) should smoothly recover a cluster, although any affected Nodes may need to be rebuilt or replaced.

      Alternatively, a custom MachineConfig might recover the MachineConfigPool, although again, you might need manual work to recover any already impacted Nodes:

      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      metadata:
        labels:
          machineconfiguration.openshift.io/role: master # or whichever roles you want to target
        name: os-layer-custom
      spec:
        osImageURL: quay.io/...  # an alternative RHCOS images.  For example, the one you had been using pre-update, or the one from 4.19.12.
      

      Is this a regression?

      Yes. The breaking change went to NetworkManager 1.52.0-7 and got fixed in 1.52.0-8. OCP 4.19.13 bumped NetworkManager from 1:1.52.0-5.el9_6 into the broken 1:1.52.0-7.el9_6. OCP 4.18 is using RHEL 9.4, and its 1.46.0 Network manager is not exposed. 4.20 releases are exposed, but there are no supported updates from 4.19.(z<13) to 4.20, only 4.19.15 and later have supported paths to 4.20.

              mkowalsk@redhat.com Mat Kowalski
              trking W. Trevor King
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: