Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43454

ovnkube-control-plane pods crash when upgrading from 4.16 to 4.17 with localnet topology networks without subnets

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, when you attempted to use the Cluster Network Operator (CNO) to upgrade a cluster with existing `localnet` networks, `ovnkube-control-plane` pods would fail to run. This happened because the `ovnkube-cluster-manager` container could not process an OVN-Kubernetes `localnet` topology network that did not have subnets defined. With this release, a fix ensures that the `ovnkube-cluster-manager` container can process an OVN-Kubernetes `localnet` topology network that does not have subnets defined. (link:https://issues.redhat.com/browse/OCPBUGS-43454[*OCPBUGS-43454*])
      Show
      * Previously, when you attempted to use the Cluster Network Operator (CNO) to upgrade a cluster with existing `localnet` networks, `ovnkube-control-plane` pods would fail to run. This happened because the `ovnkube-cluster-manager` container could not process an OVN-Kubernetes `localnet` topology network that did not have subnets defined. With this release, a fix ensures that the `ovnkube-cluster-manager` container can process an OVN-Kubernetes `localnet` topology network that does not have subnets defined. (link: https://issues.redhat.com/browse/OCPBUGS-43454 [* OCPBUGS-43454 *])
    • Bug Fix
    • In Progress

      Description of problem:

      Customers using OVN-K localnet topology networks for virtualization often do not define a "subnets" field in their NetworkAttachmentDefinitions. Examples in the OCP documentation virtualization section do not include that field either.

      When a cluster with such NADs is upgraded from 4.16 to 4.17, the ovnkube-control-plane pods crash when CNO is upgraded and the upgrade hangs in a failing state. Once in the failing state, the cluster upgrade can be recovered by adding a subnets field to the localnet NADs

      Version-Release number of selected component (if applicable): 4.16.15 > 4.17.1

      How reproducible:

      Start with an OCP 4.16 cluster with OVN-K localnet NADs configured per the OpenShift Virtualization documentation and attempt to upgrade the cluster to 4.17.1.

      Steps to Reproduce:

      1. Deploy an OCP 4.16.15 cluster, the type shouldn't matter but all testing has been done on bare metal (SNO and HA topologies)

      2. Configure an OVS bridge with localnet bridge mappings and create one or more NetworkAttachmentDefinitions using the localnet topology without configuring the "subnets" field

      3. Observe that this is a working configuration in 4.16 although error-level log messages appear in the ovnkube-control-plane pod (see OCPBUGS-37561)

      4. Delete the ovnkube-control-plane pod on 4.16 and observe that the log messages do not prevent you from starting ovnkube on 4.16

      5. Trigger an upgrade to 4.17.1

      6. Once ovnkube-control-plane is restarted as part of the upgrade, observe that the ovnkube-cluster-manager container is crashing with the following message where "vlan10" is the name of a NetworkAttachmentDefinition created earlier

      failed to run ovnkube: failed to start cluster manager: initial sync failed: failed to sync network vlan10: [cluster-manager network manager]: failed to create network vlan10: no cluster network controller to manage topology

      7. Edit all NetworkAttachmentDefinitions to include a subnets field

      8. Wait or delete the ovnkube-control-plane pods and observe that the pods come up and the upgrade resumes and completes normally

      Actual results: The upgrade fails and ovnkube-control-plane is left in a crashing state

      Expected results: The upgrade succeeds and ovnkube-control-plane is running

      Additional info:

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms: Tested on baremetal, but all using OVN-K localnet networks should be impacted

      Is it an

      1. internal CI failure
      2. customer issue / SD (Case 03960269)
      3. internal RedHat testing failure - Reproduction steps are based on internal testing as the customer environment has been repaired with the workaround

      If it is an internal RedHat testing failure:

      • Kubeconfig for an internet-reachable cluster currently in the failed state is available upon request from Andrew Austin Byrum until at least 25 October 2024

       

              mduarted@redhat.com Miguel Duarte de Mora Barroso
              aaustin@redhat.com Andrew Austin Byrum
              Weibin Liang Weibin Liang
              Votes:
              6 Vote for this issue
              Watchers:
              26 Start watching this issue

                Created:
                Updated:
                Resolved: