-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
4.17
Description of problem:
Customers using OVN-K localnet topology networks for virtualization often do not define a "subnets" field in their NetworkAttachmentDefinitions. Examples in the OCP documentation virtualization section do not include that field either.
When a cluster with such NADs is upgraded from 4.16 to 4.17, the ovnkube-control-plane pods crash when CNO is upgraded and the upgrade hangs in a failing state. Once in the failing state, the cluster upgrade can be recovered by adding a subnets field to the localnet NADs
Version-Release number of selected component (if applicable): 4.16.15 > 4.17.1
How reproducible:
Start with an OCP 4.16 cluster with OVN-K localnet NADs configured per the OpenShift Virtualization documentation and attempt to upgrade the cluster to 4.17.1.
Steps to Reproduce:
1. Deploy an OCP 4.16.15 cluster, the type shouldn't matter but all testing has been done on bare metal (SNO and HA topologies)
2. Configure an OVS bridge with localnet bridge mappings and create one or more NetworkAttachmentDefinitions using the localnet topology without configuring the "subnets" field
3. Observe that this is a working configuration in 4.16 although error-level log messages appear in the ovnkube-control-plane pod (see OCPBUGS-37561)
4. Delete the ovnkube-control-plane pod on 4.16 and observe that the log messages do not prevent you from starting ovnkube on 4.16
5. Trigger an upgrade to 4.17.1
6. Once ovnkube-control-plane is restarted as part of the upgrade, observe that the ovnkube-cluster-manager container is crashing with the following message where "vlan10" is the name of a NetworkAttachmentDefinition created earlier
failed to run ovnkube: failed to start cluster manager: initial sync failed: failed to sync network vlan10: [cluster-manager network manager]: failed to create network vlan10: no cluster network controller to manage topology
7. Edit all NetworkAttachmentDefinitions to include a subnets field
8. Wait or delete the ovnkube-control-plane pods and observe that the pods come up and the upgrade resumes and completes normally
Actual results: The upgrade fails and ovnkube-control-plane is left in a crashing state
Expected results: The upgrade succeeds and ovnkube-control-plane is running
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms: Tested on baremetal, but all using OVN-K localnet networks should be impacted
Is it an
internal CI failure- customer issue / SD (Case 03960269)
- internal RedHat testing failure - Reproduction steps are based on internal testing as the customer environment has been repaired with the workaround
If it is an internal RedHat testing failure:
- Kubeconfig for an internet-reachable cluster currently in the failed state is available upon request from Andrew Austin Byrum until at least 25 October 2024
- depends on
-
OCPBUGS-44195 ovnkube-control-plane pods crash when upgrading from 4.16 to 4.17 with localnet topology networks without subnets
- Verified
- is cloned by
-
OCPBUGS-44195 ovnkube-control-plane pods crash when upgrading from 4.16 to 4.17 with localnet topology networks without subnets
- Verified
- is related to
-
SDN-5485 Impact ovnkube-control-plane pods crash when upgrading from 4.16 to 4.17 with localnet topology networks without subnets
- In Progress
- links to