-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.16.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
Approved
-
CORENET Sprint 271
-
1
-
Customer Escalated
-
Done
-
Bug Fix
-
-
None
-
None
-
None
-
None
Environment: OpenShift 4.16.32 (production cluster, AWS, OVN limited live migration)
Description
During a Limited Live Migration from OpenShift SDN to OVN-Kubernetes, the migration process stalled. Nodes remained in mixed CNI states (SDN + OVN), leading to:
- Application unavailability
- DNS resolution failures
- ImagePullBackOff errors
- Broken service-to-service communication across the cluster
This behavior appears to contradict the expected automatic and non-disruptive migration flow defined in the official documentation.
Issue Summary
- Migration was blocked due to MachineConfigPool (MCP) not draining nodes.
- Cause: Pod Disruption Budgets (PDBs) prevented eviction during migration.
- The network bridge between SDN and OVN failed mid-rollout when new worker nodes joined during the second phase.
- All Master nodes have completed the second reboot cycle, and they are now on OVNKubernetes Network.
- 6 out of 8 worker nodes remained in SDN state; only 2 completed OVN configuration.
- DNS queries sent to CoreDNS from a cluster-image -registry POD running on a Master node were seen in PCAP, but no responses were received.
- When we tried connecting to the CoreDNS and Default Kubernetes Service from any POD running on the nodes that have completed the second reboot, it failed.
- Contrack packets remained UNREPLIED; ICMP shows port unreachable.
- Detailed Analysis given in the first comment of this bug.
Customer Impact
- The production application went down
- The customer could not proceed with further migrations
- Significant delay in infrastructure expansion plans
- Risk of repeated failure in upcoming cluster migrations (April 28th)
Business value: USD 5M account
Expected Behavior
- SDN/OVN bridge should persist and allow traffic between nodes in the transitional state.
Actual Behavior
- Nodes stay in inconsistent SDN/OVN states, breaking internal networking and service discovery
References
- [Official OVN Migration Docs](https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/networking/ovn-kubernetes-network-plugin#how-the-live-migration-process-works_migrate-from-openshift-sdn)
Attachments (in case 04111495)
- must-gather reports: https://attachments.access.redhat.com/hydra/rest/cases/04111495/attachments/51c7d5d1-6efa-485e-ba1a-ec6816edcf16?usePresignedUrl=true
- PCAP files/OVS Flows/Contrack Packets: https://attachments.access.redhat.com/hydra/rest/cases/04111495/attachments/1fc8804f-5a3c-42b0-b4ac-84bf30d3f9c0?usePresignedUrl=true
- SOS reports from Master Nodes:
- is duplicated by
-
OCPBUGS-59222 OpenShift API Server unstable during SDN -> OVN migration
-
- Closed
-
- links to
-
RHSA-2025:9765 OpenShift Container Platform 4.16.43 bug fix and security update