Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.16.z
Affects Version/s: 4.16.z
Component/s: Networking / ovn-kubernetes
Labels:
- QE:Reproduced
- pre-merge-tested

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.16.z
Release Blocker:
Approved
Sprint:
CORENET Sprint 271
sprint_count:
1

Customer Impact:

Customer Escalated

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
*Cause*: What actions or circumstances cause this bug to present.
*Consequence*: What happens when the bug presents.
*Fix*: What was done to fix the bug.
*Result*: Bug doesn’t present anymore.

Show
*Cause*: What actions or circumstances cause this bug to present. *Consequence*: What happens when the bug presents. *Fix*: What was done to fix the bug. *Result*: Bug doesn’t present anymore.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Environment: OpenShift 4.16.32 (production cluster, AWS, OVN limited live migration)

Description

During a Limited Live Migration from OpenShift SDN to OVN-Kubernetes, the migration process stalled. Nodes remained in mixed CNI states (SDN + OVN), leading to:

Application unavailability
DNS resolution failures
ImagePullBackOff errors
Broken service-to-service communication across the cluster

This behavior appears to contradict the expected automatic and non-disruptive migration flow defined in the official documentation.

Issue Summary

Migration was blocked due to MachineConfigPool (MCP) not draining nodes.
Cause: Pod Disruption Budgets (PDBs) prevented eviction during migration.
The network bridge between SDN and OVN failed mid-rollout when new worker nodes joined during the second phase.
All Master nodes have completed the second reboot cycle, and they are now on OVNKubernetes Network.
6 out of 8 worker nodes remained in SDN state; only 2 completed OVN configuration.
DNS queries sent to CoreDNS from a cluster-image -registry POD running on a Master node were seen in PCAP, but no responses were received.
When we tried connecting to the CoreDNS and Default Kubernetes Service from any POD running on the nodes that have completed the second reboot, it failed.
Contrack packets remained UNREPLIED; ICMP shows port unreachable.
Detailed Analysis given in the first comment of this bug.