Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:OVNK:CNIOfflineMigration

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Customer migrating 8 separate clusters running 4.14.37 in protected/disconnected network space.
Steps to migrate from SDN to OVN were completed as per documentation: https://docs.openshift.com/container-platform/4.14/networking/ovn_kubernetes_network_provider/migrate-from-openshift-sdn.html#nw-ovn-kubernetes-migration_migrate-from-openshift-sdn
After completing migration, step 11 was reached - nodes were rebooted sequentially, waiting for all hosts to return to ready before rebooting the next node. (unclear which nodes were rebooted first)
observed that oauth connectivity was failing, as was console's ability to connect to peer nodes and router pods could not contact all pods.
suspected an issue with geneve port access, so validated throughput from DNS pod to DNS pod on neighboring node, and observed inconsistent connections --> some pods could reach some other hosts, and some hosts could reach all other hosts.
Checked routing table on a node that we had rebooted again to test on, and a node that was still in problematic state and observed discrepant routing table entries:

Route table results for cmp5 and cmp7:
Starting pod/cmp5 ... ##PROBLEM NODE
To use host binaries, run `chroot /host` 
default via 160.xx.xx.254 dev br-ex proto static metric 48 
10.254.0.0/16 dev tun0 scope link ##<<------------------------------!! 
10.xx.xx.0/24 dev ovn-k8s-mp0 proto kernel scope link src 10.xx.xx.2 
160.xx.xx.0/24 dev br-ex proto kernel scope link src 160.xx.xx.25 metric 48 
169.254.169.0/29 dev br-ex proto kernel scope link src 169.xx.xx.2 
169.254.169.1 dev br-ex src 160.xx.xx.25 
169.254.169.3 via 10.254.2.1 dev ovn-k8s-mp0 
172.30.0.0/16 dev tun0 ##<<---------------------------------------!! 

cmp7 ##WORKING NODE
Starting pod/cmp7 ... 
To use host binaries, run `chroot /host` 
default via 160.xx.xx.254 dev br-ex proto static metric 48 
10.xx.0.0/16 via 10.xx.xx.1 dev ovn-k8s-mp0 
10.xx.3.0/24 dev ovn-k8s-mp0 proto kernel scope link src 10.xx.xx.2 
160.xx.xx.0/24 dev br-ex proto kernel scope link src 160.xx.xx.27 metric 48 
169.254.169.0/29 dev br-ex proto kernel scope link src 169.254.169.2 
169.254.169.1 dev br-ex src 160.xx.xx.27 
169.254.169.3 via 10.254.3.1 dev ovn-k8s-mp0 
172.30.0.0/16 via 169.254.169.4 dev br-ex mtu 1400

Subsequently rebooting this node allowed the host to come online successfully and provision to match it's working node peers that were provisioned appropriately.
Because all nodes SHOULD have rolled over properly on the FIRST manual reboot as outlined by documentation - classifying this as a bug as additional rescue steps were necessary to stabilize the cluster and ensure nodes were provisioned properly with the desired routing table entries using OVN only
Old TUN0 entries should have been torn down/replaced with ovn-k8s-mp0 gateway during the reboot process in step 11.

Version-Release number of selected component (if applicable):

4.14.37
bare-metal on HP hardware/hypervisor setup

How reproducible:

2 clusters impacted so far, remaining 6 clusters paused until we validate the nature of the problem.

Steps to Reproduce:

1. Proceed to migrate to OVN

2. Reboot nodes as outlined in step 11 (slow method, waiting for each node to come back to ready before rebooting next peer to avoid bringing the whole cluster down at once)

3. Observe that nodes can't consistently communicate with peers. Observe that some nodes (or all nodes in case of secondary cluster attempted) have a partially updated route table that still lists tun0.

4. reboot nodes again to clear the errant entries from the table + stabilize operators + allow traffic flow.

Actual results:

cluster is destabilized

Expected results:

cluster should come online successfully in OVN state with correct routing table after migration steps concluded

Additional info:

sosreports/must-gathers/linked case details in comments to follow

Assignee:: Peng Liu

Reporter:: Will Russell

Need Info From:: None

Contributors:: None

QA Contact:: Anurag Saxena

Doc Contact:: None

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/01/14 8:31 PM

Updated:: 2025/10/08 12:51 PM

Resolved:: 2025/05/13 10:33 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide