-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.14.z
-
Important
-
None
-
False
-
Description of problem:
- Customer migrating 8 separate clusters running 4.14.37 in protected/disconnected network space.
- Steps to migrate from SDN to OVN were completed as per documentation: https://docs.openshift.com/container-platform/4.14/networking/ovn_kubernetes_network_provider/migrate-from-openshift-sdn.html#nw-ovn-kubernetes-migration_migrate-from-openshift-sdn
- After completing migration, step 11 was reached - nodes were rebooted sequentially, waiting for all hosts to return to ready before rebooting the next node. (unclear which nodes were rebooted first)
- observed that oauth connectivity was failing, as was console's ability to connect to peer nodes and router pods could not contact all pods.
- suspected an issue with geneve port access, so validated throughput from DNS pod to DNS pod on neighboring node, and observed inconsistent connections --> some pods could reach some other hosts, and some hosts could reach all other hosts.
- Checked routing table on a node that we had rebooted again to test on, and a node that was still in problematic state and observed discrepant routing table entries:
Route table results for cmp5 and cmp7: Starting pod/cmp5 ... ##PROBLEM NODE To use host binaries, run `chroot /host` default via 160.xx.xx.254 dev br-ex proto static metric 48 10.254.0.0/16 dev tun0 scope link ##<<------------------------------!! 10.xx.xx.0/24 dev ovn-k8s-mp0 proto kernel scope link src 10.xx.xx.2 160.xx.xx.0/24 dev br-ex proto kernel scope link src 160.xx.xx.25 metric 48 169.254.169.0/29 dev br-ex proto kernel scope link src 169.xx.xx.2 169.254.169.1 dev br-ex src 160.xx.xx.25 169.254.169.3 via 10.254.2.1 dev ovn-k8s-mp0 172.30.0.0/16 dev tun0 ##<<---------------------------------------!! cmp7 ##WORKING NODE Starting pod/cmp7 ... To use host binaries, run `chroot /host` default via 160.xx.xx.254 dev br-ex proto static metric 48 10.xx.0.0/16 via 10.xx.xx.1 dev ovn-k8s-mp0 10.xx.3.0/24 dev ovn-k8s-mp0 proto kernel scope link src 10.xx.xx.2 160.xx.xx.0/24 dev br-ex proto kernel scope link src 160.xx.xx.27 metric 48 169.254.169.0/29 dev br-ex proto kernel scope link src 169.254.169.2 169.254.169.1 dev br-ex src 160.xx.xx.27 169.254.169.3 via 10.254.3.1 dev ovn-k8s-mp0 172.30.0.0/16 via 169.254.169.4 dev br-ex mtu 1400
- Subsequently rebooting this node allowed the host to come online successfully and provision to match it's working node peers that were provisioned appropriately.
- Because all nodes SHOULD have rolled over properly on the FIRST manual reboot as outlined by documentation - classifying this as a bug as additional rescue steps were necessary to stabilize the cluster and ensure nodes were provisioned properly with the desired routing table entries using OVN only
- Old TUN0 entries should have been torn down/replaced with ovn-k8s-mp0 gateway during the reboot process in step 11.
Version-Release number of selected component (if applicable):
- 4.14.37
- bare-metal on HP hardware/hypervisor setup
How reproducible:
- 2 clusters impacted so far, remaining 6 clusters paused until we validate the nature of the problem.
Steps to Reproduce:
1. Proceed to migrate to OVN
2. Reboot nodes as outlined in step 11 (slow method, waiting for each node to come back to ready before rebooting next peer to avoid bringing the whole cluster down at once)
3. Observe that nodes can't consistently communicate with peers. Observe that some nodes (or all nodes in case of secondary cluster attempted) have a partially updated route table that still lists tun0.
4. reboot nodes again to clear the errant entries from the table + stabilize operators + allow traffic flow.
Actual results:
- cluster is destabilized
Expected results:
- cluster should come online successfully in OVN state with correct routing table after migration steps concluded
Additional info:
- sosreports/must-gathers/linked case details in comments to follow