Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-48399

OCP 4.14.37 : SDN to OVN migration failed on step 11 --> routing table did not successfully update, left reference to tun0 on all nodes after manual reboots completed

XMLWordPrintable

    • Important
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      • Customer migrating 8 separate clusters running 4.14.37 in protected/disconnected network space.
      • Steps to migrate from SDN to OVN were completed as per documentation: https://docs.openshift.com/container-platform/4.14/networking/ovn_kubernetes_network_provider/migrate-from-openshift-sdn.html#nw-ovn-kubernetes-migration_migrate-from-openshift-sdn 
      • After completing migration, step 11 was reached - nodes were rebooted sequentially, waiting for all hosts to return to ready before rebooting the next node. (unclear which nodes were rebooted first)
      • observed that oauth connectivity was failing, as was console's ability to connect to peer nodes and router pods could not contact all pods.
      • suspected an issue with geneve port access, so validated throughput from DNS pod to DNS pod on neighboring node, and observed inconsistent connections --> some pods could reach some other hosts, and some hosts could reach all other hosts. 
      • Checked routing table on a node that we had rebooted again to test on, and a node that was still in problematic state and observed discrepant routing table entries:
      Route table results for cmp5 and cmp7:
      Starting pod/cmp5 ... ##PROBLEM NODE
      To use host binaries, run `chroot /host` 
      default via 160.xx.xx.254 dev br-ex proto static metric 48 
      10.254.0.0/16 dev tun0 scope link ##<<------------------------------!! 
      10.xx.xx.0/24 dev ovn-k8s-mp0 proto kernel scope link src 10.xx.xx.2 
      160.xx.xx.0/24 dev br-ex proto kernel scope link src 160.xx.xx.25 metric 48 
      169.254.169.0/29 dev br-ex proto kernel scope link src 169.xx.xx.2 
      169.254.169.1 dev br-ex src 160.xx.xx.25 
      169.254.169.3 via 10.254.2.1 dev ovn-k8s-mp0 
      172.30.0.0/16 dev tun0 ##<<---------------------------------------!! 
      
      cmp7 ##WORKING NODE
      Starting pod/cmp7 ... 
      To use host binaries, run `chroot /host` 
      default via 160.xx.xx.254 dev br-ex proto static metric 48 
      10.xx.0.0/16 via 10.xx.xx.1 dev ovn-k8s-mp0 
      10.xx.3.0/24 dev ovn-k8s-mp0 proto kernel scope link src 10.xx.xx.2 
      160.xx.xx.0/24 dev br-ex proto kernel scope link src 160.xx.xx.27 metric 48 
      169.254.169.0/29 dev br-ex proto kernel scope link src 169.254.169.2 
      169.254.169.1 dev br-ex src 160.xx.xx.27 
      169.254.169.3 via 10.254.3.1 dev ovn-k8s-mp0 
      172.30.0.0/16 via 169.254.169.4 dev br-ex mtu 1400 
      • Subsequently rebooting this node allowed the host to come online successfully and provision to match it's working node peers that were provisioned appropriately.
      • Because all nodes SHOULD have rolled over properly on the FIRST manual reboot as outlined by documentation - classifying this as a bug as additional rescue steps were necessary to stabilize the cluster and ensure nodes were provisioned properly with the desired routing table entries using OVN only
      • Old TUN0 entries should have been torn down/replaced with ovn-k8s-mp0 gateway during the reboot process in step 11. 

       

      Version-Release number of selected component (if applicable):

      • 4.14.37
      • bare-metal on HP hardware/hypervisor setup

      How reproducible:

      • 2 clusters impacted so far, remaining 6 clusters paused until we validate the nature of the problem.

      Steps to Reproduce:

      1. Proceed to migrate to OVN

      2. Reboot nodes as outlined in step 11 (slow method, waiting for each node to come back to ready before rebooting next peer to avoid bringing the whole cluster down at once)

      3. Observe that nodes can't consistently communicate with peers. Observe that some nodes (or all nodes in case of secondary cluster attempted) have a partially updated route table that still lists tun0.

      4. reboot nodes again to clear the errant entries from the table + stabilize operators + allow traffic flow.

       

      Actual results:

      • cluster is destabilized

      Expected results:

      • cluster should come online successfully in OVN state with correct routing table after migration steps concluded

      Additional info:

      • sosreports/must-gathers/linked case details in comments to follow

              pliurh Peng Liu
              rhn-support-wrussell Will Russell
              Anurag Saxena Anurag Saxena
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: