Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-8246

It is necessary for OVN-Kubernetes control plane redeployment after deleting the nodes on etcd restore process

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • 4.11
    • 4.11.0
    • Documentation / etcd
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 5
    • Important
    • No
    • None
    • None
    • Rejected
    • OSDOCS Sprint 233, OSDOCS Sprint 234
    • 2
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Between step 12 and 13 of https://docs.openshift.com/container-platform/4.11/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html , it is necessary to specify that the user must wait up to several minutes (more than five sometimes) so that CNO redeploys OVN-Kubernetes control plane once the node objects associated to the wrong master nodes are deleted.
      
      If this wait is not added, the step 13 may cause OVN databases to be bootstrapped again as clustered and stay as such.
      
      The best way to check is that the ovnkube-master ds no longer contains any reference to the wrong master IPs as per this command:
      
      oc -n openshift-ovn-kubernetes get ds/ovnkube-master -o yaml | grep -E "${WRONG_MASTER_IP_1}|${WRONG_MASTER_IP_2}"
      
      The user should wait until the command above returns empty result (or piping it to wc -l shows a 0).
      

      Version-Release number of selected component (if applicable):

      4.11.13
      

      How reproducible:

      Often during cluster restore if OVN-Kubernetes is in use and the nodes have not been deleted automatically but are also not ready (because there is no cloud provider or the broken masters are powered off).
      

      Steps to Reproduce:

      1. Follow steps and be unlucky
      2.
      3.
      

      Actual results:

      OVN-Kubernetes trying to start clustered databases
      

      Expected results:

      OVN-Kubernetes working after step 13, so that machine-api can work during step 14 (if required in the environment) and/or the procedure can safely continue, in general.
      

      Additional info:

      Not relevant for the documentation fix redaction, but just for the record, it is likely that the reasons why it takes several minutes for the CNO to re-bootstrap OVN-K after node deletions are (or include):
      - If network-operator pod was in one of the dead masters, it can take long until the newer one spawn in the surviving master can acquire the lease and become active.
      - CNO introduces an intentional delay on reconciling OVN-Kubernetes if the number of masters is smaller than 3, because it assumes it may need to wait for other masters to be installed (it is not the case for this procedure, though).
      

              rhn-support-tlove Tami Love
              rhn-support-palonsor Pablo Alonso Rodriguez
              Ge Liu
              None
              Ge Liu Ge Liu
              Tami Love Tami Love
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: