Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20194

Stop other control plane components on non-recovery hosts during etcd restore procedure

XMLWordPrintable

    • Important
    • No
    • 3
    • OSDOCS Sprint 247, OSDOCS Sprint 248, OSDOCS Sprint 249
    • 3
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

      Documentation URL: https://docs.openshift.com/container-platform/4.14/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html

      Under some circumstances, if kube-scheduler or kube-controller-manager on the non-recovery hosts stay running during the procedure and while we switch from the normal 3-node control plane to the temporary single-node one, those components can malfunction. The biggest consequence is that ovnkube-control-plane pods may be stuck "pending" forever on step 12 due to kube-scheduler malfunction.

      Solution: Do not run kube-scheduler or kube-controller-manager on non-recovery hosts (cluster-restore.sh restarts them properly, so we don't have to worry about them on the recovery host).

      So what we would need:

      • In step 4, after current sub-step e and before sub-step f, add the following substeps (with the same formatting than sub-steps d and e
        • Move the existing Kubernetes Controller Manager pod file out of the kubelet manifest directory: sudo mv /etc/kubernetes/manifests/kube-controller-manager-pod.yaml /tmp
        • Verify that the Kubernetes Controller Manager containers are stopped: sudo crictl ps | grep kube-controller-manager | egrep -v "operator|guard". The output of this command should be empty. If it is not empty, wait a few minutes and check again.
        • Move the existing Kubernetes Scheduler pod file out of the kubelet manifest directory: sudo mv /etc/kubernetes/manifests/kube-scheduler-pod.yaml /tmp
        • Verify that the Kubernetes Scheduler containers are stopped: sudo crictl ps | grep kube-scheduler | egrep -v "operator|guard". The output of this command should be empty. If it is not empty, wait a few minutes and check again.
      • On step 7, add a paragraph or note at the bottom telling the user that the cluster-restore.sh must show that etcd, kube-apiserver, kube-controller-manager and kube-scheduler pods are stopped and all of them are later started.
      • On step 4d), where it says "Verify that the Kubernetes API server pods are stopped.", it should say "Verify that the Kubernetes API server containers are stopped.", because you are checking the many containers of a single pod (this is not strictly related to the issue, but it would be good to fix so the wording is consistent with the statements above).

            rhn-support-nalhadef Neal Alhadeff
            rhn-support-palonsor Pablo Alonso Rodriguez
            ge liu ge liu
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: