Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-60700

The steps for etcd restore procedure for on-premise HCP cluster need to be revisited

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 2
    • Moderate
    • None
    • None
    • None
    • OSDOCS Sprint 276
    • 1
    • In Progress
    • Bug Fix
    • Hide
      *Cause*: What actions or circumstances cause this bug to present.
      *Consequence*: What happens when the bug presents.
      *Fix*: What was done to fix the bug.
      *Result*: Bug doesn’t present anymore.
      Show
      *Cause*: What actions or circumstances cause this bug to present. *Consequence*: What happens when the bug presents. *Fix*: What was done to fix the bug. *Result*: Bug doesn’t present anymore.
    • None
    • None
    • None
    • None

      Description of problem:

      The etcd restore procedure mentioned in below doc seems to be incomplete.
      
      https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html-single/hosted_control_planes/index#hcp-backup-restore-on-premise
      
      The control plane pods don't rollout automatically after following all the 4 steps. Below additional steps are required to make all the control plane pods for HCP to be running fine.
      
      Rollout the hostedcluster manually:
      
      oc annotate hostedcluster -n <hostedcluster-namespace> <hostedcluster-name> hypershift.openshift.io/restart-date=$(date --iso-8601=seconds)
      
      The multus admission controller and network node identity pods still don't start.
      
      Delete the pods for second and third members of etcd along with their PVCs:
      
      oc delete -n $CONTROL_PLANE_NAMESPACE pvc/data-etcd-1 pod/etcd-1 --wait=false
      oc delete -n $CONTROL_PLANE_NAMESPACE pvc/data-etcd-2 pod/etcd-2 --wait=false
      
      Rollout the hostedcluster manually again:
      
      oc annotate hostedcluster -n <hostedcluster-namespace> <hostedcluster-name> hypershift.openshift.io/restart-date=$(date --iso-8601=seconds) --overwrite
      
      All the control plane pods start running after waiting for sometime.

       

      Version-Release number of selected component (if applicable):

          4.18.19

      How reproducible:

          100% in customer environment

      Steps to Reproduce:

      Follow the doc on a baremetal HCP cluster.
      
      https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html-single/hosted_control_planes/index#hcp-backup-restore-on-premise  

      Actual results:

      Some steps seem to be missing as the control plane pods don't start fine just by following the doc.    

      Expected results:

      Any missing steps to be added in the docs.

      Additional info:

      These steps were tested in customer environment and is is required to run the additional steps mentioned every time an etcd restore with the manual method.

              rhn-support-lahinson Laura Hinson
              rhn-support-alosingh Alok Singh
              None
              None
              Martin Gencur Martin Gencur
              None
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: