Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-2680

OCP 4.18 OSD worker node replacement does not recreate OSD pod

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • odf-4.18.5
    • rook
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • ?
    • ?
    • Critical
    • Proposed
    • None

      Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

      When trying to replace an OSD worker node (rather than just the OSD device) following the official doc steps, the OSD pod for the replace OSD worker node is not recreated. 

       

      The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

      • Baremetal installed via Assisted Installer (Libvirt simulated BM)

       

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

      • Internal-Attached (LSO)
      • 3 control plane / 3 OSD worker nodes - 1 OSD disk per worker node (Cluster + ODF/LSO installed/setup using Assisted Installer)

       

      The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

      • OCP 4.18.15
      • Operator CSVs

       

      cephcsi-operator.v4.18.4-rhodf           CephCSI operator                   4.18.4-rhodf   cephcsi-operator.v4.18.3-rhodf          Succeeded
      mcg-operator.v4.18.4-rhodf               NooBaa Operator                    4.18.4-rhodf   mcg-operator.v4.18.3-rhodf              Succeeded
      ocs-client-operator.v4.18.4-rhodf        OpenShift Data Foundation Client   4.18.4-rhodf   ocs-client-operator.v4.18.3-rhodf       Succeeded
      ocs-operator.v4.18.4-rhodf               OpenShift Container Storage        4.18.4-rhodf   ocs-operator.v4.18.3-rhodf              Succeeded
      odf-csi-addons-operator.v4.18.4-rhodf    CSI Addons                         4.18.4-rhodf   odf-csi-addons-operator.v4.18.3-rhodf   Succeeded
      odf-dependencies.v4.18.4-rhodf           Data Foundation Dependencies       4.18.4-rhodf   odf-dependencies.v4.18.3-rhodf          Succeeded
      odf-node-recovery-operator.v1.1.0-rc.7   ODF Node Recovery Operator         1.1.0-rc.7                                             Succeeded
      odf-operator.v4.18.4-rhodf               OpenShift Data Foundation          4.18.4-rhodf   odf-operator.v4.18.3-rhodf              Succeeded
      odf-prometheus-operator.v4.18.4-rhodf    Prometheus Operator                4.18.4-rhodf   odf-prometheus-operator.v4.18.3-rhodf   Succeeded
      recipe.v4.18.4-rhodf                     Recipe                             4.18.4-rhodf   recipe.v4.18.3-rhodf                    Succeeded
      rook-ceph-operator.v4.18.4-rhodf         Rook-Ceph                          4.18.4-rhodf   rook-ceph-operator.v4.18.3-rhodf        Succeeded 

       

       

       

      Does this issue impact your ability to continue to work with the product?

      • Yes

       

      Is there any workaround available to the best of your knowledge?

      • No

       

      Can this issue be reproduced? If so, please provide the hit rate

      • Yes - 100%

       

      Can this issue be reproduced from the UI?

      • Not sure - Following official doc steps which are from the kubeapi CLI

       

      If this is a regression, please provide more details to justify this:

      • Possibly as I recall this being successful on earlier OCP versions (ie 4.14) - although I need to retest

      Steps to Reproduce:

      1. Deploy a baremetal OCP 4.18 cluster (Using assisted installer) - it should have 3 CP nodes and 3 OSD worker nodes that will have 1 extra disk each for OSD - Assisted Installer will install and setup ODF/LSO operator at install time

      2. Power off an OSD worker node in order to simulate node failure

      3. Follow steps to replace the OSD worker node using assisted installer - New OSD worker node will have the same hostname/node name as the original OSD worker, the replacement OSD disk will be blank

      4. Follow the official documented steps for replacing storage devices 

       

      The exact date and time when the issue was observed, including timezone details:

      June 2 10:02AM eastern time

      Actual results:

      OSD pod does not get recreated and ceph remains in a bad state

          NAME                               READY   STATUS    RESTARTS   AGE
          rook-ceph-osd-1-f789d7d4f-wpx5j    2/2     Running   2          3d16h
          rook-ceph-osd-2-6f6748b4c6-xt9qh   2/2     Running   2          3d16h

       

      Expected results:

      OSD pod is recreated and ceph eventually gets into a good state

       

      Additional info:

      Here are detailed steps taken with output:

      - OSD worker node is powered off to simulate a node failure
      - The OSD pod goes into Pending state
      NAME                               READY   STATUS    RESTARTS   AGE
      rook-ceph-osd-0-8978cd9b6-44r74    0/2     Pending   0          7m5s
      rook-ceph-osd-1-f789d7d4f-wpx5j    2/2     Running   2          3d16h
      rook-ceph-osd-2-6f6748b4c6-xt9qh   2/2     Running   2          3d16h
      - Worker node is replaced using assisted installer (deleted and reinstalled)
          - It will have the same node name as the node it is replacing
          - The new node's OSD disk is a new empty disk
      - Node is replaced and eventually OSD pod goes into CLBO
          NAME                               READY   STATUS                  RESTARTS      AGE
          rook-ceph-osd-0-8978cd9b6-44r74    0/2     Init:CrashLoopBackOff   3 (42s ago)   10m
          rook-ceph-osd-1-f789d7d4f-wpx5j    2/2     Running                 2             3d16h
          rook-ceph-osd-2-6f6748b4c6-xt9qh   2/2     Running                 2             3d16h
      - Original PVs do not change
          LSO PVCs:
          NAMESPACE           NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  VOLUMEATTRIBUTESCLASS   AGE
          openshift-storage   db-noobaa-db-pg-0             Bound    pvc-35b8fbc6-309f-4e79-9cdf-a96062ba8256   50Gi       RWO            ocs-storagecluster-ceph-rbd   <unset>                 3d16h
          openshift-storage   ocs-deviceset-0-data-0lb7wj   Bound    local-pv-cf2c16ed                          100Gi      RWO            localblock-sc                 <unset>                 3d16h
          openshift-storage   ocs-deviceset-0-data-1jlgcs   Bound    local-pv-ad0b1b68                          100Gi      RWO            localblock-sc                 <unset>                 3d16h
          openshift-storage   ocs-deviceset-0-data-2ls4wb   Bound    local-pv-85501e5f                          100Gi      RWO            localblock-sc                 <unset>                 3d16
      - CLBO OSD pod is scaled down per docs, leaving the 2 remaining OSD pods
          oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
          NAME                               READY   STATUS    RESTARTS   AGE
          rook-ceph-osd-1-f789d7d4f-wpx5j    2/2     Running   2          3d16h
          rook-ceph-osd-2-6f6748b4c6-xt9qh   2/2     Running   2          3d16h    
          
      - OSD removal job is run and completed (as per docs)
          oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
          oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
          oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
          NAME                        READY   STATUS      RESTARTS   AGE
          ocs-osd-removal-job-h244k   0/1     Completed   0          41s
          
      - OSD PV is Released for a moment
          NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                           STORAGECLASS                  VOLUMEATTRIBU
          TESCLASS   REASON   AGE     HOSTNAME
          local-pv-85501e5f                          100Gi      RWO            Delete           Released   openshift-storage/ocs-deviceset-0-data-2ls4wb   localblock-sc                 <unset>
                              3d16h   ocptest-worker-1
          local-pv-ad0b1b68                          100Gi      RWO            Delete           Bound      openshift-storage/ocs-deviceset-0-data-1jlgcs   localblock-sc                 <unset>
                              3d16h   ocptest-worker-2
          local-pv-cf2c16ed                          100Gi      RWO            Delete           Bound      openshift-storage/ocs-deviceset-0-data-0lb7wj   localblock-sc                 <unset>
                              3d16h   ocptest-worker-3
          pvc-35b8fbc6-309f-4e79-9cdf-a96062ba8256   50Gi       RWO            Delete           Bound      openshift-storage/db-noobaa-db-pg-0             ocs-storagecluster-ceph-rbd   <unset>
                              3d16h
                              
      - Then it immediately moves to Active and then automatically gets rebound
          NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                                           STORAGECLASS                  VOLUMEATTRIB
          UTESCLASS   REASON   AGE     HOSTNAME
          local-pv-85501e5f                          100Gi      RWO            Delete           Available                                                   localblock-sc                 <unset>
                              4s      ocptest-worker-1
          local-pv-ad0b1b68                          100Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-0-data-1jlgcs   localblock-sc                 <unset>
                              3d16h   ocptest-worker-2
          local-pv-cf2c16ed                          100Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-0-data-0lb7wj   localblock-sc                 <unset>
                              3d16h   ocptest-worker-3
          pvc-35b8fbc6-309f-4e79-9cdf-a96062ba8256   50Gi       RWO            Delete           Bound       openshift-storage/db-noobaa-db-pg-0             ocs-storagecluster-ceph-rbd   <unset>
                              3d16
          # MOVES to Bound   
          NAMESPACE           NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  VOLUMEATTRIBUTESCLASS   AGE
          openshift-storage   db-noobaa-db-pg-0             Bound    pvc-35b8fbc6-309f-4e79-9cdf-a96062ba8256   50Gi       RWO            ocs-storagecluster-ceph-rbd   <unset>                 3d16h
          openshift-storage   ocs-deviceset-0-data-0lb7wj   Bound    local-pv-cf2c16ed                          100Gi      RWO            localblock-sc                 <unset>                 3d16h
          openshift-storage   ocs-deviceset-0-data-1jlgcs   Bound    local-pv-ad0b1b68                          100Gi      RWO            localblock-sc                 <unset>                 3d16h
          openshift-storage   ocs-deviceset-0-data-2jzj4m   Bound    local-pv-85501e5f                          100Gi      RWO            localblock-sc                 <unset>                 7s
      - I have also deleted the PV once it moved to "Released" state but it is auto-recreated and bound, with the same outcome
      - At this point the cluster only has the remaining 2 OSD nodes running - 
          NAME                               READY   STATUS    RESTARTS   AGE
          rook-ceph-osd-1-f789d7d4f-wpx5j    2/2     Running   2          3d16h
          rook-ceph-osd-2-6f6748b4c6-xt9qh   2/2     Running   2          3d16h
      - If I simulate an OSD disk failure and replacement (as opposed to the entire osd node) and follow the documented steps, then the OSD pod is recreated and ceph moves back into a good state 

       

       

              sapillai Santosh Pillai
              chadcrum Chad Crum
              Elad Ben Aharon Elad Ben Aharon
              Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

                Created:
                Updated: