Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: odf-4.18.5
Component/s: rook
Labels:
- odf-node-recovery

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Dev Approval:
Committed
Docs Approval:
?
PM Approval:
?
QE Approval:
?
Target Release:

odf-4.21
Intelligence Requested:
Market:

Severity:
Critical

Release Blocker:
Proposed

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

When trying to replace an OSD worker node (rather than just the OSD device) following the official doc steps, the OSD pod for the replace OSD worker node is not recreated.

The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

Baremetal installed via Assisted Installer (Libvirt simulated BM)

The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

Internal-Attached (LSO)
3 control plane / 3 OSD worker nodes - 1 OSD disk per worker node (Cluster + ODF/LSO installed/setup using Assisted Installer)

The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

OCP 4.18.15
Operator CSVs

cephcsi-operator.v4.18.4-rhodf           CephCSI operator                   4.18.4-rhodf   cephcsi-operator.v4.18.3-rhodf          Succeeded
mcg-operator.v4.18.4-rhodf               NooBaa Operator                    4.18.4-rhodf   mcg-operator.v4.18.3-rhodf              Succeeded
ocs-client-operator.v4.18.4-rhodf        OpenShift Data Foundation Client   4.18.4-rhodf   ocs-client-operator.v4.18.3-rhodf       Succeeded
ocs-operator.v4.18.4-rhodf               OpenShift Container Storage        4.18.4-rhodf   ocs-operator.v4.18.3-rhodf              Succeeded
odf-csi-addons-operator.v4.18.4-rhodf    CSI Addons                         4.18.4-rhodf   odf-csi-addons-operator.v4.18.3-rhodf   Succeeded
odf-dependencies.v4.18.4-rhodf           Data Foundation Dependencies       4.18.4-rhodf   odf-dependencies.v4.18.3-rhodf          Succeeded
odf-node-recovery-operator.v1.1.0-rc.7   ODF Node Recovery Operator         1.1.0-rc.7                                             Succeeded
odf-operator.v4.18.4-rhodf               OpenShift Data Foundation          4.18.4-rhodf   odf-operator.v4.18.3-rhodf              Succeeded
odf-prometheus-operator.v4.18.4-rhodf    Prometheus Operator                4.18.4-rhodf   odf-prometheus-operator.v4.18.3-rhodf   Succeeded
recipe.v4.18.4-rhodf                     Recipe                             4.18.4-rhodf   recipe.v4.18.3-rhodf                    Succeeded
rook-ceph-operator.v4.18.4-rhodf         Rook-Ceph                          4.18.4-rhodf   rook-ceph-operator.v4.18.3-rhodf        Succeeded

Does this issue impact your ability to continue to work with the product?

Yes

Is there any workaround available to the best of your knowledge?

No

Can this issue be reproduced? If so, please provide the hit rate

Yes - 100%

Can this issue be reproduced from the UI?

Not sure - Following official doc steps which are from the kubeapi CLI

If this is a regression, please provide more details to justify this:

Possibly as I recall this being successful on earlier OCP versions (ie 4.14) - although I need to retest

Steps to Reproduce:

1. Deploy a baremetal OCP 4.18 cluster (Using assisted installer) - it should have 3 CP nodes and 3 OSD worker nodes that will have 1 extra disk each for OSD - Assisted Installer will install and setup ODF/LSO operator at install time

2. Power off an OSD worker node in order to simulate node failure

3. Follow steps to replace the OSD worker node using assisted installer - New OSD worker node will have the same hostname/node name as the original OSD worker, the replacement OSD disk will be blank

4. Follow the official documented steps for replacing storage devices

The exact date and time when the issue was observed, including timezone details:

June 2 10:02AM eastern time

Actual results:

OSD pod does not get recreated and ceph remains in a bad state

NAME READY STATUS RESTARTS AGE
rook-ceph-osd-1-f789d7d4f-wpx5j 2/2 Running 2 3d16h
rook-ceph-osd-2-6f6748b4c6-xt9qh 2/2 Running 2 3d16h

Expected results:

OSD pod is recreated and ceph eventually gets into a good state

Additional info:

Here are detailed steps taken with output:

- OSD worker node is powered off to simulate a node failure
- The OSD pod goes into Pending state
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-8978cd9b6-44r74    0/2     Pending   0          7m5s
rook-ceph-osd-1-f789d7d4f-wpx5j    2/2     Running   2          3d16h
rook-ceph-osd-2-6f6748b4c6-xt9qh   2/2     Running   2          3d16h
- Worker node is replaced using assisted installer (deleted and reinstalled)
    - It will have the same node name as the node it is replacing
    - The new node's OSD disk is a new empty disk
- Node is replaced and eventually OSD pod goes into CLBO
    NAME                               READY   STATUS                  RESTARTS      AGE
    rook-ceph-osd-0-8978cd9b6-44r74    0/2     Init:CrashLoopBackOff   3 (42s ago)   10m
    rook-ceph-osd-1-f789d7d4f-wpx5j    2/2     Running                 2             3d16h
    rook-ceph-osd-2-6f6748b4c6-xt9qh   2/2     Running                 2             3d16h
- Original PVs do not change
    LSO PVCs:
    NAMESPACE           NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  VOLUMEATTRIBUTESCLASS   AGE
    openshift-storage   db-noobaa-db-pg-0             Bound    pvc-35b8fbc6-309f-4e79-9cdf-a96062ba8256   50Gi       RWO            ocs-storagecluster-ceph-rbd   <unset>                 3d16h
    openshift-storage   ocs-deviceset-0-data-0lb7wj   Bound    local-pv-cf2c16ed                          100Gi      RWO            localblock-sc                 <unset>                 3d16h
    openshift-storage   ocs-deviceset-0-data-1jlgcs   Bound    local-pv-ad0b1b68                          100Gi      RWO            localblock-sc                 <unset>                 3d16h
    openshift-storage   ocs-deviceset-0-data-2ls4wb   Bound    local-pv-85501e5f                          100Gi      RWO            localblock-sc                 <unset>                 3d16
- CLBO OSD pod is scaled down per docs, leaving the 2 remaining OSD pods
    oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
    NAME                               READY   STATUS    RESTARTS   AGE
    rook-ceph-osd-1-f789d7d4f-wpx5j    2/2     Running   2          3d16h
    rook-ceph-osd-2-6f6748b4c6-xt9qh   2/2     Running   2          3d16h    
    
- OSD removal job is run and completed (as per docs)
    oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
    oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
    oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
    NAME                        READY   STATUS      RESTARTS   AGE
    ocs-osd-removal-job-h244k   0/1     Completed   0          41s
    
- OSD PV is Released for a moment
    NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                           STORAGECLASS                  VOLUMEATTRIBU
    TESCLASS   REASON   AGE     HOSTNAME
    local-pv-85501e5f                          100Gi      RWO            Delete           Released   openshift-storage/ocs-deviceset-0-data-2ls4wb   localblock-sc                 <unset>
                        3d16h   ocptest-worker-1
    local-pv-ad0b1b68                          100Gi      RWO            Delete           Bound      openshift-storage/ocs-deviceset-0-data-1jlgcs   localblock-sc                 <unset>
                        3d16h   ocptest-worker-2
    local-pv-cf2c16ed                          100Gi      RWO            Delete           Bound      openshift-storage/ocs-deviceset-0-data-0lb7wj   localblock-sc                 <unset>
                        3d16h   ocptest-worker-3
    pvc-35b8fbc6-309f-4e79-9cdf-a96062ba8256   50Gi       RWO            Delete           Bound      openshift-storage/db-noobaa-db-pg-0             ocs-storagecluster-ceph-rbd   <unset>
                        3d16h
                        
- Then it immediately moves to Active and then automatically gets rebound
    NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                                           STORAGECLASS                  VOLUMEATTRIB
    UTESCLASS   REASON   AGE     HOSTNAME
    local-pv-85501e5f                          100Gi      RWO            Delete           Available                                                   localblock-sc                 <unset>
                        4s      ocptest-worker-1
    local-pv-ad0b1b68                          100Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-0-data-1jlgcs   localblock-sc                 <unset>
                        3d16h   ocptest-worker-2
    local-pv-cf2c16ed                          100Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-0-data-0lb7wj   localblock-sc                 <unset>
                        3d16h   ocptest-worker-3
    pvc-35b8fbc6-309f-4e79-9cdf-a96062ba8256   50Gi       RWO            Delete           Bound       openshift-storage/db-noobaa-db-pg-0             ocs-storagecluster-ceph-rbd   <unset>
                        3d16
    # MOVES to Bound   
    NAMESPACE           NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  VOLUMEATTRIBUTESCLASS   AGE
    openshift-storage   db-noobaa-db-pg-0             Bound    pvc-35b8fbc6-309f-4e79-9cdf-a96062ba8256   50Gi       RWO            ocs-storagecluster-ceph-rbd   <unset>                 3d16h
    openshift-storage   ocs-deviceset-0-data-0lb7wj   Bound    local-pv-cf2c16ed                          100Gi      RWO            localblock-sc                 <unset>                 3d16h
    openshift-storage   ocs-deviceset-0-data-1jlgcs   Bound    local-pv-ad0b1b68                          100Gi      RWO            localblock-sc                 <unset>                 3d16h
    openshift-storage   ocs-deviceset-0-data-2jzj4m   Bound    local-pv-85501e5f                          100Gi      RWO            localblock-sc                 <unset>                 7s
- I have also deleted the PV once it moved to "Released" state but it is auto-recreated and bound, with the same outcome
- At this point the cluster only has the remaining 2 OSD nodes running - 
    NAME                               READY   STATUS    RESTARTS   AGE
    rook-ceph-osd-1-f789d7d4f-wpx5j    2/2     Running   2          3d16h
    rook-ceph-osd-2-6f6748b4c6-xt9qh   2/2     Running   2          3d16h
- If I simulate an OSD disk failure and replacement (as opposed to the entire osd node) and follow the documented steps, then the OSD pod is recreated and ceph moves back into a good state

blocks

FLPATH-2285 Test ODF node recovery metrics and multiple OSD failure

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty