-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
odf-4.18.5
Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:
When trying to replace an OSD worker node (rather than just the OSD device) following the official doc steps, the OSD pod for the replace OSD worker node is not recreated.
The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):
- Baremetal installed via Assisted Installer (Libvirt simulated BM)
The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):
- Internal-Attached (LSO)
- 3 control plane / 3 OSD worker nodes - 1 OSD disk per worker node (Cluster + ODF/LSO installed/setup using Assisted Installer)
The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):
- OCP 4.18.15
- Operator CSVs
cephcsi-operator.v4.18.4-rhodf CephCSI operator 4.18.4-rhodf cephcsi-operator.v4.18.3-rhodf Succeeded mcg-operator.v4.18.4-rhodf NooBaa Operator 4.18.4-rhodf mcg-operator.v4.18.3-rhodf Succeeded ocs-client-operator.v4.18.4-rhodf OpenShift Data Foundation Client 4.18.4-rhodf ocs-client-operator.v4.18.3-rhodf Succeeded ocs-operator.v4.18.4-rhodf OpenShift Container Storage 4.18.4-rhodf ocs-operator.v4.18.3-rhodf Succeeded odf-csi-addons-operator.v4.18.4-rhodf CSI Addons 4.18.4-rhodf odf-csi-addons-operator.v4.18.3-rhodf Succeeded odf-dependencies.v4.18.4-rhodf Data Foundation Dependencies 4.18.4-rhodf odf-dependencies.v4.18.3-rhodf Succeeded odf-node-recovery-operator.v1.1.0-rc.7 ODF Node Recovery Operator 1.1.0-rc.7 Succeeded odf-operator.v4.18.4-rhodf OpenShift Data Foundation 4.18.4-rhodf odf-operator.v4.18.3-rhodf Succeeded odf-prometheus-operator.v4.18.4-rhodf Prometheus Operator 4.18.4-rhodf odf-prometheus-operator.v4.18.3-rhodf Succeeded recipe.v4.18.4-rhodf Recipe 4.18.4-rhodf recipe.v4.18.3-rhodf Succeeded rook-ceph-operator.v4.18.4-rhodf Rook-Ceph 4.18.4-rhodf rook-ceph-operator.v4.18.3-rhodf Succeeded
Does this issue impact your ability to continue to work with the product?
- Yes
Is there any workaround available to the best of your knowledge?
- No
Can this issue be reproduced? If so, please provide the hit rate
- Yes - 100%
Can this issue be reproduced from the UI?
- Not sure - Following official doc steps which are from the kubeapi CLI
If this is a regression, please provide more details to justify this:
- Possibly as I recall this being successful on earlier OCP versions (ie 4.14) - although I need to retest
Steps to Reproduce:
1. Deploy a baremetal OCP 4.18 cluster (Using assisted installer) - it should have 3 CP nodes and 3 OSD worker nodes that will have 1 extra disk each for OSD - Assisted Installer will install and setup ODF/LSO operator at install time
2. Power off an OSD worker node in order to simulate node failure
3. Follow steps to replace the OSD worker node using assisted installer - New OSD worker node will have the same hostname/node name as the original OSD worker, the replacement OSD disk will be blank
4. Follow the official documented steps for replacing storage devices
The exact date and time when the issue was observed, including timezone details:
June 2 10:02AM eastern time
Actual results:
OSD pod does not get recreated and ceph remains in a bad state
NAME READY STATUS RESTARTS AGE
rook-ceph-osd-1-f789d7d4f-wpx5j 2/2 Running 2 3d16h
rook-ceph-osd-2-6f6748b4c6-xt9qh 2/2 Running 2 3d16h
Expected results:
OSD pod is recreated and ceph eventually gets into a good state
Additional info:
Here are detailed steps taken with output:
- OSD worker node is powered off to simulate a node failure - The OSD pod goes into Pending state NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-8978cd9b6-44r74 0/2 Pending 0 7m5s rook-ceph-osd-1-f789d7d4f-wpx5j 2/2 Running 2 3d16h rook-ceph-osd-2-6f6748b4c6-xt9qh 2/2 Running 2 3d16h - Worker node is replaced using assisted installer (deleted and reinstalled) - It will have the same node name as the node it is replacing - The new node's OSD disk is a new empty disk - Node is replaced and eventually OSD pod goes into CLBO NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-8978cd9b6-44r74 0/2 Init:CrashLoopBackOff 3 (42s ago) 10m rook-ceph-osd-1-f789d7d4f-wpx5j 2/2 Running 2 3d16h rook-ceph-osd-2-6f6748b4c6-xt9qh 2/2 Running 2 3d16h - Original PVs do not change LSO PVCs: NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE openshift-storage db-noobaa-db-pg-0 Bound pvc-35b8fbc6-309f-4e79-9cdf-a96062ba8256 50Gi RWO ocs-storagecluster-ceph-rbd <unset> 3d16h openshift-storage ocs-deviceset-0-data-0lb7wj Bound local-pv-cf2c16ed 100Gi RWO localblock-sc <unset> 3d16h openshift-storage ocs-deviceset-0-data-1jlgcs Bound local-pv-ad0b1b68 100Gi RWO localblock-sc <unset> 3d16h openshift-storage ocs-deviceset-0-data-2ls4wb Bound local-pv-85501e5f 100Gi RWO localblock-sc <unset> 3d16 - CLBO OSD pod is scaled down per docs, leaving the 2 remaining OSD pods oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0 NAME READY STATUS RESTARTS AGE rook-ceph-osd-1-f789d7d4f-wpx5j 2/2 Running 2 3d16h rook-ceph-osd-2-6f6748b4c6-xt9qh 2/2 Running 2 3d16h - OSD removal job is run and completed (as per docs) oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f - oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal' NAME READY STATUS RESTARTS AGE ocs-osd-removal-job-h244k 0/1 Completed 0 41s - OSD PV is Released for a moment NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBU TESCLASS REASON AGE HOSTNAME local-pv-85501e5f 100Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-2ls4wb localblock-sc <unset> 3d16h ocptest-worker-1 local-pv-ad0b1b68 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-data-1jlgcs localblock-sc <unset> 3d16h ocptest-worker-2 local-pv-cf2c16ed 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-data-0lb7wj localblock-sc <unset> 3d16h ocptest-worker-3 pvc-35b8fbc6-309f-4e79-9cdf-a96062ba8256 50Gi RWO Delete Bound openshift-storage/db-noobaa-db-pg-0 ocs-storagecluster-ceph-rbd <unset> 3d16h - Then it immediately moves to Active and then automatically gets rebound NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIB UTESCLASS REASON AGE HOSTNAME local-pv-85501e5f 100Gi RWO Delete Available localblock-sc <unset> 4s ocptest-worker-1 local-pv-ad0b1b68 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-data-1jlgcs localblock-sc <unset> 3d16h ocptest-worker-2 local-pv-cf2c16ed 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-data-0lb7wj localblock-sc <unset> 3d16h ocptest-worker-3 pvc-35b8fbc6-309f-4e79-9cdf-a96062ba8256 50Gi RWO Delete Bound openshift-storage/db-noobaa-db-pg-0 ocs-storagecluster-ceph-rbd <unset> 3d16 # MOVES to Bound NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE openshift-storage db-noobaa-db-pg-0 Bound pvc-35b8fbc6-309f-4e79-9cdf-a96062ba8256 50Gi RWO ocs-storagecluster-ceph-rbd <unset> 3d16h openshift-storage ocs-deviceset-0-data-0lb7wj Bound local-pv-cf2c16ed 100Gi RWO localblock-sc <unset> 3d16h openshift-storage ocs-deviceset-0-data-1jlgcs Bound local-pv-ad0b1b68 100Gi RWO localblock-sc <unset> 3d16h openshift-storage ocs-deviceset-0-data-2jzj4m Bound local-pv-85501e5f 100Gi RWO localblock-sc <unset> 7s - I have also deleted the PV once it moved to "Released" state but it is auto-recreated and bound, with the same outcome - At this point the cluster only has the remaining 2 OSD nodes running - NAME READY STATUS RESTARTS AGE rook-ceph-osd-1-f789d7d4f-wpx5j 2/2 Running 2 3d16h rook-ceph-osd-2-6f6748b4c6-xt9qh 2/2 Running 2 3d16h - If I simulate an OSD disk failure and replacement (as opposed to the entire osd node) and follow the documented steps, then the OSD pod is recreated and ceph moves back into a good state
- blocks
-
FLPATH-2285 Test ODF node recovery metrics and multiple OSD failure
-
- Closed
-