-
Bug
-
Resolution: Done
-
Blocker
-
None
-
None
-
False
-
-
False
-
-
Description of the problem:
I simulated replacing a ceph disk, applied the node recovery cr, and it got stuck at "ForceDeleteRookCephOSDPods"
The node recovery controller logs show a service account authorization issue continually scrolling:
c logs -f odf-node-recovery-operator-controller-manager-5dbc86974-v24k6 I0314 15:24:38.482404 1 leaderelection.go:254] attempting to acquire leader lease openshift-operators/9ef0f34a.openshift.io... I0314 15:24:54.751741 1 leaderelection.go:268] successfully acquired lease openshift-operators/9ef0f34a.openshift.io W0314 17:40:58.738069 1 reflector.go:561] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:106: failed to list *v1.Job: jobs.batch is forbidden: User "system:serviceaccount:openshift-operators:odf-node-recovery-operator-controller-manager" cannot list resource "jobs" in API group "batch" at the cluster scope E0314 17:40:58.738281 1 reflector.go:158] "Unhandled Error" err="sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:106: Failed to watch *v1.Job: failed to list *v1.Job: jobs.batch is forbidden: User \"system:serviceaccount:openshift-operators:odf-node-recovery-operator-controller-manager\" cannot list resource \"jobs\" in API group \"batch\" at the cluster scope" logger="UnhandledError"
node recovery status:
oc get noderecovery chadworklaptop: Fri Mar 14 13:50:26 2025NAME CREATED AT COMPLETED AT PHASE STATE noderecovery 2025-03-14T17:40:33Z Running ForceDeleteRookCephOSDPods
Ceph status (from ceph tools)
cluster: id: bc65606e-c89e-491f-b7a3-1682a73a99f0 health: HEALTH_WARN 1 osds down 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set 1 host (1 osds) down Degraded data redundancy: 549/1647 objects degraded (33.333%), 82 pgs degraded, 169 pgs undersized 1 daemons have recently crashed services: mon: 3 daemons, quorum a,b,c (age 20h) mgr: a(active, since 20h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 2 up (since 12m), 3 in (since 20h) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 549 objects, 398 MiB usage: 1.9 GiB used, 298 GiB / 300 GiB avail pgs: 549/1647 objects degraded (33.333%) 87 active+undersized 82 active+undersized+degraded io: client: 3.4 KiB/s rd, 13 KiB/s wr, 5 op/s rd, 1 op/s wr
How reproducible:
100%
Steps to reproduce:
1. Deploy 3 node master, 3 node worker cluster using assisted installer cloud - include odf operator (ensure 1 extra 100G disk is attached to each worker node)
2. Detach one worker extra ceph node (ie from vdb) and format / reattach to vdc
3. Apply node recovery cr
Actual results:
Node recovery stuck "ForceDeleteRookCephOSDPods"
Expected results:
Node recovery performs ceph recovery steps using newly attached node