-
Bug
-
Resolution: Done
-
Blocker
-
None
-
None
-
False
-
-
False
-
-
Description of the problem:
I simulated replacing a ceph disk, applied the node recovery cr, and it got stuck at "ForceDeleteRookCephOSDPods"
The node recovery controller logs show a service account authorization issue continually scrolling:
c logs -f odf-node-recovery-operator-controller-manager-5dbc86974-v24k6 I0314 15:24:38.482404 1 leaderelection.go:254] attempting to acquire leader lease openshift-operators/9ef0f34a.openshift.io... I0314 15:24:54.751741 1 leaderelection.go:268] successfully acquired lease openshift-operators/9ef0f34a.openshift.io W0314 17:40:58.738069 1 reflector.go:561] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:106: failed to list *v1.Job: jobs.batch is forbidden: User "system:serviceaccount:openshift-operators:odf-node-recovery-operator-controller-manager" cannot list resource "jobs" in API group "batch" at the cluster scope E0314 17:40:58.738281 1 reflector.go:158] "Unhandled Error" err="sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:106: Failed to watch *v1.Job: failed to list *v1.Job: jobs.batch is forbidden: User \"system:serviceaccount:openshift-operators:odf-node-recovery-operator-controller-manager\" cannot list resource \"jobs\" in API group \"batch\" at the cluster scope" logger="UnhandledError"
node recovery status:
oc get noderecovery chadworklaptop: Fri Mar 14 13:50:26 2025NAME CREATED AT COMPLETED AT PHASE STATE noderecovery 2025-03-14T17:40:33Z Running ForceDeleteRookCephOSDPods
Ceph status (from ceph tools)
cluster:
id: bc65606e-c89e-491f-b7a3-1682a73a99f0
health: HEALTH_WARN
1 osds down
1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
1 host (1 osds) down
Degraded data redundancy: 549/1647 objects degraded (33.333%), 82 pgs degraded, 169 pgs undersized
1 daemons have recently crashed services:
mon: 3 daemons, quorum a,b,c (age 20h)
mgr: a(active, since 20h)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 2 up (since 12m), 3 in (since 20h)
rgw: 1 daemon active (1 hosts, 1 zones) data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 549 objects, 398 MiB
usage: 1.9 GiB used, 298 GiB / 300 GiB avail
pgs: 549/1647 objects degraded (33.333%)
87 active+undersized
82 active+undersized+degraded io:
client: 3.4 KiB/s rd, 13 KiB/s wr, 5 op/s rd, 1 op/s wr
How reproducible:
100%
Steps to reproduce:
1. Deploy 3 node master, 3 node worker cluster using assisted installer cloud - include odf operator (ensure 1 extra 100G disk is attached to each worker node)
2. Detach one worker extra ceph node (ie from vdb) and format / reattach to vdc
3. Apply node recovery cr
Actual results:
Node recovery stuck "ForceDeleteRookCephOSDPods"
Expected results:
Node recovery performs ceph recovery steps using newly attached node