Uploaded image for project: 'FlightPath'
  1. FlightPath
  2. FLPATH-2173

[odf node recovery] noderecovery stuck "ForceDeleteRookCephOSDPods"

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False

      Description of the problem:

      I simulated replacing a ceph disk, applied the node recovery cr, and it got stuck at "ForceDeleteRookCephOSDPods" 

      The node recovery controller logs show a service account authorization issue continually scrolling:

       

      c logs -f odf-node-recovery-operator-controller-manager-5dbc86974-v24k6
      I0314 15:24:38.482404       1 leaderelection.go:254] attempting to acquire leader lease openshift-operators/9ef0f34a.openshift.io...
      I0314 15:24:54.751741       1 leaderelection.go:268] successfully acquired lease openshift-operators/9ef0f34a.openshift.io
      W0314 17:40:58.738069       1 reflector.go:561] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:106: failed to list *v1.Job: jobs.batch is forbidden: User "system:serviceaccount:openshift-operators:odf-node-recovery-operator-controller-manager" cannot list resource "jobs" in API group "batch" at the cluster scope
      E0314 17:40:58.738281       1 reflector.go:158] "Unhandled Error" err="sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:106: Failed to watch *v1.Job: failed to list *v1.Job: jobs.batch is forbidden: User \"system:serviceaccount:openshift-operators:odf-node-recovery-operator-controller-manager\" cannot list resource \"jobs\" in API group \"batch\" at the cluster scope" logger="UnhandledError"
       

       

       

      node recovery status:

       

       oc get noderecovery                                                                                                                     chadworklaptop: Fri Mar 14 13:50:26 2025NAME           CREATED AT             COMPLETED AT   PHASE     STATE
      noderecovery   2025-03-14T17:40:33Z                  Running   ForceDeleteRookCephOSDPods
       

      Ceph status (from ceph tools)

       

       

        cluster:
          id:     bc65606e-c89e-491f-b7a3-1682a73a99f0
          health: HEALTH_WARN
                  1 osds down
                  1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
                  1 host (1 osds) down
                  Degraded data redundancy: 549/1647 objects degraded (33.333%), 82 pgs degraded, 169 pgs undersized
                  1 daemons have recently crashed  services:
          mon: 3 daemons, quorum a,b,c (age 20h)
          mgr: a(active, since 20h)
          mds: 1/1 daemons up, 1 hot standby
          osd: 3 osds: 2 up (since 12m), 3 in (since 20h)
          rgw: 1 daemon active (1 hosts, 1 zones)  data:
          volumes: 1/1 healthy
          pools:   12 pools, 169 pgs
          objects: 549 objects, 398 MiB
          usage:   1.9 GiB used, 298 GiB / 300 GiB avail
          pgs:     549/1647 objects degraded (33.333%)
                   87 active+undersized
                   82 active+undersized+degraded  io:
          client:   3.4 KiB/s rd, 13 KiB/s wr, 5 op/s rd, 1 op/s wr
       

       

       

      How reproducible:

      100%

      Steps to reproduce:

      1. Deploy 3 node master, 3 node worker cluster using assisted installer cloud - include odf operator (ensure 1 extra 100G disk is attached to each worker node)

      2. Detach one worker extra ceph node (ie from vdb) and format / reattach to vdc

      3. Apply node recovery cr

       

      Actual results:

      Node recovery stuck "ForceDeleteRookCephOSDPods"

      Expected results:

      Node recovery performs ceph recovery steps using newly attached node

              jgil@redhat.com Jordi Gil
              chadcrum Chad Crum
              Chad Crum Chad Crum
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: