Uploaded image for project: 'FlightPath'
  1. FlightPath
  2. FLPATH-2188

[odf node recovery] noderecovery stuck "WaitForOSDPodsStabilize" after node replacement

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Critical

      Description of the problem:
      On an assisted installer cluster, I replace one of the osd worker nodes (with the same hostname) and create the odf node recovery resource after the new node joined the cluster. rook ceph osd pod is in the expected clbo, but the node recovery is stuck WaitForOSDPodsStabilize

       

      Node recovery status:

       

      Status:
        Conditions:
          Last Probe Time:       2025-03-20T20:55:27Z
          Last Transition Time:  2025-03-20T20:55:27Z
          Status:                False
          Type:                  EnableCephToolsPod
          Last Probe Time:       2025-03-20T20:55:37Z
          Last Transition Time:  2025-03-20T20:55:27Z
          Status:                False
          Type:                  WaitForCephToolsPodRunning
          Last Probe Time:       2025-03-20T20:59:28Z
          Last Transition Time:  2025-03-20T20:55:37Z
          Message:               OSD pods still in initializing status: pod rook-ceph-osd-1-7dfdb846f5-78xlg: container expand-bluefs waiting in PodInitializing:
      pod rook-ceph-osd-1-7dfdb846f5-78xlg: container chown-container-data-dir waiting in PodInitializing:
      pod rook-ceph-osd-1-7dfdb846f5-78xlg: container log-collector waiting in PodInitializing:
      pod rook-ceph-osd-1-7dfdb846f5-78xlg: container osd waiting in PodInitializing:
          Reason:    WaitingForPodsToInitialize
          Status:    True
          Type:      WaitForOSDPodsStabilize
        Phase:       Running
        Start Time:  2025-03-20T20:55:27Z
      Events:        <none>
       

      osd pod is at clbo:

      rook-ceph-osd-0-856fb46576-n6sjc                                  2/2     Running                 2               29h
      rook-ceph-osd-1-7dfdb846f5-78xlg                                  0/2     Init:CrashLoopBackOff   6 (2m30s ago)   26m
      rook-ceph-osd-2-d5dd6fd8-pddgl                                    2/2     Running                 2               29h
       

       

       

       

       

      How reproducible: 100% so far

      Steps to reproduce:1. Deploy 3 master 3 worker assisted installer cluster with odf operator and 1 extra 100G disk for each worker

      2. Destroy an osd worker

      3. Create a new osd worker via day 2 worker node in assisted installer (same hostname)

      4. Run node recovery when the worker joins the cluster

      Actual results:

      Node recovery stuck WaitForOSDPodsStabilize

       

      Expected results:

      Node recovery to recovery the new node

       


      More info

              jgil@redhat.com Jordi Gil
              chadcrum Chad Crum
              Chad Crum Chad Crum
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: