Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-2363

[Tracker][HCI only] RDR: After cephfs apps enrolment for RDR, all the 3 compute nodes one by one on secondary site ag5 goes into NotReady state. Before app enrollment all nodes were in Ready state (node reboot)

XMLWordPrintable

    • True
    • Show
      https://issues.redhat.com/browse/RHEL-92887
    • False
    • Committed
    • ?
    • ?
    • 4.19.1-1.konflux
    • Committed
    • Release Note Not Required
    • Critical
    • Approved
    • None

       

      Description of problem - RDR: After cephfs app enrolment for RDR, all the 3 compute nodes one by one on secondary site ag5 goes into NotReady state. Before app enrollment all nodes were in Ready state

       

      The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

      OCP HCI 6 node 4.18 cluster

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

      Provider RDR

       

      The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

      OCP 4.18
      ODF 4.18.2
      ACM 2.13
      Seen on racks: rackm03, rackm14, ag5
       

      Does this issue impact your ability to continue to work with the product?

      Yes

       

      Is there any workaround available to the best of your knowledge?

      No

       

      Can this issue be reproduced? If so, please provide the hit rate

      Yes

       

      Can this issue be reproduced from the UI?

      Yes

      If this is a regression, please provide more details to justify this:

       

      Steps to Reproduce:

      1. Deployed apps on ag4 cluster cephfs-tarun-1 using gitops using cephfs storage class

      2. Enrolled above app for DR

      3. After cephfs app enrolment for RDR, node on secondary site ag5 goes into NotReady state. Before app enrolment all nodes were in Ready state

      4.initially only 1 compute node was in NotReady state later another compute node also gone into NotReady

      For RBD also observed today.
      1. Enroll discovered app for ag4.
      2. Failover to site2 ag5.
      3.Observed on primary site ag4, one node went into Notready state

       
       
       

       

      NAME                                 STATUS     ROLES                  AGE   VERSION

      compute-1-ru5.rackag4.mydomain.com   Ready      worker                 40d   v1.31.5

      compute-1-ru6.rackag4.mydomain.com   Ready      worker                 40d   v1.31.5

      compute-1-ru7.rackag4.mydomain.com   NotReady   worker                 40d   v1.31.5

      control-1-ru2.rackag4.mydomain.com   Ready      control-plane,master   47d   v1.31.5

      control-1-ru3.rackag4.mydomain.com   Ready      control-plane,master   47d   v1.31.5

      control-1-ru4.rackag4.mydomain.com   Ready      control-plane,master   47d   v1.31.5

       

      ----- Sat Apr 26 03:03:55 EDT 2025 -----

      NAME                                 STATUS     ROLES                  AGE   VERSION

      compute-1-ru5.rackag5.mydomain.com   Ready      worker                 17d   v1.31.6

      compute-1-ru6.rackag5.mydomain.com   NotReady   worker                 16d   v1.31.6

      compute-1-ru7.rackag5.mydomain.com   Ready      worker                 17d   v1.31.6

      control-1-ru2.rackag5.mydomain.com   Ready      control-plane,master   17d   v1.31.6

      control-1-ru3.rackag5.mydomain.com   Ready      control-plane,master   17d   v1.31.6

      control-1-ru4.rackag5.mydomain.com   Ready      control-plane,master   17d   v1.31.6

      ----- Sat Apr 26 03:04:00 EDT 2025 -----

      NAME                                 STATUS     ROLES                  AGE   VERSION

      compute-1-ru5.rackag5.mydomain.com   NotReady   worker                 17d   v1.31.6

      compute-1-ru6.rackag5.mydomain.com   NotReady   worker                 16d   v1.31.6

      compute-1-ru7.rackag5.mydomain.com   Ready      worker                 17d   v1.31.6

      control-1-ru2.rackag5.mydomain.com   Ready      control-plane,master   17d   v1.31.6

      control-1-ru3.rackag5.mydomain.com   Ready      control-plane,master   17d   v1.31.6

      control-1-ru4.rackag5.mydomain.com   Ready      control-plane,master   17d   v1.31.6

       

      Later  observed node compute-1-ru7.rackag5.mydomain.com also went into NotReady state.

      compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
      compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
      compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
      compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
      compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
      compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.6
      compute-1-ru7.rackag5.mydomain.com NotReady worker 17d v1.31.

      The exact date and time when the issue was observed, including timezone details:

       

      Actual results:
      Nodes are getting rebooted simultaneously when dr protected cephfs workload is running from sometime.
       

       

      Expected results:
      Nodes should not get rebooted.
       

      Logs collected and log location:
      rackm03- http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/amrita/bz/racks/rackm03/30-04-2025_15-07-58/30-04-2025_15-07-58-cl/
      rackm14- http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/amrita/bz/racks/rackm14/30-04-2025_15-08-19/30-04-2025_15-08-19-api-rackm14-mydomain-com/

       

      Additional info:
      Slack discussion

       
       

              kseeger@redhat.com Karolin Seeger
              tarunk_rh Tarun Kumar (Inactive)
              Karolin Seeger
              Pratik Surve Pratik Surve
              Votes:
              0 Vote for this issue
              Watchers:
              23 Start watching this issue

                Created:
                Updated:
                Resolved: