Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-20317

[RDR] Lighthouse EndpointSlice condition not marked ready on import cluster despite hub slice and the import cluster showing ready causing sync to stop for a single pvc

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • False
    • Important
    • None

      Description of problem:

      On a Regional DR setup, there are 4 clusters imported to ACM names amagrawa-419-2/3/4/5 where clusters 2-3 are tied to clusterset myclusterset-1 and clusters 4-5 are tied to myclusterset-2 and connected via submariner. Then after DR configuration, 2 RBD workloads (1 appset and 1 subscription) are deployed on cluster2 and 2 CephFS workloads (1 appset and 1 subscription) are deployed on cluster4 which makes them primary cluster for the workload.

      Using DR, data is then replicated to it's peer cluster i.e. data from cluster2 syncs to cluster3 and data from cluster4 syncs to cluster5.

      After running IOs for 2 days without any DR operation, we see that data sync for workload in NS busybox-workloads-4 running on cluster amagrawa-419-4 gets impacted for 1 of the 4 PVCs in that NS.

      Please note that these clusters aren't heavily loaded with IOs and it's not an infrastructure issue.

      bmekhiss helped us bring this issue to Submariner team and was discussed here https://redhat-internal.slack.com/archives/C0134E73VH6/p1745591703720299?thread_ts=1743492550.465549&cid=C0134E73VH6 with rh-ee-vthapar and tpanteli 

      The issue didn't recover on it's own so after discussion and their recommendation, `lighthouse-agent` was restarted which fixed this issue. Follow thread for more details.

       

      Logs from the import clusterhttps://drive.google.com/drive/folders/1foqGvKHSTN7fbiZYL2dicsINPvAOebvP?usp=sharing
      Logs from the export cluster: https://drive.google.com/drive/folders/1qcX6w2UHVO5RMKmzdUxdUFVRa5-oFWTM?usp=sharing

      Version-Release number of selected component (if applicable):

      ODF 4.19.0-46.konflux

      OCP 4.19.0-0.nightly-2025-04-17-154552

      GitOps 1.16.0

      Submariner 0.20.0 GA'ed

      ACM 2.13.2 GA'ed

      How reproducible: We have hit this issue multiple times and few of these occurrences were discussed in the same slack thread

      Steps to Reproduce:

      1. On a RDR setup, import 4 clusters to ACM and configure 2 peer DR relationships between them using 2 clustersets and submariner. Run IOs for a few days and monitor data sync.
      2.  
      3. ...

      Actual results: [RDR] Lighthouse EndpointSlice condition not marked ready on import cluster despite hub slice and the import cluster showing ready causing sync to stop for a single pvc

      Expected results: Sync for none of the PVCs should be impacted during continuous IOs on a RDR setup

      Additional info:

              tpanteli Thomas Pantelis
              amagrawa@redhat.com Aman Agrawal
              Prachi Yadav Prachi Yadav
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: