Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-20458

Nodes in Destination Cluster Transition to NotReady During Submariner-Driven Sync Operations

XMLWordPrintable

    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • Submariner Sprint 2025-38
    • None

      Whenever we initiate a sync from one cluster to another, some nodes in the destination cluster transition into a `NotReady` status. Once the sync operation completes, the cluster stabilizes. Over the past few days, we have investigated across the entire stack without identifying a clear root cause. However, a consistent pattern we observe is that when the sync starts, several pods lose connectivity to the apiserver, and some eventually restart. Once the sync completes, the environment stabilizes again.

      As a comparison, we ran an experiment where we deployed an equivalent number of pods on both clusters without using Submariner. In this case, the sync between the clusters proceeded without any issues for several hours. In contrast, when using Submariner, the problem typically appears within a few minutes.

      It's important to note that the issue consistently affects the destination cluster where the ServiceExport is created.

      The load on the nodes is insignificant during these events. Here is a snapshot of that before the issue starts.

      oc adm top nodes
      NAME                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
      compute-1-ru5.rackm03.mydomain.com   6425m        10%    31887Mi         12%       
      compute-1-ru6.rackm03.mydomain.com   2963m        4%     19110Mi         7%        
      compute-1-ru7.rackm03.mydomain.com   1972m        3%     19600Mi         7%        
      control-1-ru2.rackm03.mydomain.com   3634m        5%     22197Mi         8%        
      control-1-ru3.rackm03.mydomain.com   3233m        5%     32523Mi         12%       
      control-1-ru4.rackm03.mydomain.com   5805m        9%     24199Mi         9%    
      

      You can find subctl gather logs from the destination cluster here: https://drive.google.com/drive/folders/1MdZIVc2fR4hlzkCjbUi4K48eBTOO_x2a?usp=sharing
      The "subctl gather" logs were collected during or right after the observed issues.

      Also ran subctl verify and the log output is here: https://drive.google.com/file/d/1ROmeLCks3Xbj-4CX8tlynUVkcpBdQxe5/view?usp=sharing

        1. compute-1-ru5-rackm03.logs
          515 kB
          Pratik Surve
        2. compute-1-ru6-rackm1.logs
          516 kB
          Pratik Surve
        3. compute-1-ru6-rackm1-1.logs
          516 kB
          Pratik Surve
        4. compute-1-ru5-rackm03-1.logs
          515 kB
          Pratik Surve
        5. compute-1-ru6-rackm03.logs
          585 kB
          Pratik Surve
        6. compute-1-ru7-rackm14.logs
          847 kB
          Pratik Surve
        7. compute-1-ru6-rackm03-forceUDPEncaps-true.logs
          1.31 MB
          Pratik Surve
        8. compute-1-ru7-rackm14-forceUDPEncaps-true.logs
          960 kB
          Pratik Surve
        9. compute-1-ru5-rackm14-forceUDPEncaps-true-post-reboot.logs
          57 kB
          Pratik Surve
        10. compute-1-ru7-rackm03-forceUDPEncaps-true-post-reboot.logs
          670 kB
          Pratik Surve

              yboaron Yossi Boaron
              bmekhiss Benamar Mekhissi
              Prachi Yadav Prachi Yadav
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: