Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Blocker
Fix Version/s: None
Affects Version/s: ACM 2.13.2
Component/s: Multicluster Networking
Labels:
- RDR-Blocker

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Sprint:
Submariner Sprint 2025-38

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Whenever we initiate a sync from one cluster to another, some nodes in the destination cluster transition into a `NotReady` status. Once the sync operation completes, the cluster stabilizes. Over the past few days, we have investigated across the entire stack without identifying a clear root cause. However, a consistent pattern we observe is that when the sync starts, several pods lose connectivity to the apiserver, and some eventually restart. Once the sync completes, the environment stabilizes again.

As a comparison, we ran an experiment where we deployed an equivalent number of pods on both clusters without using Submariner. In this case, the sync between the clusters proceeded without any issues for several hours. In contrast, when using Submariner, the problem typically appears within a few minutes.

It's important to note that the issue consistently affects the destination cluster where the ServiceExport is created.

The load on the nodes is insignificant during these events. Here is a snapshot of that before the issue starts.

oc adm top nodes
NAME                                 CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
compute-1-ru5.rackm03.mydomain.com   6425m        10%    31887Mi         12%       
compute-1-ru6.rackm03.mydomain.com   2963m        4%     19110Mi         7%        
compute-1-ru7.rackm03.mydomain.com   1972m        3%     19600Mi         7%        
control-1-ru2.rackm03.mydomain.com   3634m        5%     22197Mi         8%        
control-1-ru3.rackm03.mydomain.com   3233m        5%     32523Mi         12%       
control-1-ru4.rackm03.mydomain.com   5805m        9%     24199Mi         9%

You can find subctl gather logs from the destination cluster here: https://drive.google.com/drive/folders/1MdZIVc2fR4hlzkCjbUi4K48eBTOO_x2a?usp=sharing
The "subctl gather" logs were collected during or right after the observed issues.

Also ran subctl verify and the log output is here: https://drive.google.com/file/d/1ROmeLCks3Xbj-4CX8tlynUVkcpBdQxe5/view?usp=sharing

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

compute-1-ru5-rackm03.logs
2025/05/07 9:21 AM
515 kB
Pratik Surve
compute-1-ru6-rackm1.logs
2025/05/07 9:21 AM
516 kB
Pratik Surve
compute-1-ru6-rackm1-1.logs
2025/05/07 9:21 AM
516 kB
Pratik Surve
compute-1-ru5-rackm03-1.logs
2025/05/07 9:21 AM
515 kB
Pratik Surve
compute-1-ru6-rackm03.logs
2025/05/07 10:04 AM
585 kB
Pratik Surve
compute-1-ru7-rackm14.logs
2025/05/07 10:04 AM
847 kB
Pratik Surve
Hide
dmesg_and_journalctl-logs.zip
2025/05/07 10:08 AM
38.45 MB
Pratik Surve
Extracting archive...
Show
dmesg_and_journalctl-logs.zip
2025/05/07 10:08 AM
38.45 MB
Pratik Surve
compute-1-ru6-rackm03-forceUDPEncaps-true.logs
2025/05/07 2:17 PM
1.31 MB
Pratik Surve
compute-1-ru7-rackm14-forceUDPEncaps-true.logs
2025/05/07 2:17 PM
960 kB
Pratik Surve
compute-1-ru5-rackm14-forceUDPEncaps-true-post-reboot.logs
2025/05/07 2:19 PM
57 kB
Pratik Surve
compute-1-ru7-rackm03-forceUDPEncaps-true-post-reboot.logs
2025/05/07 2:19 PM
670 kB
Pratik Surve

Assignee:: Yossi Boaron

Reporter:: Benamar Mekhissi

QA Contact:: Prachi Yadav

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/05/01 12:49 PM

Updated:: 2025/06/19 1:07 PM

Resolved:: 2025/06/19 1:07 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates