-
Bug
-
Resolution: Not a Bug
-
Blocker
-
None
-
ACM 2.13.2
-
3
-
False
-
-
False
-
-
-
Submariner Sprint 2025-38
-
None
Whenever we initiate a sync from one cluster to another, some nodes in the destination cluster transition into a `NotReady` status. Once the sync operation completes, the cluster stabilizes. Over the past few days, we have investigated across the entire stack without identifying a clear root cause. However, a consistent pattern we observe is that when the sync starts, several pods lose connectivity to the apiserver, and some eventually restart. Once the sync completes, the environment stabilizes again.
As a comparison, we ran an experiment where we deployed an equivalent number of pods on both clusters without using Submariner. In this case, the sync between the clusters proceeded without any issues for several hours. In contrast, when using Submariner, the problem typically appears within a few minutes.
It's important to note that the issue consistently affects the destination cluster where the ServiceExport is created.
The load on the nodes is insignificant during these events. Here is a snapshot of that before the issue starts.
oc adm top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% compute-1-ru5.rackm03.mydomain.com 6425m 10% 31887Mi 12% compute-1-ru6.rackm03.mydomain.com 2963m 4% 19110Mi 7% compute-1-ru7.rackm03.mydomain.com 1972m 3% 19600Mi 7% control-1-ru2.rackm03.mydomain.com 3634m 5% 22197Mi 8% control-1-ru3.rackm03.mydomain.com 3233m 5% 32523Mi 12% control-1-ru4.rackm03.mydomain.com 5805m 9% 24199Mi 9%
You can find subctl gather logs from the destination cluster here: https://drive.google.com/drive/folders/1MdZIVc2fR4hlzkCjbUi4K48eBTOO_x2a?usp=sharing
The "subctl gather" logs were collected during or right after the observed issues.
Also ran subctl verify and the log output is here: https://drive.google.com/file/d/1ROmeLCks3Xbj-4CX8tlynUVkcpBdQxe5/view?usp=sharing