-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.16, 4.17
-
Important
-
None
-
Rejected
-
False
-
Description of problem:
StorageClient fails to connect to a NodePort service on the hosted node's corresponding bare-metal node in an HCP environment. The issue is intermittent, and the connection only succeeds when the StorageClient pod runs on a different hosted node or connects to a different NodePort IP. Root cause from DF side: the issue is w/ hcp networking and now we have a consistent reproducer of this bug. When the reporter/client-op runs on VM ie, hosted node which is in-turn running on BM node uses the nodeip of that BM in storageclient CR, the network connection is not established even though ping is working. ceph client on hosted cluster connects to ceph server on mgmnt cluster over nodeport ranges, since it is fault tolerant ceph client tries another node unlike the grpc client. For example node ip ---- --- bm-n1 ip1 bm-n2 ip2 bm-n3 ip3 if ht-n1 (hosted node) running on bm-n1 --- and reporter/client-op is scheduled to run on ht-n1 and if endpoint in storageclient cr is ip1:31659 (nodeport of bm-n1) <---- doesn't work but if endpoint in storageclient cr is anything except ip1:31659 <-------- works --- or reporter/client-op is schedule on anything except ht-n1 with endpoint in storageclient cr is ip1:31659 (nodeport of bm-n1) <----------- works ---
Version-Release number of selected component (if applicable):
4.16, 4.17, (across all available versions)
How reproducible:
see description
Steps to Reproduce:
1. 2. 3.
Actual results:
- The StorageClient fails to connect to the management cluster (Storage Provider) when the StorageClient pod is scheduled on the same bare-metal node as the hosted cluster's virtual machine (VM). - During node operations, such as an OCP upgrade, the StorageClient loses communication if the pod is relocated to a hosted node on the same bare-metal node as the management cluster's NodePort service, triggering connection failures.
Expected results:
no communication issues
Additional info:
Latest slack conversation: https://ibm-systems-storage.slack.com/archives/C06EPQRBM36/p1726047883407299 https://redhat.enterprise.slack.com/archives/C019X3PEF2B/p1726051047584979?thread_ts=1726051047.584979&cid=C019X3PEF2B https://redhat.enterprise.slack.com/archives/C02UVQRJG83/p1726055560238719?thread_ts=1726055560.238719&cid=C02UVQRJG83 Related bugs: Client heartbeat missing on provider after upgrading to 4.17 - https://bugzilla.redhat.com/show_bug.cgi?id=2311357 [Provider mode] StorageClient connection fails. Failed to create a new provider client: failed to dial. - https://bugzilla.redhat.com/show_bug.cgi?id=2281536 HyperShift dump - https://drive.google.com/file/d/1NCFB2f2kOifgOiNsOFmkaJJysuSwtxG_/view?usp=sharing OCP mg - https://drive.google.com/file/d/1oy8Jy_v849UPzm6L6Z6jjKwRTwtuG7JH/view?usp=sharing OCS mg - https://drive.google.com/file/d/1kc-eIsSfi5QF8yY2RsDcpGRp5lh4NgQu/view?usp=sharing