Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.16, 4.17
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:OVNK:NetworkPolicy

Severity:
Important
Regression:
None
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

StorageClient fails to connect to a NodePort service on the hosted node's corresponding bare-metal node in an HCP environment. The issue is intermittent, and the connection only succeeds when the StorageClient pod runs on a different hosted node or connects to a different NodePort IP.

Root cause from DF side:

the issue is w/ hcp networking and now we have a consistent reproducer of this bug. When the reporter/client-op runs on VM ie, hosted node which is in-turn running on BM node uses the nodeip of that BM in storageclient CR, the network connection is not established even though ping is working.

ceph client on hosted cluster connects to ceph server on mgmnt cluster over nodeport ranges, since it is fault tolerant ceph client tries another node unlike the grpc client.

For example

node  ip
----  ---
bm-n1 ip1
bm-n2 ip2
bm-n3 ip3

if ht-n1 (hosted node) running on bm-n1
---
and
 reporter/client-op is scheduled to run on ht-n1
and
 if endpoint in storageclient cr is ip1:31659 (nodeport of bm-n1) <---- doesn't work
but
 if endpoint in storageclient cr is anything except ip1:31659 <-------- works
---
or
 reporter/client-op is schedule on anything except ht-n1
with
 endpoint in storageclient cr is ip1:31659 (nodeport of bm-n1) <----------- works
---

Version-Release number of selected component (if applicable):

    4.16, 4.17, (across all available versions)

How reproducible:

    see description

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

- The StorageClient fails to connect to the management cluster (Storage Provider) when the StorageClient pod is scheduled on the same bare-metal node as the hosted cluster's virtual machine (VM).
- During node operations, such as an OCP upgrade, the StorageClient loses communication if the pod is relocated to a hosted node on the same bare-metal node as the management cluster's NodePort service, triggering connection failures.

Expected results:

    no communication issues

Additional info:

    Latest slack conversation:

https://ibm-systems-storage.slack.com/archives/C06EPQRBM36/p1726047883407299
https://redhat.enterprise.slack.com/archives/C019X3PEF2B/p1726051047584979?thread_ts=1726051047.584979&cid=C019X3PEF2B
https://redhat.enterprise.slack.com/archives/C02UVQRJG83/p1726055560238719?thread_ts=1726055560.238719&cid=C02UVQRJG83

Related bugs: 
Client heartbeat missing on provider after upgrading to 4.17 - https://bugzilla.redhat.com/show_bug.cgi?id=2311357 
[Provider mode] StorageClient connection fails. Failed to create a new provider client: failed to dial. - 
https://bugzilla.redhat.com/show_bug.cgi?id=2281536

HyperShift dump - https://drive.google.com/file/d/1NCFB2f2kOifgOiNsOFmkaJJysuSwtxG_/view?usp=sharing

OCP mg - https://drive.google.com/file/d/1oy8Jy_v849UPzm6L6Z6jjKwRTwtuG7JH/view?usp=sharing

OCS mg - https://drive.google.com/file/d/1kc-eIsSfi5QF8yY2RsDcpGRp5lh4NgQu/view?usp=sharing

Assignee:: sdn-team bot

Reporter:: Daniel Osypenko

QA Contact:: Jie Zhao

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2024/09/12 8:12 AM

Updated:: 2024/09/18 2:52 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates