Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-41844

Inconsistent Network routing. StorageClient Fails to connect to NodePort on Hosted Node

XMLWordPrintable

    • Important
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      StorageClient fails to connect to a NodePort service on the hosted node's corresponding bare-metal node in an HCP environment. The issue is intermittent, and the connection only succeeds when the StorageClient pod runs on a different hosted node or connects to a different NodePort IP.
      
      Root cause from DF side:
      
      the issue is w/ hcp networking and now we have a consistent reproducer of this bug. When the reporter/client-op runs on VM ie, hosted node which is in-turn running on BM node uses the nodeip of that BM in storageclient CR, the network connection is not established even though ping is working.
      
      ceph client on hosted cluster connects to ceph server on mgmnt cluster over nodeport ranges, since it is fault tolerant ceph client tries another node unlike the grpc client.
      
      For example
      
      node  ip
      ----  ---
      bm-n1 ip1
      bm-n2 ip2
      bm-n3 ip3
      
      if ht-n1 (hosted node) running on bm-n1
      ---
      and
       reporter/client-op is scheduled to run on ht-n1
      and
       if endpoint in storageclient cr is ip1:31659 (nodeport of bm-n1) <---- doesn't work
      but
       if endpoint in storageclient cr is anything except ip1:31659 <-------- works
      ---
      or
       reporter/client-op is schedule on anything except ht-n1
      with
       endpoint in storageclient cr is ip1:31659 (nodeport of bm-n1) <----------- works
      ---

      Version-Release number of selected component (if applicable):

          4.16, 4.17, (across all available versions)

      How reproducible:

          see description

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

      - The StorageClient fails to connect to the management cluster (Storage Provider) when the StorageClient pod is scheduled on the same bare-metal node as the hosted cluster's virtual machine (VM).
      - During node operations, such as an OCP upgrade, the StorageClient loses communication if the pod is relocated to a hosted node on the same bare-metal node as the management cluster's NodePort service, triggering connection failures.

      Expected results:

          no communication issues 

      Additional info:

          Latest slack conversation:
      
      https://ibm-systems-storage.slack.com/archives/C06EPQRBM36/p1726047883407299
      https://redhat.enterprise.slack.com/archives/C019X3PEF2B/p1726051047584979?thread_ts=1726051047.584979&cid=C019X3PEF2B
      https://redhat.enterprise.slack.com/archives/C02UVQRJG83/p1726055560238719?thread_ts=1726055560.238719&cid=C02UVQRJG83
      
      Related bugs: 
      Client heartbeat missing on provider after upgrading to 4.17 - https://bugzilla.redhat.com/show_bug.cgi?id=2311357 
      [Provider mode] StorageClient connection fails. Failed to create a new provider client: failed to dial. - 
      https://bugzilla.redhat.com/show_bug.cgi?id=2281536
      
      HyperShift dump - https://drive.google.com/file/d/1NCFB2f2kOifgOiNsOFmkaJJysuSwtxG_/view?usp=sharing
      
      OCP mg - https://drive.google.com/file/d/1oy8Jy_v849UPzm6L6Z6jjKwRTwtuG7JH/view?usp=sharing
      
      OCS mg - https://drive.google.com/file/d/1kc-eIsSfi5QF8yY2RsDcpGRp5lh4NgQu/view?usp=sharing

              sdn-team-bot sdn-team bot
              rh-ee-dosypenk Daniel Osypenko
              Jie Zhao Jie Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: