Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63360

Unable to run RDMA workloads

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • CNF Network Sprint 279
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      AI created summary of the problem from customer case based on the comments. The case number is 04258382

       
      The issues moved through three primary phases of configuration challenges:

          1. 1. Initial Third-Party Operator Configuration (NVIDIA Network Operator)

      The first configuration blocker involved the necessary setup components provided by NVIDIA for the InfiniBand hardware.

      • *NVIDIA Network Operator:* The customer initially struggled with the configuration of the *NVIDIA Network Operator*.
      • *NIC Cluster Policy:* The primary symptom was that the *NIC Cluster Policy was in a not ready state*.
      • *OFED Component:* A specific internal component of the NVIDIA Network Operator, the *OFED component, was reported to be in a **notReady state*.
      • *Resolution:* Red Hat advised that issues with the NVIDIA operator documentation and components needed to be addressed with NVIDIA support. The customer later reported that *the problems related to NVIDIA were solved*.
          1. 2. SR-IOV Network Configuration Mismatches

      After resolving the NVIDIA issues, the focus shifted to the Red Hat SR-IOV Network Operator configuration:

      • *Namespace Mismatch:* There was an initial problem where the test pod (`testpod1`) was requested in the *`openshift-sriov-network-operator` namespace, but the **`SriovIBNetwork` resource (or `SriovNetwork`) was configured to target the `default` namespace*. This mismatch prevented the network attachment definition (NAD) from being found.
      • *Resolution:* The customer was able to successfully resolve this by aligning the pod and network definition, confirming they could *create a pod and assign the VF on it* in the `default` namespace.
      • *Physical Link State:* Even with the configuration corrected, a physical or low-level network configuration issue remained, as the assigned InfiniBand interface appeared to be utilized but its link state was *"INIT" and "DOWN"* when communication was attempted. Red Hat later suggested a possible external configuration, checking if *`virt_enabled 2`* was set in OpenSM (possibly via an NVIDIA operator ConfigMap), although this was deemed possibly outside Red Hat's scope.
          1. 3. Missing RDMA CNI MetaPlugin Configuration

      The final and most persistent configuration issue centered on fully enabling RDMA functionality, despite the VF being attached to the pod:

      • *Missing `rdma` MetaPlugin:* Red Hat engineers noticed that the core problem preventing RDMA workloads was likely that the *`SriovNetwork` object did not include the `rdma metaPlugin`*. This specific configuration is crucial for enabling the SR-IOV RDMA CNI (Container Network Interface). The necessary configuration was detailed in the OpenShift documentation for configuring SR-IOV RDMA CNI.
      • *Configuration After Adding RDMA MetaPlugin:* The customer updated the configuration by adding the RDMA setting to a new network resource (`sriov-ibnetwork-rdma`). This change led to a new error state: the pod was scheduled, but *the network was not allocated, and the relevant **Node went to an unschedulable state* after some time.
      • *Current Blocker (SR-IOV Device Plugin):* Most recently, the investigation indicated that the *`sriov-device-plugin` still seemed unable to recognize the RDMA devices*. This suggests that while the required resources (like `SriovNetworkNodePolicy` for the `rdma` resource) might be present, and the network object might include the `rdma metaPlugin`, the device plugin responsible for exposing the RDMA capability to the pods is not functioning correctly.

      In summary, the key configurations at issue spanned three areas: *NVIDIA/OFED readiness, **SR-IOV Network and Pod Namespace alignment, and the explicit definition and recognition of **RDMA capabilities via the `rdma metaPlugin`* within the SR-IOV operator objects.

      The specific phrase `iberror: failed: discover: failed` is *not explicitly present* within the provided source materials (Case 04258382 excerpts).

      However, the sources detail a sequence of troubleshooting steps related to InfiniBand (IB) networking that lead to failures in device recognition and communication, which would contextually relate to the type of discovery failure indicated by this error phrase.

      Based on the context of the case, this type of error would likely relate to the failure of InfiniBand tools or components to properly discover and map the network topology, despite configuration efforts.

      Here is an explication of the relevant InfiniBand (IB) discovery and networking failures documented in the case that align with a "discover failed" error:

      1. *InfiniBand Link State and Communication Failure (Post-VF Assignment):* After the customer successfully configured and assigned Virtual Functions (VFs) from the InfiniBand interfaces to a pod, they reported a subsequent problem where communication failed. The interface appeared to be utilized, but the link state was reported as *"INIT" and "DOWN"* when communication was attempted. This "down" state indicates that the network path or device itself failed to achieve an active state necessary for data transfer, which can prevent successful discovery of remote IB devices.
      2. *Topology Discovery Attempts:* The customer later executed InfiniBand diagnostic commands, such as `ibswitches` and `ibnetdiscover`.

      • The *`ibnetdiscover`* command successfully listed the switch and all connected Host Channel Adapters (HCAs) on nodes `openshift-ai-04`, `openshift-ai-05`, and `node001`. The fact that these hosts and links were successfully mapped and assigned Local Identifiers (LIDs) suggests that the basic link layer and the Subnet Manager (SM)—which was confirmed to be running on the Switch—were functional enough to discover the physical topology.
        3. *Specific Port Failures:* Despite the successful overall topology discovery (`ibnetdiscover`), executing `iblinkinfo` revealed that *"several SRIOV ports are shown as down"* (e.g., `CA 'mlx5_67'`, `CA 'mlx5_68'`, `CA 'mlx5_69'`). This shows that while the topology could be mapped, specific SR-IOV ports were failing to come up, which represents a persistent discovery/connectivity failure for those particular VFs.
        4. *SR-IOV Device Plugin Failure (RDMA Capability):* The root cause of the continuing inability to run RDMA workloads was determined to be a failure in the software layer responsible for exposing the device capabilities. The *`sriov-device-plugin` still seemed unable to recognize the RDMA devices*. If the device plugin cannot recognize or expose the RDMA capabilities (the necessary resources for the pod), any RDMA-specific "discovery" operations initiated by the pod workload (`iberror: failed: discover: failed`) would fail, even if the physical link is technically "up" for basic InfiniBand networking. This failure persisted despite attempts to add the `rdma metaPlugin` to the `SriovNetwork` object.

      Therefore, while the exact error string is missing, the case clearly documents ongoing failures in InfiniBand connectivity and RDMA device discovery within the cluster environment.

              apanatto@redhat.com Andrea Panattoni
              rhn-support-rcegan Radek Cegan
              None
              None
              Zhiqiang Fang Zhiqiang Fang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: