Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.19
Component/s: Networking / SR-IOV
Labels:
- openshift
- rdma
- sr-iov

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
CNF Network Sprint 279, CNF Network Sprint 280
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

AI created summary of the problem from customer case based on the comments. The case number is 04258382

The issues moved through three primary phases of configuration challenges:

1. 1. 1. Initial Third-Party Operator Configuration (NVIDIA Network Operator)

The first configuration blocker involved the necessary setup components provided by NVIDIA for the InfiniBand hardware.

*NVIDIA Network Operator:* The customer initially struggled with the configuration of the *NVIDIA Network Operator*.
*NIC Cluster Policy:* The primary symptom was that the *NIC Cluster Policy was in a not ready state*.
*OFED Component:* A specific internal component of the NVIDIA Network Operator, the *OFED component, was reported to be in a **notReady state*.
*Resolution:* Red Hat advised that issues with the NVIDIA operator documentation and components needed to be addressed with NVIDIA support. The customer later reported that *the problems related to NVIDIA were solved*.

1. 1. 2. SR-IOV Network Configuration Mismatches

After resolving the NVIDIA issues, the focus shifted to the Red Hat SR-IOV Network Operator configuration:

*Namespace Mismatch:* There was an initial problem where the test pod (`testpod1`) was requested in the *`openshift-sriov-network-operator` namespace, but the **`SriovIBNetwork` resource (or `SriovNetwork`) was configured to target the `default` namespace*. This mismatch prevented the network attachment definition (NAD) from being found.
*Resolution:* The customer was able to successfully resolve this by aligning the pod and network definition, confirming they could *create a pod and assign the VF on it* in the `default` namespace.

*Physical Link State:* Even with the configuration corrected, a physical or low-level network configuration issue remained, as the assigned InfiniBand interface appeared to be utilized but its link state was *"INIT" and "DOWN"* when communication was attempted. Red Hat later suggested a possible external configuration, checking if *`virt_enabled 2`* was set in OpenSM (possibly via an NVIDIA operator ConfigMap), although this was deemed possibly outside Red Hat's scope.

1. 1. 3. Missing RDMA CNI MetaPlugin Configuration

The final and most persistent configuration issue centered on fully enabling RDMA functionality, despite the VF being attached to the pod:

*Missing `rdma` MetaPlugin:* Red Hat engineers noticed that the core problem preventing RDMA workloads was likely that the *`SriovNetwork` object did not include the `rdma metaPlugin`*. This specific configuration is crucial for enabling the SR-IOV RDMA CNI (Container Network Interface). The necessary configuration was detailed in the OpenShift documentation for configuring SR-IOV RDMA CNI.

*Configuration After Adding RDMA MetaPlugin:* The customer updated the configuration by adding the RDMA setting to a new network resource (`sriov-ibnetwork-rdma`). This change led to a new error state: the pod was scheduled, but *the network was not allocated, and the relevant **Node went to an unschedulable state* after some time.

*Current Blocker (SR-IOV Device Plugin):* Most recently, the investigation indicated that the *`sriov-device-plugin` still seemed unable to recognize the RDMA devices*. This suggests that while the required resources (like `SriovNetworkNodePolicy` for the `rdma` resource) might be present, and the network object might include the `rdma metaPlugin`, the device plugin responsible for exposing the RDMA capability to the pods is not functioning correctly.

In summary, the key configurations at issue spanned three areas: *NVIDIA/OFED readiness, **SR-IOV Network and Pod Namespace alignment, and the explicit definition and recognition of **RDMA capabilities via the `rdma metaPlugin`* within the SR-IOV operator objects.

The specific phrase `iberror: failed: discover: failed` is *not explicitly present* within the provided source materials (Case 04258382 excerpts).

However, the sources detail a sequence of troubleshooting steps related to InfiniBand (IB) networking that lead to failures in device recognition and communication, which would contextually relate to the type of discovery failure indicated by this error phrase.

Based on the context of the case, this type of error would likely relate to the failure of InfiniBand tools or components to properly discover and map the network topology, despite configuration efforts.

Here is an explication of the relevant InfiniBand (IB) discovery and networking failures documented in the case that align with a "discover failed" error:

1. *InfiniBand Link State and Communication Failure (Post-VF Assignment):* After the customer successfully configured and assigned Virtual Functions (VFs) from the InfiniBand interfaces to a pod, they reported a subsequent problem where communication failed. The interface appeared to be utilized, but the link state was reported as *"INIT" and "DOWN"* when communication was attempted. This "down" state indicates that the network path or device itself failed to achieve an active state necessary for data transfer, which can prevent successful discovery of remote IB devices.
2. *Topology Discovery Attempts:* The customer later executed InfiniBand diagnostic commands, such as `ibswitches` and `ibnetdiscover`.

The *`ibnetdiscover`* command successfully listed the switch and all connected Host Channel Adapters (HCAs) on nodes `openshift-ai-04`, `openshift-ai-05`, and `node001`. The fact that these hosts and links were successfully mapped and assigned Local Identifiers (LIDs) suggests that the basic link layer and the Subnet Manager (SM)—which was confirmed to be running on the Switch—were functional enough to discover the physical topology.
3. *Specific Port Failures:* Despite the successful overall topology discovery (`ibnetdiscover`), executing `iblinkinfo` revealed that *"several SRIOV ports are shown as down"* (e.g., `CA 'mlx5_67'`, `CA 'mlx5_68'`, `CA 'mlx5_69'`). This shows that while the topology could be mapped, specific SR-IOV ports were failing to come up, which represents a persistent discovery/connectivity failure for those particular VFs.
4. *SR-IOV Device Plugin Failure (RDMA Capability):* The root cause of the continuing inability to run RDMA workloads was determined to be a failure in the software layer responsible for exposing the device capabilities. The *`sriov-device-plugin` still seemed unable to recognize the RDMA devices*. If the device plugin cannot recognize or expose the RDMA capabilities (the necessary resources for the pod), any RDMA-specific "discovery" operations initiated by the pod workload (`iberror: failed: discover: failed`) would fail, even if the physical link is technically "up" for basic InfiniBand networking. This failure persisted despite attempts to add the `rdma metaPlugin` to the `SriovNetwork` object.

Therefore, while the exact error string is missing, the case clearly documents ongoing failures in InfiniBand connectivity and RDMA device discovery within the cluster environment.

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide