-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
4.20
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
CNF Network Sprint 285
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Summary
_______
SR-IOV Network Operator fails to configure InfiniBand Virtual Functions (VFs) due to netlink "message too long" errors, even after increasing kernel netlink buffers to 16MB. The root cause is that the SR-IOV daemon sets its own SO_RCVBUF socket buffer size which overrides kernel sysctl settings.
Environment
___________
OpenShift | Server Version: 4.20.8 | Kubernetes Version: v1.33.6
SR-IOV Operator Version | sriov-network-operator.v4.20.0-202601120340 |Version: 4.20.0-202601120340
Node Information
___________________
Node: ocp-poc26704-13779
OS: Red Hat Enterprise Linux CoreOS 9.6.20251205-0 (Plow)
Kernel: 5.14.0-570.73.1.el9_6.x86_64
Hardware
___________________
Device: Mellanox Technologies MT2910 Family [ConnectX-7]
PCI Address: 0000:18:00.0
Interface: ibp24s0
Link Type: InfiniBand
SR-IOV Total VFs: 16
Steps to Reproduce
__________________
Apply the following configuration to an OpenShift cluster with Mellanox ConnectX-7 InfiniBand NICs:
SriovNetworkNodePolicy for ibp24s0 on node ocp-poc26704-13779
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-ocp-poc26704-13779-ibp24s0 namespace: openshift-sriov-network-operator spec: deviceType: netdevice isRdma: true linkType: ib nicSelector: pfNames: ibp24s0 nodeSelector: kubernetes.io/hostname: ocp-poc26704-13779 numVfs: 16 priority: 99 resourceName: ibp24s0rdma
SriovIBNetwork for ibp24s0 on node ocp-poc26704-13779
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovIBNetwork metadata: name: ibp24s0-network namespace: openshift-sriov-network-operator spec: ipam: |- { "type": "whereabouts", "range": "10.0.107.0/24", "routes": [ { "dst": "192.168.81.0/24" } ], "device-info": Unknown macro: { "type"} } linkState: enable networkNamespace: default resourceName: ibp24s0rdma
Expected Behavior
_________________
1. SR-IOV config daemon writes 16 to /sys/class/net/ibp24s0/device/sriov_numvfs
2. Kernel creates 16 InfiniBand VFs successfully
3. Daemon queries PF link information via netlink to configure VF details
4. VFs remain configured and available as Kubernetes resources (openshift.io/ibp24s0rdma: "16")
Actual Behavior
_______________
1. [OK] Config daemon successfully writes 16 to /sys/class/net/ibp24s0/device/sriov_numvfs
2. [OK] VFs are created successfully in kernel (ib0-ib15, renamed to ibp24s0v0-ibp24s0v15)
3. [FAIL] Netlink query for PF link information fails with "message too long" error
4. [FAIL] Daemon treats this as configuration failure and rolls back
5. [FAIL] Daemon writes 0 to /sys/class/net/ibp24s0/device/sriov_numvfs, destroying all VFs
6. [FAIL] Resources remain unavailable (allocatable: 0)
Complete Error Sequence from Daemon Logs
__________________________________________
2026-01-29T07:59:48.421229338Z LEVEL(-2) sriov/sriov.go:1061 SetSriovNumVfs(): set NumVfs
{"device": "0000:18:00.0", "numVfs": 16}[6 second delay while kernel creates VFs]
2026-01-29T07:59:54.805253081Z ERROR sriov/sriov.go:587 configSriovVFDevices(): unable to get PF link for device {"device": {"pciAddress":"0000:18:00.0","numVfs":16,"name":"ibp24s0","linkType":"ib","vfGroups":[
{"resourceName":"ibp24s0rdma","deviceType":"netdevice","vfRange":"0-15","policyName":"policy-ocp-poc26704-13779-ibp24s0","isRdma":true}]}, "error": "message too long"}
2026-01-29T07:59:54.805286435Z ERROR sriov/sriov.go:615 configSriovInterfaces(): fail to configure sriov interface. resetting interface.
{"address": "0000:18:00.0", "error": "message too long"}2026-01-29T07:59:54.805299301Z LEVEL(-2) sriov/sriov.go:115 SetSriovNumVfs(): set NumVfs
{"device": "0000:18:00.0", "numVfs": 0}2026-01-29T08:00:04.728019316Z ERROR generic/generic_plugin.go:229 cannot configure sriov interfaces
{"error": "message too long"}
Timeline:
___________________
07:59:48 - VFs creation initiated
07:59:54 - ERROR: "message too long" when querying PF link
07:59:54 - Rollback initiated, VFs set to 0
08:00:04 - Configuration marked as failed
Root Cause: Daemon SO_RCVBUF Overrides Kernel Sysctl Settings
_______________________________________________________________
Kernel Buffer Increase Did NOT Help
_______________________________________________________
I increased the kernel netlink buffer sizes to 16MB via MachineConfig:
bash
Kernel sysctl values on node ocp-poc26704-13779 (verified 2026-01-29)
$ oc debug node/ocp-poc26704-13779 – chroot /host sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max
net.core.rmem_default = 16777216 # 16 MB net.core.rmem_max = 16777216 # 16 MB net.core.wmem_default = 16777216 # 16 MB net.core.wmem_max = 16777216 # 16 MB
Despite this 16MB increase, the "message too long" errors persist.
This proves that:
1. [FAIL] The kernel sysctl settings alone are insufficient
2. [OK] The SR-IOV daemon sets its own socket buffer via SO_RCVBUF
3. [OK] The daemon's SO_RCVBUF setting overrides the kernel defaults
Technical Analysis
__________________
The error occurs when the SR-IOV config daemon calls netlink to retrieve PF link information after creating VFs:
File: sriov/sriov.go
Function: configSriovVFDevices()
Line: 587
The daemon uses the vishvananda/netlink Go library for netlink operations. This library creates netlink sockets with a hardcoded buffer size via setsockopt(SO_RCVBUF), which overrides the kernel's net.core.rmem_default setting.
Impact:
______
Cannot deploy InfiniBand SR-IOV for containerized workloads
RDMA-intensive applications blocked (AI/ML training, HPC, high-performance storage)
Blocks entire product category for HPC/AI workloads on OpenShift
We should increase Daemon's Netlink Socket Buffer
Workarounds
___________
currently No Effective Workaround Available
I attempted to work around this by increasing kernel netlink buffers to 16MB:
Applied via MachineConfig:
net.core.rmem_default = 16777216
net.core.rmem_max = 16777216
net.core.wmem_default = 16777216
net.core.wmem_max = 16777216
Errors persist because daemon sets its own SO_RCVBUF.
Manual VF Creation (Bypass Operator)
The only current workaround is to manually create VFs, which bypasses the SR-IOV operator entirely:
bash
SSH to node
ssh core@ocp-poc26704-13779
Manually create VFs
echo 16 > /sys/class/net/ibp24s0/device/sriov_numvfs
Verify VFs created
ls -l /sys/class/net/ibp24s0v*
VF Creation Works, Rollback is the Problem
NetworkManager logs prove that VFs are created successfully:
Jan 29 07:59:49 ocp-poc26704-13779 NetworkManager[2121]: <info> manager: (ib0): new InfiniBand device (carrier: OFF, driver: 'mlx5_core', ifindex: 1712)
Jan 29 07:59:49 ocp-poc26704-13779 NetworkManager[2121]: <info> manager: (ib1): new InfiniBand device (carrier: OFF, driver: 'mlx5_core', ifindex: 1713)
...
Jan 29 07:59:49 ocp-poc26704-13779 NetworkManager[2121]: <info> manager: (ib15): new InfiniBand device (carrier: OFF, driver: 'mlx5_core', ifindex: 1727)
Jan 29 07:59:54 ocp-poc26704-13779 NetworkManager[2121]: <info> device (ib0): interface index 1712 renamed iface from 'ib0' to 'ibp24s0v0'
...
Jan 29 07:59:54 ocp-poc26704-13779 NetworkManager[2121]: <info> device (ib15): interface index 1727 renamed iface from 'ib15' to 'ibp24s0v15'
All 16 VFs are created and renamed successfully. The operator then destroys them due to the netlink query failure.
- duplicates
-
OCPBUGS-74889 failed to create large number of infiniband vfs
-
- New
-