Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-74637

AI inference Infiniband blocker| "message too long" Netlink Error

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 4.20
    • Networking / SR-IOV
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • CNF Network Sprint 285
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary
      _______
      SR-IOV Network Operator fails to configure InfiniBand Virtual Functions (VFs) due to netlink "message too long" errors, even after increasing kernel netlink buffers to 16MB. The root cause is that the SR-IOV daemon sets its own SO_RCVBUF socket buffer size which overrides kernel sysctl settings.

      Environment
      ___________

      OpenShift | Server Version: 4.20.8 | Kubernetes Version: v1.33.6

      SR-IOV Operator Version | sriov-network-operator.v4.20.0-202601120340 |Version: 4.20.0-202601120340

      Node Information
      ___________________

      Node: ocp-poc26704-13779
      OS: Red Hat Enterprise Linux CoreOS 9.6.20251205-0 (Plow)
      Kernel: 5.14.0-570.73.1.el9_6.x86_64

      Hardware
      ___________________

      Device: Mellanox Technologies MT2910 Family [ConnectX-7]
      PCI Address: 0000:18:00.0
      Interface: ibp24s0
      Link Type: InfiniBand
      SR-IOV Total VFs: 16

      Steps to Reproduce
      __________________

      Apply the following configuration to an OpenShift cluster with Mellanox ConnectX-7 InfiniBand NICs:

      SriovNetworkNodePolicy for ibp24s0 on node ocp-poc26704-13779

      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetworkNodePolicy
      metadata:
      name: policy-ocp-poc26704-13779-ibp24s0
      namespace: openshift-sriov-network-operator
      spec:
      deviceType: netdevice
      isRdma: true
      linkType: ib
      nicSelector:
      pfNames:
      ibp24s0
      nodeSelector:
      kubernetes.io/hostname: ocp-poc26704-13779
      numVfs: 16
      priority: 99
      resourceName: ibp24s0rdma

      SriovIBNetwork for ibp24s0 on node ocp-poc26704-13779

      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovIBNetwork
      metadata:
      name: ibp24s0-network
      namespace: openshift-sriov-network-operator
      spec:
      ipam: |- { "type": "whereabouts", "range": "10.0.107.0/24", "routes": [
      { "dst": "192.168.81.0/24" }
      ],
      "device-info":
      Unknown macro: { "type"}
      }
      linkState: enable
      networkNamespace: default
      resourceName: ibp24s0rdma
      

       

       

      Expected Behavior
      _________________

      1. SR-IOV config daemon writes 16 to /sys/class/net/ibp24s0/device/sriov_numvfs
      2. Kernel creates 16 InfiniBand VFs successfully
      3. Daemon queries PF link information via netlink to configure VF details
      4. VFs remain configured and available as Kubernetes resources (openshift.io/ibp24s0rdma: "16")

      Actual Behavior
      _______________

      1. [OK] Config daemon successfully writes 16 to /sys/class/net/ibp24s0/device/sriov_numvfs
      2. [OK] VFs are created successfully in kernel (ib0-ib15, renamed to ibp24s0v0-ibp24s0v15)
      3. [FAIL] Netlink query for PF link information fails with "message too long" error
      4. [FAIL] Daemon treats this as configuration failure and rolls back
      5. [FAIL] Daemon writes 0 to /sys/class/net/ibp24s0/device/sriov_numvfs, destroying all VFs
      6. [FAIL] Resources remain unavailable (allocatable: 0)

      Complete Error Sequence from Daemon Logs
      __________________________________________

      2026-01-29T07:59:48.421229338Z LEVEL(-2) sriov/sriov.go:1061 SetSriovNumVfs(): set NumVfs

      {"device": "0000:18:00.0", "numVfs": 16}

      [6 second delay while kernel creates VFs]

      2026-01-29T07:59:54.805253081Z ERROR sriov/sriov.go:587 configSriovVFDevices(): unable to get PF link for device {"device": {"pciAddress":"0000:18:00.0","numVfs":16,"name":"ibp24s0","linkType":"ib","vfGroups":[

      {"resourceName":"ibp24s0rdma","deviceType":"netdevice","vfRange":"0-15","policyName":"policy-ocp-poc26704-13779-ibp24s0","isRdma":true}

      ]}, "error": "message too long"}

      2026-01-29T07:59:54.805286435Z ERROR sriov/sriov.go:615 configSriovInterfaces(): fail to configure sriov interface. resetting interface.

      {"address": "0000:18:00.0", "error": "message too long"}

      2026-01-29T07:59:54.805299301Z LEVEL(-2) sriov/sriov.go:115 SetSriovNumVfs(): set NumVfs

      {"device": "0000:18:00.0", "numVfs": 0}

      2026-01-29T08:00:04.728019316Z ERROR generic/generic_plugin.go:229 cannot configure sriov interfaces

      {"error": "message too long"}

       

      Timeline:
      ___________________

      07:59:48 - VFs creation initiated
      07:59:54 - ERROR: "message too long" when querying PF link
      07:59:54 - Rollback initiated, VFs set to 0
      08:00:04 - Configuration marked as failed
      Root Cause: Daemon SO_RCVBUF Overrides Kernel Sysctl Settings
      _______________________________________________________________
      Kernel Buffer Increase Did NOT Help
      _______________________________________________________

      I increased the kernel netlink buffer sizes to 16MB via MachineConfig:

      bash

      Kernel sysctl values on node ocp-poc26704-13779 (verified 2026-01-29)
      $ oc debug node/ocp-poc26704-13779 – chroot /host sysctl net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max

      net.core.rmem_default = 16777216 # 16 MB
      net.core.rmem_max = 16777216 # 16 MB
      net.core.wmem_default = 16777216 # 16 MB
      net.core.wmem_max = 16777216 # 16 MB

      Despite this 16MB increase, the "message too long" errors persist.

      This proves that:
      1. [FAIL] The kernel sysctl settings alone are insufficient
      2. [OK] The SR-IOV daemon sets its own socket buffer via SO_RCVBUF
      3. [OK] The daemon's SO_RCVBUF setting overrides the kernel defaults

      Technical Analysis
      __________________

      The error occurs when the SR-IOV config daemon calls netlink to retrieve PF link information after creating VFs:

      File: sriov/sriov.go
      Function: configSriovVFDevices()
      Line: 587

      The daemon uses the vishvananda/netlink Go library for netlink operations. This library creates netlink sockets with a hardcoded buffer size via setsockopt(SO_RCVBUF), which overrides the kernel's net.core.rmem_default setting.

      Impact:
      ______

      Cannot deploy InfiniBand SR-IOV for containerized workloads
      RDMA-intensive applications blocked (AI/ML training, HPC, high-performance storage)
      Blocks entire product category for HPC/AI workloads on OpenShift

       We should increase Daemon's Netlink Socket Buffer

      Workarounds
      ___________

      currently  No Effective Workaround Available

      I  attempted to work around this by increasing kernel netlink buffers to 16MB:

      Applied via MachineConfig:
      net.core.rmem_default = 16777216
      net.core.rmem_max = 16777216
      net.core.wmem_default = 16777216
      net.core.wmem_max = 16777216

       Errors persist because daemon sets its own SO_RCVBUF.

      Manual VF Creation (Bypass Operator)

      The only current workaround is to manually create VFs, which bypasses the SR-IOV operator entirely:

      bash

      SSH to node
      ssh core@ocp-poc26704-13779

      Manually create VFs
      echo 16 > /sys/class/net/ibp24s0/device/sriov_numvfs

      Verify VFs created
      ls -l /sys/class/net/ibp24s0v*

      VF Creation Works, Rollback is the Problem

      NetworkManager logs prove that VFs are created successfully:

      Jan 29 07:59:49 ocp-poc26704-13779 NetworkManager[2121]: <info> manager: (ib0): new InfiniBand device (carrier: OFF, driver: 'mlx5_core', ifindex: 1712)
      Jan 29 07:59:49 ocp-poc26704-13779 NetworkManager[2121]: <info> manager: (ib1): new InfiniBand device (carrier: OFF, driver: 'mlx5_core', ifindex: 1713)
      ...
      Jan 29 07:59:49 ocp-poc26704-13779 NetworkManager[2121]: <info> manager: (ib15): new InfiniBand device (carrier: OFF, driver: 'mlx5_core', ifindex: 1727)

      Jan 29 07:59:54 ocp-poc26704-13779 NetworkManager[2121]: <info> device (ib0): interface index 1712 renamed iface from 'ib0' to 'ibp24s0v0'
      ...
      Jan 29 07:59:54 ocp-poc26704-13779 NetworkManager[2121]: <info> device (ib15): interface index 1727 renamed iface from 'ib15' to 'ibp24s0v15'

      All 16 VFs are created and renamed successfully. The operator then destroys them due to the netlink query failure.

              sscheink@redhat.com Sebastian Scheinkman
              bbenshab@redhat.com Boaz Ben Shabat
              Zhiqiang Fang Zhiqiang Fang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: