-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
4.18
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Customer is reporting that during testing of bonding failover on a Mellanox ConnectX-6 Lx 2x25G SFP28 NIC using 'ip link set down dev' (bonding mode = active-backup), packet loss is observed for ~5–6 seconds when the active interface is brought down.
Customer Environment:
- Hardware: Mellanox ConnectX-6 Lx 2x25G SFP28
- OS: RHEL 9.4 (Kernel v5.14.0-427.72.1.el9_4.x86_64)
- OpenShift: 4.18.17
- Bonding mode: active-backup (mode=1)
- Tools: ip link set down dev, ping
How reproducible Environment Matrix:
| NIC Vendor | CPU Platform | Result | Notes |
|---|---|---|---|
| Intel NIC | Intel CPU | ✅ No issue | Failover transparent |
| Intel NIC | AMD CPU | ❓ Not tested | |
| Mellanox NIC | Intel CPU | ❌ Issue occurs (5–7s packet drop) | Always reproducible |
| Mellanox NIC | AMD CPU | ❌ Issue occurs (5–7s packet drop) | Always reproducible |
Steps to Reproduce:
1. Log in to the OCP node (Node A)
2. Verify the Bond configuration and the Active interface:
cat /proc/net/bonding/bond0 | grep 'Currently Active Slave'
# Example Output: Currently Active Slave: ens14f0np0
3. Note the name of the currently active interface (e.g., ens14f0np0).
4. From Node A, start a continuous ping to a separate node (Node B). Use an interval of 0.1 seconds for a more granular packet loss measurement:
ping -D -i 0.1 <NODE_B_IP>
5. On the OCP node (Node A), execute the following command, to shutdown the current Active interface identified in Step 2.
date +'%Y-%m-%d %H:%M:%S.%N'; ip link set down dev ens14f0np0; date +'%Y-%m-%d %H:%M:%S.%N'
6. Record the time delta from the output of the command in Step 5. This measures how long it took for the command to execute.
7. Observe the ping output, note the number of missed pings and the total duration of the interruption.
Actual results: The entire failover process, from command execution to bond state update, took approximately 4-5 seconds. During this window, continuous ping monitoring recorded complete packet loss.
Expected results: Failover in active-backup bonding mode should be transparent with no or minimal packet loss.
Additional info:
Key Customer Findings:
- The issue is reproducible on multiple platforms, but only when using Mellanox NICs.
- The delay only occurs when the link is brought down via a software command (ip link set down); a physical cable disconnect works instantly.
- The customer reports that they first started seeing this behavior in OCP 4.16.
- The issue is switch-independent: Samsung's testing confirms that various internal teams have reproduced the behavior across different switches from multiple vendors.
Customer test results: For detailed information, refer to the Support Case.
[root@worker02 ~]# ping -D -i 0.1 172.25.239.62 [1755494816.475491] 64 bytes from 172.25.239.62: icmp_seq=5030 ttl=64 time=0.084 ms [1755494816.579496] 64 bytes from 172.25.239.62: icmp_seq=5031 ttl=64 time=0.089 ms [1755494816.683496] 64 bytes from 172.25.239.62: icmp_seq=5032 ttl=64 time=0.089 ms [1755494816.787515] 64 bytes from 172.25.239.62: icmp_seq=5033 ttl=64 time=0.098 ms [1755494822.403554] 64 bytes from 172.25.239.62: icmp_seq=5087 ttl=64 time=0.132 ms # Packet dropped (seq 5034 to 5086). Time difference = 5.616039 s [1755494822.507515] 64 bytes from 172.25.239.62: icmp_seq=5088 ttl=64 time=0.104 ms [1755494822.611501] 64 bytes from 172.25.239.62: icmp_seq=5089 ttl=64 time=0.091 ms [1755494822.715512] 64 bytes from 172.25.239.62: icmp_seq=5090 ttl=64 time=0.101 ms [1755494822.819497] 64 bytes from 172.25.239.62: icmp_seq=5091 ttl=64 time=0.089 ms [1755494822.923498] 64 bytes from 172.25.239.62: icmp_seq=5092 ttl=64 time=0.089 ms
In the first comment, I'll share my observations from the reproduction scenario.