Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: 4.18
Component/s: Networking / ovn-kubernetes
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Customer is reporting that during testing of bonding failover on a Mellanox ConnectX-6 Lx 2x25G SFP28 NIC using 'ip link set down dev' (bonding mode = active-backup), packet loss is observed for ~5–6 seconds when the active interface is brought down.

Customer Environment:

Hardware: Mellanox ConnectX-6 Lx 2x25G SFP28
OS: RHEL 9.4 (Kernel v5.14.0-427.72.1.el9_4.x86_64)
OpenShift: 4.18.17
Bonding mode: active-backup (mode=1)
Tools: ip link set down dev, ping

How reproducible Environment Matrix:

NIC Vendor	CPU Platform	Result	Notes
Intel NIC	Intel CPU	✅ No issue	Failover transparent
Intel NIC	AMD CPU	❓ Not tested
Mellanox NIC	Intel CPU	❌ Issue occurs (5–7s packet drop)	Always reproducible
Mellanox NIC	AMD CPU	❌ Issue occurs (5–7s packet drop)	Always reproducible

Steps to Reproduce:

1. Log in to the OCP node (Node A)

2. Verify the Bond configuration and the Active interface:

cat /proc/net/bonding/bond0 | grep 'Currently Active Slave' 
# Example Output: Currently Active Slave: ens14f0np0

3. Note the name of the currently active interface (e.g., ens14f0np0).

4. From Node A, start a continuous ping to a separate node (Node B). Use an interval of 0.1 seconds for a more granular packet loss measurement:

ping -D -i 0.1 <NODE_B_IP>

5. On the OCP node (Node A), execute the following command, to shutdown the current Active interface identified in Step 2.

date +'%Y-%m-%d %H:%M:%S.%N'; ip link set down dev ens14f0np0; date +'%Y-%m-%d %H:%M:%S.%N'

6. Record the time delta from the output of the command in Step 5. This measures how long it took for the command to execute.

7. Observe the ping output, note the number of missed pings and the total duration of the interruption.

Actual results: The entire failover process, from command execution to bond state update, took approximately 4-5 seconds. During this window, continuous ping monitoring recorded complete packet loss.

Expected results: Failover in active-backup bonding mode should be transparent with no or minimal packet loss.

Additional info:

Key Customer Findings:

The issue is reproducible on multiple platforms, but only when using Mellanox NICs.
The delay only occurs when the link is brought down via a software command (ip link set down); a physical cable disconnect works instantly.
The customer reports that they first started seeing this behavior in OCP 4.16.
The issue is switch-independent: Samsung's testing confirms that various internal teams have reproduced the behavior across different switches from multiple vendors.

Customer test results: For detailed information, refer to the Support Case.

[root@worker02 ~]# ping -D -i 0.1 172.25.239.62
[1755494816.475491] 64 bytes from 172.25.239.62: icmp_seq=5030 ttl=64 time=0.084 ms
[1755494816.579496] 64 bytes from 172.25.239.62: icmp_seq=5031 ttl=64 time=0.089 ms
[1755494816.683496] 64 bytes from 172.25.239.62: icmp_seq=5032 ttl=64 time=0.089 ms
[1755494816.787515] 64 bytes from 172.25.239.62: icmp_seq=5033 ttl=64 time=0.098 ms  

[1755494822.403554] 64 bytes from 172.25.239.62: icmp_seq=5087 ttl=64 time=0.132 ms   # Packet dropped (seq 5034 to 5086). Time difference = 5.616039 s
[1755494822.507515] 64 bytes from 172.25.239.62: icmp_seq=5088 ttl=64 time=0.104 ms
[1755494822.611501] 64 bytes from 172.25.239.62: icmp_seq=5089 ttl=64 time=0.091 ms
[1755494822.715512] 64 bytes from 172.25.239.62: icmp_seq=5090 ttl=64 time=0.101 ms
[1755494822.819497] 64 bytes from 172.25.239.62: icmp_seq=5091 ttl=64 time=0.089 ms
[1755494822.923498] 64 bytes from 172.25.239.62: icmp_seq=5092 ttl=64 time=0.089 ms

In the first comment, I'll share my observations from the reproduction scenario.

Assignee:: Ben Bennett

Reporter:: Stephanie Sierra

Need Info From:: None

Contributors:: None

QA Contact:: Anurag Saxena

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/10/31 9:47 PM

Updated:: 2025/11/03 5:24 PM

Resolved:: 2025/11/03 3:09 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates