Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: rhel-9
Component/s: openvswitch3.5
Labels:
None

Story Points:
13
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Acceptance Criteria:

Hide

Given openvswitch-3.4 on a 3-node worker pool,

When 360 pods (each with OVN veth + macvlan + VF) are deleted within 30 s and the same number are recreated immediately,

Then ovs-vswitchd does not stall and the worker nodes remain Ready for the duration of the test.

Show
Given openvswitch-3.4 on a 3-node worker pool, When 360 pods (each with OVN veth + macvlan + VF) are deleted within 30 s and the same number are recreated immediately, Then ovs-vswitchd does not stall and the worker nodes remain Ready for the duration of the test.
OS:
rhel-9
Planning:
None
AssignedTeam:
rhel-net-ovs-dpdk
Intelligence Requested:
Market:

Sprint:
OVS/DPDK - Sprint 10 - East
sprint_count:
1

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Problem Description: Clearly explain the issue.

ovs performance degradation when deleting and creating large amount of Pods.

ZTE (a telco customer) has a use case that one of their telco apps needs to switch the primary and backup. This application is restricted to be run on only 3 particular worker nodes. The problem arises when they delete 360 Pods within 3 worker nodes.

Impact Assessment: Describe the severity and impact (e.g., network down,availability of a workaround, etc.).

The impact including but not limited to

1. Worker node became NotReady
2. Pod creation failed due to ovs-vsctl timeout, killed by Signal Alarm

error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[cs-ci-8/xlbagent-74fc87ddc4-gvpx4 a9c7953f3f0aec22a6e896bd0fb385390c68c05aea203ea982f069ea93c883d6 network default NAD default] [cs-ci-8/xlbagent-74fc87ddc4-gvpx4 a9c7953f3f0aec22a6e896bd0fb385390c68c05aea203ea982f069ea93c883d6 network default NAD default] failed to configure pod interface: failure in plugging pod interface: failed to run 'ovs-vsctl --timeout=30 --may-exist add-port br-int a9c7953f3f0aec2 other_config:transient=true -- set interface a9c7953f3f0aec2 external_ids:attached_mac=0a:58:0a:80:28:e9 external_ids:iface-id=cs-ci-8_xlbagent-74fc87ddc4-gvpx4 external_ids:iface-id-ver=33ee516d-91a2-4b58-8fee-129d6df85029 external_ids:sandbox=a9c7953f3f0aec22a6e896bd0fb385390c68c05aea203ea982f069ea93c883d6 external_ids:ip_addresses=10.128.40.233/23 -- --if-exists remove interface a9c7953f3f0aec2 external_ids k8s.ovn.org/network -- --if-exists remove interface a9c7953f3f0aec2 external_ids k8s.ovn.org/nad': signal: alarm clock
  "2025-06-28T02:34:06Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n"

3. unregister_netdevice timeout in dmesg

[613772.965792] unregister_netdevice: waiting for 6e231484949c493 to become free. Usage count = 2
[613774.221807] unregister_netdevice: waiting for b67ff05f351e104 to become free. Usage count = 2
[613776.885834] unregister_netdevice: waiting for cf74709a9016742 to become free. Usage count = 2
[613776.885836] unregister_netdevice: waiting for 31360d1a7b9ba9e to become free. Usage count = 2
[613777.709844] unregister_netdevice: waiting for a7d43f24fb32f23 to become free. Usage count = 2
[613777.709849] unregister_netdevice: waiting for 1005a74bcb081d6 to become free. Usage count = 2
[613777.957848] unregister_netdevice: waiting for 28f1fd61ac32130 to become free. Usage count = 2
[613786.533934] unregister_netdevice: waiting for 85f8595937b52de to become free. Usage count = 2
[613786.533940] unregister_netdevice: waiting for 911fcfdac984572 to become free. Usage count = 2
[613786.845938] unregister_netdevice: waiting for ca1f43c210fd57c to become free. Usage count = 2
[613786.845938] unregister_netdevice: waiting for 8a4e2eb781dbe1c to become free. Usage count = 2
[613786.845939] unregister_netdevice: waiting for b67ff05f351e104 to become free. Usage count = 2
[613786.845941] unregister_netdevice: waiting for 6e231484949c493 to become free. Usage count = 2
[613788.341961] unregister_netdevice: waiting for 170be6a90ad092e to become free. Usage count = 2

4. cephfs client got evicted
5. ovs-vswitchd is blocked quite long

2025-06-25T03:29:33.289Z|00434|ovs_rcu(urcu6)|WARN|blocked 256008 ms waiting for main to quiesce
2025-06-25T03:29:33.351Z|00033|timeval(handler406)|WARN|Unreasonably long 48181ms poll interval (0ms user, 0ms system)
2025-06-25T03:29:33.351Z|00136|timeval(handler438)|WARN|Unreasonably long 48116ms poll interval (1ms user, 0ms system)

Software Versions: Specify the exact versions in use (e.g.,openvswitch3.1-3.1.0-147.el8fdp).

openvswitch3.4-3.4.0-48.el9fdp.x86_64

Issue Type: Indicate whether this is a new issue or a regression (if a regression, state the last known working version).

New Issue

Reproducibility: Confirm if the issue can be reproduced consistently. If not, describe how often it occurs.

This issue could be reproduced in customer's site if they delete large amount of Pods at the same time, each Pod with ovn interface, macvlan interface and sriov interface.

Reproduction Steps: Provide detailed steps or scripts to replicate the issue.

In my local lab, if I delete 170 Pods with 1 ovn interface and 1 macvlan interface, I can only see the ovs-vswitchd will have long "Unreasonably long XXX poll" and "blocked XXX waiting for main to quiesce" error log. Also I don't see 200+ seconds or block or 40+ seconds Unreasonably poll. In addition I don't see br-ex is affected at all.

Expected Behavior: Describe what should happen under normal circumstances.

The customer wants to find out the bottleneck and make improvement

Observed Behavior: Explain what actually happens.

It appears to me that ovs-vswitchd is blocked in such situation, which even impacts br-ex upcall because the customer reported the node became NotReady and cephfs client got evicted, which both indicates that the br-ex was unable to handle network traffic. But the CPU utilization is quite low during the restart. As https://redhat-internal.slack.com/archives/C04L7QWC9CZ/p1751463356352129 pointed it could happen that perhaps RTNL mutex of maybe ovs mutex could be the bottleneck in such scenario.

Troubleshooting Actions: Outline the steps taken to diagnose or resolve the issue so far.

I am not sure whether the following data is enough or not. If anything else is needed please let me know and I will ask the customer to prepare to collect.

$ podman run -it --rm \
   -e PS1='[(DEBUG)\u@\h \W]\$ ' \
   --privileged --network=host --pid=host \
   -v /lib/modules:/lib/modules:ro \
   -v /sys/kernel/debug:/sys/kernel/debug \
   -v /proc:/proc \
   -v /:/mnt/rootdir \
   quay.io/fedora/fedora:38-x86_64
[(DEBUG)root@worker2 /]# dnf install -y bcc-tools perl-interpreter python3-pytz  python3-psutil bison elfutils-libelf-devel flex gcc make openssl-devel trace-cmd procps-ng perf
[(DEBUG)root@worker2 /]# rpm -i \
    openvswitch2.17-debuginfo-2.17.0-67.el8fdp.x86_64.rpm \
    openvswitch2.17-debugsource-2.17.0-67.el8fdp.x86_64.rpm \
    kernel-devel-4.18.0-372.41.1.el8_6.x86_64.rpm

1. kernel_delay.py
 
https://github.com/chaudron/ovs/blob/dev/kernel_delay/utilities/usdt-scripts/kernel_delay.py
[(DEBUG)root@worker2 /]# ./kernel_delay.py --sample-time 30 --sample-count 10 | tee delay_results.txt
2. syscall
[(DEBUG)root@worker2 /]# for ((;;)) do date; \
  find /proc/$(pidof -s ovs-vswitchd)/task -maxdepth 1 \
    -type d \( ! -name . \) -exec \
  bash -c "echo '{}' && cat '{}'/comm && cat '{}'/stack" \; ; \
  sleep 30; \
done | tee syscall_log
3. trace-cmd
[(DEBUG)root@worker2 /]# trace-cmd record -p function_graph -g __sys_sendmsg -P $(pidof ovs-vswitchd) -c
4. perf
[(DEBUG)root@worker2 /]# perf sched record -a --output perf-sched.data -- sleep 120
5. tar czvf ovs-debug.tar.gz trace.dat perf-sched.data delay_results.txt syscall_log
 Logs: If you collected logs please provide them (e.g. sos report, /var/log/openvswitch/* , testpmd console)

is blocked by

RHEL-115423 [root cause analysis] analyse OVS stalls on RTNL

Closed

Details

Description

Problem Description: Clearly explain the issue.

Impact Assessment: Describe the severity and impact (e.g., network down,availability of a workaround, etc.).

Issue Type: Indicate whether this is a new issue or a regression (if a regression, state the last known working version).

Reproducibility: Confirm if the issue can be reproduced consistently. If not, describe how often it occurs.

Reproduction Steps: Provide detailed steps or scripts to replicate the issue.

Expected Behavior: Describe what should happen under normal circumstances.

Observed Behavior: Explain what actually happens.

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates