-
Bug
-
Resolution: Done
-
Critical
-
None
-
rhel-8
-
None
-
13
-
False
-
-
False
-
openvswitch3.1-3.1.0-147.el8fdp
-
rhel-sst-network-fastdatapath
-
-
-
ssg_networking
Unfortunately, multiple versions of Libreswan from 4.6 to 4.15 (we did not test newer releases, but it seems that all versions after 4.5 have the same bug) can hang at scale while running ipsec auto --start. It looks like this:
root 149 0.0 0.0 30660 23604 ? S 16:06 0:00 /usr/libexec/platform-python /usr/share/openvswitch/scripts/ovs-monitor-ipsec --pidfile=/var/run/openvswitch/ovs-monitor-ipsec.pid --ike-daemon=libreswan --no-restart-ike-daemon --ipsec-d /var/lib/ipsec/ root 159 0.0 0.0 33176 19732 ? S 16:06 0:01 /usr/libexec/platform-python /usr/share/openvswitch/scripts/ovs-monitor-ipsec --pidfile=/var/run/openvswitch/ovs-monitor-ipsec.pid --ike-daemon=libreswan --no-restart-ike-daemon --ipsec-d /var/lib/ipsec/ root 14847 0.0 0.0 4528 3468 ? S 16:28 0:00 /usr/bin/sh /usr/libexec/ipsec/auto --config /etc/ipsec.conf --ctlsocket /run/pluto/pluto.ctl --start --asynchronous ovn-a953f7-0-out-1 root 14865 0.0 0.0 2772 1132 ? S 16:28 0:00 /usr/libexec/ipsec/whack --ctlsocket /run/pluto/pluto.ctl --asynchronous --name ovn-a953f7-0-out-1 --initiate
Basically:
- ovs-monitor-ipsec calls ipsec auto --start --asynchronous for a new connection.
- ipsec auto calls ipsec whack --initiate
- ipsec whack conects to pluto and asks it to initiate the connection.
- pluto never replies.
- ipsec whack waits forever.
- Every other process up the chain waits with it.
This task is to make ovs-monitor-ipsec more resilient by killing the ipsec auto process if it gets stuck for a long time. This way the monitor will be able to proceed and start other connections.
In practice, it seems that pluto does actually initiate the connection but just doesn't reply for some reason (Libreswan folks are investigating), so just killing the process may be fine. However, the reconciliation logic is required in order to check that the connection was loaded in the end.
The libreswan issue appears mostly at scale. We can reproduce it by deploying 500 node OpenShift cluster with IPsec (there is a chance for the issue to be reproduced at 120 node scale, but it doesn't always appear, 500 seems to be much more reliable). Smaller reproducers are not known at this time.
- is related to
-
OCPBUGS-41551 Nodes to Node and subsequently pod to pod communication are repeatedly degrading despite multiple OVN DB rebuilds to fix the issue
- Closed
- links to
-
RHBA-2024:140512 openvswitch3.1 bug fix and enhancement update