Loading...

XML

Word

Printable

Type: Sub-task
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhel-9
Component/s: openvswitch3.3
Labels:
None

Story Points:
0
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Fixed in Build:
openvswitch3.3-3.3.0-58.el9fdp
AssignedTeam:
rhel-net-ovs-dpdk
Intelligence Requested:
Market:
Sub-System Group:

ssg_networking

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Unfortunately, multiple versions of Libreswan from 4.6 to 4.15 (we did not test newer releases, but it seems that all versions after 4.5 have the same bug) can hang at scale while running ipsec auto --start. It looks like this:

root         149  0.0  0.0  30660 23604 ?        S    16:06   0:00 /usr/libexec/platform-python /usr/share/openvswitch/scripts/ovs-monitor-ipsec --pidfile=/var/run/openvswitch/ovs-monitor-ipsec.pid --ike-daemon=libreswan --no-restart-ike-daemon --ipsec-d /var/lib/ipsec/
root         159  0.0  0.0  33176 19732 ?        S    16:06   0:01 /usr/libexec/platform-python /usr/share/openvswitch/scripts/ovs-monitor-ipsec --pidfile=/var/run/openvswitch/ovs-monitor-ipsec.pid --ike-daemon=libreswan --no-restart-ike-daemon --ipsec-d /var/lib/ipsec/
root       14847  0.0  0.0   4528  3468 ?        S    16:28   0:00 /usr/bin/sh /usr/libexec/ipsec/auto --config /etc/ipsec.conf --ctlsocket /run/pluto/pluto.ctl --start --asynchronous ovn-a953f7-0-out-1
root       14865  0.0  0.0   2772  1132 ?        S    16:28   0:00 /usr/libexec/ipsec/whack --ctlsocket /run/pluto/pluto.ctl --asynchronous --name ovn-a953f7-0-out-1 --initiate

Basically:

ovs-monitor-ipsec calls ipsec auto --start --asynchronous for a new connection.
ipsec auto calls ipsec whack --initiate
ipsec whack conects to pluto and asks it to initiate the connection.
pluto never replies.
ipsec whack waits forever.
Every other process up the chain waits with it.

This task is to make ovs-monitor-ipsec more resilient by killing the ipsec auto process if it gets stuck for a long time. This way the monitor will be able to proceed and start other connections.

In practice, it seems that pluto does actually initiate the connection but just doesn't reply for some reason (Libreswan folks are investigating), so just killing the process may be fine. However, the reconciliation logic is required in order to check that the connection was loaded in the end.

The libreswan issue appears mostly at scale. We can reproduce it by deploying 500 node OpenShift cluster with IPsec (there is a chance for the issue to be reproduced at 120 node scale, but it doesn't always appear, 500 seems to be much more reliable). Smaller reproducers are not known at this time.

links to

RHBA-2024:140515 openvswitch3.3 bug fix and enhancement update

Assignee:: Ilya Maximets

Reporter:: Ilya Maximets

QA Contact:: Qijun Ding

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/11/04 4:11 PM

Updated:: 2024/12/10 12:11 AM

Resolved:: 2024/12/10 12:11 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates