Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: None
Affects Version/s: rhel-8
Component/s: openvswitch3.1
Labels:
None

Story Points:
13
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Fixed in Build:
openvswitch3.1-3.1.0-147.el8fdp
Pool Team:

rhel-sst-network-fastdatapath
Intelligence Requested:
Market:
Sub-System Group:

ssg_networking

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Unfortunately, multiple versions of Libreswan from 4.6 to 4.15 (we did not test newer releases, but it seems that all versions after 4.5 have the same bug) can hang at scale while running ipsec auto --start. It looks like this:

root         149  0.0  0.0  30660 23604 ?        S    16:06   0:00 /usr/libexec/platform-python /usr/share/openvswitch/scripts/ovs-monitor-ipsec --pidfile=/var/run/openvswitch/ovs-monitor-ipsec.pid --ike-daemon=libreswan --no-restart-ike-daemon --ipsec-d /var/lib/ipsec/
root         159  0.0  0.0  33176 19732 ?        S    16:06   0:01 /usr/libexec/platform-python /usr/share/openvswitch/scripts/ovs-monitor-ipsec --pidfile=/var/run/openvswitch/ovs-monitor-ipsec.pid --ike-daemon=libreswan --no-restart-ike-daemon --ipsec-d /var/lib/ipsec/
root       14847  0.0  0.0   4528  3468 ?        S    16:28   0:00 /usr/bin/sh /usr/libexec/ipsec/auto --config /etc/ipsec.conf --ctlsocket /run/pluto/pluto.ctl --start --asynchronous ovn-a953f7-0-out-1
root       14865  0.0  0.0   2772  1132 ?        S    16:28   0:00 /usr/libexec/ipsec/whack --ctlsocket /run/pluto/pluto.ctl --asynchronous --name ovn-a953f7-0-out-1 --initiate

Basically:

ovs-monitor-ipsec calls ipsec auto --start --asynchronous for a new connection.
ipsec auto calls ipsec whack --initiate
ipsec whack conects to pluto and asks it to initiate the connection.
pluto never replies.
ipsec whack waits forever.
Every other process up the chain waits with it.

This task is to make ovs-monitor-ipsec more resilient by killing the ipsec auto process if it gets stuck for a long time. This way the monitor will be able to proceed and start other connections.

In practice, it seems that pluto does actually initiate the connection but just doesn't reply for some reason (Libreswan folks are investigating), so just killing the process may be fine. However, the reconciliation logic is required in order to check that the connection was loaded in the end.

The libreswan issue appears mostly at scale. We can reproduce it by deploying 500 node OpenShift cluster with IPsec (there is a chance for the issue to be reproduced at 120 node scale, but it doesn't always appear, 500 seems to be much more reliable). Smaller reproducers are not known at this time.

is related to

OCPBUGS-41551 Nodes to Node and subsequently pod to pod communication are repeatedly degrading despite multiple OVN DB rebuilds to fix the issue

Closed

links to

openshift/ovn-kubernetes#2387: OCPBUGS-45951: bump OVS version to 3.4.0-18

RHBA-2024:140512 openvswitch3.1 bug fix and enhancement update

1.	ovs-monitor-ipsec can't proceed if 'ipsec auto' process is stuck	Closed	Ilya Maximets
2.	ovs-monitor-ipsec can't proceed if 'ipsec auto' process is stuck	Closed	Ilya Maximets
3.	ovs-monitor-ipsec can't proceed if 'ipsec auto' process is stuck	Closed	Ilya Maximets
4.	[RHEL-9 OVS-3.2] ovs-monitor-ipsec can't proceed if 'ipsec auto' process is stuck	Verified	Ilya Maximets
5.	[ OVS-3.1] ovs-monitor-ipsec can't proceed if 'ipsec auto' process is stuck	Closed	Ilya Maximets

Assignee:: Ilya Maximets

Reporter:: Ilya Maximets

QA Contact:: Qijun Ding

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/10/07 7:36 PM

Updated:: 2024/12/10 8:02 AM

Resolved:: 2024/12/10 12:10 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates