-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
rhel-9.4
-
None
-
No
-
Important
-
rhel-sst-security-crypto
-
ssg_security
-
None
-
False
-
-
None
-
None
-
None
-
None
-
-
x86_64
-
None
What were you trying to do that didn't work?
Two nodes are configured with IPsec connections in transport mode on udp port 6081 and traffic is all working fine. But when both nodes are rebooted almost at the same time, both came up at the same time, trying to start pluto service, parses /etc/ipsec.d/openshift.conf and trying to reestablish the IPsec connection again, but it fails for more than two minutes until a script does ipsec auto start <conn-name>. Looks like libreswan needs some fixing to IKE SA establishment when both peers are trying at the same time. This is causing a major problem for OCP cluster node reboot scenarios.
What is the impact of this issue to you?
It causes OCP east west traffic is broken for an intermediate period, produces major disruptive events and IPsec CI broken in OpenShift.
Please provide the package NVR for which the bug is seen:
sh-5.1# cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" ID="rhcos" ID_LIKE="rhel fedora" VERSION="416.94.202412112347-0" VERSION_ID="4.16" VARIANT="CoreOS" VARIANT_ID=coreos PLATFORM_ID="platform:el9" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 416.94.202412112347-0" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos::coreos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://docs.okd.io/latest/welcome/index.html" BUG_REPORT_URL="https://access.redhat.com/labs/rhir/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.16" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.16" OPENSHIFT_VERSION="4.16" RHEL_VERSION=9.4 OSTREE_VERSION="416.94.202412112347-0" sh-5.1# rpm -q libreswan libreswan-4.6-3.el9_0.3.x86_64
How reproducible is this bug?:
Always
Steps to reproduce
- Bringup two nodes, configure ipsec connections as mentioned in the log.
- Ensure IPsec connectivity works between nodes
- Reboot both nodes at the same time.
Expected results
The IPsec connections must be established as soon as pluto service started running on both peers.
Actual results
It never happens until ipsec auto start <conn-name> done from ovs-monitor-ipsec script: https://github.com/openvswitch/ovs/blob/main/ipsec/ovs-monitor-ipsec.in
Logs:
Attached.