-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.18
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
The `ovn-ipsec-host` pod on a node is in a `CrashLoopBackOff` state because the `ipsec.service` fails to start due to a corrupted `openshift.conf` file. The corruption could be caused for example by a power outage that occurrs while the file was being written, resulting in invalid syntax. The `ovn-ipsec` container has an `ExecStartPre` command that checks the config file, which prevents it from starting if the file is broken. The system lacks an automated self-healing mechanism to detect and repair this corrupted file, requiring manual intervention to delete the file and restart the services. This affects node stability and the security of pod-to-pod communication.
Version-Release number of selected component (if applicable):
This was observed in OpenShift Container Platform 4.18.20, but more OpenShift versions are probably affected.
How reproducible:
Always, once the configuration file is in a corrupted state. Reproducing the initial corruption can be obtained by truncating the file.
Steps to Reproduce:
1. Enable IPsec on an OpenShift cluster.
2. Cause corruption of the file `/etc/ipsec.d/openshift.conf` by truncating the file in the middle of a line. For example:
$ cat /etc/ipsec.d/openshift.conf n-1 left=<redacted> right=<redacted> leftid=@<redacted> rightid=@<redacted> leftcert="ovs_certkey_<redacted>" leftrsasigkey=%cert rightca=%same leftprotoport=udp/6081 rightprotoport=udp conn <redacted> auto=start left=<redacted> right=<redacted> leftid=@<redacted> rightid=@<redacted> leftcert="ovs_certkey_<redacted>" leftrsasigkey=%cert rightca=%same leftprotoport=udp rightprotoport=udp/6081 [...]
3. Reboot the node.
Actual results:
* The `ovn-ipsec-host` pod on that node fails to start and enters a `CrashLoopBackOff` state.
* The `ipsec.service` on that node fails to start with a syntax error in `/etc/ipsec.d/openshift.conf`.
* The `ovn-ipsec` container's pre-start check fails, preventing it from starting.
* Manual intervention is required to delete the corrupted configuration file and restart the services.
Expected results:
* The `ovn-ipsec-host` pod and `ipsec.service` should be resilient to file corruption.
* In the event of a corrupted `openshift.conf` file, the `ovn-ipsec-host` pod should have a self-healing mechanism to detect the invalid syntax, delete the broken file, and regenerate a new, valid one.
* The `clusteroperator/network` should not report a degraded status.
Additional info:
The issue was identified from a customer support case.
The issue was seen in the `ipsec.service` logs and the `ovn-ipsec-host` pod logs. The `ipsec.service` fails to start because of a syntax error in the `/etc/ipsec.d/openshift.conf` file. The file corruption caused by a power outage prevents the `ovs-monitor-ipsec` daemon from running correctly, which in turn prevents the `ovn-ipsec-host` pod from starting. The `ovn-ipsec-host` pod is configured to check the validity of the IPsec configuration as a part of its startup process, creating a loop where it cannot start because the config is broken, and the config cannot be fixed because the pod that fixes it cannot start.
Here are the relevant log snippets from the case:
`systemctl` status for ipsec.service:
$ systemctl status ipsec.service ● ipsec.service - Internet Key Exchange (IKE) Protocol Daemon for IPsec Loaded: loaded (/usr/lib/systemd/system/ipsec.service; enabled; preset: disabled) Drop-In: /etc/systemd/system/ipsec.service.d └─01-after-configure-ovs.conf Active: deactivating (stop-post) (Result: exit-code) Docs: man:ipsec(8) man:pluto(8) man:ipsec.conf(5) Process: 248754 ExecStartPre=/usr/libexec/ipsec/addconn --config /etc/ipsec.conf --checkconfig (code=exited, status=3) Process: 248913 ExecStopPost=/bin/bash -c if test "$EXIT_STATUS" != "12"; then /sbin/ip xfrm policy flush; /sbin/ip xfrm state flush; fi (code=exited, status=0/SUCCESS) Cntrl PID: 248935 (ipsec) Tasks: 6 (limit: 816509) Memory: 2.7M CPU: 10.611s CGroup: /system.slice/ipsec.service ├─248935 /usr/bin/sh /usr/sbin/ipsec --stopnflog ├─248936 /usr/bin/sh /usr/sbin/ipsec --stopnflog ├─248937 /usr/libexec/ipsec/addconn --ctlsocket /run/pluto/pluto.ctl --configsetup ├─248938 grep -v # ├─248939 grep nflog └─248940 sed -e "s/^.*=//" -e "s/'//g" Aug 12 15:11:19 <node name> systemd[1]: Starting Internet Key Exchange (IKE) Protocol Daemon for IPsec... Aug 12 15:11:28 <node name> addconn[248754]: cannot load config '/etc/ipsec.conf': /etc/ipsec.d/openshift.conf:1: syntax error [] Aug 12 15:11:28 <node name> systemd[1]: ipsec.service: Control process exited, code=exited, status=3/NOTIMPLEMENTED
A potential solution, as suggested by Engineering, would be to implement a check in the `ExecStartPre` script to handle this scenario.
For example, modifying the start logic to handle a broken config. Currently the start logic is:
/usr/libexec/ipsec/addconn --config /etc/ipsec.conf --checkconfig
But it could be modified to something like:
if [ -e /etc/ipsec.d/openshift.conf ] && [ ! /usr/libexec/ipsec/addconn --config /etc/ipsec.d/openshift.conf --checkconfig ]; then echo "openshift.conf is not correct! Removing." rm -f /etc/ipsec.d/openshift.conf fi