Loading...

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.18
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:OVNK:IPSEC

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

The `ovn-ipsec-host` pod on a node is in a `CrashLoopBackOff` state because the `ipsec.service` fails to start due to a corrupted `openshift.conf` file. The corruption could be caused for example by a power outage that occurrs while the file was being written, resulting in invalid syntax. The `ovn-ipsec` container has an `ExecStartPre` command that checks the config file, which prevents it from starting if the file is broken. The system lacks an automated self-healing mechanism to detect and repair this corrupted file, requiring manual intervention to delete the file and restart the services. This affects node stability and the security of pod-to-pod communication.

Version-Release number of selected component (if applicable):

This was observed in OpenShift Container Platform 4.18.20, but more OpenShift versions are probably affected.

How reproducible:

Always, once the configuration file is in a corrupted state. Reproducing the initial corruption can be obtained by truncating the file.

Steps to Reproduce:

1. Enable IPsec on an OpenShift cluster.
2. Cause corruption of the file `/etc/ipsec.d/openshift.conf` by truncating the file in the middle of a line. For example:

$ cat /etc/ipsec.d/openshift.conf
n-1
    left=<redacted>
    right=<redacted>
    leftid=@<redacted>
    rightid=@<redacted>
    leftcert="ovs_certkey_<redacted>"
    leftrsasigkey=%cert
    rightca=%same
    leftprotoport=udp/6081
    rightprotoport=udp

conn <redacted>
    auto=start
    left=<redacted>
    right=<redacted>
    leftid=@<redacted>
    rightid=@<redacted>
    leftcert="ovs_certkey_<redacted>"
    leftrsasigkey=%cert
    rightca=%same
    leftprotoport=udp
    rightprotoport=udp/6081
[...]

3. Reboot the node.

Actual results:

* The `ovn-ipsec-host` pod on that node fails to start and enters a `CrashLoopBackOff` state.
* The `ipsec.service` on that node fails to start with a syntax error in `/etc/ipsec.d/openshift.conf`.
* The `ovn-ipsec` container's pre-start check fails, preventing it from starting.
* Manual intervention is required to delete the corrupted configuration file and restart the services.

Expected results:

* The `ovn-ipsec-host` pod and `ipsec.service` should be resilient to file corruption.
* In the event of a corrupted `openshift.conf` file, the `ovn-ipsec-host` pod should have a self-healing mechanism to detect the invalid syntax, delete the broken file, and regenerate a new, valid one.
* The `clusteroperator/network` should not report a degraded status.

Additional info:

The issue was identified from a customer support case.

The issue was seen in the `ipsec.service` logs and the `ovn-ipsec-host` pod logs. The `ipsec.service` fails to start because of a syntax error in the `/etc/ipsec.d/openshift.conf` file. The file corruption caused by a power outage prevents the `ovs-monitor-ipsec` daemon from running correctly, which in turn prevents the `ovn-ipsec-host` pod from starting. The `ovn-ipsec-host` pod is configured to check the validity of the IPsec configuration as a part of its startup process, creating a loop where it cannot start because the config is broken, and the config cannot be fixed because the pod that fixes it cannot start.

Here are the relevant log snippets from the case:

`systemctl` status for ipsec.service:

$ systemctl status ipsec.service
● ipsec.service - Internet Key Exchange (IKE) Protocol Daemon for IPsec
     Loaded: loaded (/usr/lib/systemd/system/ipsec.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/ipsec.service.d
             └─01-after-configure-ovs.conf
     Active: deactivating (stop-post) (Result: exit-code)
       Docs: man:ipsec(8)
             man:pluto(8)
             man:ipsec.conf(5)
    Process: 248754 ExecStartPre=/usr/libexec/ipsec/addconn --config /etc/ipsec.conf --checkconfig (code=exited, status=3)
    Process: 248913 ExecStopPost=/bin/bash -c if test "$EXIT_STATUS" != "12"; then /sbin/ip xfrm policy flush; /sbin/ip xfrm state flush; fi (code=exited, status=0/SUCCESS)
Cntrl PID: 248935 (ipsec)
      Tasks: 6 (limit: 816509)
     Memory: 2.7M
        CPU: 10.611s
     CGroup: /system.slice/ipsec.service
             ├─248935 /usr/bin/sh /usr/sbin/ipsec --stopnflog
             ├─248936 /usr/bin/sh /usr/sbin/ipsec --stopnflog
             ├─248937 /usr/libexec/ipsec/addconn --ctlsocket /run/pluto/pluto.ctl --configsetup
             ├─248938 grep -v #
             ├─248939 grep nflog
             └─248940 sed -e "s/^.*=//" -e "s/'//g"
Aug 12 15:11:19 <node name> systemd[1]: Starting Internet Key Exchange (IKE) Protocol Daemon for IPsec...
Aug 12 15:11:28 <node name> addconn[248754]: cannot load config '/etc/ipsec.conf': /etc/ipsec.d/openshift.conf:1: syntax error []
Aug 12 15:11:28 <node name> systemd[1]: ipsec.service: Control process exited, code=exited, status=3/NOTIMPLEMENTED

A potential solution, as suggested by Engineering, would be to implement a check in the `ExecStartPre` script to handle this scenario.

For example, modifying the start logic to handle a broken config. Currently the start logic is:

/usr/libexec/ipsec/addconn --config /etc/ipsec.conf --checkconfig

But it could be modified to something like:

if [ -e /etc/ipsec.d/openshift.conf ] && [ ! /usr/libexec/ipsec/addconn --config /etc/ipsec.d/openshift.conf --checkconfig ]; then
    echo "openshift.conf is not correct!  Removing."
    rm -f /etc/ipsec.d/openshift.conf
fi

links to

KCS - ovn-ipsec-host pod on a node fails to start and enters CrashLoopBackOff

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide