Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-60952

Failed to recover from a corrupted IPsec configuration file

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      The `ovn-ipsec-host` pod on a node is in a `CrashLoopBackOff` state because the `ipsec.service` fails to start due to a corrupted `openshift.conf` file. The corruption could be caused for example by a power outage that occurrs while the file was being written, resulting in invalid syntax. The `ovn-ipsec` container has an `ExecStartPre` command that checks the config file, which prevents it from starting if the file is broken. The system lacks an automated self-healing mechanism to detect and repair this corrupted file, requiring manual intervention to delete the file and restart the services. This affects node stability and the security of pod-to-pod communication.

      Version-Release number of selected component (if applicable):

      This was observed in OpenShift Container Platform 4.18.20, but more OpenShift versions are probably affected. 

      How reproducible:

      Always, once the configuration file is in a corrupted state. Reproducing the initial corruption can be obtained by truncating the file.

      Steps to Reproduce:

      1.  Enable IPsec on an OpenShift cluster.
      2.  Cause corruption of the file `/etc/ipsec.d/openshift.conf` by truncating the file in the middle of a line. For example:

       

      $ cat /etc/ipsec.d/openshift.conf
      n-1
          left=<redacted>
          right=<redacted>
          leftid=@<redacted>
          rightid=@<redacted>
          leftcert="ovs_certkey_<redacted>"
          leftrsasigkey=%cert
          rightca=%same
          leftprotoport=udp/6081
          rightprotoport=udp
      
      conn <redacted>
          auto=start
          left=<redacted>
          right=<redacted>
          leftid=@<redacted>
          rightid=@<redacted>
          leftcert="ovs_certkey_<redacted>"
          leftrsasigkey=%cert
          rightca=%same
          leftprotoport=udp
          rightprotoport=udp/6081
      [...] 

      3.  Reboot the node.

       

      Actual results:

        * The `ovn-ipsec-host` pod on that node fails to start and enters a `CrashLoopBackOff` state.
        * The `ipsec.service` on that node fails to start with a syntax error in `/etc/ipsec.d/openshift.conf`.
        * The `ovn-ipsec` container's pre-start check fails, preventing it from starting.
        * Manual intervention is required to delete the corrupted configuration file and restart the services.

      Expected results:

        * The `ovn-ipsec-host` pod and `ipsec.service` should be resilient to file corruption.
        * In the event of a corrupted `openshift.conf` file, the `ovn-ipsec-host` pod should have a self-healing mechanism to detect the invalid syntax, delete the broken file, and regenerate a new, valid one.
        * The `clusteroperator/network` should not report a degraded status.

      Additional info:

      The issue was identified from a customer support case. 

      The issue was seen in the `ipsec.service` logs and the `ovn-ipsec-host` pod logs. The `ipsec.service` fails to start because of a syntax error in the `/etc/ipsec.d/openshift.conf` file. The file corruption caused by a power outage prevents the `ovs-monitor-ipsec` daemon from running correctly, which in turn prevents the `ovn-ipsec-host` pod from starting. The `ovn-ipsec-host` pod is configured to check the validity of the IPsec configuration as a part of its startup process, creating a loop where it cannot start because the config is broken, and the config cannot be fixed because the pod that fixes it cannot start.

      Here are the relevant log snippets from the case:

      `systemctl` status for ipsec.service:

      $ systemctl status ipsec.service
      ● ipsec.service - Internet Key Exchange (IKE) Protocol Daemon for IPsec
           Loaded: loaded (/usr/lib/systemd/system/ipsec.service; enabled; preset: disabled)
          Drop-In: /etc/systemd/system/ipsec.service.d
                   └─01-after-configure-ovs.conf
           Active: deactivating (stop-post) (Result: exit-code)
             Docs: man:ipsec(8)
                   man:pluto(8)
                   man:ipsec.conf(5)
          Process: 248754 ExecStartPre=/usr/libexec/ipsec/addconn --config /etc/ipsec.conf --checkconfig (code=exited, status=3)
          Process: 248913 ExecStopPost=/bin/bash -c if test "$EXIT_STATUS" != "12"; then /sbin/ip xfrm policy flush; /sbin/ip xfrm state flush; fi (code=exited, status=0/SUCCESS)
      Cntrl PID: 248935 (ipsec)
            Tasks: 6 (limit: 816509)
           Memory: 2.7M
              CPU: 10.611s
           CGroup: /system.slice/ipsec.service
                   ├─248935 /usr/bin/sh /usr/sbin/ipsec --stopnflog
                   ├─248936 /usr/bin/sh /usr/sbin/ipsec --stopnflog
                   ├─248937 /usr/libexec/ipsec/addconn --ctlsocket /run/pluto/pluto.ctl --configsetup
                   ├─248938 grep -v #
                   ├─248939 grep nflog
                   └─248940 sed -e "s/^.*=//" -e "s/'//g"
      Aug 12 15:11:19 <node name> systemd[1]: Starting Internet Key Exchange (IKE) Protocol Daemon for IPsec...
      Aug 12 15:11:28 <node name> addconn[248754]: cannot load config '/etc/ipsec.conf': /etc/ipsec.d/openshift.conf:1: syntax error []
      Aug 12 15:11:28 <node name> systemd[1]: ipsec.service: Control process exited, code=exited, status=3/NOTIMPLEMENTED
      

      A potential solution, as suggested by Engineering, would be to implement a check in the `ExecStartPre` script to handle this scenario.

      For example, modifying the start logic to handle a broken config. Currently the start logic is: 

      /usr/libexec/ipsec/addconn --config /etc/ipsec.conf --checkconfig

      But it could be modified to something like:

      if [ -e /etc/ipsec.d/openshift.conf ] && [ ! /usr/libexec/ipsec/addconn --config /etc/ipsec.d/openshift.conf --checkconfig ]; then
          echo "openshift.conf is not correct!  Removing."
          rm -f /etc/ipsec.d/openshift.conf
      fi

       

              rravaiol@redhat.com Riccardo Ravaioli
              rh-ee-dcoronel David Coronel
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: