Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-846

ovs-monitor-ipsec can't proceed if 'ipsec auto' process is stuck

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • None
    • openvswitch3.1
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      Unfortunately, multiple versions of Libreswan from 4.6 to 4.15 (we did not test newer releases, but it seems that all versions after 4.5 have the same bug) can hang at scale while running ipsec auto --start. It looks like this:

      root         149  0.0  0.0  30660 23604 ?        S    16:06   0:00 /usr/libexec/platform-python /usr/share/openvswitch/scripts/ovs-monitor-ipsec --pidfile=/var/run/openvswitch/ovs-monitor-ipsec.pid --ike-daemon=libreswan --no-restart-ike-daemon --ipsec-d /var/lib/ipsec/
      root         159  0.0  0.0  33176 19732 ?        S    16:06   0:01 /usr/libexec/platform-python /usr/share/openvswitch/scripts/ovs-monitor-ipsec --pidfile=/var/run/openvswitch/ovs-monitor-ipsec.pid --ike-daemon=libreswan --no-restart-ike-daemon --ipsec-d /var/lib/ipsec/
      root       14847  0.0  0.0   4528  3468 ?        S    16:28   0:00 /usr/bin/sh /usr/libexec/ipsec/auto --config /etc/ipsec.conf --ctlsocket /run/pluto/pluto.ctl --start --asynchronous ovn-a953f7-0-out-1
      root       14865  0.0  0.0   2772  1132 ?        S    16:28   0:00 /usr/libexec/ipsec/whack --ctlsocket /run/pluto/pluto.ctl --asynchronous --name ovn-a953f7-0-out-1 --initiate
      

      Basically:

      1. ovs-monitor-ipsec calls ipsec auto --start --asynchronous for a new connection.
      2. ipsec auto calls ipsec whack --initiate
      3. ipsec whack conects to pluto and asks it to initiate the connection.
      4. pluto never replies.
      5. ipsec whack waits forever.
      6. Every other process up the chain waits with it.

      This task is to make ovs-monitor-ipsec more resilient by killing the ipsec auto process if it gets stuck for a long time. This way the monitor will be able to proceed and start other connections.

      In practice, it seems that pluto does actually initiate the connection but just doesn't reply for some reason (Libreswan folks are investigating), so just killing the process may be fine. However, the reconciliation logic is required in order to check that the connection was loaded in the end.

      The libreswan issue appears mostly at scale. We can reproduce it by deploying 500 node OpenShift cluster with IPsec (there is a chance for the issue to be reproduced at 120 node scale, but it doesn't always appear, 500 seems to be much more reliable). Smaller reproducers are not known at this time.

            imaximet@redhat.com Ilya Maximets
            imaximet@redhat.com Ilya Maximets
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: