Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3824

[4.12] Ipsec pods restart due to liveness probes fail in cluster with more than 150 +

    • Important
    • None
    • SDN Sprint 227, SDN Sprint 228
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-2598. The following is the description of the original issue:

      Description of problem:

      Liveness probe of ipsec pods fail with large clusters. Currently the command that is executed in the ipsec container is
      ovs-appctl -t ovs-monitor-ipsec ipsec/status && ipsec status
      The problem is with command "ipsec/status". In clusters with high node count this command will return a list with all the node daemons of the cluster. This means that as the node count raises the completion time of the command raises too. 

      This makes the main command 

      ovs-appctl -t ovs-monitor-ipsec

      To hang until the subcommand is finished.

      As the liveness and readiness probe values are hardcoded in the manifest of the ipsec container herehttps//github.com/openshift/cluster-network-operator/blob/9c1181e34316d34db49d573698d2779b008bcc20/bindata/network/ovn-kubernetes/common/ipsec.yaml] the liveness timeout of the container probe of 60 seconds start to be  insufficient as the node count list is growing. This resulted in a cluster with 170 + nodes to have 15+ ipsec pods in a crashloopbackoff state.

      Version-Release number of selected component (if applicable):

      Openshift Container Platform 4.10 but i think the same will be visible to other versions too.

      How reproducible:

      I was not able to reproduce due to an extreamely high amount of resources are needed and i think that there is no point as we have spotted the issue.

      Steps to Reproduce:

      1. Install an Openshift cluster with IPSEC enabled
      2. Scale to 170+ nodes or more
      3. Notice that the ipsec pods will start getting in a Crashloopbackoff state with failed Liveness/Readiness probes.
      

      Actual results:

      Ip Sec pods are stuck in a Crashloopbackoff state

      Expected results:

      Ip Sec pods to work normally

      Additional info:

      We have provided a workaround where CVO and CNO operators are scaled to 0 replicas in order for us to be able to increase the liveness probe limit to a value of 600 that recovered the cluster. 
      As a next step the customer will try to reduce the node count and restore the default liveness timeout value along with bringing the operators back to see if the cluster will stabilize.

       

            [OCPBUGS-3824] [4.12] Ipsec pods restart due to liveness probes fail in cluster with more than 150 +

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory, and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2022:7399

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399

             $ ogpow -n openshift-ovn-kubernetes -l=app=ovn-ipsec | grep -i running| wc -l
            154
            [anusaxen@anusaxen ~]$ oc get machineset -A
            NAMESPACE               NAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE
            openshift-machine-api   qe-anurag67r-47542-worker-us-east-2a   75        75        75      75          104m
            openshift-machine-api   qe-anurag67r-47542-worker-us-east-2b   75        75        75      75          104m
            openshift-machine-api   qe-anurag67r-47542-worker-us-east-2c   1         1         1       1           104m
            [anusaxen@anusaxen ~]$ oc get clusterversion
            NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.12.0-0.nightly-2022-11-25-231352   True        False         84m     Cluster version is 4.12.0-0.nightly-2022-11-25-231352
            

            Anurag Saxena added a comment - $ ogpow -n openshift-ovn-kubernetes -l=app=ovn-ipsec | grep -i running| wc -l 154 [anusaxen@anusaxen ~]$ oc get machineset -A NAMESPACE               NAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE openshift-machine-api   qe-anurag67r-47542-worker-us-east-2a   75        75        75      75          104m openshift-machine-api   qe-anurag67r-47542-worker-us-east-2b   75        75        75      75          104m openshift-machine-api   qe-anurag67r-47542-worker-us-east-2c   1         1         1       1           104m [anusaxen@anusaxen ~]$ oc get clusterversion NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS version   4.12.0-0.nightly-2022-11-25-231352   True        False         84m     Cluster version is 4.12.0-0.nightly-2022-11-25-231352

              akaris@redhat.com Andreas Karis
              openshift-crt-jira-prow OpenShift Prow Bot
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: