Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-52951

[release-4.17] Unexpected Behavior During Cluster Upgrade (4.14.23 to 4.15.15) for the ovn-ipsec-host pods.

    • Moderate
    • Yes
    • False
    • Hide

      None

      Show
      None
    • Hide
      The IPsec on the RHEL worker is not supported now due to a regression issue that we found while testing fixes for OCPBUGS-52280.
      Regression bug: https://issues.redhat.com/browse/OCPBUGS-53316
      we must document about it.
      Show
      The IPsec on the RHEL worker is not supported now due to a regression issue that we found while testing fixes for OCPBUGS-52280 . Regression bug: https://issues.redhat.com/browse/OCPBUGS-53316 we must document about it.
    • Bug Fix
    • Done
    • Customer Escalated

      Issue:-

      At the time of upgrade OVN-ipsec pods gets in crash loop state, with the below error:-

      2024-07-04T14:09:29.507289285Z + counter=0
      2024-07-04T14:09:29.507487324Z + '[' -f /etc/cni/net.d/10-ovn-kubernetes.conf ']'
      2024-07-04T14:09:29.507533492Z + echo 'ovnkube-node has configured node.'
      2024-07-04T14:09:29.507558436Z ovnkube-node has configured node.
      2024-07-04T14:09:29.507584586Z + pgrep pluto
      2024-07-04T14:09:29.562637753Z + echo 'pluto is not running, enable the service and/or check system logs'
      2024-07-04T14:09:29.562751199Z pluto is not running, enable the service and/or check system logs
      2024-07-04T14:09:29.562812899Z + exit 2 

      Pods which are in crash loop state and nodes:-

      ovn-ipsec-host-9k5pt                    0/1     CrashLoopBackOff   8          17m   10.86.75.32      sv0a4101.lab-openshift-na.hybrid.sunlifecorp.com   <none>           <none>
      
      ovn-ipsec-host-tbhgs                    0/1     CrashLoopBackOff   5          5m    10.86.75.29      sv0a4098.lab-openshift-na.hybrid.sunlifecorp.com   <none>           <none>
      
      ovn-ipsec-host-xdfr8                    0/1     CrashLoopBackOff   16         1h    10.86.75.30      sv0a4099.lab-openshift-na.hybrid.sunlifecorp.com   <none>           <none> 

      Node:-

      sv0a4098.lab-openshift-na.hybrid.sunlifecorp.com   Ready    patchgroup3,worker   1y    v1.27.13+401bb48   10.86.75.29      <none>        Red Hat Enterprise Linux CoreOS 414.92.202404231906-0 (Plow)   5.14.0-284.64.1.el9_2.x86_64   cri-o://1.27.5-2.rhaos4.14.gitbe29f54.el9
      
      sv0a4099.lab-openshift-na.hybrid.sunlifecorp.com   Ready    patchgroup3,worker   1y    v1.27.13+401bb48   10.86.75.30      <none>        Red Hat Enterprise Linux CoreOS 414.92.202404231906-0 (Plow)   5.14.0-284.64.1.el9_2.x86_64   cri-o://1.27.5-2.rhaos4.14.gitbe29f54.el9
      
      sv0a4101.lab-openshift-na.hybrid.sunlifecorp.com   Ready    patchgroup4,worker   1y    v1.27.13+401bb48   10.86.75.32      <none>        Red Hat Enterprise Linux CoreOS 414.92.202404231906-0 (Plow)   5.14.0-284.64.1.el9_2.x86_64   cri-o://1.27.5-2.rhaos4.14.gitbe29f54.el9 

      Observation:-

      Customer has the below MCP:-
      
      oc get mcp
      NAME                     CONFIG                                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master                   rendered-master-f0299e2a6f235ab8a60300290d828678                   True      False      False      3              3                   3                     0                      1y
      worker                   rendered-worker-835bec655f5412a6f39c814ffc84c7bc                   True      False      False      0              0                   0                     0                      296d
      workerpool-patchgroup1   rendered-workerpool-patchgroup1-835bec655f5412a6f39c814ffc84c7bc   True      False      False      2              2                   2                     0                      1y
      workerpool-patchgroup2   rendered-workerpool-patchgroup2-835bec655f5412a6f39c814ffc84c7bc   True      False      False      3              3                   3                     0                      1y
      workerpool-patchgroup3   rendered-workerpool-patchgroup3-a5138fe6904ec0741aecb7a7c83111cd   False     False      False      2              0                   0                     0                      1y
      workerpool-patchgroup4   rendered-workerpool-patchgroup4-a5138fe6904ec0741aecb7a7c83111cd   False     False      False      1              0                   0                     0                      1y 

      Procedure Followed:-

       

      Pausing and Unpausing MCPs
      - Pause MCP2, MCP3, and MCP4 before the upgrade, leaving MCP1 unpaused.
      - During the ovn-ipsec-host update:
        - Reboot MCP1 and then unpause MCP2, MCP3, and MCP4 one at a time for workload redundancy.
        - Repeat similar steps during the machine-config state to ensure application redundancy.

      Issue with ovn-ipsec-host Update:-

      - During the ovn-ipsec update, despite rebooting MCPs one at a time, pods(OVN-ipsec) on workers from different MCPs randomly enter crashloopback state, error I have mentioned on the starting of this collab.
      - These pods remain in crash loopback state until the respective MCPs are unpaused, sometimes causing delays of more than 3 hours. 

      While rebooting MCP1 nodes, ovn-ipsec-host pods on MCP2, MCP3, and MCP4 may enter crashloopback state. The pods recover only when their respective MCPs are unpaused.

      Concern:-

      • Why the other worker node pods (OVN-ipsec) are in crash loop state when that particular MCP is paused and there is no update/upgrade going on with those worker nodes.
      • Also will there be any application outage if ovn-ipsec pods are in crash state.

      This behavior appears to be buggy because the MCPs that are in a paused state the upgrade is not running, yet the ovn-ipsec pods on those nodes are in a crash state.

       

              pepalani@redhat.com Periyasamy Palanisamy
              rhn-support-satripat Sanjay Tripathi
              Anurag Saxena Anurag Saxena
              Yuval Kashtan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: