-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.14.z, 4.15.z
Issue:-
At the time of upgrade OVN-ipsec pods gets in crash loop state, with the below error:-
2024-07-04T14:09:29.507289285Z + counter=0 2024-07-04T14:09:29.507487324Z + '[' -f /etc/cni/net.d/10-ovn-kubernetes.conf ']' 2024-07-04T14:09:29.507533492Z + echo 'ovnkube-node has configured node.' 2024-07-04T14:09:29.507558436Z ovnkube-node has configured node. 2024-07-04T14:09:29.507584586Z + pgrep pluto 2024-07-04T14:09:29.562637753Z + echo 'pluto is not running, enable the service and/or check system logs' 2024-07-04T14:09:29.562751199Z pluto is not running, enable the service and/or check system logs 2024-07-04T14:09:29.562812899Z + exit 2
Pods which are in crash loop state and nodes:-
ovn-ipsec-host-9k5pt 0/1 CrashLoopBackOff 8 17m 10.86.75.32 sv0a4101.lab-openshift-na.hybrid.sunlifecorp.com <none> <none> ovn-ipsec-host-tbhgs 0/1 CrashLoopBackOff 5 5m 10.86.75.29 sv0a4098.lab-openshift-na.hybrid.sunlifecorp.com <none> <none> ovn-ipsec-host-xdfr8 0/1 CrashLoopBackOff 16 1h 10.86.75.30 sv0a4099.lab-openshift-na.hybrid.sunlifecorp.com <none> <none>
Node:-
sv0a4098.lab-openshift-na.hybrid.sunlifecorp.com Ready patchgroup3,worker 1y v1.27.13+401bb48 10.86.75.29 <none> Red Hat Enterprise Linux CoreOS 414.92.202404231906-0 (Plow) 5.14.0-284.64.1.el9_2.x86_64 cri-o://1.27.5-2.rhaos4.14.gitbe29f54.el9 sv0a4099.lab-openshift-na.hybrid.sunlifecorp.com Ready patchgroup3,worker 1y v1.27.13+401bb48 10.86.75.30 <none> Red Hat Enterprise Linux CoreOS 414.92.202404231906-0 (Plow) 5.14.0-284.64.1.el9_2.x86_64 cri-o://1.27.5-2.rhaos4.14.gitbe29f54.el9 sv0a4101.lab-openshift-na.hybrid.sunlifecorp.com Ready patchgroup4,worker 1y v1.27.13+401bb48 10.86.75.32 <none> Red Hat Enterprise Linux CoreOS 414.92.202404231906-0 (Plow) 5.14.0-284.64.1.el9_2.x86_64 cri-o://1.27.5-2.rhaos4.14.gitbe29f54.el9
Observation:-
Customer has the below MCP:- oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-f0299e2a6f235ab8a60300290d828678 True False False 3 3 3 0 1y worker rendered-worker-835bec655f5412a6f39c814ffc84c7bc True False False 0 0 0 0 296d workerpool-patchgroup1 rendered-workerpool-patchgroup1-835bec655f5412a6f39c814ffc84c7bc True False False 2 2 2 0 1y workerpool-patchgroup2 rendered-workerpool-patchgroup2-835bec655f5412a6f39c814ffc84c7bc True False False 3 3 3 0 1y workerpool-patchgroup3 rendered-workerpool-patchgroup3-a5138fe6904ec0741aecb7a7c83111cd False False False 2 0 0 0 1y workerpool-patchgroup4 rendered-workerpool-patchgroup4-a5138fe6904ec0741aecb7a7c83111cd False False False 1 0 0 0 1y
Procedure Followed:-
Pausing and Unpausing MCPs
- Pause MCP2, MCP3, and MCP4 before the upgrade, leaving MCP1 unpaused.
- During the ovn-ipsec-host update:
- Reboot MCP1 and then unpause MCP2, MCP3, and MCP4 one at a time for workload redundancy.
- Repeat similar steps during the machine-config state to ensure application redundancy.
Issue with ovn-ipsec-host Update:-
- During the ovn-ipsec update, despite rebooting MCPs one at a time, pods(OVN-ipsec) on workers from different MCPs randomly enter crashloopback state, error I have mentioned on the starting of this collab.
- These pods remain in crash loopback state until the respective MCPs are unpaused, sometimes causing delays of more than 3 hours.
While rebooting MCP1 nodes, ovn-ipsec-host pods on MCP2, MCP3, and MCP4 may enter crashloopback state. The pods recover only when their respective MCPs are unpaused.
Concern:-
- Why the other worker node pods (OVN-ipsec) are in crash loop state when that particular MCP is paused and there is no update/upgrade going on with those worker nodes.
- Also will there be any application outage if ovn-ipsec pods are in crash state.
This behavior appears to be buggy because the MCPs that are in a paused state the upgrade is not running, yet the ovn-ipsec pods on those nodes are in a crash state.
- clones
-
OCPBUGS-52949 [release-4.18] Unexpected Behavior During Cluster Upgrade (4.14.23 to 4.15.15) for the ovn-ipsec-host pods.
-
- Closed
-
- depends on
-
OCPBUGS-52949 [release-4.18] Unexpected Behavior During Cluster Upgrade (4.14.23 to 4.15.15) for the ovn-ipsec-host pods.
-
- Closed
-
- is cloned by
-
OCPBUGS-52952 [release-4.16] Unexpected Behavior During Cluster Upgrade (4.14.23 to 4.15.15) for the ovn-ipsec-host pods.
-
- Verified
-
- is depended on by
-
OCPBUGS-52952 [release-4.16] Unexpected Behavior During Cluster Upgrade (4.14.23 to 4.15.15) for the ovn-ipsec-host pods.
-
- Verified
-
- links to
-
RHBA-2025:3565 OpenShift Container Platform 4.17.z bug fix update