[OCPBUGS-52951] [release-4.17] Unexpected Behavior During Cluster Upgrade (4.14.23 to 4.15.15) for the ovn-ipsec-host pods. - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14.z, 4.15.z
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:Backport
- SDN:OVNK:IPSEC

Severity:
Moderate
Regression:
Yes
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
The IPsec on the RHEL worker is not supported now due to a regression issue that we found while testing fixes for OCPBUGS-52280.
Regression bug: https://issues.redhat.com/browse/OCPBUGS-53316
we must document about it.

Show
The IPsec on the RHEL worker is not supported now due to a regression issue that we found while testing fixes for OCPBUGS-52280 . Regression bug: https://issues.redhat.com/browse/OCPBUGS-53316 we must document about it.
Release Note Type:
Bug Fix
Release Note Status:
Done
Customer Impact:

Customer Escalated
RH Private Keywords:
Target Version:

4.17.z
Target Backport Versions:

4.15.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Review Complete:

Issue:-

At the time of upgrade OVN-ipsec pods gets in crash loop state, with the below error:-

2024-07-04T14:09:29.507289285Z + counter=0
2024-07-04T14:09:29.507487324Z + '[' -f /etc/cni/net.d/10-ovn-kubernetes.conf ']'
2024-07-04T14:09:29.507533492Z + echo 'ovnkube-node has configured node.'
2024-07-04T14:09:29.507558436Z ovnkube-node has configured node.
2024-07-04T14:09:29.507584586Z + pgrep pluto
2024-07-04T14:09:29.562637753Z + echo 'pluto is not running, enable the service and/or check system logs'
2024-07-04T14:09:29.562751199Z pluto is not running, enable the service and/or check system logs
2024-07-04T14:09:29.562812899Z + exit 2

Pods which are in crash loop state and nodes:-

ovn-ipsec-host-9k5pt                    0/1     CrashLoopBackOff   8          17m   10.86.75.32      sv0a4101.lab-openshift-na.hybrid.sunlifecorp.com   <none>           <none>

ovn-ipsec-host-tbhgs                    0/1     CrashLoopBackOff   5          5m    10.86.75.29      sv0a4098.lab-openshift-na.hybrid.sunlifecorp.com   <none>           <none>

ovn-ipsec-host-xdfr8                    0/1     CrashLoopBackOff   16         1h    10.86.75.30      sv0a4099.lab-openshift-na.hybrid.sunlifecorp.com   <none>           <none>

Node:-

sv0a4098.lab-openshift-na.hybrid.sunlifecorp.com   Ready    patchgroup3,worker   1y    v1.27.13+401bb48   10.86.75.29      <none>        Red Hat Enterprise Linux CoreOS 414.92.202404231906-0 (Plow)   5.14.0-284.64.1.el9_2.x86_64   cri-o://1.27.5-2.rhaos4.14.gitbe29f54.el9

sv0a4099.lab-openshift-na.hybrid.sunlifecorp.com   Ready    patchgroup3,worker   1y    v1.27.13+401bb48   10.86.75.30      <none>        Red Hat Enterprise Linux CoreOS 414.92.202404231906-0 (Plow)   5.14.0-284.64.1.el9_2.x86_64   cri-o://1.27.5-2.rhaos4.14.gitbe29f54.el9

sv0a4101.lab-openshift-na.hybrid.sunlifecorp.com   Ready    patchgroup4,worker   1y    v1.27.13+401bb48   10.86.75.32      <none>        Red Hat Enterprise Linux CoreOS 414.92.202404231906-0 (Plow)   5.14.0-284.64.1.el9_2.x86_64   cri-o://1.27.5-2.rhaos4.14.gitbe29f54.el9

Observation:-

Customer has the below MCP:-

oc get mcp
NAME                     CONFIG                                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master                   rendered-master-f0299e2a6f235ab8a60300290d828678                   True      False      False      3              3                   3                     0                      1y
worker                   rendered-worker-835bec655f5412a6f39c814ffc84c7bc                   True      False      False      0              0                   0                     0                      296d
workerpool-patchgroup1   rendered-workerpool-patchgroup1-835bec655f5412a6f39c814ffc84c7bc   True      False      False      2              2                   2                     0                      1y
workerpool-patchgroup2   rendered-workerpool-patchgroup2-835bec655f5412a6f39c814ffc84c7bc   True      False      False      3              3                   3                     0                      1y
workerpool-patchgroup3   rendered-workerpool-patchgroup3-a5138fe6904ec0741aecb7a7c83111cd   False     False      False      2              0                   0                     0                      1y
workerpool-patchgroup4   rendered-workerpool-patchgroup4-a5138fe6904ec0741aecb7a7c83111cd   False     False      False      1              0                   0                     0                      1y

Procedure Followed:-

Pausing and Unpausing MCPs
- Pause MCP2, MCP3, and MCP4 before the upgrade, leaving MCP1 unpaused.
- During the ovn-ipsec-host update:
  - Reboot MCP1 and then unpause MCP2, MCP3, and MCP4 one at a time for workload redundancy.
  - Repeat similar steps during the machine-config state to ensure application redundancy.

Issue with ovn-ipsec-host Update:-

- During the ovn-ipsec update, despite rebooting MCPs one at a time, pods(OVN-ipsec) on workers from different MCPs randomly enter crashloopback state, error I have mentioned on the starting of this collab.
- These pods remain in crash loopback state until the respective MCPs are unpaused, sometimes causing delays of more than 3 hours.

While rebooting MCP1 nodes, ovn-ipsec-host pods on MCP2, MCP3, and MCP4 may enter crashloopback state. The pods recover only when their respective MCPs are unpaused.

Concern:-

Why the other worker node pods (OVN-ipsec) are in crash loop state when that particular MCP is paused and there is no update/upgrade going on with those worker nodes.
Also will there be any application outage if ovn-ipsec pods are in crash state.

This behavior appears to be buggy because the MCPs that are in a paused state the upgrade is not running, yet the ovn-ipsec pods on those nodes are in a crash state.

clones

OCPBUGS-52949 [release-4.18] Unexpected Behavior During Cluster Upgrade (4.14.23 to 4.15.15) for the ovn-ipsec-host pods.

Closed

depends on

OCPBUGS-52949 [release-4.18] Unexpected Behavior During Cluster Upgrade (4.14.23 to 4.15.15) for the ovn-ipsec-host pods.

Closed

is cloned by

OCPBUGS-52952 [release-4.16] Unexpected Behavior During Cluster Upgrade (4.14.23 to 4.15.15) for the ovn-ipsec-host pods.

Verified

is depended on by

OCPBUGS-52952 [release-4.16] Unexpected Behavior During Cluster Upgrade (4.14.23 to 4.15.15) for the ovn-ipsec-host pods.

Verified

links to

openshift/cluster-network-operator#2654: [release-4.17] OCPBUGS-52951: Unexpected Behavior During Cluster Upgrade for the ovn-ipsec-host pods

openshift/machine-config-operator#4930: [release-4.17] OCPBUGS-52951: Add ipsec connect wait service

RHBA-2025:3565 OpenShift Container Platform 4.17.z bug fix update

(2 links to)

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates