-
Bug
-
Resolution: Done
-
Normal
-
4.15, 4.16, 4.17, 4.18, 4.19
-
None
-
Quality / Stability / Reliability
-
False
-
-
5
-
None
-
None
-
None
-
None
-
None
-
OSDOCS Sprint 272, OSDOCS Sprint 273
-
2
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description:
This is not a functional bug but a documentation gap that leads to customer confusion and potential operational impact due to unexpected reboots.
During OpenShift Container Platform (OCP) upgrades on clusters with IPsec enabled, nodes may experience two consecutive reboots instead of the typical single reboot. This behavior has been observed when upgrading from OCP 4.16 to 4.17, specifically from 4.16.36 to 4.17.27, and is confirmed to be an expected behavior.
The first reboot is caused by the IPsec machine configs rolling out by the Cluster Network Operator (CNO), and the second reboot is caused by the Machine Config Operator (MCO) for rolling out the latest set of machine configs after its upgrade.
This double reboot behavior was also observed during the 4.14 to 4.15 upgrade (when IPsec moved to host-based), but not from 4.15 to 4.16. The presence of the double reboot depends on whether the libreswan and NetworkManager-libreswan versions in the target OCP release are different from the source OCP release, requiring a machine config update by the CNO.
Customers have expressed surprise at this behavior, as it was not always the case, and every node reboot is a significant event for some customers.
Proposed Resolution:
Update the OpenShift Container Platform documentation to clearly state that double node reboots are an expected behavior during upgrades for IPsec-enabled clusters, particularly when the libreswan and NetworkManager-libreswan packages are updated.
Specifically, add this information to the networking documentation under the IPsec configuration section. The suggested location is:
This should be done in OCP documentation from 4.15 and up.
Workaround (to achieve a single reboot):
It is possible for users to pause the worker MachineConfigPools during the cluster upgrade and unpause them at the end. This allows the CNO and MCO changes to be applied together, resulting in only a single reboot for the worker nodes.
Additional Information:
- A KCS article summarizing this behavior has been created: https://access.redhat.com/solutions/7124386