[OCPBUGS-47802] Multiple reboots during EUS upgrade on Control Plane nodes - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14.z, 4.15.z, 4.16.z
Component/s: Machine Config Operator
Labels:
- mco-triaged

Severity:
Moderate
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Release Note Status:
Done
RH Private Keywords:
Target Version:

4.17.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-46460~~. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-42636. The following is the description of the original issue:
—
Description of problem:

    During the EUS to EUS upgrade of a MNO cluster from 4.14.16 to 4.16.11 on baremetal, we have seen that depending on the custom configuration, like performance profile or container runtime config, one or more control plane nodes are rebooted multiple times. 

Seems that this is a race condition. When the first MachineConfig rendered is generated, the first Control Plane node start the reboot(the maxUnavailable is set to 1 on the master MCP), and at this moment a new MachineConfig render is generated, what means a second reboot. Once this first node is rebooted the second time, the rest of the Control Plane nodes are rebooted just once, because no more new MachineConfig renders are generated.

Version-Release number of selected component (if applicable):

    OCP 4.14.16 > 4.15.31  > 4.16.11

How reproducible:

    Perform the upgrade of a Multi Node OCP with a custom configuration like a performance profile or container runtime configuration (like force cgroups v1, or update runc to crun)

Steps to Reproduce:

    1. Deploy on baremetal a MNO OCP 4.14 with a custom manifest, like the below:

---
apiVersion: config.openshift.io/v1
kind: Node
metadata:
  name: cluster
spec:
  cgroupMode: v1

    2. Upgrade the cluster to the next minor version available, for instance 4.15.31, make a partial upgrade pausing the worker Machine Config Pool.

    3. Monitoring the upgrade process (cluster operators, Machine Configs, Machine Config Pools and nodes)

Actual results:

    You will see that once almost all the Cluster Operators are in the 4.15.31 version, except the Machine Config Operator, at this moment review the MachineConfig reders that are generated for the master Machine Config Pool, and also monitor the nodes, to see that new MachineConfig render is generated once the first Control Plane node has been rebooted.

Expected results:

  What is expected is that in a upgrade only one Machine Config Render is generated per Machine Config Pool, and only one reboot per node to finish the upgrade.

Additional info:

blocks

OCPBUGS-48116 Multiple reboots during EUS upgrade on Control Plane nodes

Closed

clones

OCPBUGS-46460 Multiple reboots during EUS upgrade on Control Plane nodes

Closed

is blocked by

OCPBUGS-46460 Multiple reboots during EUS upgrade on Control Plane nodes

Closed

is cloned by

OCPBUGS-48116 Multiple reboots during EUS upgrade on Control Plane nodes

Closed

links to

openshift/machine-config-operator#4775: OCPBUGS-47802: OCPBUGS-47801: trying to wait for sub-controllers

RHBA-2025:0115 OpenShift Container Platform 4.17.z bug fix update

(1 links to)

Assignee:: Team MCO

Reporter:: OpenShift Prow Bot

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/01/06 8:09 PM

Updated:: 2025/01/14 9:53 AM

Resolved:: 2025/01/14 9:53 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide