Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Machine Config Operator
Labels:
- mco-triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-48116~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-47802~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-46460~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-42636~~. The following is the description of the original issue:
—

Description of problem:


During the upgrade of an OpenShift cluster from 4.16.37 to 4.16.38 on baremetal, we have observed that master nodes are being rebooted multiple times (3 times instead of the expected single reboot). This issue matches exactly the behavior described in OCPBUGS-48116, which was supposedly fixed in 4.16.32 via RHSA-2025:0650. The issue occurs in environments with custom container runtime configurations where a second MachineConfig render is unnecessarily generated during the upgrade process.

How reproducible:

Perform the upgrade of a Multi Node OCP with a custom configuration like a performance profile or container runtime configuration (like force cgroups v1, or update runc to crun)

Steps to Reproduce:

1. Deploy an OCP 4.16.37 cluster on baremetal with ContainerRuntimeConfig that specifies crun as the default runtime
2. Upgrade the cluster to 4.16.38
3. Monitor the upgrade process (cluster operators, Machine Configs, Machine Config Pools and nodes)

Actual results:

After most of the Cluster Operators are updated to 4.16.38 (except the Machine Config Operator), the following was observed:
1. A rendered machine config (e.g., rendered-master-21[]eee) is generated for the master MCP
2. The first master node begins rebooting
3. While that node is rebooting, another rendered machine config (e.g., rendered-master-6b[]af7) is generated containing an unnecessary container runtime configuration
4. The first master node then has to reboot a second time to apply this new config
5. This causes significant extension of application downtime and increased resource usage which causes financial impact for the customer.

Expected results:

What is expected is that in a upgrade only one Machine Config Render is generated per Machine Config Pool, and only one reboot per node to finish the upgrade.

Additional info:

This issue was supposedly fixed in 4.16.32 (OCPBUGS-48116, RHSA-2025:0650) but is still occurring in 4.16.38.

The diff between the two rendered configs shows the unnecessary container runtime configuration being added:

```
$ diff 04129367/0020-rendered-master-6b[]af7.yaml 04129367/0060-rendered-master-21[]eee.yaml
7c7
< creationTimestamp: "2025-04-23T08:35:09Z"
—
> creationTimestamp: "2025-04-23T08:15:08Z"
9c9
< name: rendered-master-6b[]af7
—
> name: rendered-master-21[]eee
17,18c17,18
< resourceVersion: "1344163"
< uid: 2b4603cb-001e-4353-98ca-81988ecd1a99
—
> resourceVersion: "1338139"
> uid: 43117626-92a8-4204-9fbd-3129e5815ebb
407,412d406
< - contents:
< compression: ""
< source: data:text/plain;charset=utf-8;base64,W2NyaW9dCiAgW2NyaW8ucnVudGltZV0KICAgIGRlZmF1bHRfcnVudGltZSA9ICJjcnVuIgo=
< mode: 420
< overwrite: true
< path: /etc/crio/crio.conf.d/01-ctrcfg-defaultRuntime
```
When decoded, it is
```
[crio]
[crio.runtime]
default_runtime = "crun"
```

impacts account

OCPBUGS-42636 Multiple reboots during EUS upgrade on Control Plane nodes

Closed

OCPBUGS-48116 Multiple reboots during EUS upgrade on Control Plane nodes

Closed

Assignee:: Team MCO

Reporter:: Abdullah Sikder

Need Info From:: None

Contributors:: None

QA Contact:: Sergio Regidor de la Rosa

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/04/30 6:34 AM

Updated:: 2025/10/09 9:11 PM

Resolved:: 2025/04/30 8:28 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide