Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: 4.16
Affects Version/s: 4.15
Component/s: Machine Config Operator
Labels:
- mco-triaged

Test Coverage:

+
Regression:
No
Epic Link:
On Cluster Layering Tech Preview
Sprint:
MCO Sprint 248, MCO Sprint 249
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
OCB opt-in process should consider both the nodes' MachineConfigs as well as images
Release Note Type:
Bug Fix
Release Note Status:
In Progress
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When opting into on-cluster builds on both the worker and control plane MachineConfigPools, the maxUnavailable value on the MachineConfigPools is not respected when the newly built image is rolled out to all of the nodes in a given pool.

Version-Release number of selected component (if applicable):

How reproducible:

Sometimes reproducible. I'm still working on figuring out what conditions need to be present for this to occur.

Steps to Reproduce:

    1. Opt an OpenShift cluster in on-cluster builds by following these instructions: https://github.com/openshift/machine-config-operator/blob/master/docs/OnClusterBuildInstructions.md
    2. Ensure that both the worker and control plane MachineConfigPools are opted in.

Actual results:

Multiple nodes in both the control plane and worker MachineConfigPools are drained and cordoned simultaneously, irrespective of the maxUnavailable value. This is particularly problematic for control plane nodes since draining more than one control plane node at a time can cause etcd issues, in addition to PDBs (Pod Disruption Budgets) which can make the config change take substantially longer or block completely.

I've mostly seen this issue affect control plane nodes, but I've also seen it impact both control plane and worker nodes.

Expected results:

I would have expected the new OS image to be rolled out in a similar fashion as new MachineConfigs are rolled out. In other words, a single node (or nodes up to maxUnavailable for non-control-plane nodes) is cordoned, drained, updated, and uncordoned at a time.

Additional info:

I suspect the bug may be someplace within the NodeController since that's the part of the MCO that controls which nodes update at a given time. That said, I've had difficulty reliably reproducing this issue, so finding a root cause could be more involved. This also seems to be mostly confined to the initial opt-in process. Subsequent updates seem to follow the original "rules" more closely.

relates to

MCO-759 BuildController should be aware of OpenShift upgrades

Closed

links to

openshift/machine-config-operator#4135: OCPBUGS-24705: consider currentImage and desiredImage annotations

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Assignee:: Zack Zlotnik

Reporter:: Zack Zlotnik

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/12/08 7:02 PM

Updated:: 2024/06/27 11:34 AM

Resolved:: 2024/03/06 9:20 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates