[OCPBUGS-34970] [release-4.16] control-plane-machine-set goes Available=False with UnavailableReplicas during updates

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16.0
Affects Version/s: 4.13, 4.12, 4.14, 4.15, 4.16, 4.17
Component/s: Cloud Compute / Unknown
Labels:
None

Severity:
Moderate
Regression:
No
Sprint:
CLOUD Sprint 254, CLOUD Sprint 255
sprint_count:
2
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, a node associated with a rebooting machine briefly having a status of `Ready=Unknown` triggered the `UnavailableReplicas` condition in the Control Plane Machine Set Operator.
This condition caused the Operator to enter the `Available=False` state and trigger alerts because that state indicates a nonfunctional component that requires immediate administrator intervention.
This alert should not have been triggered for the brief and expected unavailabilty while rebooting.
With this release, a grace period for node unreadiness is added to avoid triggering unnecessary alerts.
(link:https://issues.redhat.com/browse/OCPBUGS-34970[*~~OCPBUGS-34970~~*])

Show
* Previously, a node associated with a rebooting machine briefly having a status of `Ready=Unknown` triggered the `UnavailableReplicas` condition in the Control Plane Machine Set Operator. This condition caused the Operator to enter the `Available=False` state and trigger alerts because that state indicates a nonfunctional component that requires immediate administrator intervention. This alert should not have been triggered for the brief and expected unavailabilty while rebooting. With this release, a grace period for node unreadiness is added to avoid triggering unnecessary alerts. (link: https://issues.redhat.com/browse/OCPBUGS-34970 [* OCPBUGS-34970 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-20061~~. The following is the description of the original issue:
—

Description of problem:

Possibly reviving ~~OCPBUGS-10771~~, the control-plane-machine-set ClusterOperator occasionally goes Available=False with reason=UnavailableReplicas. For example, this run includes:

: [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available expand_less	1h34m30s
{  3 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:

Oct 03 22:03:29.822 - 106s  E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
Oct 03 22:08:34.162 - 98s   E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
Oct 03 22:13:01.645 - 118s  E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)

But those are the nodes rebooting into newer RHCOS, and do not warrant immediate admin intervention. Teaching the CPMS operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

Version-Release number of selected component (if applicable):

4.15. Possibly all supported versions of the CPMS operator have this exposure.

How reproducible:

Looks like many (all?) 4.15 update jobs have near 100% reproducibility for some kind of issue with CPMS going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.

Steps to Reproduce:

w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort

Actual results:

periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 42% failed, 225% of failures match = 95% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 61% failed, 127% of failures match = 78% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 47% failed, 200% of failures match = 95% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 78% failed, 114% of failures match = 89% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 11 runs, 64% failed, 143% of failures match = 91% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 41% failed, 207% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 7 runs, 43% failed, 200% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn (all) - 6 runs, 50% failed, 33% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 71 runs, 24% failed, 382% of failures match = 92% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 70 runs, 30% failed, 281% of failures match = 84% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 50% failed, 175% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 71 runs, 38% failed, 233% of failures match = 89% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 69 runs, 49% failed, 171% of failures match = 84% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 175% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 63 runs, 37% failed, 222% of failures match = 81% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 7 runs, 43% failed, 233% of failures match = 100% impact
periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 13 runs, 54% failed, 100% of failures match = 54% impact
periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 16 runs, 63% failed, 90% of failures match = 56% impact

Expected results:

CPMS goes Available=False if and only if immediate admin intervention is appropriate.

blocks

OCPBUGS-34971 [release-4.15] control-plane-machine-set goes Available=False with UnavailableReplicas during updates

Closed

clones

OCPBUGS-20061 control-plane-machine-set goes Available=False with UnavailableReplicas during updates

Closed

is blocked by

OCPBUGS-20061 control-plane-machine-set goes Available=False with UnavailableReplicas during updates

Closed

links to

openshift/cluster-control-plane-machine-set-operator#298: [release-4.16] OCPBUGS-34970: Add unreadyNodeGracePeriod for allowing brief node hiccups

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Errata Tool added a comment - 2024/06/27 11:49 AM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2024:0041

Errata Tool added a comment - 2024/06/27 11:49 AM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:0041

Milind Yadav added a comment - 2024/06/14 4:54 AM

Based on results -
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/release-openshift-origin-installer-launch-aws-modern/1801449857324421120

( search for "[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available" in passed section)

Moving to VERIFIED

Milind Yadav added a comment - 2024/06/14 4:54 AM Based on results - https://prow.ci.openshift.org/view/gs/test-platform-results/logs/release-openshift-origin-installer-launch-aws-modern/1801449857324421120 ( search for " [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available" in passed section) Moving to VERIFIED

OpenShift Jira Bot added a comment - 2024/06/12 2:45 PM

Hi ddonati@redhat.com,

Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

OpenShift Jira Bot added a comment - 2024/06/12 2:45 PM Hi ddonati@redhat.com , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

Assignee:: Damiano Donati

Reporter:: OpenShift Prow Bot

QA Contact:: Milind Yadav

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2024/06/05 6:32 PM

Updated:: 2024/06/27 11:49 AM

Resolved:: 2024/06/27 11:49 AM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Errata Tool added a comment - 2024/06/27 11:49 AM

Expand comment: Errata Tool added a comment - 2024/06/27 11:49 AM

Collapse comment: Milind Yadav added a comment - 2024/06/14 4:54 AM

Expand comment: Milind Yadav added a comment - 2024/06/14 4:54 AM

Collapse comment: OpenShift Jira Bot added a comment - 2024/06/12 2:45 PM

Expand comment: OpenShift Jira Bot added a comment - 2024/06/12 2:45 PM

People

Dates