Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.15.z
Affects Version/s: 4.13, 4.12, 4.14, 4.15, 4.16, 4.17
Component/s: Cloud Compute / Unknown
Labels:
None

Severity:
Moderate
Regression:
No
Sprint:
CLOUD Sprint 254, CLOUD Sprint 255
sprint_count:
2
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, during node reboots, especially during update operations, the node that interacts with the rebooting machine entered a `Ready=Unknown` state for a short amount of time. This situation caused the Control Plane Machine Set Operator to enter an `UnavailableReplicas` condition and then an `Available=false` state. The `Available=false` state triggers alerts that demand urgent action, but in this case, intervention was only required for a short period of time until the node rebooted. With this release, a grace period for node unreadiness is provided where if a node enters an unready state, the Control Plane Machine Set Operator does not instantly enter an `UnavailableReplicas` condition or an `Available=false` state. (link:https://issues.redhat.com/browse/OCPBUGS-34971[*~~OCPBUGS-34971~~*]).

Show
* Previously, during node reboots, especially during update operations, the node that interacts with the rebooting machine entered a `Ready=Unknown` state for a short amount of time. This situation caused the Control Plane Machine Set Operator to enter an `UnavailableReplicas` condition and then an `Available=false` state. The `Available=false` state triggers alerts that demand urgent action, but in this case, intervention was only required for a short period of time until the node rebooted. With this release, a grace period for node unreadiness is provided where if a node enters an unready state, the Control Plane Machine Set Operator does not instantly enter an `UnavailableReplicas` condition or an `Available=false` state. (link: https://issues.redhat.com/browse/OCPBUGS-34971 [* OCPBUGS-34971 *]).
Release Note Type:
Bug Fix
Release Note Status:
In Progress
Target Version:

4.15.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-20061~~. The following is the description of the original issue:
—

Description of problem:

Possibly reviving ~~OCPBUGS-10771~~, the control-plane-machine-set ClusterOperator occasionally goes Available=False with reason=UnavailableReplicas. For example, this run includes:

: [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available expand_less	1h34m30s
{  3 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:

Oct 03 22:03:29.822 - 106s  E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
Oct 03 22:08:34.162 - 98s   E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)
Oct 03 22:13:01.645 - 118s  E clusteroperator/control-plane-machine-set condition/Available reason/UnavailableReplicas status/False Missing 1 available replica(s)

But those are the nodes rebooting into newer RHCOS, and do not warrant immediate admin intervention. Teaching the CPMS operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

Version-Release number of selected component (if applicable):

4.15. Possibly all supported versions of the CPMS operator have this exposure.

How reproducible:

Looks like many (all?) 4.15 update jobs have near 100% reproducibility for some kind of issue with CPMS going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.

Steps to Reproduce:

w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/control-plane-machine-set+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort

Actual results:

periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 42% failed, 225% of failures match = 95% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 61% failed, 127% of failures match = 78% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 47% failed, 200% of failures match = 95% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 78% failed, 114% of failures match = 89% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 11 runs, 64% failed, 143% of failures match = 91% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 41% failed, 207% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 7 runs, 43% failed, 200% of failures match = 86% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn (all) - 6 runs, 50% failed, 33% of failures match = 17% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 71 runs, 24% failed, 382% of failures match = 92% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 70 runs, 30% failed, 281% of failures match = 84% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 50% failed, 175% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 71 runs, 38% failed, 233% of failures match = 89% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 69 runs, 49% failed, 171% of failures match = 84% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 175% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 63 runs, 37% failed, 222% of failures match = 81% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 7 runs, 43% failed, 233% of failures match = 100% impact
periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 13 runs, 54% failed, 100% of failures match = 54% impact
periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 16 runs, 63% failed, 90% of failures match = 56% impact

Expected results:

CPMS goes Available=False if and only if immediate admin intervention is appropriate.

is blocked by

OCPBUGS-34970 [release-4.16] control-plane-machine-set goes Available=False with UnavailableReplicas during updates

Closed

is cloned by

OCPBUGS-48211 [release-4.14] control-plane-machine-set goes Available=False with UnavailableReplicas during updates

Closed

is depended on by

OCPBUGS-48211 [release-4.14] control-plane-machine-set goes Available=False with UnavailableReplicas during updates

Closed

links to

openshift/cluster-control-plane-machine-set-operator#299: [release-4.15] OCPBUGS-34971: Add unreadyNodeGracePeriod for allowing brief node hiccups

RHBA-2024:4041 OpenShift Container Platform 4.15.z bug fix update

Assignee:: Damiano Donati

Reporter:: OpenShift Prow Bot

QA Contact:: Milind Yadav

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/06/05 6:33 PM

Updated:: 2025/01/09 10:08 AM

Resolved:: 2024/06/26 12:06 PM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates