Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.15.0
Affects Version/s: 4.16.0
Component/s: Cluster Version Operator
Labels:
- groomed
- pre-merge

Severity:
Moderate
Regression:
No
Story Points:
0.5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Target Version:

4.15.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-25862~~. The following is the description of the original issue:
—

Description of problem:

At 17:26:09, the cluster is happily upgrading nodes:

An update is in progress for 57m58s: Working towards 4.14.1: 734 of 859 done (85% complete), waiting on machine-config

At 17:26:54, the upgrade starts to reboot master nodes and COs get noisy (this one specifically is ~~OCPBUGS-20061~~)

An update is in progress for 58m50s: Unable to apply 4.14.1: the cluster operator control-plane-machine-set is not available

~Two minutes later, at 17:29:07, CVO starts to shout about waiting on operators for over 40 despite not indicating anything is wrong earlier:

An update is in progress for 1h1m2s: Unable to apply 4.14.1: wait has exceeded 40 minutes for these operators: etcd, kube-apiserver

This is only because these operators go briefly degraded during master reboot (which they shouldn't but that is a different story). CVO computes its 40 minutes against the time when it first started to upgrade the given operator so it:

1. Upgrades etcd / KAS very early in the upgrade, noting the time when it started to do that
2. These two COs upgrade successfuly and upgrade proceeds
3. Eventually cluster starts rebooting masters and etcd/KAS go degraded
4. CVO compares current time against the noted time, discovers its more than 40 minutes and starts warning about it.

Version-Release number of selected component (if applicable):

all

How reproducible:

Not entirely deterministic:

1. the upgrade must go for 40m+ between upgrading etcd and upgrading nodes
2. the upgrade must reboot a master that is not running CVO (otherwise there will be a new CVO instance without the saved times, they are only saved in memory)

Steps to Reproduce:

1. Watch oc adm upgrade during the upgrade

Actual results:

Spurious "waiting for over 40m" message pops out of the blue

Expected results:

CVO simply says "waiting up to 40m on" and this eventually goes away as the node goes up and etcd goes out of degraded.

clones

OCPBUGS-25862 Spurious "wait has exceeded 40 minutes" when etcd operator briefly goes degraded in late upgrade

Closed

is blocked by

OCPBUGS-25862 Spurious "wait has exceeded 40 minutes" when etcd operator briefly goes degraded in late upgrade

Closed

links to

openshift/cluster-version-operator#1023: [release-4.15] OCPBUGS-27359: CO health: only track current in-progress upgrade start

RHSA-2023:7198 OpenShift Container Platform 4.15 security update

Assignee:: Petr Muller

Reporter:: OpenShift Prow Bot

QA Contact:: Jian Li

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/01/18 3:25 PM

Updated:: 2024/02/27 9:07 PM

Resolved:: 2024/02/27 9:07 PM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates