Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.15.0
Affects Version/s: 4.13, 4.12, 4.11, 4.10, 4.14, 4.15
Component/s: Etcd
Labels:
None

Severity:
Moderate
Regression:
No
Story Points:
3
Sprint:
ETCD Sprint 242, ETCD Sprint 243, ETCD Sprint 244
sprint_count:
3
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Target Version:

4.15.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

With the fix for BZ 2079803 [1] we have introduced a backup trigger on every z-release (instead of every y-release). Sadly we have not updated the CVO [2] logic along with it, which effectively stops the upgrade until a snapshot was taken. 

Currently we have a split state machine (thanks Trevor):

... today we have this for minor updates:
1. User bumps ClusterVersion spec asking for a minor update
2. CVO checks for a recent etcd backup.  Until it is available, we refuse to accept the retarget request.
3. Once the etcd backup is available (assuming no other precondition issues), we accept the retarget and start updating.

While for patch updates:
1. User bumps ClusterVersion spec asking for a minor update.
2. CVO accepts the retarget, sets status.desired , and starts in on the update


In the latter two cases, it might be that the CEO takes a snapshot while the upgrade is already running (race condition). This creates an inconsistent snapshot, which on restore would just re-attempt to execute the (botched) upgrade.


[1] https://github.com/openshift/cluster-etcd-operator/pull/835
[2] https://github.com/openshift/cluster-version-operator/blob/master/pkg/payload/precondition/clusterversion/etcdbackup.go#L76-L77

Version-Release number of selected component (if applicable):

any OCP > 4.10

How reproducible:

almost always (race condition between CEO and CVO)

Steps to Reproduce:

1. trigger a z-upgrade
2. observe when the etcd backup is taken, it might happen after the upgrade is already in progress

Actual results:

The snapshot that was created contains parts of the newly upgraded OCP (CVO CRD or any other operator state).

Expected results:

The snapshot should not contain any information that could come through with the z-upgrade.

Additional info:

Either the CVO should also wait on z-upgrades to ensure the snapshots are consistently on a pre-upgrade state, or we revert the z-stream upgrade behavior again.

—

wcabanba@redhat.com and our team decided to entirely remove the controller.

trking to drop the requirement in CVO.

blocks

OCPBUGS-22477 [4.14] Remove z-upgrades from UpgradeBackupController

Closed

is cloned by

OCPBUGS-22477 [4.14] Remove z-upgrades from UpgradeBackupController

Closed

relates to

OCPBUGS-20128 Updating Cluster documentation should suggest backup of etcd

Closed

links to

openshift/cluster-etcd-operator#1129: OCPBUGS-18984: remove UpgradeBackupController

openshift/cluster-version-operator#968: OCPBUGS-18984: pkg/payload/precondition/clusterversion/etcdbackup: Drop precondition

RHEA-2023:7198 rpm

(1 links to)

Assignee:: Mustafa Elbehery

Reporter:: Thomas Jungblut

QA Contact:: Ge Liu

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/09/14 8:22 AM

Updated:: 2024/02/27 8:51 PM

Resolved:: 2024/02/27 8:51 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates