Description of problem:
With the fix for BZ 2079803 [1] we have introduced a backup trigger on every z-release (instead of every y-release). Sadly we have not updated the CVO [2] logic along with it, which effectively stops the upgrade until a snapshot was taken. Currently we have a split state machine (thanks Trevor): ... today we have this for minor updates: 1. User bumps ClusterVersion spec asking for a minor update 2. CVO checks for a recent etcd backup. Until it is available, we refuse to accept the retarget request. 3. Once the etcd backup is available (assuming no other precondition issues), we accept the retarget and start updating. While for patch updates: 1. User bumps ClusterVersion spec asking for a minor update. 2. CVO accepts the retarget, sets status.desired , and starts in on the update In the latter two cases, it might be that the CEO takes a snapshot while the upgrade is already running (race condition). This creates an inconsistent snapshot, which on restore would just re-attempt to execute the (botched) upgrade. [1] https://github.com/openshift/cluster-etcd-operator/pull/835 [2] https://github.com/openshift/cluster-version-operator/blob/master/pkg/payload/precondition/clusterversion/etcdbackup.go#L76-L77
Version-Release number of selected component (if applicable):
any OCP > 4.10
How reproducible:
almost always (race condition between CEO and CVO)
Steps to Reproduce:
1. trigger a z-upgrade 2. observe when the etcd backup is taken, it might happen after the upgrade is already in progress
Actual results:
The snapshot that was created contains parts of the newly upgraded OCP (CVO CRD or any other operator state).
Expected results:
The snapshot should not contain any information that could come through with the z-upgrade.
Additional info:
Either the CVO should also wait on z-upgrades to ensure the snapshots are consistently on a pre-upgrade state, or we revert the z-stream upgrade behavior again.
—
wcabanba@redhat.com and our team decided to entirely remove the controller.
trking to drop the requirement in CVO.
- blocks
-
OCPBUGS-22477 [4.14] Remove z-upgrades from UpgradeBackupController
- Closed
- is cloned by
-
OCPBUGS-22477 [4.14] Remove z-upgrades from UpgradeBackupController
- Closed
- relates to
-
OCPBUGS-20128 Updating Cluster documentation should suggest backup of etcd
- Closed
- links to
-
RHEA-2023:7198 rpm
(1 links to)