Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18984

Potentially inconsistent snapshots taken from UpgradeBackupController on z releases

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • 4.15.0
    • 4.13, 4.12, 4.11, 4.10, 4.14, 4.15
    • Etcd
    • None
    • Moderate
    • No
    • 3
    • ETCD Sprint 242, ETCD Sprint 243, ETCD Sprint 244
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

      Description of problem:

      With the fix for BZ 2079803 [1] we have introduced a backup trigger on every z-release (instead of every y-release). Sadly we have not updated the CVO [2] logic along with it, which effectively stops the upgrade until a snapshot was taken. 
      
      Currently we have a split state machine (thanks Trevor):
      
      ... today we have this for minor updates:
      1. User bumps ClusterVersion spec asking for a minor update
      2. CVO checks for a recent etcd backup.  Until it is available, we refuse to accept the retarget request.
      3. Once the etcd backup is available (assuming no other precondition issues), we accept the retarget and start updating.
      
      While for patch updates:
      1. User bumps ClusterVersion spec asking for a minor update.
      2. CVO accepts the retarget, sets status.desired , and starts in on the update
      
      
      In the latter two cases, it might be that the CEO takes a snapshot while the upgrade is already running (race condition). This creates an inconsistent snapshot, which on restore would just re-attempt to execute the (botched) upgrade.
      
      
      [1] https://github.com/openshift/cluster-etcd-operator/pull/835
      [2] https://github.com/openshift/cluster-version-operator/blob/master/pkg/payload/precondition/clusterversion/etcdbackup.go#L76-L77
      
      
      

      Version-Release number of selected component (if applicable):

      any OCP > 4.10

      How reproducible:

      almost always (race condition between CEO and CVO)

      Steps to Reproduce:

      1. trigger a z-upgrade
      2. observe when the etcd backup is taken, it might happen after the upgrade is already in progress
      
      

      Actual results:

      The snapshot that was created contains parts of the newly upgraded OCP (CVO CRD or any other operator state). 

      Expected results:

      The snapshot should not contain any information that could come through with the z-upgrade. 

      Additional info:

      Either the CVO should also wait on z-upgrades to ensure the snapshots are consistently on a pre-upgrade state, or we revert the z-stream upgrade behavior again.

      wcabanba@redhat.com and our team decided to entirely remove the controller.

      trking to drop the requirement in CVO. 

       

            melbeher@redhat.com Mustafa Elbehery
            tjungblu@redhat.com Thomas Jungblut
            Ge Liu Ge Liu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: