Uploaded image for project: 'OpenShift Hive'
  1. OpenShift Hive
  2. HIVE-1599

Bug: Claims in excess of pool size sometimes get deleted when they don't oughtta

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • 5
    • False
    • False
    • Undefined

      Creating a number of ClusterClaims that exceeds the ClusterPool.Size will create ClusterDeployments to satisfy all those claims. See the analysis in HIVE-1593 for a walkthrough of this for the edge case where the pool size is zero and we satisfy one claim.

      We decided in scrum 20210721 that this is the desired behavior (as opposed to failing the claim, or making it wait until the pool has capacity).

      However, if we hit a timing window between the clusterpool_controller and the clusterclaim_controller, such "excess" CDs can sometimes be deleted spuriously. Here's how it happens:

      Note that the clusterclaim_controller's Reconcile just needs to not hit that update before Bob loads the CDs. It doesn't matter if the clusterclaim_controller successfully updates after that point, since Bob is deleting without checking ResourceVersion.

      SafeDelete would narrow the window somewhat, but doesn't fully solve the problem; we need Bob to recognize that Sally is claimed so she can't end up a candidate for deletion.

      We could update the CDs and the claims at the same time (during Fred). But we still have to account for partial failures (the claim update succeeds, but the CD update fails).

      We could add code to iterate over assigned claims (which we ignore today) and try to match them against unclaimed CDs, moving matched CDs over to the claimed bucket and/or updating them on the spot.

      See  this slack thread for further discussion, musings, false starts, and more. Bring coffee.

      Done: Excess clusters survive to fulfill their destiny.

       

            efried.openshift Eric Fried
            efried.openshift Eric Fried
            Lin Wang Lin Wang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: