Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Labels:
- CI

Activity Type:
None
Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
None
Story Points:
None

Target Version:
None
Release Blocker:
None
Sprint:
None

Since ~~HIVE-1576~~, you can specify ClusterPool.Spec.RunningCount to keep some number of unclaimed pool clusters active.

If your pool is in steady state and you create <RunningCount claims, all those claims will get running CDs. If you create RunningCount < #claims <= Size, RunningCount of them will be running when claimed; and the remaining #claims - RunningCount will be kicked from Hibernating to Running. If you create >Size claims, Size of them will behave as above, and #claims new CDs will be provisioned so that the remaining #claims - Size of them can be assigned once they finish installing.

In this last scenario, if RunningCount >= #claims - Size, you got lucky: your excess claims will get clusters that were provisioned with PowerState=Running. But if RunningCount < #claims - Size, then #claims - Size - RunningCount of them will get provisioned with PowerState=Hibernating. As soon as they're finished installing, they'll get assigned, whereupon they're kicked to Running – but there's going to be some interval during which Hibernating could have taken effect [1][2].

The point of this card is to mitigate that last scenario such that those excess CDs are provisioned with PowerState=Running.

Except that's not quite right. Because we want to enforce FIFO behavior, we always want the oldest CDs to be Running, and the oldest CDs to be claimed. I think that's not necessarily always (all of) the clusters we're provisioning to handle the excess.

So I think the answer here is that we should calculate and use an "effective RunningCount", which is the original RunningCount plus some number calculated based on claims in excess of the pool size (zero when there are no excess claims). My brain doesn't know exactly what that calculation should be right now, but that's the idea.

[1] Note that this scenario already existed prior to RunningCount: it applied any time a claim was waiting for the pool to finish installing a CD.
[2] Hibernation is mostly a black box to hive. We kick around machine (de)activations, but don't really know/control what happens at the cloud provider. So for example, we don't know whether an in-progress hibernation is reversible, or whether it needs to finish hibernating before it turns around and starts back up. And that answer might be different for different cloud providers: What I'm getting at is: we don't really know the cost of this edge case.

relates to

HIVE-1576 ClusterPools: Keep it hot

Closed

links to

openshift/hive#1528: ClusterPool RunningCount

openshift/hive#1567: ClusterPool: running clusters for excess claims

Assignee:: Eric Fried

Reporter:: Eric Fried

Need Info From:: None

Contributors:: None

QA Contact:: Lin Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2021/09/22 3:23 PM

Updated:: 2022/09/09 7:14 AM

Resolved:: 2021/10/18 3:27 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates