-
Epic
-
Resolution: Done
-
Major
-
None
-
MachinePool/MachineSet matching when subnets (un)specified
-
False
-
None
-
False
-
Not Selected
-
To Do
-
0% To Do, 0% In Progress, 100% Done
Background
HIVE-2254 describes hive's machinepool controller having problems matching (and therefore reconciling) MachineSets on vSphere when <the installer version used to create the spoke cluster> and <the version of vendored installer code in hive> straddled OCPSTRAT-153, where said code started generating MachineSets named with numeric suffixes according to failure domain (aka availability zone).
Ultimately the fix in that card entailed changing hive code to stop using name string matching entirely, and instead:
- Match MachineSets to MachinePools via labels
- Match <MachineSets generated by the controller> to <existing MachineSets on the spoke> based on matching their failure domains
The second thing resulted in the bug that is the subject of this card.
The Bug
For AWS, installer code generates the Subnet component of the providerConfig for a MachineSet differently depending whether subnets were provided in the input:
- If subnets were provided, providerConfig.Subnet identifies the subnet by ID.
- If subnets were not provided, providerConfig.Subnet uses Filters by tag:Name instead.
For day 0 (via installer), "provided in the input" means "provided in the install-config".
Example:
apiVersion: v1 ... compute: - name: worker platform: aws: subnets: - subnet-07c3a778191d310ee - subnet-0064a771e53f30bc8 - subnet-0b163898f4ebb7c96 zones: - ap-northeast-1a - ap-northeast-1c - ap-northeast-1d ...
For day 2 (hive's MachinePool controller) at the moment it means "provided in the MachinePool".
Example:
apiVersion: hive.openshift.io/v1 kind: MachinePool ... spec: ... platform: aws: ... subnets: - subnet-07c3a778191d310ee - subnet-0064a771e53f30bc8 - subnet-0b163898f4ebb7c96 zones: - ap-northeast-1a - ap-northeast-1c - ap-northeast-1d ...
(Aside: if subnets are enumerated, zones must also be enumerated, and match. This is an orthogonal issue, described by HIVE-2227, but the restriction applies to testing here.)
The problem arises when:
- The fix for
HIVE-2254is present - A MachineSet is created via one of those paths
- A corresponding MachinePool uses the other.
To be explicit (and to inform the test matrix):
Subnets enumerated (Day 0 install-config) | Subnets enumerated (Day 2 MachinePool) | Bug manifests |
---|---|---|
no | no | no |
no | yes | yes |
yes | no | yes |
yes | yes | no |
In other words, the bug manifests if and only if day 0 install-config and day 2 MachinePool differ with respect to whether they enumerated subnets.
*TODO:*
- Determine if this impacts other cloud providers!
- I don't think there's a failure mode dependent on whether zones are enumerated or not, but it would be worth checking.
The Symptom
- The MachinePool controller refuses to reconcile.
- No error appears in its status.
- The machinepool controller logs show an error like:
time="..." level=error msg="unable to create machine set" controller=machinepool error="machinesets.machine.openshift.io \"$MACHINESET_NAME\" already exists" machinePool=$MACHINEPOOL_NAMESPACE/$MACHINEPOOL_NAME reconcileID=...
Workaround
Make your MachinePool.Spec.Platform.AWS.Subnets "match" your install-config: either both should enumerate (the same) subnets, or both should omit subnets.
The Fix
See HIVE-2443. In short, we're going to look up subnets in AWS and label each MachineSet with the (unambiguous, explicit) subnet ID; then always use that – and never Filters – for matching failure domains.
- is caused by
-
HIVE-2254 vSphere: Duplicate machinesets are created when a new cluster is deployed using ACM
- Closed