Uploaded image for project: 'OpenShift Hive'
  1. OpenShift Hive
  2. HIVE-2441

MachinePool/MachineSet matching when subnets (un)specified

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • MachinePool/MachineSet matching when subnets (un)specified
    • False
    • None
    • False
    • Not Selected
    • To Do
    • 0% To Do, 50% In Progress, 50% Done

      Background

      HIVE-2254 describes hive's machinepool controller having problems matching (and therefore reconciling) MachineSets on vSphere when <the installer version used to create the spoke cluster> and <the version of vendored installer code in hive> straddled OCPSTRAT-153, where said code started generating MachineSets named with numeric suffixes according to failure domain (aka availability zone).

      Ultimately the fix in that card entailed changing hive code to stop using name string matching entirely, and instead:

      • Match MachineSets to MachinePools via labels
      • Match <MachineSets generated by the controller> to <existing MachineSets on the spoke> based on matching their failure domains

      The second thing resulted in the bug that is the subject of this card.

      The Bug

      For AWS, installer code generates the Subnet component of the providerConfig for a MachineSet differently depending whether subnets were provided in the input:

      • If subnets were provided, providerConfig.Subnet identifies the subnet by ID.
      • If subnets were not provided, providerConfig.Subnet uses Filters by tag:Name instead.

      For day 0 (via installer), "provided in the input" means "provided in the install-config".

      Example:

      apiVersion: v1
      ...
      compute:
      - name: worker
        platform:
          aws:
            subnets:
            - subnet-07c3a778191d310ee
            - subnet-0064a771e53f30bc8
            - subnet-0b163898f4ebb7c96
            zones:
            - ap-northeast-1a
            - ap-northeast-1c
            - ap-northeast-1d
      ...
      

      For day 2 (hive's MachinePool controller) at the moment it means "provided in the MachinePool".

      Example:

      apiVersion: hive.openshift.io/v1
      kind: MachinePool
      ...
      spec:
        ...
        platform:
          aws:
            ...
            subnets:
            - subnet-07c3a778191d310ee
            - subnet-0064a771e53f30bc8
            - subnet-0b163898f4ebb7c96
            zones:
            - ap-northeast-1a
            - ap-northeast-1c
            - ap-northeast-1d
      ...
      

      (Aside: if subnets are enumerated, zones must also be enumerated, and match. This is an orthogonal issue, described by HIVE-2227, but the restriction applies to testing here.)

      The problem arises when:

      • The fix for HIVE-2254 is present
      • A MachineSet is created via one of those paths
      • A corresponding MachinePool uses the other.

      To be explicit (and to inform the test matrix):

      Subnets enumerated (Day 0 install-config) Subnets enumerated (Day 2 MachinePool) Bug manifests
      no no no
      no yes yes
      yes no yes
      yes yes no

      In other words, the bug manifests if and only if day 0 install-config and day 2 MachinePool differ with respect to whether they enumerated subnets.

      *TODO:*

      • Determine if this impacts other cloud providers!
      • I don't think there's a failure mode dependent on whether zones are enumerated or not, but it would be worth checking.

      The Symptom

      • The MachinePool controller refuses to reconcile.
      • No error appears in its status.
      • The machinepool controller logs show an error like:

      time="..." level=error msg="unable to create machine set" controller=machinepool error="machinesets.machine.openshift.io \"$MACHINESET_NAME\" already exists" machinePool=$MACHINEPOOL_NAMESPACE/$MACHINEPOOL_NAME reconcileID=...

      Workaround

      Make your MachinePool.Spec.Platform.AWS.Subnets "match" your install-config: either both should enumerate (the same) subnets, or both should omit subnets.

      The Fix

      See HIVE-2443. In short, we're going to look up subnets in AWS and label each MachineSet with the (unambiguous, explicit) subnet ID; then always use that – and never Filters – for matching failure domains.

            leah_leshchinsky Leah Leshchinsky
            efried.openshift Eric Fried
            Mingxia Huang
            Jianping Shu Jianping Shu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: