Uploaded image for project: 'OpenShift Hosted Control Plane'
  1. OpenShift Hosted Control Plane
  2. HOSTEDCP-526

Hypershift creates extra EC2 instances

XMLWordPrintable

    • False
    • None
    • False
    • Hypershift Sprint 15, Hypershift Sprint 16
    • 0
    • 0
    • 0

      While working on spinning up Hosted clusters via OCM, I noticed that the Hypershift Operator creates extra worker nodes. Here is the node pool CR on MC:

      % oc get nodepool -n ocm-sbarouti-1tmr67kf784u4e8mojeehkaoh22bioos sbarouti308-workers -oyaml
      apiVersion: hypershift.openshift.io/v1alpha1
      kind: NodePool
      metadata:
        annotations:
          hypershift.openshift.io/nodePoolCurrentConfig: 13dca8db
          hypershift.openshift.io/nodePoolCurrentConfigVersion: 7ec88832
        creationTimestamp: "2022-07-26T14:59:15Z"
        finalizers:
        - hypershift.openshift.io/finalizer
        generation: 64
        labels:
          hypershift.openshift.io/auto-created-for-infra: 1tmr67kf784u4e8mojeehkaoh22bioos
        name: sbarouti308-workers
        namespace: ocm-sbarouti-1tmr67kf784u4e8mojeehkaoh22bioos
        ownerReferences:
        - apiVersion: hypershift.openshift.io/v1alpha1
          kind: HostedCluster
          name: sbarouti308
          uid: 837a8a49-d31c-4455-b810-f861d90453f5
        resourceVersion: "228032783"
        uid: 1aa33bf5-8cd2-4bdc-95fd-2596fa64cbf8
      spec:
        clusterName: sbarouti308
        management:
          autoRepair: false
          replace:
            rollingUpdate:
              maxSurge: 1
              maxUnavailable: 0
            strategy: RollingUpdate
          upgradeType: Replace
        nodeCount: 2
        platform:
          aws:
            ami: ami-00c25def04737ae38
            instanceProfile: sbarouti-1tmr67kf784u4e8mojeehkaoh22bioos-sbarouti308-worker
            instanceType: m5.xlarge
            resourceTags:
            - key: api.openshift.com/environment
              value: sbarouti
            - key: api.openshift.com/id
              value: 1tmr67kf784u4e8mojeehkaoh22bioos
            - key: api.openshift.com/name
              value: sbarouti308
            rootVolume:
              size: 300
              type: gp3
            subnet:
              id: subnet-0098aace7c02a6c70
          type: AWS
        release:
          image: quay.io/openshift-release-dev/ocp-release@sha256:5b1a987e21b199321d200ac20ae27390a75c1f44b83805dadfae7e5a967b9e5d
        replicas: 2
      status:
        conditions:
        - lastTransitionTime: "2022-07-26T14:59:16Z"
          observedGeneration: 64
          reason: AsExpected
          status: "False"
          type: AutoscalingEnabled
        - lastTransitionTime: "2022-07-26T14:59:16Z"
          observedGeneration: 64
          reason: AsExpected
          status: "True"
          type: UpdateManagementEnabled
        - lastTransitionTime: "2022-07-26T15:01:51Z"
          message: 'Using release image: quay.io/openshift-release-dev/ocp-release@sha256:5b1a987e21b199321d200ac20ae27390a75c1f44b83805dadfae7e5a967b9e5d'
          observedGeneration: 64
          reason: AsExpected
          status: "True"
          type: ValidReleaseImage
        - lastTransitionTime: "2022-07-26T15:01:51Z"
          observedGeneration: 64
          reason: AsExpected
          status: "True"
          type: ValidMachineConfig
        - lastTransitionTime: "2022-07-26T15:01:51Z"
          observedGeneration: 64
          reason: AsExpected
          status: "False"
          type: AutorepairEnabled
        - lastTransitionTime: "2022-07-26T15:10:39Z"
          observedGeneration: 64
          reason: AsExpected
          status: "True"
          type: Ready
        replicas: 2
        version: 4.11.0-rc.5 

      Context:

      The nodecount is just one inconsistency, there's also deprecated field for aws roles, but that's not causing the extra instances. As mentioned above It's the awstemplate which gets a tags from the hypershift backend ("kubernetes.io/cluster/" + hcluster.Spec.InfraID) that is not in the mock referenced HC so when either the hypershiftDeployment reconciles the manifesWork or when the manifestWork reconciles the HC (I'm not sure which one) the tag get stamped on and removed, causing a new generation for the HC and for the AWSTemplate and hence a MachineDeployment rollout.  The generation for HC and the awstemplate is > 2k https://coreos.slack.com/archives/C01C8502FMM/p1658944183684319?thread_ts=1658927592.373669&cid=C01C8502FMMThis is a good exercise for debugging but there're plenty of things that could have broken this env. I'm more interested in creating one from scratch revisiting all the ocm steps, start observing and moving towards the new workflow. This sync issue will be alleviated by removing the level of indirection coming from the hypershiftDeployment, for manifestWork we need to revisit behaviour for unset fields, and hypershift should avoid self-updating the HC spec, for the tag in particular I see no need for it, could be done transparently, but there's more things e.g .clusterID.

      Hypershift mitigates this with https://github.com/openshift/hypershift/pull/1625

      For OCM side of things instead of fixing those clusters provision I suggest we move forward against dropping the HypershiftDeployment, which is GA plan. https://issues.redhat.com/browse/SDE-2107

       

      DoD:

      The the Nodepool controller enforces the kubernetes.io/cluster/" + hcluster.Spec.InfraID in the awsTemplate so there's no race with the HC.

       

        1. capi-logs.txt
          4.17 MB
          Samira Barouti
        2. hypershift-operator-logs.txt
          27.97 MB
          Samira Barouti
        3. Screen Shot 2022-07-25 at 1.56.49 PM.png
          471 kB
          Samira Barouti

              agarcial@redhat.com Alberto Garcia Lamela
              sbarouti@redhat.com Samira Barouti (Inactive)
              He Liu He Liu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: