-
Bug
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
False
-
None
-
False
-
Hypershift Sprint 15, Hypershift Sprint 16
-
0
-
0
-
0
While working on spinning up Hosted clusters via OCM, I noticed that the Hypershift Operator creates extra worker nodes. Here is the node pool CR on MC:
% oc get nodepool -n ocm-sbarouti-1tmr67kf784u4e8mojeehkaoh22bioos sbarouti308-workers -oyaml apiVersion: hypershift.openshift.io/v1alpha1 kind: NodePool metadata: annotations: hypershift.openshift.io/nodePoolCurrentConfig: 13dca8db hypershift.openshift.io/nodePoolCurrentConfigVersion: 7ec88832 creationTimestamp: "2022-07-26T14:59:15Z" finalizers: - hypershift.openshift.io/finalizer generation: 64 labels: hypershift.openshift.io/auto-created-for-infra: 1tmr67kf784u4e8mojeehkaoh22bioos name: sbarouti308-workers namespace: ocm-sbarouti-1tmr67kf784u4e8mojeehkaoh22bioos ownerReferences: - apiVersion: hypershift.openshift.io/v1alpha1 kind: HostedCluster name: sbarouti308 uid: 837a8a49-d31c-4455-b810-f861d90453f5 resourceVersion: "228032783" uid: 1aa33bf5-8cd2-4bdc-95fd-2596fa64cbf8 spec: clusterName: sbarouti308 management: autoRepair: false replace: rollingUpdate: maxSurge: 1 maxUnavailable: 0 strategy: RollingUpdate upgradeType: Replace nodeCount: 2 platform: aws: ami: ami-00c25def04737ae38 instanceProfile: sbarouti-1tmr67kf784u4e8mojeehkaoh22bioos-sbarouti308-worker instanceType: m5.xlarge resourceTags: - key: api.openshift.com/environment value: sbarouti - key: api.openshift.com/id value: 1tmr67kf784u4e8mojeehkaoh22bioos - key: api.openshift.com/name value: sbarouti308 rootVolume: size: 300 type: gp3 subnet: id: subnet-0098aace7c02a6c70 type: AWS release: image: quay.io/openshift-release-dev/ocp-release@sha256:5b1a987e21b199321d200ac20ae27390a75c1f44b83805dadfae7e5a967b9e5d replicas: 2 status: conditions: - lastTransitionTime: "2022-07-26T14:59:16Z" observedGeneration: 64 reason: AsExpected status: "False" type: AutoscalingEnabled - lastTransitionTime: "2022-07-26T14:59:16Z" observedGeneration: 64 reason: AsExpected status: "True" type: UpdateManagementEnabled - lastTransitionTime: "2022-07-26T15:01:51Z" message: 'Using release image: quay.io/openshift-release-dev/ocp-release@sha256:5b1a987e21b199321d200ac20ae27390a75c1f44b83805dadfae7e5a967b9e5d' observedGeneration: 64 reason: AsExpected status: "True" type: ValidReleaseImage - lastTransitionTime: "2022-07-26T15:01:51Z" observedGeneration: 64 reason: AsExpected status: "True" type: ValidMachineConfig - lastTransitionTime: "2022-07-26T15:01:51Z" observedGeneration: 64 reason: AsExpected status: "False" type: AutorepairEnabled - lastTransitionTime: "2022-07-26T15:10:39Z" observedGeneration: 64 reason: AsExpected status: "True" type: Ready replicas: 2 version: 4.11.0-rc.5
Context:
The nodecount is just one inconsistency, there's also deprecated field for aws roles, but that's not causing the extra instances. As mentioned above It's the awstemplate which gets a tags from the hypershift backend ("kubernetes.io/cluster/" + hcluster.Spec.InfraID) that is not in the mock referenced HC so when either the hypershiftDeployment reconciles the manifesWork or when the manifestWork reconciles the HC (I'm not sure which one) the tag get stamped on and removed, causing a new generation for the HC and for the AWSTemplate and hence a MachineDeployment rollout. The generation for HC and the awstemplate is > 2k https://coreos.slack.com/archives/C01C8502FMM/p1658944183684319?thread_ts=1658927592.373669&cid=C01C8502FMMThis is a good exercise for debugging but there're plenty of things that could have broken this env. I'm more interested in creating one from scratch revisiting all the ocm steps, start observing and moving towards the new workflow. This sync issue will be alleviated by removing the level of indirection coming from the hypershiftDeployment, for manifestWork we need to revisit behaviour for unset fields, and hypershift should avoid self-updating the HC spec, for the tag in particular I see no need for it, could be done transparently, but there's more things e.g .clusterID.
Hypershift mitigates this with https://github.com/openshift/hypershift/pull/1625
For OCM side of things instead of fixing those clusters provision I suggest we move forward against dropping the HypershiftDeployment, which is GA plan. https://issues.redhat.com/browse/SDE-2107
DoD:
The the Nodepool controller enforces the kubernetes.io/cluster/" + hcluster.Spec.InfraID in the awsTemplate so there's no race with the HC.