Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:

Blocked:
False
Blocked Reason:
None
Ready:
False

Sprint:
Hypershift Sprint 15, Hypershift Sprint 16
Cost of Delay:
0
WSJF:
0
Risk Score:
0

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

While working on spinning up Hosted clusters via OCM, I noticed that the Hypershift Operator creates extra worker nodes. Here is the node pool CR on MC:

% oc get nodepool -n ocm-sbarouti-1tmr67kf784u4e8mojeehkaoh22bioos sbarouti308-workers -oyaml
apiVersion: hypershift.openshift.io/v1alpha1
kind: NodePool
metadata:
  annotations:
    hypershift.openshift.io/nodePoolCurrentConfig: 13dca8db
    hypershift.openshift.io/nodePoolCurrentConfigVersion: 7ec88832
  creationTimestamp: "2022-07-26T14:59:15Z"
  finalizers:
  - hypershift.openshift.io/finalizer
  generation: 64
  labels:
    hypershift.openshift.io/auto-created-for-infra: 1tmr67kf784u4e8mojeehkaoh22bioos
  name: sbarouti308-workers
  namespace: ocm-sbarouti-1tmr67kf784u4e8mojeehkaoh22bioos
  ownerReferences:
  - apiVersion: hypershift.openshift.io/v1alpha1
    kind: HostedCluster
    name: sbarouti308
    uid: 837a8a49-d31c-4455-b810-f861d90453f5
  resourceVersion: "228032783"
  uid: 1aa33bf5-8cd2-4bdc-95fd-2596fa64cbf8
spec:
  clusterName: sbarouti308
  management:
    autoRepair: false
    replace:
      rollingUpdate:
        maxSurge: 1
        maxUnavailable: 0
      strategy: RollingUpdate
    upgradeType: Replace
  nodeCount: 2
  platform:
    aws:
      ami: ami-00c25def04737ae38
      instanceProfile: sbarouti-1tmr67kf784u4e8mojeehkaoh22bioos-sbarouti308-worker
      instanceType: m5.xlarge
      resourceTags:
      - key: api.openshift.com/environment
        value: sbarouti
      - key: api.openshift.com/id
        value: 1tmr67kf784u4e8mojeehkaoh22bioos
      - key: api.openshift.com/name
        value: sbarouti308
      rootVolume:
        size: 300
        type: gp3
      subnet:
        id: subnet-0098aace7c02a6c70
    type: AWS
  release:
    image: quay.io/openshift-release-dev/ocp-release@sha256:5b1a987e21b199321d200ac20ae27390a75c1f44b83805dadfae7e5a967b9e5d
  replicas: 2
status:
  conditions:
  - lastTransitionTime: "2022-07-26T14:59:16Z"
    observedGeneration: 64
    reason: AsExpected
    status: "False"
    type: AutoscalingEnabled
  - lastTransitionTime: "2022-07-26T14:59:16Z"
    observedGeneration: 64
    reason: AsExpected
    status: "True"
    type: UpdateManagementEnabled
  - lastTransitionTime: "2022-07-26T15:01:51Z"
    message: 'Using release image: quay.io/openshift-release-dev/ocp-release@sha256:5b1a987e21b199321d200ac20ae27390a75c1f44b83805dadfae7e5a967b9e5d'
    observedGeneration: 64
    reason: AsExpected
    status: "True"
    type: ValidReleaseImage
  - lastTransitionTime: "2022-07-26T15:01:51Z"
    observedGeneration: 64
    reason: AsExpected
    status: "True"
    type: ValidMachineConfig
  - lastTransitionTime: "2022-07-26T15:01:51Z"
    observedGeneration: 64
    reason: AsExpected
    status: "False"
    type: AutorepairEnabled
  - lastTransitionTime: "2022-07-26T15:10:39Z"
    observedGeneration: 64
    reason: AsExpected
    status: "True"
    type: Ready
  replicas: 2
  version: 4.11.0-rc.5

Context:

The nodecount is just one inconsistency, there's also deprecated field for aws roles, but that's not causing the extra instances. As mentioned above It's the awstemplate which gets a tags from the hypershift backend ("kubernetes.io/cluster/" + hcluster.Spec.InfraID) that is not in the mock referenced HC so when either the hypershiftDeployment reconciles the manifesWork or when the manifestWork reconciles the HC (I'm not sure which one) the tag get stamped on and removed, causing a new generation for the HC and for the AWSTemplate and hence a MachineDeployment rollout. The generation for HC and the awstemplate is > 2k https://coreos.slack.com/archives/C01C8502FMM/p1658944183684319?thread_ts=1658927592.373669&cid=C01C8502FMMThis is a good exercise for debugging but there're plenty of things that could have broken this env. I'm more interested in creating one from scratch revisiting all the ocm steps, start observing and moving towards the new workflow. This sync issue will be alleviated by removing the level of indirection coming from the hypershiftDeployment, for manifestWork we need to revisit behaviour for unset fields, and hypershift should avoid self-updating the HC spec, for the tag in particular I see no need for it, could be done transparently, but there's more things e.g .clusterID.

Hypershift mitigates this with https://github.com/openshift/hypershift/pull/1625

For OCM side of things instead of fixing those clusters provision I suggest we move forward against dropping the HypershiftDeployment, which is GA plan. https://issues.redhat.com/browse/SDE-2107

DoD:

The the Nodepool controller enforces the kubernetes.io/cluster/" + hcluster.Spec.InfraID in the awsTemplate so there's no race with the HC.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

capi-logs.txt
4.17 MB
2022/07/27 12:50 PM
hypershift-operator-logs.txt
27.97 MB
2022/07/27 2:00 PM
Screen Shot 2022-07-25 at 1.56.49 PM.png
471 kB
2022/07/27 12:08 PM

mentioned on

Merge request - [Hypershift] Replace deprecated NodePool field with Replicas on HostedClusterSpec

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates