-
Bug
-
Resolution: Done
-
Major
-
None
-
4.12
-
None
-
Moderate
-
None
-
CLOUD Sprint 226
-
1
-
Rejected
-
False
-
-
N/A
Description of problem:
CPMS failureDomains is not keep consistent with master machines on heterogeneous cluster after upgrade from 4.11 to 4.12
Version-Release number of selected component (if applicable):
4.11.9-multi -> 4.12.0-0.nightly-multi-2022-10-20-153503
How reproducible:
always
Steps to Reproduce:
1.Launch a 4.11 heterogeneous cluster on AWS, we use automated template https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_11/ipi-on-aws/versioned-installer-x86_arm64_heterogeneous_workers liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.9-multi True False 25m Cluster version is 4.11.9-multi liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.9-multi True False False 25m baremetal 4.11.9-multi True False False 43m cloud-controller-manager 4.11.9-multi True False False 45m cloud-credential 4.11.9-multi True False False 46m cluster-api 4.11.9-multi True False False 44m cluster-autoscaler 4.11.9-multi True False False 43m config-operator 4.11.9-multi True False False 44m console 4.11.9-multi True False False 31m csi-snapshot-controller 4.11.9-multi True False False 43m dns 4.11.9-multi True False False 43m etcd 4.11.9-multi True False False 42m image-registry 4.11.9-multi True False False 38m ingress 4.11.9-multi True False False 38m insights 4.11.9-multi True False False 37m kube-apiserver 4.11.9-multi True False False 40m kube-controller-manager 4.11.9-multi True False False 41m kube-scheduler 4.11.9-multi True False False 41m kube-storage-version-migrator 4.11.9-multi True False False 44m machine-api 4.11.9-multi True False False 40m machine-approver 4.11.9-multi True False False 43m machine-config 4.11.9-multi True False False 42m marketplace 4.11.9-multi True False False 43m monitoring 4.11.9-multi True False False 35m network 4.11.9-multi True False False 45m node-tuning 4.11.9-multi True False False 43m openshift-apiserver 4.11.9-multi True False False 38m openshift-controller-manager 4.11.9-multi True False False 43m openshift-samples 4.11.9-multi True False False 37m operator-lifecycle-manager 4.11.9-multi True False False 43m operator-lifecycle-manager-catalog 4.11.9-multi True False False 43m operator-lifecycle-manager-packageserver 4.11.9-multi True False False 38m service-ca 4.11.9-multi True False False 44m storage 4.11.9-multi True False False 38m 2.Upgrade to 4.12.0-0.nightly-multi-2022-10-20-153503 liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-multi-2022-10-20-153503 True False 15m Cluster version is 4.12.0-0.nightly-multi-2022-10-20-153503 liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 120m baremetal 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 137m cloud-controller-manager 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 140m cloud-credential 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 140m cluster-api 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m cluster-autoscaler 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 137m config-operator 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m console 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 125m control-plane-machine-set 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 55m csi-snapshot-controller 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m dns 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 137m etcd 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 136m image-registry 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m ingress 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m insights 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m kube-apiserver 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 134m kube-controller-manager 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 135m kube-scheduler 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 135m kube-storage-version-migrator 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 27m machine-api 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 134m machine-approver 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m machine-config 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 76m marketplace 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 137m monitoring 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 130m network 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 139m node-tuning 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 52m openshift-apiserver 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m openshift-controller-manager 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 52m openshift-samples 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 55m operator-lifecycle-manager 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m operator-lifecycle-manager-catalog 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m platform-operators-aggregated 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 27m service-ca 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m storage 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m 3.Found there is CPMS, but the failureDomains shows us-east-2a, us-east-2c, us-east-2a, us-east-2b, which does not keep consistent with the master machines (us-east-2a, us-east-2b, us-east-2c). liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws411he2-gbt55-master-0 Running m6i.xlarge us-east-2 us-east-2a 113m huliu-aws411he2-gbt55-master-1 Running m6i.xlarge us-east-2 us-east-2b 113m huliu-aws411he2-gbt55-master-2 Running m6i.xlarge us-east-2 us-east-2c 113m huliu-aws411he2-gbt55-worker-us-east-2a-additional-nmkwf Running m6g.large us-east-2 us-east-2a 109m huliu-aws411he2-gbt55-worker-us-east-2a-additional-xw2df Running m6g.large us-east-2 us-east-2a 109m huliu-aws411he2-gbt55-worker-us-east-2a-pbsxw Running m6i.xlarge us-east-2 us-east-2a 109m huliu-aws411he2-gbt55-worker-us-east-2b-tpzn2 Running m6i.xlarge us-east-2 us-east-2b 109m huliu-aws411he2-gbt55-worker-us-east-2c-bxchx Running m6i.xlarge us-east-2 us-east-2c 109m liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 2 Inactive 44m liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -o yaml apiVersion: machine.openshift.io/v1 kind: ControlPlaneMachineSet metadata: creationTimestamp: "2022-10-21T09:19:02Z" finalizers: - controlplanemachineset.machine.openshift.io generation: 1 name: cluster namespace: openshift-machine-api resourceVersion: "63863" uid: c33d01d3-c7f3-411f-aaed-4c5339b166d3 spec: replicas: 3 selector: matchLabels: machine.openshift.io/cluster-api-cluster: huliu-aws411he2-gbt55 machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master state: Inactive strategy: type: RollingUpdate template: machineType: machines_v1beta1_machine_openshift_io machines_v1beta1_machine_openshift_io: failureDomains: aws: - placement: availabilityZone: us-east-2a subnet: filters: - name: tag:Name values: - huliu-aws411he2-gbt55-private-us-east-2a type: Filters - placement: availabilityZone: us-east-2c subnet: filters: - name: tag:Name values: - huliu-aws411he2-gbt55-private-us-east-2c type: Filters - placement: availabilityZone: us-east-2a subnet: filters: - name: tag:Name values: - huliu-aws411he2-gbt55-private-us-east-2a type: Filters - placement: availabilityZone: us-east-2b subnet: filters: - name: tag:Name values: - huliu-aws411he2-gbt55-private-us-east-2b type: Filters platform: AWS metadata: labels: machine.openshift.io/cluster-api-cluster: huliu-aws411he2-gbt55 machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master spec: lifecycleHooks: {} metadata: {} providerSpec: value: ami: id: ami-0abf0ec5cdd856934 apiVersion: machine.openshift.io/v1beta1 blockDevices: - ebs: encrypted: true iops: 0 kmsKey: arn: "" volumeSize: 120 volumeType: gp3 credentialsSecret: name: aws-cloud-credentials deviceIndex: 0 iamInstanceProfile: id: huliu-aws411he2-gbt55-master-profile instanceType: m6i.xlarge kind: AWSMachineProviderConfig loadBalancers: - name: huliu-aws411he2-gbt55-int type: network - name: huliu-aws411he2-gbt55-ext type: network metadata: creationTimestamp: null metadataServiceOptions: {} placement: region: us-east-2 securityGroups: - filters: - name: tag:Name values: - huliu-aws411he2-gbt55-master-sg subnet: {} tags: - name: kubernetes.io/cluster/huliu-aws411he2-gbt55 value: owned userDataSecret: name: master-user-data 4.If I edit CPMS, change state from Inactive to Active, it will trigger update immediately. But seems no need update, as the three master machines already in the CPMS failureDomains. liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster controlplanemachineset.machine.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws411he2-gbt55-master-0 Running m6i.xlarge us-east-2 us-east-2a 138m huliu-aws411he2-gbt55-master-1 Running m6i.xlarge us-east-2 us-east-2b 138m huliu-aws411he2-gbt55-master-gcszr-2 Running m6i.xlarge us-east-2 us-east-2a 14m huliu-aws411he2-gbt55-worker-us-east-2a-additional-nmkwf Running m6g.large us-east-2 us-east-2a 134m huliu-aws411he2-gbt55-worker-us-east-2a-additional-xw2df Running m6g.large us-east-2 us-east-2a 134m huliu-aws411he2-gbt55-worker-us-east-2a-pbsxw Running m6i.xlarge us-east-2 us-east-2a 134m huliu-aws411he2-gbt55-worker-us-east-2b-tpzn2 Running m6i.xlarge us-east-2 us-east-2b 134m huliu-aws411he2-gbt55-worker-us-east-2c-bxchx Running m6i.xlarge us-east-2 us-east-2c 134m
Actual results:
CPMS failureDomains doesn’t keep consistent with master machines
Expected results:
CPMS failureDomains should keep consistent with master machines
Additional info:
Must-gather https://drive.google.com/file/d/1fnz22ay9wvXPwKirkSmjX7qCj2aIH8Wg/view?usp=sharing Install a 4.12 heterogeneous cluster, no such issue. Upgrade a non heterogeneous cluster, no such issue. So seems it only occurs on heterogeneous cluster upgrade.