-
Bug
-
Resolution: Done
-
Major
-
None
-
4.12
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
Rejected
-
CLOUD Sprint 226
-
1
-
None
-
None
-
N/A
-
None
-
None
-
None
-
None
Description of problem:
CPMS failureDomains is not keep consistent with master machines on heterogeneous cluster after upgrade from 4.11 to 4.12
Version-Release number of selected component (if applicable):
4.11.9-multi -> 4.12.0-0.nightly-multi-2022-10-20-153503
How reproducible:
always
Steps to Reproduce:
1.Launch a 4.11 heterogeneous cluster on AWS, we use automated template
https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_11/ipi-on-aws/versioned-installer-x86_arm64_heterogeneous_workers
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.9-multi True False 25m Cluster version is 4.11.9-multi
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.11.9-multi True False False 25m
baremetal 4.11.9-multi True False False 43m
cloud-controller-manager 4.11.9-multi True False False 45m
cloud-credential 4.11.9-multi True False False 46m
cluster-api 4.11.9-multi True False False 44m
cluster-autoscaler 4.11.9-multi True False False 43m
config-operator 4.11.9-multi True False False 44m
console 4.11.9-multi True False False 31m
csi-snapshot-controller 4.11.9-multi True False False 43m
dns 4.11.9-multi True False False 43m
etcd 4.11.9-multi True False False 42m
image-registry 4.11.9-multi True False False 38m
ingress 4.11.9-multi True False False 38m
insights 4.11.9-multi True False False 37m
kube-apiserver 4.11.9-multi True False False 40m
kube-controller-manager 4.11.9-multi True False False 41m
kube-scheduler 4.11.9-multi True False False 41m
kube-storage-version-migrator 4.11.9-multi True False False 44m
machine-api 4.11.9-multi True False False 40m
machine-approver 4.11.9-multi True False False 43m
machine-config 4.11.9-multi True False False 42m
marketplace 4.11.9-multi True False False 43m
monitoring 4.11.9-multi True False False 35m
network 4.11.9-multi True False False 45m
node-tuning 4.11.9-multi True False False 43m
openshift-apiserver 4.11.9-multi True False False 38m
openshift-controller-manager 4.11.9-multi True False False 43m
openshift-samples 4.11.9-multi True False False 37m
operator-lifecycle-manager 4.11.9-multi True False False 43m
operator-lifecycle-manager-catalog 4.11.9-multi True False False 43m
operator-lifecycle-manager-packageserver 4.11.9-multi True False False 38m
service-ca 4.11.9-multi True False False 44m
storage 4.11.9-multi True False False 38m
2.Upgrade to 4.12.0-0.nightly-multi-2022-10-20-153503
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.12.0-0.nightly-multi-2022-10-20-153503 True False 15m Cluster version is 4.12.0-0.nightly-multi-2022-10-20-153503
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 120m
baremetal 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 137m
cloud-controller-manager 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 140m
cloud-credential 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 140m
cluster-api 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m
cluster-autoscaler 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 137m
config-operator 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m
console 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 125m
control-plane-machine-set 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 55m
csi-snapshot-controller 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m
dns 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 137m
etcd 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 136m
image-registry 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m
ingress 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m
insights 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m
kube-apiserver 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 134m
kube-controller-manager 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 135m
kube-scheduler 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 135m
kube-storage-version-migrator 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 27m
machine-api 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 134m
machine-approver 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m
machine-config 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 76m
marketplace 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 137m
monitoring 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 130m
network 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 139m
node-tuning 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 52m
openshift-apiserver 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m
openshift-controller-manager 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 52m
openshift-samples 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 55m
operator-lifecycle-manager 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m
operator-lifecycle-manager-catalog 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m
operator-lifecycle-manager-packageserver 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m
platform-operators-aggregated 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 27m
service-ca 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m
storage 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m
3.Found there is CPMS, but the failureDomains shows us-east-2a, us-east-2c, us-east-2a, us-east-2b, which does not keep consistent with the master machines (us-east-2a, us-east-2b, us-east-2c).
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
huliu-aws411he2-gbt55-master-0 Running m6i.xlarge us-east-2 us-east-2a 113m
huliu-aws411he2-gbt55-master-1 Running m6i.xlarge us-east-2 us-east-2b 113m
huliu-aws411he2-gbt55-master-2 Running m6i.xlarge us-east-2 us-east-2c 113m
huliu-aws411he2-gbt55-worker-us-east-2a-additional-nmkwf Running m6g.large us-east-2 us-east-2a 109m
huliu-aws411he2-gbt55-worker-us-east-2a-additional-xw2df Running m6g.large us-east-2 us-east-2a 109m
huliu-aws411he2-gbt55-worker-us-east-2a-pbsxw Running m6i.xlarge us-east-2 us-east-2a 109m
huliu-aws411he2-gbt55-worker-us-east-2b-tpzn2 Running m6i.xlarge us-east-2 us-east-2b 109m
huliu-aws411he2-gbt55-worker-us-east-2c-bxchx Running m6i.xlarge us-east-2 us-east-2c 109m
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE
cluster 3 3 3 2 Inactive 44m
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -o yaml
apiVersion: machine.openshift.io/v1
kind: ControlPlaneMachineSet
metadata:
creationTimestamp: "2022-10-21T09:19:02Z"
finalizers:
- controlplanemachineset.machine.openshift.io
generation: 1
name: cluster
namespace: openshift-machine-api
resourceVersion: "63863"
uid: c33d01d3-c7f3-411f-aaed-4c5339b166d3
spec:
replicas: 3
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: huliu-aws411he2-gbt55
machine.openshift.io/cluster-api-machine-role: master
machine.openshift.io/cluster-api-machine-type: master
state: Inactive
strategy:
type: RollingUpdate
template:
machineType: machines_v1beta1_machine_openshift_io
machines_v1beta1_machine_openshift_io:
failureDomains:
aws:
- placement:
availabilityZone: us-east-2a
subnet:
filters:
- name: tag:Name
values:
- huliu-aws411he2-gbt55-private-us-east-2a
type: Filters
- placement:
availabilityZone: us-east-2c
subnet:
filters:
- name: tag:Name
values:
- huliu-aws411he2-gbt55-private-us-east-2c
type: Filters
- placement:
availabilityZone: us-east-2a
subnet:
filters:
- name: tag:Name
values:
- huliu-aws411he2-gbt55-private-us-east-2a
type: Filters
- placement:
availabilityZone: us-east-2b
subnet:
filters:
- name: tag:Name
values:
- huliu-aws411he2-gbt55-private-us-east-2b
type: Filters
platform: AWS
metadata:
labels:
machine.openshift.io/cluster-api-cluster: huliu-aws411he2-gbt55
machine.openshift.io/cluster-api-machine-role: master
machine.openshift.io/cluster-api-machine-type: master
spec:
lifecycleHooks: {}
metadata: {}
providerSpec:
value:
ami:
id: ami-0abf0ec5cdd856934
apiVersion: machine.openshift.io/v1beta1
blockDevices:
- ebs:
encrypted: true
iops: 0
kmsKey:
arn: ""
volumeSize: 120
volumeType: gp3
credentialsSecret:
name: aws-cloud-credentials
deviceIndex: 0
iamInstanceProfile:
id: huliu-aws411he2-gbt55-master-profile
instanceType: m6i.xlarge
kind: AWSMachineProviderConfig
loadBalancers:
- name: huliu-aws411he2-gbt55-int
type: network
- name: huliu-aws411he2-gbt55-ext
type: network
metadata:
creationTimestamp: null
metadataServiceOptions: {}
placement:
region: us-east-2
securityGroups:
- filters:
- name: tag:Name
values:
- huliu-aws411he2-gbt55-master-sg
subnet: {}
tags:
- name: kubernetes.io/cluster/huliu-aws411he2-gbt55
value: owned
userDataSecret:
name: master-user-data
4.If I edit CPMS, change state from Inactive to Active, it will trigger update immediately. But seems no need update, as the three master machines already in the CPMS failureDomains.
liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
huliu-aws411he2-gbt55-master-0 Running m6i.xlarge us-east-2 us-east-2a 138m
huliu-aws411he2-gbt55-master-1 Running m6i.xlarge us-east-2 us-east-2b 138m
huliu-aws411he2-gbt55-master-gcszr-2 Running m6i.xlarge us-east-2 us-east-2a 14m
huliu-aws411he2-gbt55-worker-us-east-2a-additional-nmkwf Running m6g.large us-east-2 us-east-2a 134m
huliu-aws411he2-gbt55-worker-us-east-2a-additional-xw2df Running m6g.large us-east-2 us-east-2a 134m
huliu-aws411he2-gbt55-worker-us-east-2a-pbsxw Running m6i.xlarge us-east-2 us-east-2a 134m
huliu-aws411he2-gbt55-worker-us-east-2b-tpzn2 Running m6i.xlarge us-east-2 us-east-2b 134m
huliu-aws411he2-gbt55-worker-us-east-2c-bxchx Running m6i.xlarge us-east-2 us-east-2c 134m
Actual results:
CPMS failureDomains doesn’t keep consistent with master machines
Expected results:
CPMS failureDomains should keep consistent with master machines
Additional info:
Must-gather https://drive.google.com/file/d/1fnz22ay9wvXPwKirkSmjX7qCj2aIH8Wg/view?usp=sharing Install a 4.12 heterogeneous cluster, no such issue. Upgrade a non heterogeneous cluster, no such issue. So seems it only occurs on heterogeneous cluster upgrade.