-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.13
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
CLOUD Sprint 228
-
1
-
Done
-
Bug Fix
-
N/A
-
None
-
None
-
None
-
None
Description of problem:
Delete/Add a failureDomain in CPMS to trigger update cannot work right on GCP
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2022-11-19-182111
How reproducible:
always
Steps to Reproduce:
1.Launch a 4.13 cluster on GCP
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.13.0-0.nightly-2022-11-19-182111 True False 80m Cluster version is 4.13.0-0.nightly-2022-11-19-182111
liuhuali@Lius-MacBook-Pro huali-test % oc project openshift-machine-api
Now using project "openshift-machine-api" on server "https://api.huliu-gcp13c2.qe.gcp.devcluster.openshift.com:6443".
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
huliu-gcp13c2-6sh7k-master-0 Running n2-standard-4 us-central1 us-central1-a 102m
huliu-gcp13c2-6sh7k-master-1 Running n2-standard-4 us-central1 us-central1-b 102m
huliu-gcp13c2-6sh7k-master-2 Running n2-standard-4 us-central1 us-central1-c 102m
huliu-gcp13c2-6sh7k-worker-a-8sftf Running n2-standard-4 us-central1 us-central1-a 99m
huliu-gcp13c2-6sh7k-worker-b-zb48r Running n2-standard-4 us-central1 us-central1-b 99m
huliu-gcp13c2-6sh7k-worker-c-tlrzl Running n2-standard-4 us-central1 us-central1-c 99m
liuhuali@Lius-MacBook-Pro huali-test % oc get machineset
NAME DESIRED CURRENT READY AVAILABLE AGE
huliu-gcp13c2-6sh7k-worker-a 1 1 1 1 102m
huliu-gcp13c2-6sh7k-worker-b 1 1 1 1 102m
huliu-gcp13c2-6sh7k-worker-c 1 1 1 1 102m
huliu-gcp13c2-6sh7k-worker-f 0 0 102m
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE
cluster 3 3 3 3 Inactive 99m
2.Edit CPMS, change state to Active
liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE
cluster 3 3 3 3 Active 100m
3.Edit CPMS, there are four failureDomains(us-central1-a,us-central1-b,us-central1-c,us-central1-f) by default, delete the first one(us-central1-a), found the new machine stuck in Provisioning
liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
huliu-gcp13c2-6sh7k-master-0 Running n2-standard-4 us-central1 us-central1-a 104m
huliu-gcp13c2-6sh7k-master-1 Running n2-standard-4 us-central1 us-central1-b 104m
huliu-gcp13c2-6sh7k-master-2 Running n2-standard-4 us-central1 us-central1-c 104m
huliu-gcp13c2-6sh7k-master-gb5b4-0 Provisioning 3s
huliu-gcp13c2-6sh7k-worker-a-8sftf Running n2-standard-4 us-central1 us-central1-a 101m
huliu-gcp13c2-6sh7k-worker-b-zb48r Running n2-standard-4 us-central1 us-central1-b 101m
huliu-gcp13c2-6sh7k-worker-c-tlrzl Running n2-standard-4 us-central1 us-central1-c 101m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
huliu-gcp13c2-6sh7k-master-0 Running n2-standard-4 us-central1 us-central1-a 131m
huliu-gcp13c2-6sh7k-master-1 Running n2-standard-4 us-central1 us-central1-b 131m
huliu-gcp13c2-6sh7k-master-2 Running n2-standard-4 us-central1 us-central1-c 131m
huliu-gcp13c2-6sh7k-master-gb5b4-0 Provisioning n2-standard-4 us-central1 us-central1-f 26m
huliu-gcp13c2-6sh7k-worker-a-8sftf Running n2-standard-4 us-central1 us-central1-a 127m
huliu-gcp13c2-6sh7k-worker-b-zb48r Running n2-standard-4 us-central1 us-central1-b 127m
huliu-gcp13c2-6sh7k-worker-c-tlrzl Running n2-standard-4 us-central1 us-central1-c 127m
machine-controller log:
E1121 05:10:15.654929 1 actuator.go:53] huliu-gcp13c2-6sh7k-master-gb5b4-0 error: huliu-gcp13c2-6sh7k-master-gb5b4-0: reconciler failed to Update machine: failed to register instance to instance group: failed to fetch running instances in instance group huliu-gcp13c2-6sh7k-master-us-central1-f: instanceGroupsListInstances request failed: googleapi: Error 404: The resource 'projects/openshift-qe/zones/us-central1-f/instanceGroups/huliu-gcp13c2-6sh7k-master-us-central1-f' was not found, notFound
E1121 05:10:15.655015 1 controller.go:315] huliu-gcp13c2-6sh7k-master-gb5b4-0: error updating machine: huliu-gcp13c2-6sh7k-master-gb5b4-0: reconciler failed to Update machine: failed to register instance to instance group: failed to fetch running instances in instance group huliu-gcp13c2-6sh7k-master-us-central1-f: instanceGroupsListInstances request failed: googleapi: Error 404: The resource 'projects/openshift-qe/zones/us-central1-f/instanceGroups/huliu-gcp13c2-6sh7k-master-us-central1-f' was not found, notFound, retrying in 30s seconds
I1121 05:10:15.655829 1 recorder.go:103] events "msg"="huliu-gcp13c2-6sh7k-master-gb5b4-0: reconciler failed to Update machine: failed to register instance to instance group: failed to fetch running instances in instance group huliu-gcp13c2-6sh7k-master-us-central1-f: instanceGroupsListInstances request failed: googleapi: Error 404: The resource 'projects/openshift-qe/zones/us-central1-f/instanceGroups/huliu-gcp13c2-6sh7k-master-us-central1-f' was not found, notFound" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"huliu-gcp13c2-6sh7k-master-gb5b4-0","uid":"008cbb45-2b29-493e-8985-37f87fe6a98d","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"60780"} "reason"="FailedUpdate" "type"="Warning"
4.Edit CPMS, add the failureDomain(us-central1-a) back, found the machine stuck in Deleting
liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
huliu-gcp13c2-6sh7k-master-0 Running n2-standard-4 us-central1 us-central1-a 3h37m
huliu-gcp13c2-6sh7k-master-1 Running n2-standard-4 us-central1 us-central1-b 3h37m
huliu-gcp13c2-6sh7k-master-2 Running n2-standard-4 us-central1 us-central1-c 3h37m
huliu-gcp13c2-6sh7k-master-gb5b4-0 Deleting n2-standard-4 us-central1 us-central1-f 113m
huliu-gcp13c2-6sh7k-worker-a-8sftf Running n2-standard-4 us-central1 us-central1-a 3h34m
huliu-gcp13c2-6sh7k-worker-b-zb48r Running n2-standard-4 us-central1 us-central1-b 3h34m
huliu-gcp13c2-6sh7k-worker-c-tlrzl Running n2-standard-4 us-central1 us-central1-c 3h34m
Actual results:
When delete a failureDomain, the new machine stuck in Provisioning, when add the failureDomain back, the new machine stuck in Deleting
Expected results:
When delete a failureDomain, the new machine should get Running, when add the failureDomain back, the new machine should be deleted successfully, Or if the machine cannot be created in the failureDomain, the new machine should be Failed when delete a failureDomain, and the machine should be deleted successfully when add the failureDomain back.
Additional info:
Must-gather: https://drive.google.com/file/d/1AxnVwToQ15g6M4Mc5S7rh62FygM44B6f/view?usp=sharing worker machine created successfully in this failureDomain: huliu-gcp13c2-6sh7k-worker-f-g5h77 Running n2-standard-4 us-central1 us-central1-f 8m36s