-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.13
-
None
-
Moderate
-
None
-
CLOUD Sprint 228
-
1
-
False
-
-
N/A
-
Bug Fix
-
Done
Description of problem:
Delete/Add a failureDomain in CPMS to trigger update cannot work right on GCP
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2022-11-19-182111
How reproducible:
always
Steps to Reproduce:
1.Launch a 4.13 cluster on GCP liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2022-11-19-182111 True False 80m Cluster version is 4.13.0-0.nightly-2022-11-19-182111 liuhuali@Lius-MacBook-Pro huali-test % oc project openshift-machine-api Now using project "openshift-machine-api" on server "https://api.huliu-gcp13c2.qe.gcp.devcluster.openshift.com:6443". liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-gcp13c2-6sh7k-master-0 Running n2-standard-4 us-central1 us-central1-a 102m huliu-gcp13c2-6sh7k-master-1 Running n2-standard-4 us-central1 us-central1-b 102m huliu-gcp13c2-6sh7k-master-2 Running n2-standard-4 us-central1 us-central1-c 102m huliu-gcp13c2-6sh7k-worker-a-8sftf Running n2-standard-4 us-central1 us-central1-a 99m huliu-gcp13c2-6sh7k-worker-b-zb48r Running n2-standard-4 us-central1 us-central1-b 99m huliu-gcp13c2-6sh7k-worker-c-tlrzl Running n2-standard-4 us-central1 us-central1-c 99m liuhuali@Lius-MacBook-Pro huali-test % oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE huliu-gcp13c2-6sh7k-worker-a 1 1 1 1 102m huliu-gcp13c2-6sh7k-worker-b 1 1 1 1 102m huliu-gcp13c2-6sh7k-worker-c 1 1 1 1 102m huliu-gcp13c2-6sh7k-worker-f 0 0 102m liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Inactive 99m 2.Edit CPMS, change state to Active liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster controlplanemachineset.machine.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Active 100m 3.Edit CPMS, there are four failureDomains(us-central1-a,us-central1-b,us-central1-c,us-central1-f) by default, delete the first one(us-central1-a), found the new machine stuck in Provisioning liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster controlplanemachineset.machine.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-gcp13c2-6sh7k-master-0 Running n2-standard-4 us-central1 us-central1-a 104m huliu-gcp13c2-6sh7k-master-1 Running n2-standard-4 us-central1 us-central1-b 104m huliu-gcp13c2-6sh7k-master-2 Running n2-standard-4 us-central1 us-central1-c 104m huliu-gcp13c2-6sh7k-master-gb5b4-0 Provisioning 3s huliu-gcp13c2-6sh7k-worker-a-8sftf Running n2-standard-4 us-central1 us-central1-a 101m huliu-gcp13c2-6sh7k-worker-b-zb48r Running n2-standard-4 us-central1 us-central1-b 101m huliu-gcp13c2-6sh7k-worker-c-tlrzl Running n2-standard-4 us-central1 us-central1-c 101m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-gcp13c2-6sh7k-master-0 Running n2-standard-4 us-central1 us-central1-a 131m huliu-gcp13c2-6sh7k-master-1 Running n2-standard-4 us-central1 us-central1-b 131m huliu-gcp13c2-6sh7k-master-2 Running n2-standard-4 us-central1 us-central1-c 131m huliu-gcp13c2-6sh7k-master-gb5b4-0 Provisioning n2-standard-4 us-central1 us-central1-f 26m huliu-gcp13c2-6sh7k-worker-a-8sftf Running n2-standard-4 us-central1 us-central1-a 127m huliu-gcp13c2-6sh7k-worker-b-zb48r Running n2-standard-4 us-central1 us-central1-b 127m huliu-gcp13c2-6sh7k-worker-c-tlrzl Running n2-standard-4 us-central1 us-central1-c 127m machine-controller log: E1121 05:10:15.654929 1 actuator.go:53] huliu-gcp13c2-6sh7k-master-gb5b4-0 error: huliu-gcp13c2-6sh7k-master-gb5b4-0: reconciler failed to Update machine: failed to register instance to instance group: failed to fetch running instances in instance group huliu-gcp13c2-6sh7k-master-us-central1-f: instanceGroupsListInstances request failed: googleapi: Error 404: The resource 'projects/openshift-qe/zones/us-central1-f/instanceGroups/huliu-gcp13c2-6sh7k-master-us-central1-f' was not found, notFound E1121 05:10:15.655015 1 controller.go:315] huliu-gcp13c2-6sh7k-master-gb5b4-0: error updating machine: huliu-gcp13c2-6sh7k-master-gb5b4-0: reconciler failed to Update machine: failed to register instance to instance group: failed to fetch running instances in instance group huliu-gcp13c2-6sh7k-master-us-central1-f: instanceGroupsListInstances request failed: googleapi: Error 404: The resource 'projects/openshift-qe/zones/us-central1-f/instanceGroups/huliu-gcp13c2-6sh7k-master-us-central1-f' was not found, notFound, retrying in 30s seconds I1121 05:10:15.655829 1 recorder.go:103] events "msg"="huliu-gcp13c2-6sh7k-master-gb5b4-0: reconciler failed to Update machine: failed to register instance to instance group: failed to fetch running instances in instance group huliu-gcp13c2-6sh7k-master-us-central1-f: instanceGroupsListInstances request failed: googleapi: Error 404: The resource 'projects/openshift-qe/zones/us-central1-f/instanceGroups/huliu-gcp13c2-6sh7k-master-us-central1-f' was not found, notFound" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"huliu-gcp13c2-6sh7k-master-gb5b4-0","uid":"008cbb45-2b29-493e-8985-37f87fe6a98d","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"60780"} "reason"="FailedUpdate" "type"="Warning" 4.Edit CPMS, add the failureDomain(us-central1-a) back, found the machine stuck in Deleting liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster controlplanemachineset.machine.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-gcp13c2-6sh7k-master-0 Running n2-standard-4 us-central1 us-central1-a 3h37m huliu-gcp13c2-6sh7k-master-1 Running n2-standard-4 us-central1 us-central1-b 3h37m huliu-gcp13c2-6sh7k-master-2 Running n2-standard-4 us-central1 us-central1-c 3h37m huliu-gcp13c2-6sh7k-master-gb5b4-0 Deleting n2-standard-4 us-central1 us-central1-f 113m huliu-gcp13c2-6sh7k-worker-a-8sftf Running n2-standard-4 us-central1 us-central1-a 3h34m huliu-gcp13c2-6sh7k-worker-b-zb48r Running n2-standard-4 us-central1 us-central1-b 3h34m huliu-gcp13c2-6sh7k-worker-c-tlrzl Running n2-standard-4 us-central1 us-central1-c 3h34m
Actual results:
When delete a failureDomain, the new machine stuck in Provisioning, when add the failureDomain back, the new machine stuck in Deleting
Expected results:
When delete a failureDomain, the new machine should get Running, when add the failureDomain back, the new machine should be deleted successfully, Or if the machine cannot be created in the failureDomain, the new machine should be Failed when delete a failureDomain, and the machine should be deleted successfully when add the failureDomain back.
Additional info:
Must-gather: https://drive.google.com/file/d/1AxnVwToQ15g6M4Mc5S7rh62FygM44B6f/view?usp=sharing worker machine created successfully in this failureDomain: huliu-gcp13c2-6sh7k-worker-f-g5h77 Running n2-standard-4 us-central1 us-central1-f 8m36s