Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3904

Delete/Add a failureDomain in CPMS to trigger update cannot work right on GCP

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • 4.13
    • None
    • Moderate
    • None
    • CLOUD Sprint 228
    • 1
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Bug Fix
    • Done

      Description of problem:

      Delete/Add a failureDomain in CPMS to trigger update cannot work right on GCP

      Version-Release number of selected component (if applicable):

      4.13.0-0.nightly-2022-11-19-182111

      How reproducible:

      always

      Steps to Reproduce:

      1.Launch a 4.13 cluster on GCP
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.13.0-0.nightly-2022-11-19-182111   True        False         80m     Cluster version is 4.13.0-0.nightly-2022-11-19-182111
      liuhuali@Lius-MacBook-Pro huali-test % oc project openshift-machine-api
      Now using project "openshift-machine-api" on server "https://api.huliu-gcp13c2.qe.gcp.devcluster.openshift.com:6443".
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                 PHASE     TYPE            REGION        ZONE            AGE
      huliu-gcp13c2-6sh7k-master-0         Running   n2-standard-4   us-central1   us-central1-a   102m
      huliu-gcp13c2-6sh7k-master-1         Running   n2-standard-4   us-central1   us-central1-b   102m
      huliu-gcp13c2-6sh7k-master-2         Running   n2-standard-4   us-central1   us-central1-c   102m
      huliu-gcp13c2-6sh7k-worker-a-8sftf   Running   n2-standard-4   us-central1   us-central1-a   99m
      huliu-gcp13c2-6sh7k-worker-b-zb48r   Running   n2-standard-4   us-central1   us-central1-b   99m
      huliu-gcp13c2-6sh7k-worker-c-tlrzl   Running   n2-standard-4   us-central1   us-central1-c   99m
      liuhuali@Lius-MacBook-Pro huali-test % oc get machineset
      NAME                           DESIRED   CURRENT   READY   AVAILABLE   AGE
      huliu-gcp13c2-6sh7k-worker-a   1         1         1       1           102m
      huliu-gcp13c2-6sh7k-worker-b   1         1         1       1           102m
      huliu-gcp13c2-6sh7k-worker-c   1         1         1       1           102m
      huliu-gcp13c2-6sh7k-worker-f   0         0                             102m
      liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
      NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE      AGE
      cluster   3         3         3       3                       Inactive   99m
      
      2.Edit CPMS, change state to Active
      liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
      controlplanemachineset.machine.openshift.io/cluster edited
      liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset
      NAME      DESIRED   CURRENT   READY   UPDATED   UNAVAILABLE   STATE    AGE
      cluster   3         3         3       3                       Active   100m 
      
      3.Edit CPMS, there are four failureDomains(us-central1-a,us-central1-b,us-central1-c,us-central1-f) by default, delete the first one(us-central1-a), found the new machine stuck in Provisioning
      
      liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
      controlplanemachineset.machine.openshift.io/cluster edited
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                 PHASE          TYPE            REGION        ZONE            AGE
      huliu-gcp13c2-6sh7k-master-0         Running        n2-standard-4   us-central1   us-central1-a   104m
      huliu-gcp13c2-6sh7k-master-1         Running        n2-standard-4   us-central1   us-central1-b   104m
      huliu-gcp13c2-6sh7k-master-2         Running        n2-standard-4   us-central1   us-central1-c   104m
      huliu-gcp13c2-6sh7k-master-gb5b4-0   Provisioning                                                 3s
      huliu-gcp13c2-6sh7k-worker-a-8sftf   Running        n2-standard-4   us-central1   us-central1-a   101m
      huliu-gcp13c2-6sh7k-worker-b-zb48r   Running        n2-standard-4   us-central1   us-central1-b   101m
      huliu-gcp13c2-6sh7k-worker-c-tlrzl   Running        n2-standard-4   us-central1   us-central1-c   101m
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                 PHASE          TYPE            REGION        ZONE            AGE
      huliu-gcp13c2-6sh7k-master-0         Running        n2-standard-4   us-central1   us-central1-a   131m
      huliu-gcp13c2-6sh7k-master-1         Running        n2-standard-4   us-central1   us-central1-b   131m
      huliu-gcp13c2-6sh7k-master-2         Running        n2-standard-4   us-central1   us-central1-c   131m
      huliu-gcp13c2-6sh7k-master-gb5b4-0   Provisioning   n2-standard-4   us-central1   us-central1-f   26m
      huliu-gcp13c2-6sh7k-worker-a-8sftf   Running        n2-standard-4   us-central1   us-central1-a   127m
      huliu-gcp13c2-6sh7k-worker-b-zb48r   Running        n2-standard-4   us-central1   us-central1-b   127m
      huliu-gcp13c2-6sh7k-worker-c-tlrzl   Running        n2-standard-4   us-central1   us-central1-c   127m
      
      machine-controller log:
      E1121 05:10:15.654929       1 actuator.go:53] huliu-gcp13c2-6sh7k-master-gb5b4-0 error: huliu-gcp13c2-6sh7k-master-gb5b4-0: reconciler failed to Update machine: failed to register instance to instance group: failed to fetch running instances in instance group huliu-gcp13c2-6sh7k-master-us-central1-f: instanceGroupsListInstances request failed: googleapi: Error 404: The resource 'projects/openshift-qe/zones/us-central1-f/instanceGroups/huliu-gcp13c2-6sh7k-master-us-central1-f' was not found, notFound
      E1121 05:10:15.655015       1 controller.go:315] huliu-gcp13c2-6sh7k-master-gb5b4-0: error updating machine: huliu-gcp13c2-6sh7k-master-gb5b4-0: reconciler failed to Update machine: failed to register instance to instance group: failed to fetch running instances in instance group huliu-gcp13c2-6sh7k-master-us-central1-f: instanceGroupsListInstances request failed: googleapi: Error 404: The resource 'projects/openshift-qe/zones/us-central1-f/instanceGroups/huliu-gcp13c2-6sh7k-master-us-central1-f' was not found, notFound, retrying in 30s seconds
      I1121 05:10:15.655829       1 recorder.go:103] events "msg"="huliu-gcp13c2-6sh7k-master-gb5b4-0: reconciler failed to Update machine: failed to register instance to instance group: failed to fetch running instances in instance group huliu-gcp13c2-6sh7k-master-us-central1-f: instanceGroupsListInstances request failed: googleapi: Error 404: The resource 'projects/openshift-qe/zones/us-central1-f/instanceGroups/huliu-gcp13c2-6sh7k-master-us-central1-f' was not found, notFound" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"huliu-gcp13c2-6sh7k-master-gb5b4-0","uid":"008cbb45-2b29-493e-8985-37f87fe6a98d","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"60780"} "reason"="FailedUpdate" "type"="Warning" 
      
      4.Edit CPMS, add the failureDomain(us-central1-a) back, found the machine stuck in Deleting
      
      liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster   controlplanemachineset.machine.openshift.io/cluster edited
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                 PHASE      TYPE            REGION        ZONE            AGE
      huliu-gcp13c2-6sh7k-master-0         Running    n2-standard-4   us-central1   us-central1-a   3h37m
      huliu-gcp13c2-6sh7k-master-1         Running    n2-standard-4   us-central1   us-central1-b   3h37m
      huliu-gcp13c2-6sh7k-master-2         Running    n2-standard-4   us-central1   us-central1-c   3h37m
      huliu-gcp13c2-6sh7k-master-gb5b4-0   Deleting   n2-standard-4   us-central1   us-central1-f   113m
      huliu-gcp13c2-6sh7k-worker-a-8sftf   Running    n2-standard-4   us-central1   us-central1-a   3h34m
      huliu-gcp13c2-6sh7k-worker-b-zb48r   Running    n2-standard-4   us-central1   us-central1-b   3h34m
      huliu-gcp13c2-6sh7k-worker-c-tlrzl   Running    n2-standard-4   us-central1   us-central1-c   3h34m

      Actual results:

      When delete a failureDomain, the new machine stuck in Provisioning, when add the failureDomain back, the new machine stuck in Deleting

      Expected results:

      When delete a failureDomain, the new machine should get Running, when add the failureDomain back, the new machine should be deleted successfully,
      Or if the machine cannot be created in the failureDomain, the new machine should be Failed when delete a failureDomain, and the machine should be deleted successfully when add the failureDomain back.

      Additional info:

      Must-gather: 
      https://drive.google.com/file/d/1AxnVwToQ15g6M4Mc5S7rh62FygM44B6f/view?usp=sharing
      
      worker machine created successfully in this failureDomain:
      huliu-gcp13c2-6sh7k-worker-f-g5h77   Running    n2-standard-4   us-central1   us-central1-f   8m36s

              dodvarka@redhat.com Daniel Odvarka (Inactive)
              huliu@redhat.com Huali Liu
              Huali Liu Huali Liu
              Jeana Routh Jeana Routh
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: