Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-50904

The cluster stuck or unable to connect when adding second subnet in controlplanemachineset on Nutanix

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.18, 4.19
    • None
    • Critical
    • None
    • False
    • Hide

      None

      Show
      None
    • Hide
      * The following known issues with configuring multiple subnets exist in {product-title} version 4.18:
      +
      --
      ** Adding subnets above the existing subnet in the `subnets` stanza causes a control plane node to become stuck in the `SchedulingDisabled` state.
      As a workaround, only add subnets below the existing subnet in the `subnets` stanza.

      ** Sometimes, after adding a subnet, the updated control plane machines appear in the Nutanix console but the {product-title} cluster is unreachable.
      There is no workaround for this issue.
      --
      +
      (link:https://issues.redhat.com/browse/OCPBUGS-50904[*OCPBUGS-50904*])
      Show
      * The following known issues with configuring multiple subnets exist in {product-title} version 4.18: + -- ** Adding subnets above the existing subnet in the `subnets` stanza causes a control plane node to become stuck in the `SchedulingDisabled` state. As a workaround, only add subnets below the existing subnet in the `subnets` stanza. ** Sometimes, after adding a subnet, the updated control plane machines appear in the Nutanix console but the {product-title} cluster is unreachable. There is no workaround for this issue. -- + (link: https://issues.redhat.com/browse/OCPBUGS-50904 [* OCPBUGS-50904 *])
    • Known Issue
    • Proposed

      Description of problem:

      Case1: Add the new subnet in front of the original subnet in controlplanemachineset๏ผŒthe cluster stuck
      Case2: Add the new subnet after the original subnet in controlplanemachineset๏ผŒsometimes the cluster RollingUpdate successfully, but sometimes the cluster unable to connect  

      Version-Release number of selected component (if applicable):

          4.18.0-0.nightly-2025-02-14-222249

      How reproducible:

         100% for case1, 50% for case2 in my testing

      Steps to Reproduce:

          1.Install a 4.18 cluster on Nutanix
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.18.0-0.nightly-2025-02-14-222249   True        False         40m     Cluster version is 4.18.0-0.nightly-2025-02-14-222249
      liuhuali@Lius-MacBook-Pro huali-test % oc get infrastructure cluster -oyaml
      apiVersion: config.openshift.io/v1
      kind: Infrastructure
      metadata:
        creationTimestamp: "2025-02-17T00:40:20Z"
        generation: 1
        name: cluster
        resourceVersion: "519"
        uid: d0cafa11-dcdf-4f36-ba5b-2a5b0db2e6b8
      spec:
        cloudConfig:
          key: config
          name: cloud-provider-config
        platformSpec:
          nutanix:
            failureDomains: []
            prismCentral:
              address: prismcentral.lts-cluster.nutanix-dev.devcluster.openshift.com
              port: 9440
            prismElements:
            - endpoint:
                address: 10.0.128.159
                port: 9440
              name: Development-LTS
          type: Nutanix
      status:
        apiServerInternalURI: https://api-int.ci-op-37d7j87w-590c2.nutanix-ci.devcluster.openshift.com:6443
        apiServerURL: https://api.ci-op-37d7j87w-590c2.nutanix-ci.devcluster.openshift.com:6443
        controlPlaneTopology: HighlyAvailable
        cpuPartitioning: None
        etcdDiscoveryDomain: ""
        infrastructureName: ci-op-37d7j87w-590c2-8vq5j
        infrastructureTopology: HighlyAvailable
        platform: Nutanix
        platformStatus:
          nutanix:
            apiServerInternalIP: 10.0.130.10
            apiServerInternalIPs:
            - 10.0.130.10
            ingressIP: 10.0.130.11
            ingressIPs:
            - 10.0.130.11
            loadBalancer:
              type: OpenShiftManagedDefault
          type: Nutanix
      
          2.Add a second subnet in controlplanemachineset, 
      for case1, add the new subnet in front of the original subnet in controlplanemachineset 
      before adding:
      
                  subnets:
                  - type: uuid
                    uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
      after adding:
      
                  subnets:
                  - type: uuid
                    uuid: efe26e93-f6cf-4d89-8104-009e85201fa8
                  - type: uuid
                    uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
      
      for case2, add the new subnet after the original subnet in controlplanemachineset 
      before adding:
      
                  subnets:
                  - type: uuid
                    uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
      after adding:
      
                  subnets:
                  - type: uuid
                    uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
                  - type: uuid
                    uuid: efe26e93-f6cf-4d89-8104-009e85201fa8
      
          3. for case 1, one old master stuck(sometimes it stuck on master-0, sometimes stuck on master-1, sometimes stuck on master-2 in my testing)
          
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                        PHASE      TYPE   REGION    ZONE              AGE
      ci-op-37d7j87w-590c2-8vq5j-master-1         Deleting   AHV    Unnamed   Development-LTS   3h53m
      ci-op-37d7j87w-590c2-8vq5j-master-2         Running    AHV    Unnamed   Development-LTS   3h53m
      ci-op-37d7j87w-590c2-8vq5j-master-58wn6-0   Running    AHV    Unnamed   Development-LTS   166m
      ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1   Running    AHV    Unnamed   Development-LTS   156m
      ci-op-37d7j87w-590c2-8vq5j-worker-gw9cj     Running    AHV    Unnamed   Development-LTS   3h50m
      ci-op-37d7j87w-590c2-8vq5j-worker-rcm2q     Running    AHV    Unnamed   Development-LTS   3h50m
      ci-op-37d7j87w-590c2-8vq5j-worker-tpd9b     Running    AHV    Unnamed   Development-LTS   3h50m
      liuhuali@Lius-MacBook-Pro huali-test % oc get node
      NAME                                        STATUS                     ROLES                  AGE     VERSION
      ci-op-37d7j87w-590c2-8vq5j-master-1         Ready,SchedulingDisabled   control-plane,master   3h53m   v1.31.5
      ci-op-37d7j87w-590c2-8vq5j-master-2         Ready                      control-plane,master   3h53m   v1.31.5
      ci-op-37d7j87w-590c2-8vq5j-master-58wn6-0   Ready                      control-plane,master   164m    v1.31.5
      ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1   Ready                      control-plane,master   154m    v1.31.5
      ci-op-37d7j87w-590c2-8vq5j-worker-gw9cj     Ready                      worker                 3h37m   v1.31.5
      ci-op-37d7j87w-590c2-8vq5j-worker-rcm2q     Ready                      worker                 3h37m   v1.31.5
      ci-op-37d7j87w-590c2-8vq5j-worker-tpd9b     Ready                      worker                 3h37m   v1.31.5
      liuhuali@Lius-MacBook-Pro huali-test % oc get co
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.18.0-0.nightly-2025-02-14-222249   True        True          True       3h28m   APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
      baremetal                                  4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      cloud-controller-manager                   4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h52m   
      cloud-credential                           4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      cluster-autoscaler                         4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      config-operator                            4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      console                                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h34m   
      control-plane-machine-set                  4.18.0-0.nightly-2025-02-14-222249   True        True          False      3h46m   Observed 1 replica(s) in need of update
      csi-snapshot-controller                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      dns                                        4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
      etcd                                       4.18.0-0.nightly-2025-02-14-222249   True        True          False      3h48m   NodeInstallerProgressing: 2 nodes are at revision 8; 1 node is at revision 10; 1 node is at revision 15; 0 nodes have achieved new revision 17
      image-registry                             4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h19m   
      ingress                                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h35m   
      insights                                   4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
      kube-apiserver                             4.18.0-0.nightly-2025-02-14-222249   True        True          True       3h46m   GuardControllerDegraded: Missing operand on node ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1
      kube-controller-manager                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h46m   
      kube-scheduler                             4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h48m   
      kube-storage-version-migrator              4.18.0-0.nightly-2025-02-14-222249   True        False         False      104m    
      machine-api                                4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h37m   
      machine-approver                           4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      machine-config                             4.18.0-0.nightly-2025-02-14-222249   True        False         True       3h50m   Failed to resync 4.18.0-0.nightly-2025-02-14-222249 because: error during syncRequiredMachineConfigPools: [context deadline exceeded, error required MachineConfigPool master is not ready, retrying. Status: (total: 4, ready 3, updated: 4, unavailable: 1, degraded: 0)]
      marketplace                                4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
      monitoring                                 4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h33m   
      network                                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      node-tuning                                4.18.0-0.nightly-2025-02-14-222249   True        False         False      154m    
      olm                                        4.18.0-0.nightly-2025-02-14-222249   True        False         False      104m    
      openshift-apiserver                        4.18.0-0.nightly-2025-02-14-222249   True        True          True       3h35m   APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-apiserver ()
      openshift-controller-manager               4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h41m   
      openshift-samples                          4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h41m   
      operator-lifecycle-manager                 4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
      operator-lifecycle-manager-catalog         4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
      operator-lifecycle-manager-packageserver   4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h41m   
      service-ca                                 4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      storage                                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      liuhuali@Lius-MacBook-Pro huali-test % 
      
      for case2, I unable to connect the cluster, but I can see the masters are RollingUpdate to new masters on Nutanix console https://drive.google.com/file/d/1-UbFiUiyhmeBVTBAVaB23jthiZI0VjAm/view?usp=sharing 
      
      liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset
      controlplanemachineset.machine.openshift.io/cluster edited
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                        PHASE          TYPE   REGION    ZONE              AGE
      ci-op-0pdvmm2s-f3468-7khf5-master-0         Running        AHV    Unnamed   Development-LTS   71m
      ci-op-0pdvmm2s-f3468-7khf5-master-1         Running        AHV    Unnamed   Development-LTS   71m
      ci-op-0pdvmm2s-f3468-7khf5-master-2         Running        AHV    Unnamed   Development-LTS   71m
      ci-op-0pdvmm2s-f3468-7khf5-master-qmj72-0   Provisioning                                      5s
      ci-op-0pdvmm2s-f3468-7khf5-worker-fbj48     Running        AHV    Unnamed   Development-LTS   68m
      ci-op-0pdvmm2s-f3468-7khf5-worker-pv8jw     Running        AHV    Unnamed   Development-LTS   68m
      ci-op-0pdvmm2s-f3468-7khf5-worker-xpwrf     Running        AHV    Unnamed   Development-LTS   68m
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine                        
      Unable to connect to the server: net/http: TLS handshake timeout
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      Unable to connect to the server: EOF
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      Unable to connect to the server: EOF
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      Unable to connect to the server: EOF
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      Unable to connect to the server: EOF
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      Unable to connect to the server: EOF
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      Unable to connect to the server: EOF
      liuhuali@Lius-MacBook-Pro huali-test % 

      Actual results:

          the cluster stuck or unable to connect

      Expected results:

          RollingUpdate successfully, the cluster can be connected

      Additional info:

          must gather for case1: https://drive.google.com/file/d/1ZeN_5bnCYbOFuCihv1zIt3Y26rmNynBw/view?usp=sharing

            [OCPBUGS-50904] The cluster stuck or unable to connect when adding second subnet in controlplanemachineset on Nutanix

            Huali Liu added a comment - - edited

            Tested adding failureDomains day2 following the case  OCP-70808 - [ipi-on-nutanix] adding failureDomains to an existing Nutanix cluster but set two subnets for each failure domain on 4.18 today. Met the issue - the cluster is unreachable.

            Steps:

            1.Install a nutanix ipi cluster without failureDomains

            liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
            NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.18.0-0.nightly-2025-02-18-114102   True        False         100m    Cluster version is 4.18.0-0.nightly-2025-02-18-114102
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            NAME                                      PHASE     TYPE   REGION    ZONE              AGE
            ci-op-0g1t02w6-9c0d9-wt5g7-master-0       Running   AHV    Unnamed   Development-GPU   128m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-1       Running   AHV    Unnamed   Development-GPU   128m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-2       Running   AHV    Unnamed   Development-GPU   128m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8   Running   AHV    Unnamed   Development-GPU   125m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd   Running   AHV    Unnamed   Development-GPU   125m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695   Running   AHV    Unnamed   Development-GPU   125m 

            2.Enable featuregate and wait the cluster ready

            liuhuali@Lius-MacBook-Pro huali-test % oc edit featuregate                  
            featuregate.config.openshift.io/cluster edited
            
            spec:
               customNoUpgrade:
                 enabled:
                 - NutanixMultiSubnets
               featureSet: CustomNoUpgrade
            
            liuhuali@Lius-MacBook-Pro huali-test % oc get co
            NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
            authentication                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      105m    
            baremetal                                  4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m    
            cloud-controller-manager                   4.18.0-0.nightly-2025-02-18-114102   True        False         False      132m    
            cloud-credential                           4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m    
            cluster-autoscaler                         4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m    
            config-operator                            4.18.0-0.nightly-2025-02-18-114102   True        False         False      131m    
            console                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      111m    
            control-plane-machine-set                  4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m    
            csi-snapshot-controller                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      131m    
            dns                                        4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m    
            etcd                                       4.18.0-0.nightly-2025-02-18-114102   True        False         False      129m    
            image-registry                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      102m    
            ingress                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      114m    
            insights                                   4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m    
            kube-apiserver                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      127m    
            kube-controller-manager                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      128m    
            kube-scheduler                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      127m    
            kube-storage-version-migrator              4.18.0-0.nightly-2025-02-18-114102   True        False         False      131m    
            machine-api                                4.18.0-0.nightly-2025-02-18-114102   True        False         False      116m    
            machine-approver                           4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m    
            machine-config                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m    
            marketplace                                4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m    
            monitoring                                 4.18.0-0.nightly-2025-02-18-114102   True        False         False      113m    
            network                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      131m    
            node-tuning                                4.18.0-0.nightly-2025-02-18-114102   True        False         False      115m    
            olm                                        4.18.0-0.nightly-2025-02-18-114102   True        False         False      115m    
            openshift-apiserver                        4.18.0-0.nightly-2025-02-18-114102   True        False         False      122m    
            openshift-controller-manager               4.18.0-0.nightly-2025-02-18-114102   True        False         False      127m    
            openshift-samples                          4.18.0-0.nightly-2025-02-18-114102   True        False         False      121m    
            operator-lifecycle-manager                 4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m    
            operator-lifecycle-manager-catalog         4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m    
            operator-lifecycle-manager-packageserver   4.18.0-0.nightly-2025-02-18-114102   True        False         False      122m    
            service-ca                                 4.18.0-0.nightly-2025-02-18-114102   True        False         False      131m    
            storage                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      131m 
            liuhuali@Lius-MacBook-Pro huali-test % oc get node
            NAME                                      STATUS   ROLES                  AGE     VERSION
            ci-op-0g1t02w6-9c0d9-wt5g7-master-0       Ready    control-plane,master   3h24m   v1.31.5
            ci-op-0g1t02w6-9c0d9-wt5g7-master-1       Ready    control-plane,master   3h24m   v1.31.5
            ci-op-0g1t02w6-9c0d9-wt5g7-master-2       Ready    control-plane,master   3h24m   v1.31.5
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8   Ready    worker                 3h8m    v1.31.5
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd   Ready    worker                 3h8m    v1.31.5
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695   Ready    worker                 3h8m    v1.31.5
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine 
            NAME                                      PHASE     TYPE   REGION    ZONE              AGE
            ci-op-0g1t02w6-9c0d9-wt5g7-master-0       Running   AHV    Unnamed   Development-GPU   3h25m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-1       Running   AHV    Unnamed   Development-GPU   3h25m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-2       Running   AHV    Unnamed   Development-GPU   3h25m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8   Running   AHV    Unnamed   Development-GPU   3h22m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd   Running   AHV    Unnamed   Development-GPU   3h22m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695   Running   AHV    Unnamed   Development-GPU   3h22m

            3. Edit infrastructure cluster object to add failureDomains, the masters will not update

            liuhuali@Lius-MacBook-Pro huali-test % oc get infrastructure cluster -oyaml
            apiVersion: config.openshift.io/v1
            kind: Infrastructure
            metadata:
              creationTimestamp: "2025-02-19T03:20:58Z"
              generation: 2
              name: cluster
              resourceVersion: "84629"
              uid: e5dcc0f0-44e3-4299-adc9-0ebcc96edbed
            spec:
              cloudConfig:
                key: config
                name: cloud-provider-config
              platformSpec:
                nutanix:
                  failureDomains:
                  - cluster:
                      type: UUID
                      uuid: 0005d9a8-fa6c-56da-c90a-fde303e4d564
                    name: failure-domain-1
                    subnets:
                    - type: UUID
                      uuid: 512c1d6f-c6e7-4746-8ae2-9c3e1db2aba6
                    - type: UUID
                      uuid: a94cb75c-24ff-4ee2-85cf-c2f906ee9fe5
                  - cluster:
                      type: UUID
                      uuid: 00060c83-8946-571b-1aba-2c629cb14c98
                    name: failure-domain-2
                    subnets:
                    - type: UUID
                      uuid: d1b1b617-23de-4a9d-b53f-4b386fc27600
                    - type: UUID
                      uuid: 43e96b2b-5027-469f-8b56-6e8f1b0acc17
                  - cluster:
                      type: UUID
                      uuid: 00060e88-8997-8522-8b58-4c17d4b97414
                    name: failure-domain-3
                    subnets:
                    - type: UUID
                      uuid: 3624b067-61e2-4703-b8bf-3810de5cbac1
                    - type: UUID
                      uuid: 0a949005-15a6-4e81-b1f8-7487e2bd308a
                  prismCentral:
                    address: prismcentral.sts-cluster.nutanix-dev.devcluster.openshift.com
                    port: 9440
                  prismElements:
                  - endpoint:
                      address: 10.0.128.243
                      port: 9440
                    name: Development-GPU
                type: Nutanix
            status:
              apiServerInternalURI: https://api-int.ci-op-0g1t02w6-9c0d9.nutanix-ci.devcluster.openshift.com:6443
              apiServerURL: https://api.ci-op-0g1t02w6-9c0d9.nutanix-ci.devcluster.openshift.com:6443
              controlPlaneTopology: HighlyAvailable
              cpuPartitioning: None
              etcdDiscoveryDomain: ""
              infrastructureName: ci-op-0g1t02w6-9c0d9-wt5g7
              infrastructureTopology: HighlyAvailable
              platform: Nutanix
              platformStatus:
                nutanix:
                  apiServerInternalIP: 10.0.200.14
                  apiServerInternalIPs:
                  - 10.0.200.14
                  ingressIP: 10.0.200.15
                  ingressIPs:
                  - 10.0.200.15
                  loadBalancer:
                    type: OpenShiftManagedDefault
                type: Nutanix 
            
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            NAME                                      PHASE     TYPE   REGION    ZONE              AGE
            ci-op-0g1t02w6-9c0d9-wt5g7-master-0       Running   AHV    Unnamed   Development-GPU   3h30m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-1       Running   AHV    Unnamed   Development-GPU   3h30m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-2       Running   AHV    Unnamed   Development-GPU   3h30m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8   Running   AHV    Unnamed   Development-GPU   3h27m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd   Running   AHV    Unnamed   Development-GPU   3h27m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695   Running   AHV    Unnamed   Development-GPU   3h27m
            liuhuali@Lius-MacBook-Pro huali-test % oc get node
            NAME                                      STATUS   ROLES                  AGE     VERSION
            ci-op-0g1t02w6-9c0d9-wt5g7-master-0       Ready    control-plane,master   3h30m   v1.31.5
            ci-op-0g1t02w6-9c0d9-wt5g7-master-1       Ready    control-plane,master   3h30m   v1.31.5
            ci-op-0g1t02w6-9c0d9-wt5g7-master-2       Ready    control-plane,master   3h30m   v1.31.5
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8   Ready    worker                 3h13m   v1.31.5
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd   Ready    worker                 3h13m   v1.31.5
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695   Ready    worker                 3h13m   v1.31.5
            liuhuali@Lius-MacBook-Pro huali-test % oc get co
            NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
            authentication                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h3m    
            baremetal                                  4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m   
            cloud-controller-manager                   4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h30m   
            cloud-credential                           4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m   
            cluster-autoscaler                         4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m   
            config-operator                            4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m   
            console                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h8m    
            control-plane-machine-set                  4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m   
            csi-snapshot-controller                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m   
            dns                                        4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m   
            etcd                                       4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h26m   
            image-registry                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      72m     
            ingress                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h11m   
            insights                                   4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m   
            kube-apiserver                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h25m   
            kube-controller-manager                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h25m   
            kube-scheduler                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h25m   
            kube-storage-version-migrator              4.18.0-0.nightly-2025-02-18-114102   True        False         False      68m     
            machine-api                                4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h13m   
            machine-approver                           4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m   
            machine-config                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m   
            marketplace                                4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m   
            monitoring                                 4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h11m   
            network                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m   
            node-tuning                                4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h12m   
            olm                                        4.18.0-0.nightly-2025-02-18-114102   True        False         False      70m     
            openshift-apiserver                        4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h20m   
            openshift-controller-manager               4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h24m   
            openshift-samples                          4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h19m   
            operator-lifecycle-manager                 4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m   
            operator-lifecycle-manager-catalog         4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m   
            operator-lifecycle-manager-packageserver   4.18.0-0.nightly-2025-02-18-114102   True        False         False      73m     
            service-ca                                 4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m   
            storage                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m 

            4.Edit controlplanemachineset cluster object to add failureDomains, the masters start updating, but after sometime, the cluster is unreachable.

            liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset
            controlplanemachineset.machine.openshift.io/cluster edited
            
            spec:
            ...
              template:
                machineType: machines_v1beta1_machine_openshift_io
                machines_v1beta1_machine_openshift_io:
                  failureDomains:
                    platform: Nutanix
                    nutanix:
                    - name: failure-domain-1
                    - name: failure-domain-2
            
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            NAME                                        PHASE          TYPE   REGION    ZONE              AGE
            ci-op-0g1t02w6-9c0d9-wt5g7-master-0         Running        AHV    Unnamed   Development-GPU   3h32m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-1         Running        AHV    Unnamed   Development-GPU   3h32m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-2         Running        AHV    Unnamed   Development-GPU   3h32m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-s547v-0   Provisioning                                      4s
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8     Running        AHV    Unnamed   Development-GPU   3h29m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd     Running        AHV    Unnamed   Development-GPU   3h29m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695     Running        AHV    Unnamed   Development-GPU   3h29m
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            NAME                                        PHASE      TYPE   REGION    ZONE              AGE
            ci-op-0g1t02w6-9c0d9-wt5g7-master-0         Deleting   AHV    Unnamed   Development-GPU   3h41m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-1         Running    AHV    Unnamed   Development-GPU   3h41m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-2         Running    AHV    Unnamed   Development-GPU   3h41m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-s547v-0   Running    AHV    Unnamed   Development-STS   9m5s
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8     Running    AHV    Unnamed   Development-GPU   3h38m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd     Running    AHV    Unnamed   Development-GPU   3h38m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695     Running    AHV    Unnamed   Development-GPU   3h38m
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            NAME                                        PHASE      TYPE   REGION    ZONE                AGE
            ci-op-0g1t02w6-9c0d9-wt5g7-master-1         Deleting   AHV    Unnamed   Development-GPU     3h50m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-2         Running    AHV    Unnamed   Development-GPU     3h50m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-s547v-0   Running    AHV    Unnamed   Development-STS     18m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-vvhgz-1   Running    AHV    Unnamed   Development-zonal   8m38s
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8     Running    AHV    Unnamed   Development-GPU     3h47m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd     Running    AHV    Unnamed   Development-GPU     3h47m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695     Running    AHV    Unnamed   Development-GPU     3h47m 
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            NAME                                        PHASE      TYPE   REGION    ZONE                AGE
            ci-op-0g1t02w6-9c0d9-wt5g7-master-2         Deleting   AHV    Unnamed   Development-GPU     4h19m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-s547v-0   Running    AHV    Unnamed   Development-STS     47m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-vvhgz-1   Running    AHV    Unnamed   Development-zonal   37m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-w7jxr-2   Running    AHV    Unnamed   Development-STS     28m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8     Running    AHV    Unnamed   Development-GPU     4h16m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd     Running    AHV    Unnamed   Development-GPU     4h16m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695     Running    AHV    Unnamed   Development-GPU     4h16m
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            NAME                                        PHASE      TYPE   REGION    ZONE                AGE
            ci-op-0g1t02w6-9c0d9-wt5g7-master-2         Deleting   AHV    Unnamed   Development-GPU     5h10m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-s547v-0   Running    AHV    Unnamed   Development-STS     98m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-vvhgz-1   Running    AHV    Unnamed   Development-zonal   89m
            ci-op-0g1t02w6-9c0d9-wt5g7-master-w7jxr-2   Running    AHV    Unnamed   Development-STS     79m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8     Running    AHV    Unnamed   Development-GPU     5h7m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd     Running    AHV    Unnamed   Development-GPU     5h7m
            ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695     Running    AHV    Unnamed   Development-GPU     5h7m
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            Unable to connect to the server: net/http: TLS handshake timeout
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            Unable to connect to the server: net/http: TLS handshake timeout
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            Unable to connect to the server: net/http: TLS handshake timeout
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            Unable to connect to the server: net/http: TLS handshake timeout
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            Unable to connect to the server: net/http: TLS handshake timeout
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            Unable to connect to the server: net/http: TLS handshake timeout
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            Unable to connect to the server: net/http: TLS handshake timeout
            liuhuali@Lius-MacBook-Pro huali-test % 

             

            Huali Liu added a comment - - edited Tested adding failureDomains day2 following the case  OCP-70808 - [ipi-on-nutanix] adding failureDomains to an existing Nutanix cluster but set two subnets for each failure domain on 4.18 today. Met the issue - the cluster is unreachable. Steps: 1.Install a nutanix ipi cluster without failureDomains liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS version   4.18.0-0.nightly-2025-02-18-114102   True        False         100m    Cluster version is 4.18.0-0.nightly-2025-02-18-114102 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME                                      PHASE     TYPE   REGION    ZONE              AGE ci-op-0g1t02w6-9c0d9-wt5g7-master-0       Running   AHV    Unnamed   Development-GPU   128m ci-op-0g1t02w6-9c0d9-wt5g7-master-1       Running   AHV    Unnamed   Development-GPU   128m ci-op-0g1t02w6-9c0d9-wt5g7-master-2       Running   AHV    Unnamed   Development-GPU   128m ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8   Running   AHV    Unnamed   Development-GPU   125m ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd   Running   AHV    Unnamed   Development-GPU   125m ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695   Running   AHV    Unnamed   Development-GPU   125m 2.Enable featuregate and wait the cluster ready liuhuali@Lius-MacBook-Pro huali-test % oc edit featuregate                   featuregate.config.openshift.io/cluster edited spec:   customNoUpgrade:     enabled:     - NutanixMultiSubnets   featureSet: CustomNoUpgrade liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE authentication                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      105m     baremetal                                  4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m     cloud-controller-manager                   4.18.0-0.nightly-2025-02-18-114102   True        False         False      132m     cloud-credential                           4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m     cluster-autoscaler                         4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m     config- operator                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      131m     console                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      111m     control-plane-machine-set                  4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m     csi-snapshot-controller                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      131m     dns                                        4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m     etcd                                       4.18.0-0.nightly-2025-02-18-114102   True        False         False      129m     image-registry                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      102m     ingress                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      114m     insights                                   4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m     kube-apiserver                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      127m     kube-controller-manager                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      128m     kube-scheduler                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      127m     kube-storage-version-migrator              4.18.0-0.nightly-2025-02-18-114102   True        False         False      131m     machine-api                                4.18.0-0.nightly-2025-02-18-114102   True        False         False      116m     machine-approver                           4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m     machine-config                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m     marketplace                                4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m     monitoring                                 4.18.0-0.nightly-2025-02-18-114102   True        False         False      113m     network                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      131m     node-tuning                                4.18.0-0.nightly-2025-02-18-114102   True        False         False      115m     olm                                        4.18.0-0.nightly-2025-02-18-114102   True        False         False      115m     openshift-apiserver                        4.18.0-0.nightly-2025-02-18-114102   True        False         False      122m     openshift-controller-manager               4.18.0-0.nightly-2025-02-18-114102   True        False         False      127m     openshift-samples                          4.18.0-0.nightly-2025-02-18-114102   True        False         False      121m     operator -lifecycle-manager                 4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m     operator -lifecycle-manager-catalog         4.18.0-0.nightly-2025-02-18-114102   True        False         False      130m     operator -lifecycle-manager-packageserver   4.18.0-0.nightly-2025-02-18-114102   True        False         False      122m     service-ca                                 4.18.0-0.nightly-2025-02-18-114102   True        False         False      131m     storage                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      131m  liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME                                      STATUS   ROLES                  AGE     VERSION ci-op-0g1t02w6-9c0d9-wt5g7-master-0       Ready    control-plane,master   3h24m   v1.31.5 ci-op-0g1t02w6-9c0d9-wt5g7-master-1       Ready    control-plane,master   3h24m   v1.31.5 ci-op-0g1t02w6-9c0d9-wt5g7-master-2       Ready    control-plane,master   3h24m   v1.31.5 ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8   Ready    worker                 3h8m    v1.31.5 ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd   Ready    worker                 3h8m    v1.31.5 ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695   Ready    worker                 3h8m    v1.31.5 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME                                      PHASE     TYPE   REGION    ZONE              AGE ci-op-0g1t02w6-9c0d9-wt5g7-master-0       Running   AHV    Unnamed   Development-GPU   3h25m ci-op-0g1t02w6-9c0d9-wt5g7-master-1       Running   AHV    Unnamed   Development-GPU   3h25m ci-op-0g1t02w6-9c0d9-wt5g7-master-2       Running   AHV    Unnamed   Development-GPU   3h25m ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8   Running   AHV    Unnamed   Development-GPU   3h22m ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd   Running   AHV    Unnamed   Development-GPU   3h22m ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695   Running   AHV    Unnamed   Development-GPU   3h22m 3. Edit infrastructure cluster object to add failureDomains, the masters will not update liuhuali@Lius-MacBook-Pro huali-test % oc get infrastructure cluster -oyaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata:   creationTimestamp: "2025-02-19T03:20:58Z"   generation: 2   name: cluster   resourceVersion: "84629"   uid: e5dcc0f0-44e3-4299-adc9-0ebcc96edbed spec:   cloudConfig:     key: config     name: cloud-provider-config   platformSpec:     nutanix:       failureDomains:       - cluster:           type: UUID           uuid: 0005d9a8-fa6c-56da-c90a-fde303e4d564         name: failure-domain-1         subnets:         - type: UUID           uuid: 512c1d6f-c6e7-4746-8ae2-9c3e1db2aba6         - type: UUID           uuid: a94cb75c-24ff-4ee2-85cf-c2f906ee9fe5       - cluster:           type: UUID           uuid: 00060c83-8946-571b-1aba-2c629cb14c98         name: failure-domain-2         subnets:         - type: UUID           uuid: d1b1b617-23de-4a9d-b53f-4b386fc27600         - type: UUID           uuid: 43e96b2b-5027-469f-8b56-6e8f1b0acc17       - cluster:           type: UUID           uuid: 00060e88-8997-8522-8b58-4c17d4b97414         name: failure-domain-3         subnets:         - type: UUID           uuid: 3624b067-61e2-4703-b8bf-3810de5cbac1         - type: UUID           uuid: 0a949005-15a6-4e81-b1f8-7487e2bd308a       prismCentral:         address: prismcentral.sts-cluster.nutanix-dev.devcluster.openshift.com         port: 9440       prismElements:       - endpoint:           address: 10.0.128.243           port: 9440         name: Development-GPU     type: Nutanix status:   apiServerInternalURI: https: //api- int .ci-op-0g1t02w6-9c0d9.nutanix-ci.devcluster.openshift.com:6443   apiServerURL: https: //api.ci-op-0g1t02w6-9c0d9.nutanix-ci.devcluster.openshift.com:6443   controlPlaneTopology: HighlyAvailable   cpuPartitioning: None   etcdDiscoveryDomain: ""   infrastructureName: ci-op-0g1t02w6-9c0d9-wt5g7   infrastructureTopology: HighlyAvailable   platform: Nutanix   platformStatus:     nutanix:       apiServerInternalIP: 10.0.200.14       apiServerInternalIPs:       - 10.0.200.14       ingressIP: 10.0.200.15       ingressIPs:       - 10.0.200.15       loadBalancer:         type: OpenShiftManagedDefault     type: Nutanix liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME                                      PHASE     TYPE   REGION    ZONE              AGE ci-op-0g1t02w6-9c0d9-wt5g7-master-0       Running   AHV    Unnamed   Development-GPU   3h30m ci-op-0g1t02w6-9c0d9-wt5g7-master-1       Running   AHV    Unnamed   Development-GPU   3h30m ci-op-0g1t02w6-9c0d9-wt5g7-master-2       Running   AHV    Unnamed   Development-GPU   3h30m ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8   Running   AHV    Unnamed   Development-GPU   3h27m ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd   Running   AHV    Unnamed   Development-GPU   3h27m ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695   Running   AHV    Unnamed   Development-GPU   3h27m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME                                      STATUS   ROLES                  AGE     VERSION ci-op-0g1t02w6-9c0d9-wt5g7-master-0       Ready    control-plane,master   3h30m   v1.31.5 ci-op-0g1t02w6-9c0d9-wt5g7-master-1       Ready    control-plane,master   3h30m   v1.31.5 ci-op-0g1t02w6-9c0d9-wt5g7-master-2       Ready    control-plane,master   3h30m   v1.31.5 ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8   Ready    worker                 3h13m   v1.31.5 ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd   Ready    worker                 3h13m   v1.31.5 ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695   Ready    worker                 3h13m   v1.31.5 liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE authentication                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h3m     baremetal                                  4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m    cloud-controller-manager                   4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h30m    cloud-credential                           4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m    cluster-autoscaler                         4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m    config- operator                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m    console                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h8m     control-plane-machine-set                  4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m    csi-snapshot-controller                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m    dns                                        4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m    etcd                                       4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h26m    image-registry                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      72m      ingress                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h11m    insights                                   4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m    kube-apiserver                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h25m    kube-controller-manager                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h25m    kube-scheduler                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h25m    kube-storage-version-migrator              4.18.0-0.nightly-2025-02-18-114102   True        False         False      68m      machine-api                                4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h13m    machine-approver                           4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m    machine-config                             4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m    marketplace                                4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m    monitoring                                 4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h11m    network                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m    node-tuning                                4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h12m    olm                                        4.18.0-0.nightly-2025-02-18-114102   True        False         False      70m      openshift-apiserver                        4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h20m    openshift-controller-manager               4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h24m    openshift-samples                          4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h19m    operator -lifecycle-manager                 4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m    operator -lifecycle-manager-catalog         4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h27m    operator -lifecycle-manager-packageserver   4.18.0-0.nightly-2025-02-18-114102   True        False         False      73m      service-ca                                 4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m    storage                                    4.18.0-0.nightly-2025-02-18-114102   True        False         False      3h28m 4.Edit controlplanemachineset cluster object to add failureDomains, the masters start updating, but after sometime, the cluster is unreachable. liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset controlplanemachineset.machine.openshift.io/cluster edited spec: ...   template:     machineType: machines_v1beta1_machine_openshift_io     machines_v1beta1_machine_openshift_io:       failureDomains:         platform: Nutanix         nutanix:         - name: failure-domain-1         - name: failure-domain-2 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME                                        PHASE          TYPE   REGION    ZONE              AGE ci-op-0g1t02w6-9c0d9-wt5g7-master-0         Running        AHV    Unnamed   Development-GPU   3h32m ci-op-0g1t02w6-9c0d9-wt5g7-master-1         Running        AHV    Unnamed   Development-GPU   3h32m ci-op-0g1t02w6-9c0d9-wt5g7-master-2         Running        AHV    Unnamed   Development-GPU   3h32m ci-op-0g1t02w6-9c0d9-wt5g7-master-s547v-0   Provisioning                                      4s ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8     Running        AHV    Unnamed   Development-GPU   3h29m ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd     Running        AHV    Unnamed   Development-GPU   3h29m ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695     Running        AHV    Unnamed   Development-GPU   3h29m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME                                        PHASE      TYPE   REGION    ZONE              AGE ci-op-0g1t02w6-9c0d9-wt5g7-master-0         Deleting   AHV    Unnamed   Development-GPU   3h41m ci-op-0g1t02w6-9c0d9-wt5g7-master-1         Running    AHV    Unnamed   Development-GPU   3h41m ci-op-0g1t02w6-9c0d9-wt5g7-master-2         Running    AHV    Unnamed   Development-GPU   3h41m ci-op-0g1t02w6-9c0d9-wt5g7-master-s547v-0   Running    AHV    Unnamed   Development-STS   9m5s ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8     Running    AHV    Unnamed   Development-GPU   3h38m ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd     Running    AHV    Unnamed   Development-GPU   3h38m ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695     Running    AHV    Unnamed   Development-GPU   3h38m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME                                        PHASE      TYPE   REGION    ZONE                AGE ci-op-0g1t02w6-9c0d9-wt5g7-master-1         Deleting   AHV    Unnamed   Development-GPU     3h50m ci-op-0g1t02w6-9c0d9-wt5g7-master-2         Running    AHV    Unnamed   Development-GPU     3h50m ci-op-0g1t02w6-9c0d9-wt5g7-master-s547v-0   Running    AHV    Unnamed   Development-STS     18m ci-op-0g1t02w6-9c0d9-wt5g7-master-vvhgz-1   Running    AHV    Unnamed   Development-zonal   8m38s ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8     Running    AHV    Unnamed   Development-GPU     3h47m ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd     Running    AHV    Unnamed   Development-GPU     3h47m ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695     Running    AHV    Unnamed   Development-GPU     3h47m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME                                        PHASE      TYPE   REGION    ZONE                AGE ci-op-0g1t02w6-9c0d9-wt5g7-master-2         Deleting   AHV    Unnamed   Development-GPU     4h19m ci-op-0g1t02w6-9c0d9-wt5g7-master-s547v-0   Running    AHV    Unnamed   Development-STS     47m ci-op-0g1t02w6-9c0d9-wt5g7-master-vvhgz-1   Running    AHV    Unnamed   Development-zonal   37m ci-op-0g1t02w6-9c0d9-wt5g7-master-w7jxr-2   Running    AHV    Unnamed   Development-STS     28m ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8     Running    AHV    Unnamed   Development-GPU     4h16m ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd     Running    AHV    Unnamed   Development-GPU     4h16m ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695     Running    AHV    Unnamed   Development-GPU     4h16m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME                                        PHASE      TYPE   REGION    ZONE                AGE ci-op-0g1t02w6-9c0d9-wt5g7-master-2         Deleting   AHV    Unnamed   Development-GPU     5h10m ci-op-0g1t02w6-9c0d9-wt5g7-master-s547v-0   Running    AHV    Unnamed   Development-STS     98m ci-op-0g1t02w6-9c0d9-wt5g7-master-vvhgz-1   Running    AHV    Unnamed   Development-zonal   89m ci-op-0g1t02w6-9c0d9-wt5g7-master-w7jxr-2   Running    AHV    Unnamed   Development-STS     79m ci-op-0g1t02w6-9c0d9-wt5g7-worker-5lgp8     Running    AHV    Unnamed   Development-GPU     5h7m ci-op-0g1t02w6-9c0d9-wt5g7-worker-bpjnd     Running    AHV    Unnamed   Development-GPU     5h7m ci-op-0g1t02w6-9c0d9-wt5g7-worker-c6695     Running    AHV    Unnamed   Development-GPU     5h7m liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: net/http: TLS handshake timeout liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: net/http: TLS handshake timeout liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: net/http: TLS handshake timeout liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: net/http: TLS handshake timeout liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: net/http: TLS handshake timeout liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: net/http: TLS handshake timeout liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: net/http: TLS handshake timeout liuhuali@Lius-MacBook-Pro huali-test %  

            Huali Liu added a comment - - edited

            Tried this on multi-zone cluster on 4.19 today, still met the issue.

            Install a multi zone nutanix cluster (only one subnet for each failure domain); Enable featuregate and wait the cluster ready; Edit the infrastructure cluster object to add the second subnet (here I add the second subnet in front of the original subnet) for each failure domain; Then the masters start RollingUpdate but stuck on master-2 here.

            liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
            NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.19.0-0.nightly-2025-02-14-215306   True        False         67m     Cluster version is 4.19.0-0.nightly-2025-02-14-215306
            liuhuali@Lius-MacBook-Pro huali-test % oc get co
            NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
            authentication                             4.19.0-0.nightly-2025-02-14-215306   True        False         False      6h52m   
            baremetal                                  4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m   
            cloud-controller-manager                   4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h22m   
            cloud-credential                           4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m   
            cluster-api                                4.19.0-0.nightly-2025-02-14-215306   True        False         False      5h17m   
            cluster-autoscaler                         4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m   
            config-operator                            4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m   
            console                                    4.19.0-0.nightly-2025-02-14-215306   True        False         False      5h8m    
            control-plane-machine-set                  4.19.0-0.nightly-2025-02-14-215306   True        True          False      7h16m   Waiting for 1 old replica(s) to be removed
            csi-snapshot-controller                    4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m   
            dns                                        4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h19m   
            etcd                                       4.19.0-0.nightly-2025-02-14-215306   True        True          True       7h16m   GuardControllerDegraded: Missing operand on node ci-op-xv7rf1gn-b88b1-q9p7x-master-qm7wk-2
            image-registry                             4.19.0-0.nightly-2025-02-14-215306   True        False         False      5h15m   
            ingress                                    4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h7m    
            insights                                   4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m   
            kube-apiserver                             4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h16m   
            kube-controller-manager                    4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h16m   
            kube-scheduler                             4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h18m   
            kube-storage-version-migrator              4.19.0-0.nightly-2025-02-14-215306   True        False         False      5h12m   
            machine-api                                4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h6m    
            machine-approver                           4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m   
            machine-config                             4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h19m   
            marketplace                                4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m   
            monitoring                                 4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h5m    
            network                                    4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m   
            node-tuning                                4.19.0-0.nightly-2025-02-14-215306   True        False         False      3h10m   
            olm                                        4.19.0-0.nightly-2025-02-14-215306   True        False         False      5h4m    
            openshift-apiserver                        4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h      
            openshift-controller-manager               4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h16m   
            openshift-samples                          4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h9m    
            operator-lifecycle-manager                 4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h19m   
            operator-lifecycle-manager-catalog         4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h19m   
            operator-lifecycle-manager-packageserver   4.19.0-0.nightly-2025-02-14-215306   True        False         False      5h10m   
            service-ca                                 4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m   
            storage                                    4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m   
            liuhuali@Lius-MacBook-Pro huali-test % oc get machine
            NAME                                        PHASE      TYPE   REGION    ZONE                 AGE
            ci-op-xv7rf1gn-b88b1-q9p7x-master-2         Deleting   AHV    Unnamed   Development-zonal    7h23m
            ci-op-xv7rf1gn-b88b1-q9p7x-master-llcv5-0   Running    AHV    Unnamed   Development-zonal    4h45m
            ci-op-xv7rf1gn-b88b1-q9p7x-master-qm7wk-2   Running    AHV    Unnamed   Development-zonal    3h12m
            ci-op-xv7rf1gn-b88b1-q9p7x-master-xl7gn-1   Running    AHV    Unnamed   Development-zonal2   4h34m
            ci-op-xv7rf1gn-b88b1-q9p7x-worker-0-hvrkk   Running    AHV    Unnamed   Development-STS      7h20m
            ci-op-xv7rf1gn-b88b1-q9p7x-worker-0-psg5d   Running    AHV    Unnamed   Development-STS      7h20m
            ci-op-xv7rf1gn-b88b1-q9p7x-worker-1-95qbf   Running    AHV    Unnamed   Development-zonal    7h20m
            liuhuali@Lius-MacBook-Pro huali-test %  

            Huali Liu added a comment - - edited Tried this on multi-zone cluster on 4.19 today, still met the issue. Install a multi zone nutanix cluster (only one subnet for each failure domain); Enable featuregate and wait the cluster ready; Edit the infrastructure cluster object to add the second subnet (here I add the second subnet in front of the original subnet) for each failure domain; Then the masters start RollingUpdate but stuck on master-2 here. liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS version   4.19.0-0.nightly-2025-02-14-215306   True        False         67m     Cluster version is 4.19.0-0.nightly-2025-02-14-215306 liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE authentication                             4.19.0-0.nightly-2025-02-14-215306   True        False         False      6h52m    baremetal                                  4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m    cloud-controller-manager                   4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h22m    cloud-credential                           4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m    cluster-api                                4.19.0-0.nightly-2025-02-14-215306   True        False         False      5h17m    cluster-autoscaler                         4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m    config- operator                             4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m    console                                    4.19.0-0.nightly-2025-02-14-215306   True        False         False      5h8m     control-plane-machine-set                  4.19.0-0.nightly-2025-02-14-215306   True        True          False      7h16m   Waiting for 1 old replica(s) to be removed csi-snapshot-controller                    4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m    dns                                        4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h19m    etcd                                       4.19.0-0.nightly-2025-02-14-215306   True        True          True       7h16m   GuardControllerDegraded: Missing operand on node ci-op-xv7rf1gn-b88b1-q9p7x-master-qm7wk-2 image-registry                             4.19.0-0.nightly-2025-02-14-215306   True        False         False      5h15m    ingress                                    4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h7m     insights                                   4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m    kube-apiserver                             4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h16m    kube-controller-manager                    4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h16m    kube-scheduler                             4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h18m    kube-storage-version-migrator              4.19.0-0.nightly-2025-02-14-215306   True        False         False      5h12m    machine-api                                4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h6m     machine-approver                           4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m    machine-config                             4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h19m    marketplace                                4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m    monitoring                                 4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h5m     network                                    4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m    node-tuning                                4.19.0-0.nightly-2025-02-14-215306   True        False         False      3h10m    olm                                        4.19.0-0.nightly-2025-02-14-215306   True        False         False      5h4m     openshift-apiserver                        4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h       openshift-controller-manager               4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h16m    openshift-samples                          4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h9m     operator -lifecycle-manager                 4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h19m    operator -lifecycle-manager-catalog         4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h19m    operator -lifecycle-manager-packageserver   4.19.0-0.nightly-2025-02-14-215306   True        False         False      5h10m    service-ca                                 4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m    storage                                    4.19.0-0.nightly-2025-02-14-215306   True        False         False      7h20m    liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME                                        PHASE      TYPE   REGION    ZONE                 AGE ci-op-xv7rf1gn-b88b1-q9p7x-master-2         Deleting   AHV    Unnamed   Development-zonal    7h23m ci-op-xv7rf1gn-b88b1-q9p7x-master-llcv5-0   Running    AHV    Unnamed   Development-zonal    4h45m ci-op-xv7rf1gn-b88b1-q9p7x-master-qm7wk-2   Running    AHV    Unnamed   Development-zonal    3h12m ci-op-xv7rf1gn-b88b1-q9p7x-master-xl7gn-1   Running    AHV    Unnamed   Development-zonal2   4h34m ci-op-xv7rf1gn-b88b1-q9p7x-worker-0-hvrkk   Running    AHV    Unnamed   Development-STS      7h20m ci-op-xv7rf1gn-b88b1-q9p7x-worker-0-psg5d   Running    AHV    Unnamed   Development-STS      7h20m ci-op-xv7rf1gn-b88b1-q9p7x-worker-1-95qbf   Running    AHV    Unnamed   Development-zonal    7h20m liuhuali@Lius-MacBook-Pro huali-test %

            Huali Liu added a comment -

            Yes, faced on day2 only. sgaoshang tested day0 no issue.

            Huali Liu added a comment - Yes, faced on day2 only. sgaoshang tested day0 no issue.

            Yang Yang added a comment -

            Hi huliu@redhat.com Is it faced on day2 only? Does the scenario work on day0?

            Yang Yang added a comment - Hi huliu@redhat.com Is it faced on day2 only? Does the scenario work on day0?

            Moved to Proposed

            OpenShift Jira Bot added a comment - Moved to Proposed

            Huali Liu added a comment -

            By the way, I set the Severity as Critical because it's very destructive to the cluster. But based on the previous experience (similar bugs before: https://issues.redhat.com/browse/OCPBUGS-5306  https://issues.redhat.com/browse/OCPBUGS-11025 ) I don't think this will be a blocker. But I respect the dev's opinion. Thanks!

            Huali Liu added a comment - By the way, I set the Severity as Critical because it's very destructive to the cluster. But based on the previous experience (similar bugs before: https://issues.redhat.com/browse/OCPBUGS-5306   https://issues.redhat.com/browse/OCPBUGS-11025 ) I don't think this will be a blocker. But I respect the dev's opinion. Thanks!

              yanhli@redhat.com Yanhua Li
              huliu@redhat.com Huali Liu
              Huali Liu Huali Liu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: