Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-50904

The cluster stuck or unable to connect when adding second subnet in controlplanemachineset on Nutanix

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.18, 4.19
    • None
    • Critical
    • None
    • False
    • Hide

      None

      Show
      None
    • Hide
      * The following known issues with configuring multiple subnets exist in {product-title} version 4.18:
      +
      --
      ** Adding subnets above the existing subnet in the `subnets` stanza causes a control plane node to become stuck in the `SchedulingDisabled` state.
      As a workaround, only add subnets below the existing subnet in the `subnets` stanza.

      ** Sometimes, after adding a subnet, the updated control plane machines appear in the Nutanix console but the {product-title} cluster is unreachable.
      There is no workaround for this issue.
      --
      +
      (link:https://issues.redhat.com/browse/OCPBUGS-50904[*OCPBUGS-50904*])
      Show
      * The following known issues with configuring multiple subnets exist in {product-title} version 4.18: + -- ** Adding subnets above the existing subnet in the `subnets` stanza causes a control plane node to become stuck in the `SchedulingDisabled` state. As a workaround, only add subnets below the existing subnet in the `subnets` stanza. ** Sometimes, after adding a subnet, the updated control plane machines appear in the Nutanix console but the {product-title} cluster is unreachable. There is no workaround for this issue. -- + (link: https://issues.redhat.com/browse/OCPBUGS-50904 [* OCPBUGS-50904 *])
    • Known Issue
    • Proposed

      Description of problem:

      Case1: Add the new subnet in front of the original subnet in controlplanemachineset,the cluster stuck
      Case2: Add the new subnet after the original subnet in controlplanemachineset,sometimes the cluster RollingUpdate successfully, but sometimes the cluster unable to connect  

      Version-Release number of selected component (if applicable):

          4.18.0-0.nightly-2025-02-14-222249

      How reproducible:

         100% for case1, 50% for case2 in my testing

      Steps to Reproduce:

          1.Install a 4.18 cluster on Nutanix
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.18.0-0.nightly-2025-02-14-222249   True        False         40m     Cluster version is 4.18.0-0.nightly-2025-02-14-222249
      liuhuali@Lius-MacBook-Pro huali-test % oc get infrastructure cluster -oyaml
      apiVersion: config.openshift.io/v1
      kind: Infrastructure
      metadata:
        creationTimestamp: "2025-02-17T00:40:20Z"
        generation: 1
        name: cluster
        resourceVersion: "519"
        uid: d0cafa11-dcdf-4f36-ba5b-2a5b0db2e6b8
      spec:
        cloudConfig:
          key: config
          name: cloud-provider-config
        platformSpec:
          nutanix:
            failureDomains: []
            prismCentral:
              address: prismcentral.lts-cluster.nutanix-dev.devcluster.openshift.com
              port: 9440
            prismElements:
            - endpoint:
                address: 10.0.128.159
                port: 9440
              name: Development-LTS
          type: Nutanix
      status:
        apiServerInternalURI: https://api-int.ci-op-37d7j87w-590c2.nutanix-ci.devcluster.openshift.com:6443
        apiServerURL: https://api.ci-op-37d7j87w-590c2.nutanix-ci.devcluster.openshift.com:6443
        controlPlaneTopology: HighlyAvailable
        cpuPartitioning: None
        etcdDiscoveryDomain: ""
        infrastructureName: ci-op-37d7j87w-590c2-8vq5j
        infrastructureTopology: HighlyAvailable
        platform: Nutanix
        platformStatus:
          nutanix:
            apiServerInternalIP: 10.0.130.10
            apiServerInternalIPs:
            - 10.0.130.10
            ingressIP: 10.0.130.11
            ingressIPs:
            - 10.0.130.11
            loadBalancer:
              type: OpenShiftManagedDefault
          type: Nutanix
      
          2.Add a second subnet in controlplanemachineset, 
      for case1, add the new subnet in front of the original subnet in controlplanemachineset 
      before adding:
      
                  subnets:
                  - type: uuid
                    uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
      after adding:
      
                  subnets:
                  - type: uuid
                    uuid: efe26e93-f6cf-4d89-8104-009e85201fa8
                  - type: uuid
                    uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
      
      for case2, add the new subnet after the original subnet in controlplanemachineset 
      before adding:
      
                  subnets:
                  - type: uuid
                    uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
      after adding:
      
                  subnets:
                  - type: uuid
                    uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
                  - type: uuid
                    uuid: efe26e93-f6cf-4d89-8104-009e85201fa8
      
          3. for case 1, one old master stuck(sometimes it stuck on master-0, sometimes stuck on master-1, sometimes stuck on master-2 in my testing)
          
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                        PHASE      TYPE   REGION    ZONE              AGE
      ci-op-37d7j87w-590c2-8vq5j-master-1         Deleting   AHV    Unnamed   Development-LTS   3h53m
      ci-op-37d7j87w-590c2-8vq5j-master-2         Running    AHV    Unnamed   Development-LTS   3h53m
      ci-op-37d7j87w-590c2-8vq5j-master-58wn6-0   Running    AHV    Unnamed   Development-LTS   166m
      ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1   Running    AHV    Unnamed   Development-LTS   156m
      ci-op-37d7j87w-590c2-8vq5j-worker-gw9cj     Running    AHV    Unnamed   Development-LTS   3h50m
      ci-op-37d7j87w-590c2-8vq5j-worker-rcm2q     Running    AHV    Unnamed   Development-LTS   3h50m
      ci-op-37d7j87w-590c2-8vq5j-worker-tpd9b     Running    AHV    Unnamed   Development-LTS   3h50m
      liuhuali@Lius-MacBook-Pro huali-test % oc get node
      NAME                                        STATUS                     ROLES                  AGE     VERSION
      ci-op-37d7j87w-590c2-8vq5j-master-1         Ready,SchedulingDisabled   control-plane,master   3h53m   v1.31.5
      ci-op-37d7j87w-590c2-8vq5j-master-2         Ready                      control-plane,master   3h53m   v1.31.5
      ci-op-37d7j87w-590c2-8vq5j-master-58wn6-0   Ready                      control-plane,master   164m    v1.31.5
      ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1   Ready                      control-plane,master   154m    v1.31.5
      ci-op-37d7j87w-590c2-8vq5j-worker-gw9cj     Ready                      worker                 3h37m   v1.31.5
      ci-op-37d7j87w-590c2-8vq5j-worker-rcm2q     Ready                      worker                 3h37m   v1.31.5
      ci-op-37d7j87w-590c2-8vq5j-worker-tpd9b     Ready                      worker                 3h37m   v1.31.5
      liuhuali@Lius-MacBook-Pro huali-test % oc get co
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.18.0-0.nightly-2025-02-14-222249   True        True          True       3h28m   APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
      baremetal                                  4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      cloud-controller-manager                   4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h52m   
      cloud-credential                           4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      cluster-autoscaler                         4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      config-operator                            4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      console                                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h34m   
      control-plane-machine-set                  4.18.0-0.nightly-2025-02-14-222249   True        True          False      3h46m   Observed 1 replica(s) in need of update
      csi-snapshot-controller                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      dns                                        4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
      etcd                                       4.18.0-0.nightly-2025-02-14-222249   True        True          False      3h48m   NodeInstallerProgressing: 2 nodes are at revision 8; 1 node is at revision 10; 1 node is at revision 15; 0 nodes have achieved new revision 17
      image-registry                             4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h19m   
      ingress                                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h35m   
      insights                                   4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
      kube-apiserver                             4.18.0-0.nightly-2025-02-14-222249   True        True          True       3h46m   GuardControllerDegraded: Missing operand on node ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1
      kube-controller-manager                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h46m   
      kube-scheduler                             4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h48m   
      kube-storage-version-migrator              4.18.0-0.nightly-2025-02-14-222249   True        False         False      104m    
      machine-api                                4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h37m   
      machine-approver                           4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      machine-config                             4.18.0-0.nightly-2025-02-14-222249   True        False         True       3h50m   Failed to resync 4.18.0-0.nightly-2025-02-14-222249 because: error during syncRequiredMachineConfigPools: [context deadline exceeded, error required MachineConfigPool master is not ready, retrying. Status: (total: 4, ready 3, updated: 4, unavailable: 1, degraded: 0)]
      marketplace                                4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
      monitoring                                 4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h33m   
      network                                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      node-tuning                                4.18.0-0.nightly-2025-02-14-222249   True        False         False      154m    
      olm                                        4.18.0-0.nightly-2025-02-14-222249   True        False         False      104m    
      openshift-apiserver                        4.18.0-0.nightly-2025-02-14-222249   True        True          True       3h35m   APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-apiserver ()
      openshift-controller-manager               4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h41m   
      openshift-samples                          4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h41m   
      operator-lifecycle-manager                 4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
      operator-lifecycle-manager-catalog         4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
      operator-lifecycle-manager-packageserver   4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h41m   
      service-ca                                 4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      storage                                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
      liuhuali@Lius-MacBook-Pro huali-test % 
      
      for case2, I unable to connect the cluster, but I can see the masters are RollingUpdate to new masters on Nutanix console https://drive.google.com/file/d/1-UbFiUiyhmeBVTBAVaB23jthiZI0VjAm/view?usp=sharing 
      
      liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset
      controlplanemachineset.machine.openshift.io/cluster edited
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                        PHASE          TYPE   REGION    ZONE              AGE
      ci-op-0pdvmm2s-f3468-7khf5-master-0         Running        AHV    Unnamed   Development-LTS   71m
      ci-op-0pdvmm2s-f3468-7khf5-master-1         Running        AHV    Unnamed   Development-LTS   71m
      ci-op-0pdvmm2s-f3468-7khf5-master-2         Running        AHV    Unnamed   Development-LTS   71m
      ci-op-0pdvmm2s-f3468-7khf5-master-qmj72-0   Provisioning                                      5s
      ci-op-0pdvmm2s-f3468-7khf5-worker-fbj48     Running        AHV    Unnamed   Development-LTS   68m
      ci-op-0pdvmm2s-f3468-7khf5-worker-pv8jw     Running        AHV    Unnamed   Development-LTS   68m
      ci-op-0pdvmm2s-f3468-7khf5-worker-xpwrf     Running        AHV    Unnamed   Development-LTS   68m
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine                        
      Unable to connect to the server: net/http: TLS handshake timeout
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      Unable to connect to the server: EOF
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      Unable to connect to the server: EOF
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      Unable to connect to the server: EOF
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      Unable to connect to the server: EOF
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      Unable to connect to the server: EOF
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      Unable to connect to the server: EOF
      liuhuali@Lius-MacBook-Pro huali-test % 

      Actual results:

          the cluster stuck or unable to connect

      Expected results:

          RollingUpdate successfully, the cluster can be connected

      Additional info:

          must gather for case1: https://drive.google.com/file/d/1ZeN_5bnCYbOFuCihv1zIt3Y26rmNynBw/view?usp=sharing

              yanhli@redhat.com Yanhua Li
              huliu@redhat.com Huali Liu
              Huali Liu Huali Liu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: