[OCPBUGS-50904] The cluster stuck or unable to connect when adding second subnet in controlplanemachineset on Nutanix - Red Hat Issue Tracker

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.18, 4.19
Component/s: Cloud Compute / Nutanix Provider
Labels:
None

Severity:
Critical
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* The following known issues with configuring multiple subnets exist in {product-title} version 4.18:
+
--
** Adding subnets above the existing subnet in the `subnets` stanza causes a control plane node to become stuck in the `SchedulingDisabled` state.
As a workaround, only add subnets below the existing subnet in the `subnets` stanza.

** Sometimes, after adding a subnet, the updated control plane machines appear in the Nutanix console but the {product-title} cluster is unreachable.
There is no workaround for this issue.
--
+
(link:https://issues.redhat.com/browse/OCPBUGS-50904[*OCPBUGS-50904*])

Show
* The following known issues with configuring multiple subnets exist in {product-title} version 4.18: + -- ** Adding subnets above the existing subnet in the `subnets` stanza causes a control plane node to become stuck in the `SchedulingDisabled` state. As a workaround, only add subnets below the existing subnet in the `subnets` stanza. ** Sometimes, after adding a subnet, the updated control plane machines appear in the Nutanix console but the {product-title} cluster is unreachable. There is no workaround for this issue. -- + (link: https://issues.redhat.com/browse/OCPBUGS-50904 [* OCPBUGS-50904 *])
Release Note Type:
Known Issue
Release Note Status:
Proposed

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Case1: Add the new subnet in front of the original subnet in controlplanemachineset，the cluster stuck
Case2: Add the new subnet after the original subnet in controlplanemachineset，sometimes the cluster RollingUpdate successfully, but sometimes the cluster unable to connect

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2025-02-14-222249

How reproducible:

   100% for case1, 50% for case2 in my testing

Steps to Reproduce:

    1.Install a 4.18 cluster on Nutanix
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-0.nightly-2025-02-14-222249   True        False         40m     Cluster version is 4.18.0-0.nightly-2025-02-14-222249
liuhuali@Lius-MacBook-Pro huali-test % oc get infrastructure cluster -oyaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2025-02-17T00:40:20Z"
  generation: 1
  name: cluster
  resourceVersion: "519"
  uid: d0cafa11-dcdf-4f36-ba5b-2a5b0db2e6b8
spec:
  cloudConfig:
    key: config
    name: cloud-provider-config
  platformSpec:
    nutanix:
      failureDomains: []
      prismCentral:
        address: prismcentral.lts-cluster.nutanix-dev.devcluster.openshift.com
        port: 9440
      prismElements:
      - endpoint:
          address: 10.0.128.159
          port: 9440
        name: Development-LTS
    type: Nutanix
status:
  apiServerInternalURI: https://api-int.ci-op-37d7j87w-590c2.nutanix-ci.devcluster.openshift.com:6443
  apiServerURL: https://api.ci-op-37d7j87w-590c2.nutanix-ci.devcluster.openshift.com:6443
  controlPlaneTopology: HighlyAvailable
  cpuPartitioning: None
  etcdDiscoveryDomain: ""
  infrastructureName: ci-op-37d7j87w-590c2-8vq5j
  infrastructureTopology: HighlyAvailable
  platform: Nutanix
  platformStatus:
    nutanix:
      apiServerInternalIP: 10.0.130.10
      apiServerInternalIPs:
      - 10.0.130.10
      ingressIP: 10.0.130.11
      ingressIPs:
      - 10.0.130.11
      loadBalancer:
        type: OpenShiftManagedDefault
    type: Nutanix

    2.Add a second subnet in controlplanemachineset, 
for case1, add the new subnet in front of the original subnet in controlplanemachineset 
before adding:

            subnets:
            - type: uuid
              uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
after adding:

            subnets:
            - type: uuid
              uuid: efe26e93-f6cf-4d89-8104-009e85201fa8
            - type: uuid
              uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1

for case2, add the new subnet after the original subnet in controlplanemachineset 
before adding:

            subnets:
            - type: uuid
              uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
after adding:

            subnets:
            - type: uuid
              uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
            - type: uuid
              uuid: efe26e93-f6cf-4d89-8104-009e85201fa8

    3. for case 1, one old master stuck(sometimes it stuck on master-0, sometimes stuck on master-1, sometimes stuck on master-2 in my testing)
    
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                        PHASE      TYPE   REGION    ZONE              AGE
ci-op-37d7j87w-590c2-8vq5j-master-1         Deleting   AHV    Unnamed   Development-LTS   3h53m
ci-op-37d7j87w-590c2-8vq5j-master-2         Running    AHV    Unnamed   Development-LTS   3h53m
ci-op-37d7j87w-590c2-8vq5j-master-58wn6-0   Running    AHV    Unnamed   Development-LTS   166m
ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1   Running    AHV    Unnamed   Development-LTS   156m
ci-op-37d7j87w-590c2-8vq5j-worker-gw9cj     Running    AHV    Unnamed   Development-LTS   3h50m
ci-op-37d7j87w-590c2-8vq5j-worker-rcm2q     Running    AHV    Unnamed   Development-LTS   3h50m
ci-op-37d7j87w-590c2-8vq5j-worker-tpd9b     Running    AHV    Unnamed   Development-LTS   3h50m
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME                                        STATUS                     ROLES                  AGE     VERSION
ci-op-37d7j87w-590c2-8vq5j-master-1         Ready,SchedulingDisabled   control-plane,master   3h53m   v1.31.5
ci-op-37d7j87w-590c2-8vq5j-master-2         Ready                      control-plane,master   3h53m   v1.31.5
ci-op-37d7j87w-590c2-8vq5j-master-58wn6-0   Ready                      control-plane,master   164m    v1.31.5
ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1   Ready                      control-plane,master   154m    v1.31.5
ci-op-37d7j87w-590c2-8vq5j-worker-gw9cj     Ready                      worker                 3h37m   v1.31.5
ci-op-37d7j87w-590c2-8vq5j-worker-rcm2q     Ready                      worker                 3h37m   v1.31.5
ci-op-37d7j87w-590c2-8vq5j-worker-tpd9b     Ready                      worker                 3h37m   v1.31.5
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.18.0-0.nightly-2025-02-14-222249   True        True          True       3h28m   APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
baremetal                                  4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
cloud-controller-manager                   4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h52m   
cloud-credential                           4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
cluster-autoscaler                         4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
config-operator                            4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
console                                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h34m   
control-plane-machine-set                  4.18.0-0.nightly-2025-02-14-222249   True        True          False      3h46m   Observed 1 replica(s) in need of update
csi-snapshot-controller                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
dns                                        4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
etcd                                       4.18.0-0.nightly-2025-02-14-222249   True        True          False      3h48m   NodeInstallerProgressing: 2 nodes are at revision 8; 1 node is at revision 10; 1 node is at revision 15; 0 nodes have achieved new revision 17
image-registry                             4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h19m   
ingress                                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h35m   
insights                                   4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
kube-apiserver                             4.18.0-0.nightly-2025-02-14-222249   True        True          True       3h46m   GuardControllerDegraded: Missing operand on node ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1
kube-controller-manager                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h46m   
kube-scheduler                             4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h48m   
kube-storage-version-migrator              4.18.0-0.nightly-2025-02-14-222249   True        False         False      104m    
machine-api                                4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h37m   
machine-approver                           4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
machine-config                             4.18.0-0.nightly-2025-02-14-222249   True        False         True       3h50m   Failed to resync 4.18.0-0.nightly-2025-02-14-222249 because: error during syncRequiredMachineConfigPools: [context deadline exceeded, error required MachineConfigPool master is not ready, retrying. Status: (total: 4, ready 3, updated: 4, unavailable: 1, degraded: 0)]
marketplace                                4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
monitoring                                 4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h33m   
network                                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
node-tuning                                4.18.0-0.nightly-2025-02-14-222249   True        False         False      154m    
olm                                        4.18.0-0.nightly-2025-02-14-222249   True        False         False      104m    
openshift-apiserver                        4.18.0-0.nightly-2025-02-14-222249   True        True          True       3h35m   APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-apiserver ()
openshift-controller-manager               4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h41m   
openshift-samples                          4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h41m   
operator-lifecycle-manager                 4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
operator-lifecycle-manager-catalog         4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h50m   
operator-lifecycle-manager-packageserver   4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h41m   
service-ca                                 4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
storage                                    4.18.0-0.nightly-2025-02-14-222249   True        False         False      3h51m   
liuhuali@Lius-MacBook-Pro huali-test % 

for case2, I unable to connect the cluster, but I can see the masters are RollingUpdate to new masters on Nutanix console https://drive.google.com/file/d/1-UbFiUiyhmeBVTBAVaB23jthiZI0VjAm/view?usp=sharing 

liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset
controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                                        PHASE          TYPE   REGION    ZONE              AGE
ci-op-0pdvmm2s-f3468-7khf5-master-0         Running        AHV    Unnamed   Development-LTS   71m
ci-op-0pdvmm2s-f3468-7khf5-master-1         Running        AHV    Unnamed   Development-LTS   71m
ci-op-0pdvmm2s-f3468-7khf5-master-2         Running        AHV    Unnamed   Development-LTS   71m
ci-op-0pdvmm2s-f3468-7khf5-master-qmj72-0   Provisioning                                      5s
ci-op-0pdvmm2s-f3468-7khf5-worker-fbj48     Running        AHV    Unnamed   Development-LTS   68m
ci-op-0pdvmm2s-f3468-7khf5-worker-pv8jw     Running        AHV    Unnamed   Development-LTS   68m
ci-op-0pdvmm2s-f3468-7khf5-worker-xpwrf     Running        AHV    Unnamed   Development-LTS   68m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine                        
Unable to connect to the server: net/http: TLS handshake timeout
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
Unable to connect to the server: EOF
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
Unable to connect to the server: EOF
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
Unable to connect to the server: EOF
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
Unable to connect to the server: EOF
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
Unable to connect to the server: EOF
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
Unable to connect to the server: EOF
liuhuali@Lius-MacBook-Pro huali-test %

Actual results:

    the cluster stuck or unable to connect

Expected results:

    RollingUpdate successfully, the cluster can be connected

Additional info:

    must gather for case1: https://drive.google.com/file/d/1ZeN_5bnCYbOFuCihv1zIt3Y26rmNynBw/view?usp=sharing

Assignee:: Yanhua Li

Reporter:: Huali Liu

QA Contact:: Huali Liu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/02/17 4:50 AM

Updated:: 2025/02/19 10:08 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates