-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.18, 4.19
-
None
-
Critical
-
None
-
False
-
-
-
Known Issue
-
Proposed
Description of problem:
Case1: Add the new subnet in front of the original subnet in controlplanemachineset,the cluster stuck Case2: Add the new subnet after the original subnet in controlplanemachineset,sometimes the cluster RollingUpdate successfully, but sometimes the cluster unable to connect
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2025-02-14-222249
How reproducible:
100% for case1, 50% for case2 in my testing
Steps to Reproduce:
1.Install a 4.18 cluster on Nutanix liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.18.0-0.nightly-2025-02-14-222249 True False 40m Cluster version is 4.18.0-0.nightly-2025-02-14-222249 liuhuali@Lius-MacBook-Pro huali-test % oc get infrastructure cluster -oyaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2025-02-17T00:40:20Z" generation: 1 name: cluster resourceVersion: "519" uid: d0cafa11-dcdf-4f36-ba5b-2a5b0db2e6b8 spec: cloudConfig: key: config name: cloud-provider-config platformSpec: nutanix: failureDomains: [] prismCentral: address: prismcentral.lts-cluster.nutanix-dev.devcluster.openshift.com port: 9440 prismElements: - endpoint: address: 10.0.128.159 port: 9440 name: Development-LTS type: Nutanix status: apiServerInternalURI: https://api-int.ci-op-37d7j87w-590c2.nutanix-ci.devcluster.openshift.com:6443 apiServerURL: https://api.ci-op-37d7j87w-590c2.nutanix-ci.devcluster.openshift.com:6443 controlPlaneTopology: HighlyAvailable cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: ci-op-37d7j87w-590c2-8vq5j infrastructureTopology: HighlyAvailable platform: Nutanix platformStatus: nutanix: apiServerInternalIP: 10.0.130.10 apiServerInternalIPs: - 10.0.130.10 ingressIP: 10.0.130.11 ingressIPs: - 10.0.130.11 loadBalancer: type: OpenShiftManagedDefault type: Nutanix 2.Add a second subnet in controlplanemachineset, for case1, add the new subnet in front of the original subnet in controlplanemachineset before adding: subnets: - type: uuid uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1 after adding: subnets: - type: uuid uuid: efe26e93-f6cf-4d89-8104-009e85201fa8 - type: uuid uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1 for case2, add the new subnet after the original subnet in controlplanemachineset before adding: subnets: - type: uuid uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1 after adding: subnets: - type: uuid uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1 - type: uuid uuid: efe26e93-f6cf-4d89-8104-009e85201fa8 3. for case 1, one old master stuck(sometimes it stuck on master-0, sometimes stuck on master-1, sometimes stuck on master-2 in my testing) liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-37d7j87w-590c2-8vq5j-master-1 Deleting AHV Unnamed Development-LTS 3h53m ci-op-37d7j87w-590c2-8vq5j-master-2 Running AHV Unnamed Development-LTS 3h53m ci-op-37d7j87w-590c2-8vq5j-master-58wn6-0 Running AHV Unnamed Development-LTS 166m ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1 Running AHV Unnamed Development-LTS 156m ci-op-37d7j87w-590c2-8vq5j-worker-gw9cj Running AHV Unnamed Development-LTS 3h50m ci-op-37d7j87w-590c2-8vq5j-worker-rcm2q Running AHV Unnamed Development-LTS 3h50m ci-op-37d7j87w-590c2-8vq5j-worker-tpd9b Running AHV Unnamed Development-LTS 3h50m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ci-op-37d7j87w-590c2-8vq5j-master-1 Ready,SchedulingDisabled control-plane,master 3h53m v1.31.5 ci-op-37d7j87w-590c2-8vq5j-master-2 Ready control-plane,master 3h53m v1.31.5 ci-op-37d7j87w-590c2-8vq5j-master-58wn6-0 Ready control-plane,master 164m v1.31.5 ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1 Ready control-plane,master 154m v1.31.5 ci-op-37d7j87w-590c2-8vq5j-worker-gw9cj Ready worker 3h37m v1.31.5 ci-op-37d7j87w-590c2-8vq5j-worker-rcm2q Ready worker 3h37m v1.31.5 ci-op-37d7j87w-590c2-8vq5j-worker-tpd9b Ready worker 3h37m v1.31.5 liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.18.0-0.nightly-2025-02-14-222249 True True True 3h28m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()... baremetal 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m cloud-controller-manager 4.18.0-0.nightly-2025-02-14-222249 True False False 3h52m cloud-credential 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m cluster-autoscaler 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m config-operator 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m console 4.18.0-0.nightly-2025-02-14-222249 True False False 3h34m control-plane-machine-set 4.18.0-0.nightly-2025-02-14-222249 True True False 3h46m Observed 1 replica(s) in need of update csi-snapshot-controller 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m dns 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m etcd 4.18.0-0.nightly-2025-02-14-222249 True True False 3h48m NodeInstallerProgressing: 2 nodes are at revision 8; 1 node is at revision 10; 1 node is at revision 15; 0 nodes have achieved new revision 17 image-registry 4.18.0-0.nightly-2025-02-14-222249 True False False 3h19m ingress 4.18.0-0.nightly-2025-02-14-222249 True False False 3h35m insights 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m kube-apiserver 4.18.0-0.nightly-2025-02-14-222249 True True True 3h46m GuardControllerDegraded: Missing operand on node ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1 kube-controller-manager 4.18.0-0.nightly-2025-02-14-222249 True False False 3h46m kube-scheduler 4.18.0-0.nightly-2025-02-14-222249 True False False 3h48m kube-storage-version-migrator 4.18.0-0.nightly-2025-02-14-222249 True False False 104m machine-api 4.18.0-0.nightly-2025-02-14-222249 True False False 3h37m machine-approver 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m machine-config 4.18.0-0.nightly-2025-02-14-222249 True False True 3h50m Failed to resync 4.18.0-0.nightly-2025-02-14-222249 because: error during syncRequiredMachineConfigPools: [context deadline exceeded, error required MachineConfigPool master is not ready, retrying. Status: (total: 4, ready 3, updated: 4, unavailable: 1, degraded: 0)] marketplace 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m monitoring 4.18.0-0.nightly-2025-02-14-222249 True False False 3h33m network 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m node-tuning 4.18.0-0.nightly-2025-02-14-222249 True False False 154m olm 4.18.0-0.nightly-2025-02-14-222249 True False False 104m openshift-apiserver 4.18.0-0.nightly-2025-02-14-222249 True True True 3h35m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-apiserver () openshift-controller-manager 4.18.0-0.nightly-2025-02-14-222249 True False False 3h41m openshift-samples 4.18.0-0.nightly-2025-02-14-222249 True False False 3h41m operator-lifecycle-manager 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m operator-lifecycle-manager-catalog 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m operator-lifecycle-manager-packageserver 4.18.0-0.nightly-2025-02-14-222249 True False False 3h41m service-ca 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m storage 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m liuhuali@Lius-MacBook-Pro huali-test % for case2, I unable to connect the cluster, but I can see the masters are RollingUpdate to new masters on Nutanix console https://drive.google.com/file/d/1-UbFiUiyhmeBVTBAVaB23jthiZI0VjAm/view?usp=sharing liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset controlplanemachineset.machine.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-0pdvmm2s-f3468-7khf5-master-0 Running AHV Unnamed Development-LTS 71m ci-op-0pdvmm2s-f3468-7khf5-master-1 Running AHV Unnamed Development-LTS 71m ci-op-0pdvmm2s-f3468-7khf5-master-2 Running AHV Unnamed Development-LTS 71m ci-op-0pdvmm2s-f3468-7khf5-master-qmj72-0 Provisioning 5s ci-op-0pdvmm2s-f3468-7khf5-worker-fbj48 Running AHV Unnamed Development-LTS 68m ci-op-0pdvmm2s-f3468-7khf5-worker-pv8jw Running AHV Unnamed Development-LTS 68m ci-op-0pdvmm2s-f3468-7khf5-worker-xpwrf Running AHV Unnamed Development-LTS 68m liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: net/http: TLS handshake timeout liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: EOF liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: EOF liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: EOF liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: EOF liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: EOF liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: EOF liuhuali@Lius-MacBook-Pro huali-test %
Actual results:
the cluster stuck or unable to connect
Expected results:
RollingUpdate successfully, the cluster can be connected
Additional info:
must gather for case1: https://drive.google.com/file/d/1ZeN_5bnCYbOFuCihv1zIt3Y26rmNynBw/view?usp=sharing