-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.18, 4.19
-
None
-
Critical
-
None
-
False
-
-
-
Known Issue
-
Proposed
Description of problem:
Case1: Add the new subnet in front of the original subnet in controlplanemachineset๏ผthe cluster stuck Case2: Add the new subnet after the original subnet in controlplanemachineset๏ผsometimes the cluster RollingUpdate successfully, but sometimes the cluster unable to connect
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2025-02-14-222249
How reproducible:
100% for case1, 50% for case2 in my testing
Steps to Reproduce:
1.Install a 4.18 cluster on Nutanix liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.18.0-0.nightly-2025-02-14-222249 True False 40m Cluster version is 4.18.0-0.nightly-2025-02-14-222249 liuhuali@Lius-MacBook-Pro huali-test % oc get infrastructure cluster -oyaml apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2025-02-17T00:40:20Z" generation: 1 name: cluster resourceVersion: "519" uid: d0cafa11-dcdf-4f36-ba5b-2a5b0db2e6b8 spec: cloudConfig: key: config name: cloud-provider-config platformSpec: nutanix: failureDomains: [] prismCentral: address: prismcentral.lts-cluster.nutanix-dev.devcluster.openshift.com port: 9440 prismElements: - endpoint: address: 10.0.128.159 port: 9440 name: Development-LTS type: Nutanix status: apiServerInternalURI: https://api-int.ci-op-37d7j87w-590c2.nutanix-ci.devcluster.openshift.com:6443 apiServerURL: https://api.ci-op-37d7j87w-590c2.nutanix-ci.devcluster.openshift.com:6443 controlPlaneTopology: HighlyAvailable cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: ci-op-37d7j87w-590c2-8vq5j infrastructureTopology: HighlyAvailable platform: Nutanix platformStatus: nutanix: apiServerInternalIP: 10.0.130.10 apiServerInternalIPs: - 10.0.130.10 ingressIP: 10.0.130.11 ingressIPs: - 10.0.130.11 loadBalancer: type: OpenShiftManagedDefault type: Nutanix 2.Add a second subnet in controlplanemachineset, for case1, add the new subnet in front of the original subnet in controlplanemachineset before adding: subnets: - type: uuid uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1 after adding: subnets: - type: uuid uuid: efe26e93-f6cf-4d89-8104-009e85201fa8 - type: uuid uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1 for case2, add the new subnet after the original subnet in controlplanemachineset before adding: subnets: - type: uuid uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1 after adding: subnets: - type: uuid uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1 - type: uuid uuid: efe26e93-f6cf-4d89-8104-009e85201fa8 3. for case 1, one old master stuck(sometimes it stuck on master-0, sometimes stuck on master-1, sometimes stuck on master-2 in my testing) liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-37d7j87w-590c2-8vq5j-master-1 Deleting AHV Unnamed Development-LTS 3h53m ci-op-37d7j87w-590c2-8vq5j-master-2 Running AHV Unnamed Development-LTS 3h53m ci-op-37d7j87w-590c2-8vq5j-master-58wn6-0 Running AHV Unnamed Development-LTS 166m ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1 Running AHV Unnamed Development-LTS 156m ci-op-37d7j87w-590c2-8vq5j-worker-gw9cj Running AHV Unnamed Development-LTS 3h50m ci-op-37d7j87w-590c2-8vq5j-worker-rcm2q Running AHV Unnamed Development-LTS 3h50m ci-op-37d7j87w-590c2-8vq5j-worker-tpd9b Running AHV Unnamed Development-LTS 3h50m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ci-op-37d7j87w-590c2-8vq5j-master-1 Ready,SchedulingDisabled control-plane,master 3h53m v1.31.5 ci-op-37d7j87w-590c2-8vq5j-master-2 Ready control-plane,master 3h53m v1.31.5 ci-op-37d7j87w-590c2-8vq5j-master-58wn6-0 Ready control-plane,master 164m v1.31.5 ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1 Ready control-plane,master 154m v1.31.5 ci-op-37d7j87w-590c2-8vq5j-worker-gw9cj Ready worker 3h37m v1.31.5 ci-op-37d7j87w-590c2-8vq5j-worker-rcm2q Ready worker 3h37m v1.31.5 ci-op-37d7j87w-590c2-8vq5j-worker-tpd9b Ready worker 3h37m v1.31.5 liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.18.0-0.nightly-2025-02-14-222249 True True True 3h28m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()... baremetal 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m cloud-controller-manager 4.18.0-0.nightly-2025-02-14-222249 True False False 3h52m cloud-credential 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m cluster-autoscaler 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m config-operator 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m console 4.18.0-0.nightly-2025-02-14-222249 True False False 3h34m control-plane-machine-set 4.18.0-0.nightly-2025-02-14-222249 True True False 3h46m Observed 1 replica(s) in need of update csi-snapshot-controller 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m dns 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m etcd 4.18.0-0.nightly-2025-02-14-222249 True True False 3h48m NodeInstallerProgressing: 2 nodes are at revision 8; 1 node is at revision 10; 1 node is at revision 15; 0 nodes have achieved new revision 17 image-registry 4.18.0-0.nightly-2025-02-14-222249 True False False 3h19m ingress 4.18.0-0.nightly-2025-02-14-222249 True False False 3h35m insights 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m kube-apiserver 4.18.0-0.nightly-2025-02-14-222249 True True True 3h46m GuardControllerDegraded: Missing operand on node ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1 kube-controller-manager 4.18.0-0.nightly-2025-02-14-222249 True False False 3h46m kube-scheduler 4.18.0-0.nightly-2025-02-14-222249 True False False 3h48m kube-storage-version-migrator 4.18.0-0.nightly-2025-02-14-222249 True False False 104m machine-api 4.18.0-0.nightly-2025-02-14-222249 True False False 3h37m machine-approver 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m machine-config 4.18.0-0.nightly-2025-02-14-222249 True False True 3h50m Failed to resync 4.18.0-0.nightly-2025-02-14-222249 because: error during syncRequiredMachineConfigPools: [context deadline exceeded, error required MachineConfigPool master is not ready, retrying. Status: (total: 4, ready 3, updated: 4, unavailable: 1, degraded: 0)] marketplace 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m monitoring 4.18.0-0.nightly-2025-02-14-222249 True False False 3h33m network 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m node-tuning 4.18.0-0.nightly-2025-02-14-222249 True False False 154m olm 4.18.0-0.nightly-2025-02-14-222249 True False False 104m openshift-apiserver 4.18.0-0.nightly-2025-02-14-222249 True True True 3h35m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-apiserver () openshift-controller-manager 4.18.0-0.nightly-2025-02-14-222249 True False False 3h41m openshift-samples 4.18.0-0.nightly-2025-02-14-222249 True False False 3h41m operator-lifecycle-manager 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m operator-lifecycle-manager-catalog 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m operator-lifecycle-manager-packageserver 4.18.0-0.nightly-2025-02-14-222249 True False False 3h41m service-ca 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m storage 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m liuhuali@Lius-MacBook-Pro huali-test % for case2, I unable to connect the cluster, but I can see the masters are RollingUpdate to new masters on Nutanix console https://drive.google.com/file/d/1-UbFiUiyhmeBVTBAVaB23jthiZI0VjAm/view?usp=sharing liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset controlplanemachineset.machine.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE ci-op-0pdvmm2s-f3468-7khf5-master-0 Running AHV Unnamed Development-LTS 71m ci-op-0pdvmm2s-f3468-7khf5-master-1 Running AHV Unnamed Development-LTS 71m ci-op-0pdvmm2s-f3468-7khf5-master-2 Running AHV Unnamed Development-LTS 71m ci-op-0pdvmm2s-f3468-7khf5-master-qmj72-0 Provisioning 5s ci-op-0pdvmm2s-f3468-7khf5-worker-fbj48 Running AHV Unnamed Development-LTS 68m ci-op-0pdvmm2s-f3468-7khf5-worker-pv8jw Running AHV Unnamed Development-LTS 68m ci-op-0pdvmm2s-f3468-7khf5-worker-xpwrf Running AHV Unnamed Development-LTS 68m liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: net/http: TLS handshake timeout liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: EOF liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: EOF liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: EOF liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: EOF liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: EOF liuhuali@Lius-MacBook-Pro huali-test % oc get machine Unable to connect to the server: EOF liuhuali@Lius-MacBook-Pro huali-test %
Actual results:
the cluster stuck or unable to connect
Expected results:
RollingUpdate successfully, the cluster can be connected
Additional info:
must gather for case1: https://drive.google.com/file/d/1ZeN_5bnCYbOFuCihv1zIt3Y26rmNynBw/view?usp=sharing
Tested adding failureDomains day2 following the case OCP-70808 - [ipi-on-nutanix] adding failureDomains to an existing Nutanix cluster but set two subnets for each failure domain on 4.18 today. Met the issue - the cluster is unreachable.
Steps:
1.Install a nutanix ipi cluster without failureDomains
2.Enable featuregate and wait the cluster ready
3. Edit infrastructure cluster object to add failureDomains, the masters will not update
4.Edit controlplanemachineset cluster object to add failureDomains, the masters start updating, but after sometime, the cluster is unreachable.