-
Bug
-
Resolution: Cannot Reproduce
-
Undefined
-
None
-
4.18, 4.19
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
None
-
None
-
None
-
None
-
None
-
Proposed
-
Known Issue
-
-
None
-
None
-
None
-
None
Description of problem:
Case1: Add the new subnet in front of the original subnet in controlplanemachineset,the cluster stuck Case2: Add the new subnet after the original subnet in controlplanemachineset,sometimes the cluster RollingUpdate successfully, but sometimes the cluster unable to connect
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2025-02-14-222249
How reproducible:
100% for case1, 50% for case2 in my testing
Steps to Reproduce:
1.Install a 4.18 cluster on Nutanix
liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.18.0-0.nightly-2025-02-14-222249 True False 40m Cluster version is 4.18.0-0.nightly-2025-02-14-222249
liuhuali@Lius-MacBook-Pro huali-test % oc get infrastructure cluster -oyaml
apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
creationTimestamp: "2025-02-17T00:40:20Z"
generation: 1
name: cluster
resourceVersion: "519"
uid: d0cafa11-dcdf-4f36-ba5b-2a5b0db2e6b8
spec:
cloudConfig:
key: config
name: cloud-provider-config
platformSpec:
nutanix:
failureDomains: []
prismCentral:
address: prismcentral.lts-cluster.nutanix-dev.devcluster.openshift.com
port: 9440
prismElements:
- endpoint:
address: 10.0.128.159
port: 9440
name: Development-LTS
type: Nutanix
status:
apiServerInternalURI: https://api-int.ci-op-37d7j87w-590c2.nutanix-ci.devcluster.openshift.com:6443
apiServerURL: https://api.ci-op-37d7j87w-590c2.nutanix-ci.devcluster.openshift.com:6443
controlPlaneTopology: HighlyAvailable
cpuPartitioning: None
etcdDiscoveryDomain: ""
infrastructureName: ci-op-37d7j87w-590c2-8vq5j
infrastructureTopology: HighlyAvailable
platform: Nutanix
platformStatus:
nutanix:
apiServerInternalIP: 10.0.130.10
apiServerInternalIPs:
- 10.0.130.10
ingressIP: 10.0.130.11
ingressIPs:
- 10.0.130.11
loadBalancer:
type: OpenShiftManagedDefault
type: Nutanix
2.Add a second subnet in controlplanemachineset,
for case1, add the new subnet in front of the original subnet in controlplanemachineset
before adding:
subnets:
- type: uuid
uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
after adding:
subnets:
- type: uuid
uuid: efe26e93-f6cf-4d89-8104-009e85201fa8
- type: uuid
uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
for case2, add the new subnet after the original subnet in controlplanemachineset
before adding:
subnets:
- type: uuid
uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
after adding:
subnets:
- type: uuid
uuid: ae6e2fd8-79fe-4a88-a0d0-7d66cc45bdb1
- type: uuid
uuid: efe26e93-f6cf-4d89-8104-009e85201fa8
3. for case 1, one old master stuck(sometimes it stuck on master-0, sometimes stuck on master-1, sometimes stuck on master-2 in my testing)
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
ci-op-37d7j87w-590c2-8vq5j-master-1 Deleting AHV Unnamed Development-LTS 3h53m
ci-op-37d7j87w-590c2-8vq5j-master-2 Running AHV Unnamed Development-LTS 3h53m
ci-op-37d7j87w-590c2-8vq5j-master-58wn6-0 Running AHV Unnamed Development-LTS 166m
ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1 Running AHV Unnamed Development-LTS 156m
ci-op-37d7j87w-590c2-8vq5j-worker-gw9cj Running AHV Unnamed Development-LTS 3h50m
ci-op-37d7j87w-590c2-8vq5j-worker-rcm2q Running AHV Unnamed Development-LTS 3h50m
ci-op-37d7j87w-590c2-8vq5j-worker-tpd9b Running AHV Unnamed Development-LTS 3h50m
liuhuali@Lius-MacBook-Pro huali-test % oc get node
NAME STATUS ROLES AGE VERSION
ci-op-37d7j87w-590c2-8vq5j-master-1 Ready,SchedulingDisabled control-plane,master 3h53m v1.31.5
ci-op-37d7j87w-590c2-8vq5j-master-2 Ready control-plane,master 3h53m v1.31.5
ci-op-37d7j87w-590c2-8vq5j-master-58wn6-0 Ready control-plane,master 164m v1.31.5
ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1 Ready control-plane,master 154m v1.31.5
ci-op-37d7j87w-590c2-8vq5j-worker-gw9cj Ready worker 3h37m v1.31.5
ci-op-37d7j87w-590c2-8vq5j-worker-rcm2q Ready worker 3h37m v1.31.5
ci-op-37d7j87w-590c2-8vq5j-worker-tpd9b Ready worker 3h37m v1.31.5
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.18.0-0.nightly-2025-02-14-222249 True True True 3h28m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
baremetal 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m
cloud-controller-manager 4.18.0-0.nightly-2025-02-14-222249 True False False 3h52m
cloud-credential 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m
cluster-autoscaler 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m
config-operator 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m
console 4.18.0-0.nightly-2025-02-14-222249 True False False 3h34m
control-plane-machine-set 4.18.0-0.nightly-2025-02-14-222249 True True False 3h46m Observed 1 replica(s) in need of update
csi-snapshot-controller 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m
dns 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m
etcd 4.18.0-0.nightly-2025-02-14-222249 True True False 3h48m NodeInstallerProgressing: 2 nodes are at revision 8; 1 node is at revision 10; 1 node is at revision 15; 0 nodes have achieved new revision 17
image-registry 4.18.0-0.nightly-2025-02-14-222249 True False False 3h19m
ingress 4.18.0-0.nightly-2025-02-14-222249 True False False 3h35m
insights 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m
kube-apiserver 4.18.0-0.nightly-2025-02-14-222249 True True True 3h46m GuardControllerDegraded: Missing operand on node ci-op-37d7j87w-590c2-8vq5j-master-mvwnh-1
kube-controller-manager 4.18.0-0.nightly-2025-02-14-222249 True False False 3h46m
kube-scheduler 4.18.0-0.nightly-2025-02-14-222249 True False False 3h48m
kube-storage-version-migrator 4.18.0-0.nightly-2025-02-14-222249 True False False 104m
machine-api 4.18.0-0.nightly-2025-02-14-222249 True False False 3h37m
machine-approver 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m
machine-config 4.18.0-0.nightly-2025-02-14-222249 True False True 3h50m Failed to resync 4.18.0-0.nightly-2025-02-14-222249 because: error during syncRequiredMachineConfigPools: [context deadline exceeded, error required MachineConfigPool master is not ready, retrying. Status: (total: 4, ready 3, updated: 4, unavailable: 1, degraded: 0)]
marketplace 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m
monitoring 4.18.0-0.nightly-2025-02-14-222249 True False False 3h33m
network 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m
node-tuning 4.18.0-0.nightly-2025-02-14-222249 True False False 154m
olm 4.18.0-0.nightly-2025-02-14-222249 True False False 104m
openshift-apiserver 4.18.0-0.nightly-2025-02-14-222249 True True True 3h35m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-apiserver ()
openshift-controller-manager 4.18.0-0.nightly-2025-02-14-222249 True False False 3h41m
openshift-samples 4.18.0-0.nightly-2025-02-14-222249 True False False 3h41m
operator-lifecycle-manager 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m
operator-lifecycle-manager-catalog 4.18.0-0.nightly-2025-02-14-222249 True False False 3h50m
operator-lifecycle-manager-packageserver 4.18.0-0.nightly-2025-02-14-222249 True False False 3h41m
service-ca 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m
storage 4.18.0-0.nightly-2025-02-14-222249 True False False 3h51m
liuhuali@Lius-MacBook-Pro huali-test %
for case2, I unable to connect the cluster, but I can see the masters are RollingUpdate to new masters on Nutanix console https://drive.google.com/file/d/1-UbFiUiyhmeBVTBAVaB23jthiZI0VjAm/view?usp=sharing
liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset
controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME PHASE TYPE REGION ZONE AGE
ci-op-0pdvmm2s-f3468-7khf5-master-0 Running AHV Unnamed Development-LTS 71m
ci-op-0pdvmm2s-f3468-7khf5-master-1 Running AHV Unnamed Development-LTS 71m
ci-op-0pdvmm2s-f3468-7khf5-master-2 Running AHV Unnamed Development-LTS 71m
ci-op-0pdvmm2s-f3468-7khf5-master-qmj72-0 Provisioning 5s
ci-op-0pdvmm2s-f3468-7khf5-worker-fbj48 Running AHV Unnamed Development-LTS 68m
ci-op-0pdvmm2s-f3468-7khf5-worker-pv8jw Running AHV Unnamed Development-LTS 68m
ci-op-0pdvmm2s-f3468-7khf5-worker-xpwrf Running AHV Unnamed Development-LTS 68m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
Unable to connect to the server: net/http: TLS handshake timeout
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
Unable to connect to the server: EOF
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
Unable to connect to the server: EOF
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
Unable to connect to the server: EOF
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
Unable to connect to the server: EOF
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
Unable to connect to the server: EOF
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
Unable to connect to the server: EOF
liuhuali@Lius-MacBook-Pro huali-test %
Actual results:
the cluster stuck or unable to connect
Expected results:
RollingUpdate successfully, the cluster can be connected
Additional info:
must gather for case1: https://drive.google.com/file/d/1ZeN_5bnCYbOFuCihv1zIt3Y26rmNynBw/view?usp=sharing