-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
4.13
-
None
-
Critical
-
No
-
CLOUD Sprint 249, CLOUD Sprint 250, CLOUD Sprint 251, CLOUD Sprint 252, CLOUD Sprint 253, CLOUD Sprint 254, CLOUD Sprint 255, CLOUD Sprint 256, CLOUD Sprint 257, CLOUD Sprint 258, CLOUD Sprint 259, CLOUD Sprint 260, CLOUD Sprint 261
-
13
-
False
-
Description of problem:
One old machine stuck in Deleting and many co get degraded when doing master replacement on the cluster with OVN network, before we tested on aws in 4.12 and reported this bug https://issues.redhat.com/browse/OCPBUGS-5306, seems the failure rate has decreased but still happens. Tried this on five cluster, and test result is as below: Vsphere + ovn: master-1 stuck in deleting in the first rolling update Vsphere + ovn: master-2 stuck in deleting in the first rolling update Vsphere + sdn: rolling update 4 times, no issue. Vsphere + sdn: rolling update 3 times, no issue. Gcp + ovn: master-0 stuck in deleting in the second rolling update Didn’t check on azure because there is a bug https://issues.redhat.com/browse/OCPBUGS-7359 on azure.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-03-28-014156
How reproducible:
not sure, tried on 7 cluster, 4 have issue, 3 no issue.
Steps to Reproduce:
1.Create cpms liuhuali@Lius-MacBook-Pro huali-test % oc create -f controlplanemachineset_vsphere.yaml controlplanemachineset.machine.openshift.io/cluster created 2.Edit cpms to trigger master update, here I change numCPUs from 8 to 4 liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster controlplanemachineset.machine.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs29d-j6nns-master-0 Running 82m huliu-vs29d-j6nns-master-1 Running 82m huliu-vs29d-j6nns-master-2 Running 82m huliu-vs29d-j6nns-master-t644l-0 Provisioning 3s huliu-vs29d-j6nns-worker-0-46x2r Running 76m huliu-vs29d-j6nns-worker-0-xsjqt Running 76m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs29d-j6nns-master-1 Deleting 6h38m huliu-vs29d-j6nns-master-2 Running 6h38m huliu-vs29d-j6nns-master-t644l-0 Running 5h16m huliu-vs29d-j6nns-master-wqv42-1 Running 5h7m huliu-vs29d-j6nns-worker-0-46x2r Running 6h32m huliu-vs29d-j6nns-worker-0-xsjqt Running 6h32m liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.0-0.nightly-2023-03-28-014156 True True True 65s APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver (2 containers are waiting in pending apiserver-5f48dd5fc5-7m478 pod)... baremetal 4.13.0-0.nightly-2023-03-28-014156 True False False 6h32m cloud-controller-manager 4.13.0-0.nightly-2023-03-28-014156 True False False 6h36m cloud-credential 4.13.0-0.nightly-2023-03-28-014156 True False False 6h37m cluster-autoscaler 4.13.0-0.nightly-2023-03-28-014156 True False False 6h32m config-operator 4.13.0-0.nightly-2023-03-28-014156 True False False 6h33m console 4.13.0-0.nightly-2023-03-28-014156 True False False 4m5s control-plane-machine-set 4.13.0-0.nightly-2023-03-28-014156 True True False 6h33m Observed 1 replica(s) in need of update csi-snapshot-controller 4.13.0-0.nightly-2023-03-28-014156 True False False 6h32m dns 4.13.0-0.nightly-2023-03-28-014156 True False False 6h32m etcd 4.13.0-0.nightly-2023-03-28-014156 True True True 6h31m GuardControllerDegraded: Missing operand on node huliu-vs29d-j6nns-master-wqv42-1... image-registry 4.13.0-0.nightly-2023-03-28-014156 True False False 6h9m ingress 4.13.0-0.nightly-2023-03-28-014156 True False False 6h20m insights 4.13.0-0.nightly-2023-03-28-014156 True False False 6h26m kube-apiserver 4.13.0-0.nightly-2023-03-28-014156 True True True 6h28m GuardControllerDegraded: Missing operand on node huliu-vs29d-j6nns-master-wqv42-1... kube-controller-manager 4.13.0-0.nightly-2023-03-28-014156 True False False 6h30m kube-scheduler 4.13.0-0.nightly-2023-03-28-014156 True False False 6h30m kube-storage-version-migrator 4.13.0-0.nightly-2023-03-28-014156 True False False 6h33m machine-api 4.13.0-0.nightly-2023-03-28-014156 True False False 6h21m machine-approver 4.13.0-0.nightly-2023-03-28-014156 True False False 6h33m machine-config 4.13.0-0.nightly-2023-03-28-014156 True False False 6h32m marketplace 4.13.0-0.nightly-2023-03-28-014156 True False False 6h32m monitoring 4.13.0-0.nightly-2023-03-28-014156 False True True 39s deleting Thanos Ruler Route failed: the server is currently unable to handle the request (delete routes.route.openshift.io thanos-ruler), reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), reconciling Prometheus Federate Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s-federate) network 4.13.0-0.nightly-2023-03-28-014156 True True False 6h32m DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" is not available (awaiting 1 nodes)... node-tuning 4.13.0-0.nightly-2023-03-28-014156 True False False 6h32m openshift-apiserver 4.13.0-0.nightly-2023-03-28-014156 False True True 77s APIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request openshift-controller-manager 4.13.0-0.nightly-2023-03-28-014156 True False False 6h32m openshift-samples 4.13.0-0.nightly-2023-03-28-014156 True False False 6h26m operator-lifecycle-manager 4.13.0-0.nightly-2023-03-28-014156 True False False 6h32m operator-lifecycle-manager-catalog 4.13.0-0.nightly-2023-03-28-014156 True False False 6h32m operator-lifecycle-manager-packageserver 4.13.0-0.nightly-2023-03-28-014156 True False False 6h26m service-ca 4.13.0-0.nightly-2023-03-28-014156 True False False 6h33m storage 4.13.0-0.nightly-2023-03-28-014156 True False False 6h29m liuhuali@Lius-MacBook-Pro huali-test %
Actual results:
RollingUpdate cannot complete successfully
Expected results:
RollingUpdate should complete successfully
Additional info:
must gather of the first vsphere + ovn cluster: https://drive.google.com/file/d/1AmH9Eu2qkHN41QSoyb0b_jSkHyq6WHmW/view?usp=sharing