[OCPBUGS-11025] One old machine stuck in Deleting and many co get degraded when doing master replacement on the cluster with OVN network on vSphere - Red Hat Issue Tracker

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Cloud Compute / Unknown
Labels:
None

Severity:
Critical
Regression:
No
Sprint:
CLOUD Sprint 249, CLOUD Sprint 250, CLOUD Sprint 251, CLOUD Sprint 252, CLOUD Sprint 253, CLOUD Sprint 254, CLOUD Sprint 255, CLOUD Sprint 256, CLOUD Sprint 257, CLOUD Sprint 258, CLOUD Sprint 259, CLOUD Sprint 260, CLOUD Sprint 261
sprint_count:
13
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

One old machine stuck in Deleting and many co get degraded when doing master replacement on the cluster with OVN network, before we tested on aws in 4.12 and reported this bug https://issues.redhat.com/browse/OCPBUGS-5306, seems the failure rate has decreased but still happens.

Tried this on five cluster, and test result is as below:
Vsphere + ovn: master-1 stuck in deleting in the first rolling update
Vsphere + ovn: master-2 stuck in deleting in the first rolling update
Vsphere + sdn: rolling update 4 times, no issue.
Vsphere + sdn: rolling update 3 times, no issue.
Gcp + ovn: master-0 stuck in deleting in the second rolling update

Didn’t check on azure because there is a bug https://issues.redhat.com/browse/OCPBUGS-7359 on azure.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-03-28-014156

How reproducible:

not sure, tried on 7 cluster, 4 have issue, 3 no issue.

Steps to Reproduce:

1.Create cpms
liuhuali@Lius-MacBook-Pro huali-test % oc create -f controlplanemachineset_vsphere.yaml
controlplanemachineset.machine.openshift.io/cluster created 

2.Edit cpms to trigger master update, here I change numCPUs from 8 to 4
liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
controlplanemachineset.machine.openshift.io/cluster edited
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE          TYPE   REGION   ZONE   AGE
huliu-vs29d-j6nns-master-0         Running                               82m
huliu-vs29d-j6nns-master-1         Running                               82m
huliu-vs29d-j6nns-master-2         Running                               82m
huliu-vs29d-j6nns-master-t644l-0   Provisioning                          3s
huliu-vs29d-j6nns-worker-0-46x2r   Running                               76m
huliu-vs29d-j6nns-worker-0-xsjqt   Running                               76m
liuhuali@Lius-MacBook-Pro huali-test % oc get machine
NAME                               PHASE      TYPE   REGION   ZONE   AGE
huliu-vs29d-j6nns-master-1         Deleting                          6h38m
huliu-vs29d-j6nns-master-2         Running                           6h38m
huliu-vs29d-j6nns-master-t644l-0   Running                           5h16m
huliu-vs29d-j6nns-master-wqv42-1   Running                           5h7m
huliu-vs29d-j6nns-worker-0-46x2r   Running                           6h32m
huliu-vs29d-j6nns-worker-0-xsjqt   Running                           6h32m
liuhuali@Lius-MacBook-Pro huali-test % oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.13.0-0.nightly-2023-03-28-014156   True        True          True       65s     APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver (2 containers are waiting in pending apiserver-5f48dd5fc5-7m478 pod)...
baremetal                                  4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
cloud-controller-manager                   4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h36m   
cloud-credential                           4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h37m   
cluster-autoscaler                         4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
config-operator                            4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h33m   
console                                    4.13.0-0.nightly-2023-03-28-014156   True        False         False      4m5s    
control-plane-machine-set                  4.13.0-0.nightly-2023-03-28-014156   True        True          False      6h33m   Observed 1 replica(s) in need of update
csi-snapshot-controller                    4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
dns                                        4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
etcd                                       4.13.0-0.nightly-2023-03-28-014156   True        True          True       6h31m   GuardControllerDegraded: Missing operand on node huliu-vs29d-j6nns-master-wqv42-1...
image-registry                             4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h9m    
ingress                                    4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h20m   
insights                                   4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h26m   
kube-apiserver                             4.13.0-0.nightly-2023-03-28-014156   True        True          True       6h28m   GuardControllerDegraded: Missing operand on node huliu-vs29d-j6nns-master-wqv42-1...
kube-controller-manager                    4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h30m   
kube-scheduler                             4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h30m   
kube-storage-version-migrator              4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h33m   
machine-api                                4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h21m   
machine-approver                           4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h33m   
machine-config                             4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
marketplace                                4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
monitoring                                 4.13.0-0.nightly-2023-03-28-014156   False       True          True       39s     deleting Thanos Ruler Route failed: the server is currently unable to handle the request (delete routes.route.openshift.io thanos-ruler), reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), reconciling Prometheus Federate Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s-federate)
network                                    4.13.0-0.nightly-2023-03-28-014156   True        True          False      6h32m   DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" is not available (awaiting 1 nodes)...
node-tuning                                4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
openshift-apiserver                        4.13.0-0.nightly-2023-03-28-014156   False       True          True       77s     APIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
openshift-controller-manager               4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
openshift-samples                          4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h26m   
operator-lifecycle-manager                 4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
operator-lifecycle-manager-catalog         4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
operator-lifecycle-manager-packageserver   4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h26m   
service-ca                                 4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h33m   
storage                                    4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h29m   
liuhuali@Lius-MacBook-Pro huali-test %

Actual results:

RollingUpdate cannot complete successfully

Expected results:

RollingUpdate should complete successfully

Additional info:

must gather of the first vsphere + ovn cluster: https://drive.google.com/file/d/1AmH9Eu2qkHN41QSoyb0b_jSkHyq6WHmW/view?usp=sharing

Assignee:: Radek Manak

Reporter:: Huali Liu

QA Contact:: Huali Liu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/03/29 8:14 AM

Updated:: 2024/10/31 4:42 PM

Resolved:: 2024/10/31 4:42 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates