Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11025

One old machine stuck in Deleting and many co get degraded when doing master replacement on the cluster with OVN network on vSphere

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Major Major
    • None
    • 4.13
    • None
    • Critical
    • No
    • CLOUD Sprint 249, CLOUD Sprint 250, CLOUD Sprint 251, CLOUD Sprint 252, CLOUD Sprint 253, CLOUD Sprint 254, CLOUD Sprint 255, CLOUD Sprint 256, CLOUD Sprint 257, CLOUD Sprint 258, CLOUD Sprint 259, CLOUD Sprint 260, CLOUD Sprint 261
    • 13
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      One old machine stuck in Deleting and many co get degraded when doing master replacement on the cluster with OVN network, before we tested on aws in 4.12 and reported this bug https://issues.redhat.com/browse/OCPBUGS-5306, seems the failure rate has decreased but still happens.
      
      Tried this on five cluster, and test result is as below:
      Vsphere + ovn: master-1 stuck in deleting in the first rolling update
      Vsphere + ovn: master-2 stuck in deleting in the first rolling update
      Vsphere + sdn: rolling update 4 times, no issue.
      Vsphere + sdn: rolling update 3 times, no issue.
      Gcp + ovn: master-0 stuck in deleting in the second rolling update
      
      Didn’t check on azure because there is a bug https://issues.redhat.com/browse/OCPBUGS-7359 on azure.

      Version-Release number of selected component (if applicable):

      4.13.0-0.nightly-2023-03-28-014156

      How reproducible:

      not sure, tried on 7 cluster, 4 have issue, 3 no issue.

      Steps to Reproduce:

      1.Create cpms
      liuhuali@Lius-MacBook-Pro huali-test % oc create -f controlplanemachineset_vsphere.yaml
      controlplanemachineset.machine.openshift.io/cluster created 
      
      2.Edit cpms to trigger master update, here I change numCPUs from 8 to 4
      liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster
      controlplanemachineset.machine.openshift.io/cluster edited
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                               PHASE          TYPE   REGION   ZONE   AGE
      huliu-vs29d-j6nns-master-0         Running                               82m
      huliu-vs29d-j6nns-master-1         Running                               82m
      huliu-vs29d-j6nns-master-2         Running                               82m
      huliu-vs29d-j6nns-master-t644l-0   Provisioning                          3s
      huliu-vs29d-j6nns-worker-0-46x2r   Running                               76m
      huliu-vs29d-j6nns-worker-0-xsjqt   Running                               76m
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                               PHASE      TYPE   REGION   ZONE   AGE
      huliu-vs29d-j6nns-master-1         Deleting                          6h38m
      huliu-vs29d-j6nns-master-2         Running                           6h38m
      huliu-vs29d-j6nns-master-t644l-0   Running                           5h16m
      huliu-vs29d-j6nns-master-wqv42-1   Running                           5h7m
      huliu-vs29d-j6nns-worker-0-46x2r   Running                           6h32m
      huliu-vs29d-j6nns-worker-0-xsjqt   Running                           6h32m
      liuhuali@Lius-MacBook-Pro huali-test % oc get co
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.13.0-0.nightly-2023-03-28-014156   True        True          True       65s     APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver (2 containers are waiting in pending apiserver-5f48dd5fc5-7m478 pod)...
      baremetal                                  4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
      cloud-controller-manager                   4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h36m   
      cloud-credential                           4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h37m   
      cluster-autoscaler                         4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
      config-operator                            4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h33m   
      console                                    4.13.0-0.nightly-2023-03-28-014156   True        False         False      4m5s    
      control-plane-machine-set                  4.13.0-0.nightly-2023-03-28-014156   True        True          False      6h33m   Observed 1 replica(s) in need of update
      csi-snapshot-controller                    4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
      dns                                        4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
      etcd                                       4.13.0-0.nightly-2023-03-28-014156   True        True          True       6h31m   GuardControllerDegraded: Missing operand on node huliu-vs29d-j6nns-master-wqv42-1...
      image-registry                             4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h9m    
      ingress                                    4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h20m   
      insights                                   4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h26m   
      kube-apiserver                             4.13.0-0.nightly-2023-03-28-014156   True        True          True       6h28m   GuardControllerDegraded: Missing operand on node huliu-vs29d-j6nns-master-wqv42-1...
      kube-controller-manager                    4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h30m   
      kube-scheduler                             4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h30m   
      kube-storage-version-migrator              4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h33m   
      machine-api                                4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h21m   
      machine-approver                           4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h33m   
      machine-config                             4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
      marketplace                                4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
      monitoring                                 4.13.0-0.nightly-2023-03-28-014156   False       True          True       39s     deleting Thanos Ruler Route failed: the server is currently unable to handle the request (delete routes.route.openshift.io thanos-ruler), reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), reconciling Prometheus Federate Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s-federate)
      network                                    4.13.0-0.nightly-2023-03-28-014156   True        True          False      6h32m   DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" is not available (awaiting 1 nodes)...
      node-tuning                                4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
      openshift-apiserver                        4.13.0-0.nightly-2023-03-28-014156   False       True          True       77s     APIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
      openshift-controller-manager               4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
      openshift-samples                          4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h26m   
      operator-lifecycle-manager                 4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
      operator-lifecycle-manager-catalog         4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h32m   
      operator-lifecycle-manager-packageserver   4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h26m   
      service-ca                                 4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h33m   
      storage                                    4.13.0-0.nightly-2023-03-28-014156   True        False         False      6h29m   
      liuhuali@Lius-MacBook-Pro huali-test %  

      Actual results:

      RollingUpdate cannot complete successfully

      Expected results:

      RollingUpdate should complete successfully

      Additional info:

      must gather of the first vsphere + ovn cluster: https://drive.google.com/file/d/1AmH9Eu2qkHN41QSoyb0b_jSkHyq6WHmW/view?usp=sharing

              rmanak@redhat.com Radek Manak
              huliu@redhat.com Huali Liu
              Huali Liu Huali Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: