Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11675

OCP upgrade from 4.12.11 to 4.13.0-rc.2-x86_64 did not complete with unknown error

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Undefined Undefined
    • None
    • 4.13.0
    • Compliance Operator
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • No
    • None
    • None
    • None
    • NHE Sprint 235
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

       OCP upgrade from 4.12.11->4.13.0 did not complete due to unknown error on a bare metal cluster with fip enabled and SRIOV operator
      
      

      Version-Release number of selected component (if applicable):

      4.12.11->4.13.0-rc.2-x86_64
      

      How reproducible:

      seen once against a cluster with fip enabled and sriov operator installed
      

      Steps to Reproduce:

      1. Tried upgrading a OCP 4.12.11/CNV4.12.3 cluster to 4.13.0-rc.2-x86_64
      2.
      3.
      

      Actual results:

      All the master nodes were updated fine, two of the worker nodes stayed tainted:
      ================
      [cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get nodes
      NAME                                             STATUS                     ROLES                  AGE     VERSION
      cnv-qe-infra-29.cnvqe2.lab.eng.rdu2.redhat.com   Ready                      control-plane,master   8h      v1.26.2+dc93b13
      cnv-qe-infra-30.cnvqe2.lab.eng.rdu2.redhat.com   Ready                      control-plane,master   8h      v1.26.2+dc93b13
      cnv-qe-infra-31.cnvqe2.lab.eng.rdu2.redhat.com   Ready                      control-plane,master   8h      v1.26.2+dc93b13
      cnv-qe-infra-32.cnvqe2.lab.eng.rdu2.redhat.com   Ready                      worker                 7h21m   v1.26.2+dc93b13
      cnv-qe-infra-33.cnvqe2.lab.eng.rdu2.redhat.com   Ready,SchedulingDisabled   worker                 7h20m   v1.26.2+dc93b13
      cnv-qe-infra-34.cnvqe2.lab.eng.rdu2.redhat.com   Ready,SchedulingDisabled   worker                 7h20m   v1.26.2+dc93b13
      [cnv-qe-jenkins@cnv-qe-infra-01 ~]$
      ==================
      [cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get mcp
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master   rendered-master-4193256bfd798c06fc09b2787927c3f5   True      False      False      3              3                   3                     0                      8h
      worker   rendered-worker-84bfa5c08e63b044134da899b133c96f   False     False      False      3              1                   3                     0                      8h
      [cnv-qe-jenkins@cnv-qe-infra-01 ~]$ 
      ===================
      Worker mcp reports this:
      ===================
       lastTransitionTime: "2023-04-11T20:36:49Z"
            message: Pool is paused; will not update to rendered-worker-84bfa5c08e63b044134da899b133c96f
            reason: ""
            status: "False"
            type: Updating
      ===================
      CO reports this:
      ===================
      [cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get co
      NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.13.0-rc.2   True        False         False      70m     
      cloud-controller-manager                   4.13.0-rc.2   True        False         False      8h      
      cloud-credential                           4.13.0-rc.2   True        False         False      8h      
      cluster-autoscaler                         4.13.0-rc.2   True        False         False      8h      
      config-operator                            4.13.0-rc.2   True        False         False      8h      
      console                                    4.13.0-rc.2   True        False         False      7h25m   
      control-plane-machine-set                  4.13.0-rc.2   True        False         False      8h      
      csi-snapshot-controller                    4.13.0-rc.2   True        False         False      8h      
      dns                                        4.13.0-rc.2   True        False         False      8h      
      etcd                                       4.13.0-rc.2   True        False         False      8h      
      image-registry                             4.13.0-rc.2   True        False         False      112m    
      ingress                                    4.13.0-rc.2   True        True          True       112m    The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)
      insights                                   4.13.0-rc.2   True        False         False      8h      
      kube-apiserver                             4.13.0-rc.2   True        False         False      7h53m   
      kube-controller-manager                    4.13.0-rc.2   True        False         False      8h      
      kube-scheduler                             4.13.0-rc.2   True        False         False      8h      
      kube-storage-version-migrator              4.13.0-rc.2   True        False         False      174m    
      machine-api                                4.13.0-rc.2   True        False         False      7h30m   
      machine-approver                           4.13.0-rc.2   True        False         False      8h      
      machine-config                             4.13.0-rc.2   True        False         False      138m    
      marketplace                                4.13.0-rc.2   True        False         False      8h      
      monitoring                                 4.13.0-rc.2   False       True          True       99m     reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 1 unavailable replicas
      network                                    4.13.0-rc.2   True        False         False      8h      
      node-tuning                                4.13.0-rc.2   True        False         False      3h45m   
      openshift-apiserver                        4.13.0-rc.2   True        False         False      124m    
      openshift-controller-manager               4.13.0-rc.2   True        False         False      8h      
      openshift-samples                          4.13.0-rc.2   True        False         False      3h47m   
      operator-lifecycle-manager                 4.13.0-rc.2   True        False         False      8h      
      operator-lifecycle-manager-catalog         4.13.0-rc.2   True        False         False      8h      
      operator-lifecycle-manager-packageserver   4.13.0-rc.2   True        False         False      7h57m   
      service-ca                                 4.13.0-rc.2   True        False         False      8h      
      storage                                    4.13.0-rc.2   True        False         False      8h      
      [cnv-qe-jenkins@cnv-qe-infra-01 ~]$ 
      ======================
      

      Expected results:

      Upgrade to complete successfully.
      

      Additional info:

      Must gather is saved here: https://drive.google.com/drive/folders/11agooCxc0fUX9_utLTFonCoembhS-9mY?usp=share_link
      

              wenshen@redhat.com Vincent Shen
              rhn-support-dbasunag Debarati Basu-Nag
              None
              None
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              None
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: