Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-29316

OCP upgrade to nightly build failed on provider cluster - OVN-K fails to process annotation on live-migratable VM pods

XMLWordPrintable

    • Important
    • No
    • SDN Sprint 249, SDN Sprint 250, SDN Sprint 253
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-27853. The following is the description of the original issue:

      Description of problem:

      Upgrading OCP from 4.14.7 to 4.15.0 nightly build failed on Provider cluster which is part of provider-client setup.
      Platform: IBM Cloud Bare Metal cluster.
      
      Steps done:
      
      Step 1.
      
      $ oc patch clusterversions/version -p '{"spec":{"channel":"stable-4.15"}}' --type=merge
      clusterversion.config.openshift.io/version patched
      
      Step 2:
      $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837 --allow-explicit-upgrade --force
      warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
      warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
      warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
      Requesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-18-050837
      
      The cluster was not upgraded successfully.
      
       
      $ oc get clusteroperator | grep -v "4.15.0-0.nightly-2024-01-18-050837   True        False         False"
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.15.0-0.nightly-2024-01-18-050837   True        False         True       111s    APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
      console                                    4.15.0-0.nightly-2024-01-18-050837   False       False         False      111s    RouteHealthAvailable: console route is not admitted
      dns                                        4.15.0-0.nightly-2024-01-18-050837   True        True          False      12d     DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5.\nHave 5 available node-resolver pods, want 6."
      etcd                                       4.15.0-0.nightly-2024-01-18-050837   True        False         True       12d     EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [{Member:ID:14147288297306253147 name:"baremetal2-06.qe.rh-ocs.com" peerURLs:"https://52.116.161.167:2380" clientURLs:"https://52.116.161.167:2379"  Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://52.116.161.167:2379]: context deadline exceeded} {Member:ID:15369339084089827159 name:"baremetal2-03.qe.rh-ocs.com" peerURLs:"https://52.116.161.164:2380" clientURLs:"https://52.116.161.164:2379"  Healthy:true Took:9.617293ms Error:<nil>} {Member:ID:17481226479420161008 name:"baremetal2-04.qe.rh-ocs.com" peerURLs:"https://52.116.161.165:2380" clientURLs:"https://52.116.161.165:2379"  Healthy:true Took:9.090133ms Error:<nil>}]...
      image-registry                             4.15.0-0.nightly-2024-01-18-050837   True        True          False      12d     Progressing: All registry resources are removed...
      machine-config                             4.14.7                               True        True          True       7d22h   Unable to apply 4.15.0-0.nightly-2024-01-18-050837: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, MachineConfigPool master has not progressed to latest configuration: controller version mismatch for rendered-master-9b7e02d956d965d0906def1426cb03b5 expected eaab8f3562b864ef0cc7758a6b19cc48c6d09ed8 has 7649b9274cde2fb50a61a579e3891c8ead2d79c5: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-34b4781f1a0fe7119765487c383afbb3, retrying]]
      monitoring                                 4.15.0-0.nightly-2024-01-18-050837   False       True          True       7m54s   UpdatingUserWorkloadPrometheus: client rate limiter Wait returned an error: context deadline exceeded, UpdatingUserWorkloadThanosRuler: waiting for ThanosRuler object changes failed: waiting for Thanos Ruler openshift-user-workload-monitoring/user-workload: context deadline exceeded
      network                                    4.15.0-0.nightly-2024-01-18-050837   True        True          False      12d     DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 2 nodes)...
      node-tuning                                4.15.0-0.nightly-2024-01-18-050837   True        True          False      98m     Working towards "4.15.0-0.nightly-2024-01-18-050837"
      
      
      $ oc get machineconfigpool
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master   rendered-master-9b7e02d956d965d0906def1426cb03b5   False     True       True       3              0                   0                     1                      12d
      worker   rendered-worker-4f54b43e9f934f0659761929f55201a1   False     True       True       3              1                   1                     1                      12d
      
      
      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.7    True        True          120m    Unable to apply 4.15.0-0.nightly-2024-01-18-050837: an unknown error has occurred: MultipleErrors
      
      
      $ oc get nodes
      NAME                          STATUS                     ROLES                         AGE   VERSION
      baremetal2-01.qe.rh-ocs.com   Ready                      worker                        12d   v1.27.8+4fab27b
      baremetal2-02.qe.rh-ocs.com   Ready                      worker                        12d   v1.27.8+4fab27b
      baremetal2-03.qe.rh-ocs.com   Ready                      control-plane,master,worker   12d   v1.27.8+4fab27b
      baremetal2-04.qe.rh-ocs.com   Ready                      control-plane,master,worker   12d   v1.27.8+4fab27b
      baremetal2-05.qe.rh-ocs.com   Ready                      worker                        12d   v1.28.5+c84a6b8
      baremetal2-06.qe.rh-ocs.com   Ready,SchedulingDisabled   control-plane,master,worker   12d   v1.27.8+4fab27b
      ----------------------------------------------------
      
      During the efforts to bring the cluster back to a good state, these steps were done:
      The node baremetal2-06.qe.rh-ocs.com was uncordoned.
      
      Tried to upgrade to using the command
      
      $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500 --allow-explicit-upgrade --force --allow-upgrade-with-warnings=true
      warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
      warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
      warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
      warning: --allow-upgrade-with-warnings is bypassing: the cluster is already upgrading:  Reason: ClusterOperatorsDegraded
        Message: Unable to apply 4.15.0-0.nightly-2024-01-18-050837: wait has exceeded 40 minutes for these operators: etcd, kube-apiserverRequesting update to release image registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2024-01-22-051500
      
      
      Upgrade to 4.15.0-0.nightly-2024-01-22-051500 also was not successful.
      Node baremetal2-01.qe.rh-ocs.com was drained manually to see if that works.
      
      Some clusteroperators stayed on the previous version. Some moved to Degraded state. 
      
      $ oc get machineconfigpool
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master   rendered-master-9b7e02d956d965d0906def1426cb03b5   False     True       False      3              1                   1                     0                      13d
      worker   rendered-worker-4f54b43e9f934f0659761929f55201a1   False     True       True       3              1                   1                     1                      13d
      
      
      $ oc get pdb -n openshift-storage
      NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
      rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     11d
      rook-ceph-mon-pdb                                 N/A             1                 1                     11d
      rook-ceph-osd                                     N/A             1                 1                     3h17m
      
      
      $ oc rsh rook-ceph-tools-57fd4d4d68-p2psh ceph osd tree
      ID  CLASS  WEIGHT   TYPE NAME                             STATUS  REWEIGHT  PRI-AFF
      -1         5.23672  root default                                                   
      -5         1.74557      host baremetal2-01-qe-rh-ocs-com                           
       1    ssd  0.87279          osd.1                             up   1.00000  1.00000
       4    ssd  0.87279          osd.4                             up   1.00000  1.00000
      -7         1.74557      host baremetal2-02-qe-rh-ocs-com                           
       3    ssd  0.87279          osd.3                             up   1.00000  1.00000
       5    ssd  0.87279          osd.5                             up   1.00000  1.00000
      -3         1.74557      host baremetal2-05-qe-rh-ocs-com                           
       0    ssd  0.87279          osd.0                             up   1.00000  1.00000
       2    ssd  0.87279          osd.2                             up   1.00000  1.00000
      
      
      OCP must-gather logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/hcp414-aaa/hcp414-aaa_20240112T084548/logs/must-gather-ibm-bm2-provider/must-gather.local.1079362865726528648/

       

      Version-Release number of selected component (if applicable):

      Initial version:
      OCP 4.14.7
      ODF 4.14.4-5.fusion-hci
      OpenShift Virtualization: kubevirt-hyperconverged-operator.4.16.0-380
      Local Storage: local-storage-operator.v4.14.0-202312132033
      OpenShift Data Foundation Client : ocs-client-operator.v4.14.4-5.fusion-hci

      How reproducible:

      Reporting the first occurance of the isue.

      Steps to Reproduce:

          1. On a Provider-client HCI setup , upgrade provider cluster to a nightly build of OCP
          

      Actual results:

          OCP upgrade not successful. Some operators become degraded. worker machineconfigpool have 1 degraded machine count.

      Expected results:

      OCP upgrade to nightly build from 4.14.7 should be success.    

      Additional info:

          There are 3 hosted clients present

              jcaamano@redhat.com Jaime Caamaño Ruiz
              openshift-crt-jira-prow OpenShift Prow Bot
              Zhanqi Zhao Zhanqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: