Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-22484

4.12.41 to 4.13.19 upgrade is blocked by machine-config

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.13.z
    • RHCOS
    • Critical
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/44510/consoleFull

      profile: 02_UPI_on_Baremetal-packet_OVN-dual-stack_Disk-encryption_Disk-mirroring_Etcd-encryption

      wanted to upgrade from 4.12.41-x86_64 - > 4.13.19-x86_64,4.14.0-x86_64

      4.12.41 upgrade to 4.13.19 is failed

      $ oc get co
      NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.13.19   True        False         True       172m    APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
      baremetal                                  4.13.19   True        False         False      4h55m   
      cloud-controller-manager                   4.13.19   True        False         False      4h58m   
      cloud-credential                           4.13.19   True        False         False      4h59m   
      cluster-autoscaler                         4.13.19   True        False         False      4h54m   
      config-operator                            4.13.19   True        False         False      4h55m   
      console                                    4.13.19   True        False         False      4h43m   
      control-plane-machine-set                  4.13.19   True        False         False      4h55m   
      csi-snapshot-controller                    4.13.19   True        False         False      4h55m   
      dns                                        4.13.19   True        True          False      4h54m   DNS "default" reports Progressing=True: "Have 4 available node-resolver pods, want 6."
      etcd                                       4.13.19   True        False         True       4h53m   ClusterMemberControllerDegraded: unhealthy members found during reconciling members...
      image-registry                             4.13.19   True        True          False      175m    Progressing: The registry is ready...
      ingress                                    4.13.19   True        False         False      4h44m   
      insights                                   4.13.19   True        False         False      4h49m   
      kube-apiserver                             4.13.19   True        False         True       4h50m   NodeControllerDegraded: The master nodes not ready: node "master-00.juzhao-44510.qe.devcluster.openshift.com" not ready since 2023-10-27 11:13:46 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      kube-controller-manager                    4.13.19   True        False         True       4h52m   NodeControllerDegraded: The master nodes not ready: node "master-00.juzhao-44510.qe.devcluster.openshift.com" not ready since 2023-10-27 11:13:46 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      kube-scheduler                             4.13.19   True        False         True       4h52m   NodeControllerDegraded: The master nodes not ready: node "master-00.juzhao-44510.qe.devcluster.openshift.com" not ready since 2023-10-27 11:13:46 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
      kube-storage-version-migrator              4.13.19   True        False         False      89m     
      machine-api                                4.13.19   True        False         False      4h55m   
      machine-approver                           4.13.19   True        False         False      4h54m   
      machine-config                             4.12.41   False       True          True       69m     Cluster not available for [{operator 4.12.41}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2)]
      marketplace                                4.13.19   True        False         False      4h54m   
      monitoring                                 4.13.19   False       True          True       66m     reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: Too many daemonset pods are unavailable (2 > 1 max unavailable).
      network                                    4.13.19   True        True          False      4h54m   DaemonSet "/openshift-multus/multus" is not available (awaiting 2 nodes)...
      node-tuning                                4.13.19   True        True          False      121m    Working towards "4.13.19"
      openshift-apiserver                        4.13.19   True        False         True       4h48m   APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
      openshift-controller-manager               4.13.19   True        False         False      4h51m   
      openshift-samples                          4.13.19   True        False         False      121m    
      operator-lifecycle-manager                 4.13.19   True        False         False      4h54m   
      operator-lifecycle-manager-catalog         4.13.19   True        False         False      4h54m   
      operator-lifecycle-manager-packageserver   4.13.19   True        False         False      4h48m   
      service-ca                                 4.13.19   True        False         False      4h55m   
      storage                                    4.13.19   True        False         False      4h55m   
      

      and mcp is stuck at updating

      $ oc get mcp
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master   rendered-master-4abc568f4584b75998ea779c818f03c4   False     True       False      3              0                   0                     0                      4h56m
      worker   rendered-worker-b9f20eedfe223ba9bb8cd926ea93db77   False     True       False      3              1                   1                     0                      4h56m
      $ oc get node
      NAME                                                 STATUS                        ROLES                  AGE     VERSION
      master-00.juzhao-44510.qe.devcluster.openshift.com   NotReady,SchedulingDisabled   control-plane,master   4h59m   v1.25.14+31e0558
      master-01.juzhao-44510.qe.devcluster.openshift.com   Ready                         control-plane,master   4h58m   v1.25.14+31e0558
      master-02.juzhao-44510.qe.devcluster.openshift.com   Ready                         control-plane,master   4h58m   v1.25.14+31e0558
      worker-00.juzhao-44510.qe.devcluster.openshift.com   Ready                         worker                 4h44m   v1.25.14+31e0558
      worker-01.juzhao-44510.qe.devcluster.openshift.com   Ready                         worker                 4h46m   v1.26.9+636f2be
      worker-02.juzhao-44510.qe.devcluster.openshift.com   NotReady,SchedulingDisabled   worker                 4h45m   v1.25.14+31e0558
      

      clusterversion info

      $ oc get clusterversion version -oyaml
      apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      metadata:
        creationTimestamp: "2023-10-27T07:39:05Z"
        generation: 3
        name: version
        resourceVersion: "191974"
        uid: 6b3d8f94-8948-4756-a0f9-0b9fb39bb91d
      spec:
        channel: stable-4.12
        clusterID: 0b4919c9-3b65-4c77-9e0c-cea5933494ca
        desiredUpdate:
          force: true
          image: quay.io/openshift-release-dev/ocp-release@sha256:f8ba6f54eae419aba17926417d950ae18e06021beae9d7947a8b8243ad48353a
          version: ""
      status:
        availableUpdates: null
        capabilities:
          enabledCapabilities:
          - CSISnapshot
          - Console
          - Insights
          - NodeTuning
          - Storage
          - baremetal
          - marketplace
          - openshift-samples
          knownCapabilities:
          - CSISnapshot
          - Console
          - Insights
          - NodeTuning
          - Storage
          - baremetal
          - marketplace
          - openshift-samples
        conditions:
        - lastTransitionTime: "2023-10-27T07:39:07Z"
          message: 'Unable to retrieve available updates: currently reconciling cluster
            version 4.13.19 not found in the "stable-4.12" channel'
          reason: VersionNotFound
          status: "False"
          type: RetrievedUpdates
        - lastTransitionTime: "2023-10-27T10:10:34Z"
          message: Capabilities match configured spec
          reason: AsExpected
          status: "False"
          type: ImplicitlyEnabledCapabilities
        - lastTransitionTime: "2023-10-27T07:39:07Z"
          message: Payload loaded version="4.13.19" image="quay.io/openshift-release-dev/ocp-release@sha256:f8ba6f54eae419aba17926417d950ae18e06021beae9d7947a8b8243ad48353a"
            architecture="amd64"
          reason: PayloadLoaded
          status: "True"
          type: ReleaseAccepted
        - lastTransitionTime: "2023-10-27T08:03:40Z"
          message: Done applying 4.12.41
          status: "True"
          type: Available
        - lastTransitionTime: "2023-10-27T11:15:58Z"
          message: Cluster operators etcd, kube-apiserver are degraded
          reason: ClusterOperatorsDegraded
          status: "True"
          type: Failing
        - lastTransitionTime: "2023-10-27T10:10:20Z"
          message: 'Unable to apply 4.13.19: wait has exceeded 40 minutes for these operators:
            etcd, kube-apiserver'
          reason: ClusterOperatorsDegraded
          status: "True"
          type: Progressing
        - lastTransitionTime: "2023-10-27T11:14:04Z"
          message: 'Cluster operator machine-config should not be upgraded between minor
            versions: One or more machine config pools are updating, please see `oc get
            mcp` for further details'
          reason: PoolUpdating
          status: "False"
          type: Upgradeable
        desired:
          image: quay.io/openshift-release-dev/ocp-release@sha256:f8ba6f54eae419aba17926417d950ae18e06021beae9d7947a8b8243ad48353a
          url: https://access.redhat.com/errata/RHSA-2023:6130
          version: 4.13.19
        history:
        - acceptedRisks: |-
            Forced through blocking failures: Multiple precondition checks failed:
            * Precondition "EtcdRecentBackup" failed because of "ControllerStarted": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
            * Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.12.41 to 4.13.19 is unknown.
          completionTime: null
          image: quay.io/openshift-release-dev/ocp-release@sha256:f8ba6f54eae419aba17926417d950ae18e06021beae9d7947a8b8243ad48353a
          startedTime: "2023-10-27T10:10:20Z"
          state: Partial
          verified: true
          version: 4.13.19
        - completionTime: "2023-10-27T08:03:40Z"
          image: quay.io/openshift-release-dev/ocp-release@sha256:59c93fdfff4ecca2ca6d6bb0ec722bca2bb08152252ae10ce486a9fc80c82dcf
          startedTime: "2023-10-27T07:39:07Z"
          state: Completed
          verified: false
          version: 4.12.41
        observedGeneration: 3
        versionHash: K84dirQ2oDM= 

      also found the same issue in prow job, https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.13.19 , click the "Failed" link of 4.12.41 to 4.13.19 upgrade, check its logs

      https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1717511870073540608/build-log.txt

      Oct 26 14:31:54.647: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
      Oct 26 14:32:04.645: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
      Oct 26 14:32:14.646: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
      Oct 26 14:32:24.646: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
      Oct 26 14:32:24.832: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false) 

      Version-Release number of selected component (if applicable):

      4.12.41 to 4.13.19 upgrade

      How reproducible:

      not always

      Steps to Reproduce:

      1. 4.12.41 to 4.13.19 upgrade

      Actual results:

      failed upgrade

      Expected results:

      should be successful

      Additional info:

      there is successful upgrade from 4.11.52-x86_64 - > 4.12.41-x86_64,4.13.19-x86_64,4.14.0-x86_64
      https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/44490/
      
      seems the issue is not always happen

            Unassigned Unassigned
            juzhao@redhat.com Junqi Zhao
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: