Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-33065

[CNF-IBU]: Upgrade stuck at "InProgress" state before pivot for long time

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • None
    • 4.16.0
    • LCA operator
    • None
    • Important
    • No
    • CNF Ran Sprint 253
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      1. Upgrade stuck at the state "InProgress" before pivot for long time (even "Starting handleUpgrade" not kicked in) 
      2. No event/transaction written or observed in LCA operator pod container "manager" logs.

      Version-Release number of selected component (if applicable):

      TALM bundle: 4.16.0-23
      Seed SNO Cluster: OCP 4.16.0-0.nightly-2024-04-16-195622
      Target SNO cluster before upgrade : OCP 4.14.23
      Target Hub cluster: OCP 4.16.0-0.nightly-2024-04-16-195622
      ACM: 2.10.2
      GitOps: 1.12.0
      LCA operator bundle container version: 4.16.0-43
      LCA Operator version: 4.16.0-36 
      Recert image version: 4.16.0-10

      How reproducible:

          Observed once in 3 times. Intermittent issue.

      Steps to Reproduce:

      [1] Make sure all policies reporting  compliant except site specific config policy (which has destination [new labels] extra manifests) on target sno cluster.
      [2] Apply IBU Prep CGU via TALM from target hub cluster.
      [3] Make sure IBU CR reporting Prep stage completed successfully.
      [4] Import stale backucp CR (acm-klusterlet) with different cluster id label other than actual target sno cluster id 
      [5] Apply IBU upgrade policy to trigger upgrade from target hub cluster. {{ Sample snippet logs }}
      [kni@registry.kni-qe-53 ibu-cgu]$ cat upgrade-cgu.yaml 
      apiVersion: ran.openshift.io/v1alpha1
      kind: ClusterGroupUpgrade
      metadata:
        name: cgu-ibu-upgrade
        namespace: default
      spec:
        actions:
          beforeEnable:
            addClusterAnnotations:
              import.open-cluster-management.io/disable-auto-import: "true"
          afterCompletion:
            removeClusterAnnotations:
            - import.open-cluster-management.io/disable-auto-import
        clusters: 
        - helix57
        enable: true
        managedPolicies: 
        - group-ibu-upgrade-policy
        remediationStrategy: 
          canaries: 
            - helix57
          maxConcurrency: 1 
          timeout: 240
      [kni@registry.kni-qe-53 ibu-cgu]$ oc -n default get cgu 
      NAME               AGE   STATE        DETAILS
      cgu-ibu-finalize   54m   Completed    All clusters are compliant with all the managed policies
      cgu-ibu-prep       53m   Completed    All clusters are compliant with all the managed policies
      cgu-ibu-upgrade    46m   InProgress   Remediating non-compliant policies
      [kni@registry.kni-qe-53 ibu-cgu]$ 
      =============================
      [6] Observed upgrade stuck at "InProgress" with details "Backing up Application Data" before pivot for long time (more than 40 mins)     

      Actual results:

      kni@registry.kni-qe-53 ~]$ oc get ibu
      NAME      AGE   DESIRED STAGE   STATE        DETAILS
      upgrade   45m   Upgrade         InProgress   Backing up Application Data
      [kni@registry.kni-qe-53 ~]$ oc get ibu upgrade -o yaml
      apiVersion: lca.openshift.io/v1alpha1
      kind: ImageBasedUpgrade
      metadata:
        annotations:
          auto-rollback-on-failure.lca.openshift.io/init-monitor: Disabled
        creationTimestamp: "2024-04-27T23:01:03Z"
        generation: 4
        name: upgrade
        resourceVersion: "249028"
        uid: 1a437a64-547d-4b08-8e0a-69809bf16698
      spec:
        autoRollbackOnFailure: {}
        oadpContent:
        - name: oadp-cm
          namespace: openshift-adp
        seedImageRef:
          image: quay.io/rh_ee_pmohanra/ibu-seed-sno0:4.16.0-0.nightly-2024-04-16-195622
          pullSecretRef:
            name: seed-pull-secret
          version: 4.16.0-0.nightly-2024-04-16-195622
        stage: Upgrade
      status:
        conditions:
        - lastTransitionTime: "2024-04-27T23:11:41Z"
          message: In progress
          observedGeneration: 3
          reason: InProgress
          status: "False"
          type: Idle
        - lastTransitionTime: "2024-04-27T23:12:58Z"
          message: Prep completed
          observedGeneration: 3
          reason: Completed
          status: "False"
          type: PrepInProgress
        - lastTransitionTime: "2024-04-27T23:12:58Z"
          message: Prep completed successfully
          observedGeneration: 3
          reason: Completed
          status: "True"
          type: PrepCompleted
        - lastTransitionTime: "2024-04-27T23:18:49Z"
          message: Backing up Application Data
          observedGeneration: 4
          reason: InProgress
          status: "True"
          type: UpgradeInProgress
        observedGeneration: 4
        validNextStages:
        - Idle
      [kni@registry.kni-qe-53 ~]$     

      Expected results:

      [1] Upgrade should not stuck at "InProgress" state before pivot for long time.
      
      [2] We should have a mechanism like lca-init-monitor.service that runs post-reboot with a configurable timeout to monitor upgrade events before pivot

      Additional info:

        /SUT logs before the test //
      [kni@registry.kni-qe-53 ~]$ oc -n openshift-adp get backups
      NAME             AGE
      acm-klusterlet   39s
      [kni@registry.kni-qe-53 ~]$ oc -n openshift-adp get restores.velero.io 
      No resources found in openshift-adp namespace.
      [kni@registry.kni-qe-53 ~]$ oc -n openshift-adp get backups acm-klusterlet -o yaml | grep -i clusterid
          config.openshift.io/clusterID: 6b0b914f-bb0a-4f27-914c-e3dce9a67e91
      [kni@registry.kni-qe-53 ~]$  
      [kni@registry.kni-qe-53 ~]$ oc get clusterversions.config.openshift.io -o yaml | grep -i clusterid
          clusterID: 8b2e5d4f-dc16-478b-ab27-72124fe09f6d
      [kni@registry.kni-qe-53 ~]$ oc get clusterversions.config.openshift.io
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.23   True        False         23h     Cluster version is 4.14.23
      [kni@registry.kni-qe-53 ~]$ oc get no
      NAME                              STATUS   ROLES                         AGE   VERSION
      helix57.lab.eng.rdu2.redhat.com   Ready    control-plane,master,worker   23h   v1.27.13+401bb48
      [kni@registry.kni-qe-53 ~]$ 
      //SUT logs after the test //
      [kni@registry.kni-qe-53 ~]$ oc get ibu
      NAME      AGE   DESIRED STAGE   STATE        DETAILS
      upgrade   45m   Upgrade         InProgress   Backing up Application Data
      [kni@registry.kni-qe-53 ~]$ oc get ibu upgrade -o yaml
      apiVersion: lca.openshift.io/v1alpha1
      kind: ImageBasedUpgrade
      metadata:
        annotations:
          auto-rollback-on-failure.lca.openshift.io/init-monitor: Disabled
        creationTimestamp: "2024-04-27T23:01:03Z"
        generation: 4
        name: upgrade
        resourceVersion: "249028"
        uid: 1a437a64-547d-4b08-8e0a-69809bf16698
      spec:
        autoRollbackOnFailure: {}
        oadpContent:
        - name: oadp-cm
          namespace: openshift-adp
        seedImageRef:
          image: quay.io/rh_ee_pmohanra/ibu-seed-sno0:4.16.0-0.nightly-2024-04-16-195622
          pullSecretRef:
            name: seed-pull-secret
          version: 4.16.0-0.nightly-2024-04-16-195622
        stage: Upgrade
      status:
        conditions:
        - lastTransitionTime: "2024-04-27T23:11:41Z"
          message: In progress
          observedGeneration: 3
          reason: InProgress
          status: "False"
          type: Idle
        - lastTransitionTime: "2024-04-27T23:12:58Z"
          message: Prep completed
          observedGeneration: 3
          reason: Completed
          status: "False"
          type: PrepInProgress
        - lastTransitionTime: "2024-04-27T23:12:58Z"
          message: Prep completed successfully
          observedGeneration: 3
          reason: Completed
          status: "True"
          type: PrepCompleted
        - lastTransitionTime: "2024-04-27T23:18:49Z"
          message: Backing up Application Data
          observedGeneration: 4
          reason: InProgress
          status: "True"
          type: UpgradeInProgress
        observedGeneration: 4
        validNextStages:
        - Idle
      [kni@registry.kni-qe-53 ~]$ 
      [kni@registry.kni-qe-53 ~]$ oc get policy -A
      NAMESPACE   NAME                                          REMEDIATION ACTION   COMPLIANCE STATE   AGE
      helix57     ztp-common.common-config-policy               inform               Compliant          50m
      helix57     ztp-common.common-lca-subscriptions-policy    inform               Compliant          50m
      helix57     ztp-common.common-oadp-subscriptions-policy   inform               Compliant          50m
      helix57     ztp-common.common-subscriptions-policy        inform               Compliant          50m
      helix57     ztp-common.oadp-cm-common-policies                                 Compliant          50m
      helix57     ztp-group.group-du-sno-config-policy          inform               NonCompliant       50m
      helix57     ztp-group.group-ibu-finalize-policy           inform               NonCompliant       156m
      helix57     ztp-group.group-ibu-prep-policy               inform               NonCompliant       156m
      helix57     ztp-group.group-ibu-rollback-policy           inform               NonCompliant       156m
      helix57     ztp-group.group-ibu-upgrade-policy            enforce              NonCompliant       156m
      helix57     ztp-site.helix57-config-policy                inform               NonCompliant       50m
      helix57     ztp-site.helix57-oadp-config-policy           inform               Compliant          50m
      [kni@registry.kni-qe-53 ~]$ 
      [kni@registry.kni-qe-53 ~]$ oc -n openshift-adp get backups 
      No resources found in openshift-adp namespace.
      [kni@registry.kni-qe-53 ~]$ 
      [root@helix57 ~]# oc get ibu
      NAME      AGE   DESIRED STAGE   STATE        DETAILS
      upgrade   31m   Upgrade         InProgress   Backing up Application Data
      [root@helix57 ~]# oc get ibu
      NAME      AGE   DESIRED STAGE   STATE        DETAILS
      upgrade   37m   Upgrade         InProgress   Backing up Application Data
      [root@helix57 ~]# 
      
      =============Complete LCA operator logs during the test=========================
      024-04-27T23:12:58Z    INFO    controllers.ImageBasedUpgrade    Prep stage completed successfully!
      2024-04-27T23:12:58Z    INFO    controllers.ImageBasedUpgrade    Done handlePrep
      2024-04-27T23:12:58Z    INFO    controllers.ImageBasedUpgrade    Finish reconciling IBU    {"name": {"name":"upgrade"}, "requeueRightAway": false}
      2024-04-27T23:12:58Z    INFO    controllers.ImageBasedUpgrade    Start reconciling IBU    {"name": {"name":"upgrade"}}
      2024-04-27T23:12:58Z    INFO    controllers.ImageBasedUpgrade    Loaded IBU    {"name": {"name":"upgrade"}, "version": "245549", "desired stage": "Prep"}
      INFO[0715] Executing /usr/bin/env with args [-- rpm-ostree status --json] 
      2024-04-27T23:12:58Z    INFO    controllers.ImageBasedUpgrade    Finish reconciling IBU    {"name": {"name":"upgrade"}, "requeueRightAway": false}
      2024-04-27T23:13:11Z    INFO    controllers.ImageBasedUpgrade    Start reconciling IBU    {"name": {"name":"upgrade"}}
      2024-04-27T23:13:11Z    INFO    controllers.ImageBasedUpgrade    Loaded IBU    {"name": {"name":"upgrade"}, "version": "245549", "desired stage": "Prep"}
      INFO[0728] Executing /usr/bin/env with args [-- rpm-ostree status --json] 
      2024-04-27T23:13:11Z    INFO    controllers.ImageBasedUpgrade    Finish reconciling IBU    {"name": {"name":"upgrade"}, "requeueRightAway": false}
      ===============================
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade    Start reconciling IBU    {"name": {"name":"upgrade"}}
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade    Loaded IBU    {"name": {"name":"upgrade"}, "version": "246889", "desired stage": "Upgrade"}
      INFO[1066] Executing /usr/bin/env with args [-- rpm-ostree status --json] 
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade    Starting handleUpgrade
      INFO[1066] Executing /usr/bin/env with args [-- rpm-ostree status --json] 
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade    stateroots    {"current stateroot:": "rhcos", "desired stateroot": "rhcos_4.16.0_0.nightly_2024_04_16_195622"}
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade    Running PrePivot handler
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    Running health check for Upgrade (pre-pivot)
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    All cluster operators are now ready
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    MachineConfigPool ready
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    Node is ready
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    Skipping check of ClusterServiceVersion/lifecycle-agent.v4.16.0
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    All CSVs are ready
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    Cluster version is ready
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    All CSRs are approved
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    SriovNetworkNodeState is ready
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    DataProtectionApplication dataprotectionapplication is reconciled
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    All BackupStorageLocations are available
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    Health checks done
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    Handling backups with OADP operator
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.BackupRestore    No PersistentVolumes found in the cluster, skipping.
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.UpgradeHandler    Processing backup    {"groupIndex": 1, "totalGroups": 1}
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade.BackupRestore    Found stale Backup, DeleteBackupRequest has been sent    {"backup": "acm-klusterlet"}
      2024-04-27T23:18:49Z    INFO    controllers.ImageBasedUpgrade    Finish reconciling IBU    {"name": {"name":"upgrade"}, "requeueRightAway": false}
      ============================

            angwang@redhat.com Angie Wang
            rh-ee-pmohanra Periyamaruthu Mohanraj
            Periyamaruthu Mohanraj Periyamaruthu Mohanraj
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: