-
Bug
-
Resolution: Done-Errata
-
Undefined
-
None
-
4.16.0
-
None
-
Important
-
No
-
CNF Ran Sprint 253
-
1
-
Rejected
-
False
-
-
Description of problem:
1. Upgrade stuck at the state "InProgress" before pivot for long time (even "Starting handleUpgrade" not kicked in) 2. No event/transaction written or observed in LCA operator pod container "manager" logs.
Version-Release number of selected component (if applicable):
TALM bundle: 4.16.0-23 Seed SNO Cluster: OCP 4.16.0-0.nightly-2024-04-16-195622 Target SNO cluster before upgrade : OCP 4.14.23 Target Hub cluster: OCP 4.16.0-0.nightly-2024-04-16-195622 ACM: 2.10.2 GitOps: 1.12.0 LCA operator bundle container version: 4.16.0-43 LCA Operator version: 4.16.0-36 Recert image version: 4.16.0-10
How reproducible:
Observed once in 3 times. Intermittent issue.
Steps to Reproduce:
[1] Make sure all policies reporting compliant except site specific config policy (which has destination [new labels] extra manifests) on target sno cluster. [2] Apply IBU Prep CGU via TALM from target hub cluster. [3] Make sure IBU CR reporting Prep stage completed successfully. [4] Import stale backucp CR (acm-klusterlet) with different cluster id label other than actual target sno cluster id [5] Apply IBU upgrade policy to trigger upgrade from target hub cluster. {{ Sample snippet logs }} [kni@registry.kni-qe-53 ibu-cgu]$ cat upgrade-cgu.yaml apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-ibu-upgrade namespace: default spec: actions: beforeEnable: addClusterAnnotations: import.open-cluster-management.io/disable-auto-import: "true" afterCompletion: removeClusterAnnotations: - import.open-cluster-management.io/disable-auto-import clusters: - helix57 enable: true managedPolicies: - group-ibu-upgrade-policy remediationStrategy: canaries: - helix57 maxConcurrency: 1 timeout: 240 [kni@registry.kni-qe-53 ibu-cgu]$ oc -n default get cgu NAME AGE STATE DETAILS cgu-ibu-finalize 54m Completed All clusters are compliant with all the managed policies cgu-ibu-prep 53m Completed All clusters are compliant with all the managed policies cgu-ibu-upgrade 46m InProgress Remediating non-compliant policies [kni@registry.kni-qe-53 ibu-cgu]$ ============================= [6] Observed upgrade stuck at "InProgress" with details "Backing up Application Data" before pivot for long time (more than 40 mins)
Actual results:
kni@registry.kni-qe-53 ~]$ oc get ibu NAME AGE DESIRED STAGE STATE DETAILS upgrade 45m Upgrade InProgress Backing up Application Data [kni@registry.kni-qe-53 ~]$ oc get ibu upgrade -o yaml apiVersion: lca.openshift.io/v1alpha1 kind: ImageBasedUpgrade metadata: annotations: auto-rollback-on-failure.lca.openshift.io/init-monitor: Disabled creationTimestamp: "2024-04-27T23:01:03Z" generation: 4 name: upgrade resourceVersion: "249028" uid: 1a437a64-547d-4b08-8e0a-69809bf16698 spec: autoRollbackOnFailure: {} oadpContent: - name: oadp-cm namespace: openshift-adp seedImageRef: image: quay.io/rh_ee_pmohanra/ibu-seed-sno0:4.16.0-0.nightly-2024-04-16-195622 pullSecretRef: name: seed-pull-secret version: 4.16.0-0.nightly-2024-04-16-195622 stage: Upgrade status: conditions: - lastTransitionTime: "2024-04-27T23:11:41Z" message: In progress observedGeneration: 3 reason: InProgress status: "False" type: Idle - lastTransitionTime: "2024-04-27T23:12:58Z" message: Prep completed observedGeneration: 3 reason: Completed status: "False" type: PrepInProgress - lastTransitionTime: "2024-04-27T23:12:58Z" message: Prep completed successfully observedGeneration: 3 reason: Completed status: "True" type: PrepCompleted - lastTransitionTime: "2024-04-27T23:18:49Z" message: Backing up Application Data observedGeneration: 4 reason: InProgress status: "True" type: UpgradeInProgress observedGeneration: 4 validNextStages: - Idle [kni@registry.kni-qe-53 ~]$
Expected results:
[1] Upgrade should not stuck at "InProgress" state before pivot for long time. [2] We should have a mechanism like lca-init-monitor.service that runs post-reboot with a configurable timeout to monitor upgrade events before pivot
Additional info:
/SUT logs before the test // [kni@registry.kni-qe-53 ~]$ oc -n openshift-adp get backups NAME AGE acm-klusterlet 39s [kni@registry.kni-qe-53 ~]$ oc -n openshift-adp get restores.velero.io No resources found in openshift-adp namespace. [kni@registry.kni-qe-53 ~]$ oc -n openshift-adp get backups acm-klusterlet -o yaml | grep -i clusterid config.openshift.io/clusterID: 6b0b914f-bb0a-4f27-914c-e3dce9a67e91 [kni@registry.kni-qe-53 ~]$ [kni@registry.kni-qe-53 ~]$ oc get clusterversions.config.openshift.io -o yaml | grep -i clusterid clusterID: 8b2e5d4f-dc16-478b-ab27-72124fe09f6d [kni@registry.kni-qe-53 ~]$ oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.23 True False 23h Cluster version is 4.14.23 [kni@registry.kni-qe-53 ~]$ oc get no NAME STATUS ROLES AGE VERSION helix57.lab.eng.rdu2.redhat.com Ready control-plane,master,worker 23h v1.27.13+401bb48 [kni@registry.kni-qe-53 ~]$ //SUT logs after the test // [kni@registry.kni-qe-53 ~]$ oc get ibu NAME AGE DESIRED STAGE STATE DETAILS upgrade 45m Upgrade InProgress Backing up Application Data [kni@registry.kni-qe-53 ~]$ oc get ibu upgrade -o yaml apiVersion: lca.openshift.io/v1alpha1 kind: ImageBasedUpgrade metadata: annotations: auto-rollback-on-failure.lca.openshift.io/init-monitor: Disabled creationTimestamp: "2024-04-27T23:01:03Z" generation: 4 name: upgrade resourceVersion: "249028" uid: 1a437a64-547d-4b08-8e0a-69809bf16698 spec: autoRollbackOnFailure: {} oadpContent: - name: oadp-cm namespace: openshift-adp seedImageRef: image: quay.io/rh_ee_pmohanra/ibu-seed-sno0:4.16.0-0.nightly-2024-04-16-195622 pullSecretRef: name: seed-pull-secret version: 4.16.0-0.nightly-2024-04-16-195622 stage: Upgrade status: conditions: - lastTransitionTime: "2024-04-27T23:11:41Z" message: In progress observedGeneration: 3 reason: InProgress status: "False" type: Idle - lastTransitionTime: "2024-04-27T23:12:58Z" message: Prep completed observedGeneration: 3 reason: Completed status: "False" type: PrepInProgress - lastTransitionTime: "2024-04-27T23:12:58Z" message: Prep completed successfully observedGeneration: 3 reason: Completed status: "True" type: PrepCompleted - lastTransitionTime: "2024-04-27T23:18:49Z" message: Backing up Application Data observedGeneration: 4 reason: InProgress status: "True" type: UpgradeInProgress observedGeneration: 4 validNextStages: - Idle [kni@registry.kni-qe-53 ~]$ [kni@registry.kni-qe-53 ~]$ oc get policy -A NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE helix57 ztp-common.common-config-policy inform Compliant 50m helix57 ztp-common.common-lca-subscriptions-policy inform Compliant 50m helix57 ztp-common.common-oadp-subscriptions-policy inform Compliant 50m helix57 ztp-common.common-subscriptions-policy inform Compliant 50m helix57 ztp-common.oadp-cm-common-policies Compliant 50m helix57 ztp-group.group-du-sno-config-policy inform NonCompliant 50m helix57 ztp-group.group-ibu-finalize-policy inform NonCompliant 156m helix57 ztp-group.group-ibu-prep-policy inform NonCompliant 156m helix57 ztp-group.group-ibu-rollback-policy inform NonCompliant 156m helix57 ztp-group.group-ibu-upgrade-policy enforce NonCompliant 156m helix57 ztp-site.helix57-config-policy inform NonCompliant 50m helix57 ztp-site.helix57-oadp-config-policy inform Compliant 50m [kni@registry.kni-qe-53 ~]$ [kni@registry.kni-qe-53 ~]$ oc -n openshift-adp get backups No resources found in openshift-adp namespace. [kni@registry.kni-qe-53 ~]$ [root@helix57 ~]# oc get ibu NAME AGE DESIRED STAGE STATE DETAILS upgrade 31m Upgrade InProgress Backing up Application Data [root@helix57 ~]# oc get ibu NAME AGE DESIRED STAGE STATE DETAILS upgrade 37m Upgrade InProgress Backing up Application Data [root@helix57 ~]# =============Complete LCA operator logs during the test========================= 024-04-27T23:12:58Z INFO controllers.ImageBasedUpgrade Prep stage completed successfully! 2024-04-27T23:12:58Z INFO controllers.ImageBasedUpgrade Done handlePrep 2024-04-27T23:12:58Z INFO controllers.ImageBasedUpgrade Finish reconciling IBU {"name": {"name":"upgrade"}, "requeueRightAway": false} 2024-04-27T23:12:58Z INFO controllers.ImageBasedUpgrade Start reconciling IBU {"name": {"name":"upgrade"}} 2024-04-27T23:12:58Z INFO controllers.ImageBasedUpgrade Loaded IBU {"name": {"name":"upgrade"}, "version": "245549", "desired stage": "Prep"} INFO[0715] Executing /usr/bin/env with args [-- rpm-ostree status --json] 2024-04-27T23:12:58Z INFO controllers.ImageBasedUpgrade Finish reconciling IBU {"name": {"name":"upgrade"}, "requeueRightAway": false} 2024-04-27T23:13:11Z INFO controllers.ImageBasedUpgrade Start reconciling IBU {"name": {"name":"upgrade"}} 2024-04-27T23:13:11Z INFO controllers.ImageBasedUpgrade Loaded IBU {"name": {"name":"upgrade"}, "version": "245549", "desired stage": "Prep"} INFO[0728] Executing /usr/bin/env with args [-- rpm-ostree status --json] 2024-04-27T23:13:11Z INFO controllers.ImageBasedUpgrade Finish reconciling IBU {"name": {"name":"upgrade"}, "requeueRightAway": false} =============================== 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade Start reconciling IBU {"name": {"name":"upgrade"}} 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade Loaded IBU {"name": {"name":"upgrade"}, "version": "246889", "desired stage": "Upgrade"} INFO[1066] Executing /usr/bin/env with args [-- rpm-ostree status --json] 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade Starting handleUpgrade INFO[1066] Executing /usr/bin/env with args [-- rpm-ostree status --json] 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade stateroots {"current stateroot:": "rhcos", "desired stateroot": "rhcos_4.16.0_0.nightly_2024_04_16_195622"} 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade Running PrePivot handler 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler Running health check for Upgrade (pre-pivot) 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler All cluster operators are now ready 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler MachineConfigPool ready 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler Node is ready 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler Skipping check of ClusterServiceVersion/lifecycle-agent.v4.16.0 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler All CSVs are ready 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler Cluster version is ready 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler All CSRs are approved 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler SriovNetworkNodeState is ready 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler DataProtectionApplication dataprotectionapplication is reconciled 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler All BackupStorageLocations are available 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler Health checks done 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler Handling backups with OADP operator 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.BackupRestore No PersistentVolumes found in the cluster, skipping. 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.UpgradeHandler Processing backup {"groupIndex": 1, "totalGroups": 1} 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade.BackupRestore Found stale Backup, DeleteBackupRequest has been sent {"backup": "acm-klusterlet"} 2024-04-27T23:18:49Z INFO controllers.ImageBasedUpgrade Finish reconciling IBU {"name": {"name":"upgrade"}, "requeueRightAway": false} ============================
- links to
-
RHEA-2024:128907 OpenShift Container Platform 4.16.0 IBU extras update
- mentioned on