-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
4.12
-
None
-
Moderate
-
None
-
False
-
-
Description of problem:
TALM backup is only reporting the status of one cluster when two are being backed up. It is not reported a completed backup. backup-agent pod log for the unreported cluster confirms that backup did in fact complete. Repeated tests show that no particular cluster is preferred - one test will show status for one cluster, a subsequent test will show status for the other.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Test was run several times in succession.
Steps to Reproduce:
Environment consists of a hub cluster with two spokes. Tests were executed via automation: https://gitlab.cee.redhat.com/cnf/cnf-gotests/-/blob/master/test/ran/talm/tests/talm_backup.go Test case "It("should not affect backup on second spoke in same batch" fails, within test suite and as isolated test. Test case causes one spoke backup to fail, but the other should (and does) back up successfully. Test doe not pass because cgu is only reporting results for one cluster.
Actual results:
Here is the cgu output. Note that only one cluster status is shown: [kni@registry.kni-qe-18 ~]$ oc get cgu -n talm-test generated-cgu-disk-full-multiple-spokes -o yaml apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: creationTimestamp: "2023-01-12T17:30:41Z" finalizers: - ran.openshift.io/cleanup-finalizer generation: 1 name: generated-cgu-disk-full-multiple-spokes namespace: talm-test resourceVersion: "107942509" uid: a9c3ef94-2edc-45bf-92c3-f87a0ac913ad spec: actions: afterCompletion: deleteObjects: true beforeEnable: {} backup: true clusters: - ocp-edge87 - ocp-edge88 enable: true managedPolicies: - generated-policy-disk-full-multiple-spokes preCaching: false remediationStrategy: maxConcurrency: 100 timeout: 250 status: backup: clusters: - ocp-edge87 - ocp-edge88 startedAt: "2023-01-12T17:30:41Z" status: ocp-edge88: Starting computedMaxConcurrency: 2 conditions: - lastTransitionTime: "2023-01-12T17:30:41Z" message: All selected clusters are valid reason: ClusterSelectionCompleted status: "True" type: ClustersSelected - lastTransitionTime: "2023-01-12T17:30:41Z" message: Completed validation reason: ValidationCompleted status: "True" type: Validated - lastTransitionTime: "2023-01-12T17:30:41Z" message: Backup in progress for 1 clusters reason: InProgress status: "False" type: BackupSuceeded - lastTransitionTime: "2023-01-12T17:30:41Z" message: Cluster backup is in progress reason: NotStarted status: "False" type: Progressing copiedPolicies: - generated-cgu-disk-full-multiple-spokes-generat-zz6lg managedPoliciesForUpgrade: - name: generated-policy-disk-full-multiple-spokes namespace: talm-test managedPoliciesNs: generated-policy-disk-full-multiple-spokes: talm-test placementBindings: - generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement-kstpc placementRules: - generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement-kstpc remediationPlan: - - ocp-edge87 - ocp-edge88 safeResourceNames: generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes: generated-cgu-disk-full-multiple-spokes-generat-zz6lg generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-config: generated-cgu-disk-full-multiple-spokes-generated-policy--tmfd8 generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement: generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement-kstpc status: startedAt: "2023-01-12T17:30:41Z"
Expected results:
cgu should show status for both clusters: status: backup: clusters: - ocp-edge87 - ocp-edge88 startedAt: "2023-01-12T17:30:41Z" status: ocp-edge87: Starting ocp-edge88: Starting
Additional info:
This is the log from the unreported backed up cluster: 7c759c7f-tb27v 2/2 Running 10 (2d3h ago) 3d [kni@registry.kni-qe-18 ~]$ oc logs -n openshift-talo-backup backup-agent-xvwxr INFO[0000] ------------------------------------------------------------ INFO[0000] Cleaning up old content... INFO[0000] ------------------------------------------------------------ INFO[0000] fullpath: /var/recovery/upgrade-recovery.sh INFO[0000] fullpath: /var/recovery/cluster INFO[0000] fullpath: /var/recovery/etc.exclude.list INFO[0000] fullpath: /var/recovery/etc INFO[0000] fullpath: /var/recovery/usrlocal INFO[0000] fullpath: /var/recovery/kubelet INFO[0000] Old directories deleted with contents INFO[0000] Old contents have been cleaned up INFO[0000] Available disk space : 843.69 GiB; Estimated disk space required for backup: 1.12 GiB INFO[0000] Sufficient disk space found to trigger backup INFO[0000] Upgrade recovery script written INFO[0000] Running: bash -c /var/recovery/upgrade-recovery.sh --take-backup --dir /var/recovery INFO[0000] ##### Thu Jan 12 17:32:10 UTC 2023: Taking backup INFO[0000] ##### Thu Jan 12 17:32:10 UTC 2023: Wiping previous deployments and pinning active INFO[0000] error: Out of range deployment index 1, expected < 1 INFO[0000] Deployment 0 is already pinned INFO[0000] ##### Thu Jan 12 17:32:10 UTC 2023: Backing up container cluster and required files INFO[0000] Certificate /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt is missing. Checking in different directory INFO[0000] Certificate /etc/kubernetes/static-pod-resources/etcd-certs/configmaps/etcd-serving-ca/ca-bundle.crt found! INFO[0001] found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-10 INFO[0001] found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-6 INFO[0001] found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-6 INFO[0001] found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-4 INFO[0001] etcdctl is already installed INFO[0001] {"level":"info","ts":"2023-01-12T17:32:11.315Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/recovery/cluster/snapshot_2023-01-12_173210.db.part"} INFO[0001] {"level":"info","ts":"2023-01-12T17:32:11.323Z","logger":"client","caller":"v3@v3.5.6/maintenance.go:212","msg":"opened snapshot stream; downloading"} INFO[0001] {"level":"info","ts":"2023-01-12T17:32:11.324Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://10.46.46.67:2379"} INFO[0003] {"level":"info","ts":"2023-01-12T17:32:12.787Z","logger":"client","caller":"v3@v3.5.6/maintenance.go:220","msg":"completed snapshot read; closing"} INFO[0003] {"level":"info","ts":"2023-01-12T17:32:13.085Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://10.46.46.67:2379","size":"119 MB","took":"1 second ago"} INFO[0003] {"level":"info","ts":"2023-01-12T17:32:13.085Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/recovery/cluster/snapshot_2023-01-12_173210.db"} INFO[0003] Snapshot saved at /var/recovery/cluster/snapshot_2023-01-12_173210.db INFO[0003] Deprecated: Use `etcdutl snapshot status` instead. INFO[0003] INFO[0003] {"hash":1827950534,"revision":2542000,"totalKey":10282,"totalSize":119013376} INFO[0003] snapshot db and kube resources are successfully saved to /var/recovery/cluster INFO[0004] Command succeeded: rsync -a /etc/ /var/recovery/etc/ INFO[0004] Command succeeded: rsync -a /usr/local/ /var/recovery/usrlocal/ INFO[0008] Command succeeded: rsync -a /var/lib/kubelet/ /var/recovery/kubelet/ INFO[0008] ##### Thu Jan 12 17:32:18 UTC 2023: Backup complete INFO[0008] ------------------------------------------------------------ INFO[0008] backup has successfully finished ...
- duplicates
-
OCPBUGS-5797 TALM backup CGU only indicates status of one cluster when two clusters are being backed up
- Closed