-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
4.12
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
TALM backup is only reporting the status of one cluster when two are being backed up. It is not reported a completed backup. backup-agent pod log for the unreported cluster confirms that backup did in fact complete. Repeated tests show that no particular cluster is preferred - one test will show status for one cluster, a subsequent test will show status for the other.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Test was run several times in succession.
Steps to Reproduce:
Environment consists of a hub cluster with two spokes.
Tests were executed via automation:
https://gitlab.cee.redhat.com/cnf/cnf-gotests/-/blob/master/test/ran/talm/tests/talm_backup.go
Test case "It("should not affect backup on second spoke in same batch" fails, within test suite and as isolated test.
Test case causes one spoke backup to fail, but the other should (and does) back up successfully. Test doe not pass because cgu is only reporting results for one cluster.
Actual results:
Here is the cgu output. Note that only one cluster status is shown:
[kni@registry.kni-qe-18 ~]$ oc get cgu -n talm-test generated-cgu-disk-full-multiple-spokes -o yaml
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
creationTimestamp: "2023-01-12T17:30:41Z"
finalizers:
- ran.openshift.io/cleanup-finalizer
generation: 1
name: generated-cgu-disk-full-multiple-spokes
namespace: talm-test
resourceVersion: "107942509"
uid: a9c3ef94-2edc-45bf-92c3-f87a0ac913ad
spec:
actions:
afterCompletion:
deleteObjects: true
beforeEnable: {}
backup: true
clusters:
- ocp-edge87
- ocp-edge88
enable: true
managedPolicies:
- generated-policy-disk-full-multiple-spokes
preCaching: false
remediationStrategy:
maxConcurrency: 100
timeout: 250
status:
backup:
clusters:
- ocp-edge87
- ocp-edge88
startedAt: "2023-01-12T17:30:41Z"
status:
ocp-edge88: Starting
computedMaxConcurrency: 2
conditions:
- lastTransitionTime: "2023-01-12T17:30:41Z"
message: All selected clusters are valid
reason: ClusterSelectionCompleted
status: "True"
type: ClustersSelected
- lastTransitionTime: "2023-01-12T17:30:41Z"
message: Completed validation
reason: ValidationCompleted
status: "True"
type: Validated
- lastTransitionTime: "2023-01-12T17:30:41Z"
message: Backup in progress for 1 clusters
reason: InProgress
status: "False"
type: BackupSuceeded
- lastTransitionTime: "2023-01-12T17:30:41Z"
message: Cluster backup is in progress
reason: NotStarted
status: "False"
type: Progressing
copiedPolicies:
- generated-cgu-disk-full-multiple-spokes-generat-zz6lg
managedPoliciesForUpgrade:
- name: generated-policy-disk-full-multiple-spokes
namespace: talm-test
managedPoliciesNs:
generated-policy-disk-full-multiple-spokes: talm-test
placementBindings:
- generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement-kstpc
placementRules:
- generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement-kstpc
remediationPlan:
- - ocp-edge87
- ocp-edge88
safeResourceNames:
generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes: generated-cgu-disk-full-multiple-spokes-generat-zz6lg
generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-config: generated-cgu-disk-full-multiple-spokes-generated-policy--tmfd8
generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement: generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement-kstpc
status:
startedAt: "2023-01-12T17:30:41Z"
Expected results:
cgu should show status for both clusters:
status:
backup:
clusters:
- ocp-edge87
- ocp-edge88
startedAt: "2023-01-12T17:30:41Z"
status:
ocp-edge87: Starting
ocp-edge88: Starting
Additional info:
This is the log from the unreported backed up cluster:
7c759c7f-tb27v 2/2 Running 10 (2d3h ago) 3d
[kni@registry.kni-qe-18 ~]$ oc logs -n openshift-talo-backup backup-agent-xvwxr
INFO[0000] ------------------------------------------------------------
INFO[0000] Cleaning up old content...
INFO[0000] ------------------------------------------------------------
INFO[0000]
fullpath: /var/recovery/upgrade-recovery.sh
INFO[0000]
fullpath: /var/recovery/cluster
INFO[0000]
fullpath: /var/recovery/etc.exclude.list
INFO[0000]
fullpath: /var/recovery/etc
INFO[0000]
fullpath: /var/recovery/usrlocal
INFO[0000]
fullpath: /var/recovery/kubelet
INFO[0000] Old directories deleted with contents
INFO[0000] Old contents have been cleaned up
INFO[0000] Available disk space : 843.69 GiB; Estimated disk space required for backup: 1.12 GiB
INFO[0000] Sufficient disk space found to trigger backup
INFO[0000] Upgrade recovery script written
INFO[0000] Running: bash -c /var/recovery/upgrade-recovery.sh --take-backup --dir /var/recovery
INFO[0000] ##### Thu Jan 12 17:32:10 UTC 2023: Taking backup
INFO[0000] ##### Thu Jan 12 17:32:10 UTC 2023: Wiping previous deployments and pinning active
INFO[0000] error: Out of range deployment index 1, expected < 1
INFO[0000] Deployment 0 is already pinned
INFO[0000] ##### Thu Jan 12 17:32:10 UTC 2023: Backing up container cluster and required files
INFO[0000] Certificate /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt is missing. Checking in different directory
INFO[0000] Certificate /etc/kubernetes/static-pod-resources/etcd-certs/configmaps/etcd-serving-ca/ca-bundle.crt found!
INFO[0001] found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-10
INFO[0001] found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-6
INFO[0001] found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-6
INFO[0001] found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-4
INFO[0001] etcdctl is already installed
INFO[0001] {"level":"info","ts":"2023-01-12T17:32:11.315Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/recovery/cluster/snapshot_2023-01-12_173210.db.part"}
INFO[0001] {"level":"info","ts":"2023-01-12T17:32:11.323Z","logger":"client","caller":"v3@v3.5.6/maintenance.go:212","msg":"opened snapshot stream; downloading"}
INFO[0001] {"level":"info","ts":"2023-01-12T17:32:11.324Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://10.46.46.67:2379"}
INFO[0003] {"level":"info","ts":"2023-01-12T17:32:12.787Z","logger":"client","caller":"v3@v3.5.6/maintenance.go:220","msg":"completed snapshot read; closing"}
INFO[0003] {"level":"info","ts":"2023-01-12T17:32:13.085Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://10.46.46.67:2379","size":"119 MB","took":"1 second ago"}
INFO[0003] {"level":"info","ts":"2023-01-12T17:32:13.085Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/recovery/cluster/snapshot_2023-01-12_173210.db"}
INFO[0003] Snapshot saved at /var/recovery/cluster/snapshot_2023-01-12_173210.db
INFO[0003] Deprecated: Use `etcdutl snapshot status` instead.
INFO[0003]
INFO[0003] {"hash":1827950534,"revision":2542000,"totalKey":10282,"totalSize":119013376}
INFO[0003] snapshot db and kube resources are successfully saved to /var/recovery/cluster
INFO[0004] Command succeeded: rsync -a /etc/ /var/recovery/etc/
INFO[0004] Command succeeded: rsync -a /usr/local/ /var/recovery/usrlocal/
INFO[0008] Command succeeded: rsync -a /var/lib/kubelet/ /var/recovery/kubelet/
INFO[0008] ##### Thu Jan 12 17:32:18 UTC 2023: Backup complete
INFO[0008] ------------------------------------------------------------
INFO[0008] backup has successfully finished ...
- duplicates
-
OCPBUGS-5797 TALM backup CGU only indicates status of one cluster when two clusters are being backed up
-
- Closed
-