Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.12
Component/s: TALM Operator
Labels:
None

Severity:
Moderate
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Internal Whiteboard:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

TALM backup is only reporting the status of one cluster when two are being backed up.  It is not reported a completed backup.  backup-agent pod log for the unreported cluster confirms that backup did in fact complete.  Repeated tests show that no particular cluster is preferred - one test will show status for one cluster, a subsequent test will show status for the other.

Version-Release number of selected component (if applicable):

4.12

How reproducible:

Test was run several times in succession.

Steps to Reproduce:

Environment consists of a hub cluster with two spokes.

Tests were executed via automation:
https://gitlab.cee.redhat.com/cnf/cnf-gotests/-/blob/master/test/ran/talm/tests/talm_backup.go
Test case "It("should not affect backup on second spoke in same batch" fails, within test suite and as isolated test.

Test case causes one spoke backup to fail, but the other should (and does) back up successfully.  Test doe not pass because cgu is only reporting results for one cluster.

Actual results:

Here is the cgu output.  Note that only one cluster status is shown:

[kni@registry.kni-qe-18 ~]$ oc get cgu -n talm-test generated-cgu-disk-full-multiple-spokes -o yaml 
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
  creationTimestamp: "2023-01-12T17:30:41Z"
  finalizers:
  - ran.openshift.io/cleanup-finalizer
  generation: 1
  name: generated-cgu-disk-full-multiple-spokes
  namespace: talm-test
  resourceVersion: "107942509"
  uid: a9c3ef94-2edc-45bf-92c3-f87a0ac913ad
spec:
  actions:
    afterCompletion:
      deleteObjects: true
    beforeEnable: {}
  backup: true
  clusters:
  - ocp-edge87
  - ocp-edge88
  enable: true
  managedPolicies:
  - generated-policy-disk-full-multiple-spokes
  preCaching: false
  remediationStrategy:
    maxConcurrency: 100
    timeout: 250
status:
  backup:
    clusters:
    - ocp-edge87
    - ocp-edge88
    startedAt: "2023-01-12T17:30:41Z"
    status:
      ocp-edge88: Starting
  computedMaxConcurrency: 2
  conditions:
  - lastTransitionTime: "2023-01-12T17:30:41Z"
    message: All selected clusters are valid
    reason: ClusterSelectionCompleted
    status: "True"
    type: ClustersSelected
  - lastTransitionTime: "2023-01-12T17:30:41Z"
    message: Completed validation
    reason: ValidationCompleted
    status: "True"
    type: Validated
  - lastTransitionTime: "2023-01-12T17:30:41Z"
    message: Backup in progress for 1 clusters
    reason: InProgress
    status: "False"
    type: BackupSuceeded
  - lastTransitionTime: "2023-01-12T17:30:41Z"
    message: Cluster backup is in progress
    reason: NotStarted
    status: "False"
    type: Progressing
  copiedPolicies:
  - generated-cgu-disk-full-multiple-spokes-generat-zz6lg
  managedPoliciesForUpgrade:
  - name: generated-policy-disk-full-multiple-spokes
    namespace: talm-test
  managedPoliciesNs:
    generated-policy-disk-full-multiple-spokes: talm-test
  placementBindings:
  - generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement-kstpc
  placementRules:
  - generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement-kstpc
  remediationPlan:
  - - ocp-edge87
    - ocp-edge88
  safeResourceNames:
    generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes: generated-cgu-disk-full-multiple-spokes-generat-zz6lg
    generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-config: generated-cgu-disk-full-multiple-spokes-generated-policy--tmfd8
    generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement: generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement-kstpc
  status:
    startedAt: "2023-01-12T17:30:41Z"

Expected results:

cgu should show status for both clusters:

status:
  backup:
    clusters:
    - ocp-edge87
    - ocp-edge88
    startedAt: "2023-01-12T17:30:41Z"
    status:
      ocp-edge87: Starting
      ocp-edge88: Starting

Additional info:

This is the log from the unreported backed up cluster:

7c759c7f-tb27v                     2/2     Running            10 (2d3h ago)    3d
[kni@registry.kni-qe-18 ~]$ oc logs -n openshift-talo-backup backup-agent-xvwxr 
INFO[0000] ------------------------------------------------------------ 
INFO[0000] Cleaning up old content...                   
INFO[0000] ------------------------------------------------------------ 
INFO[0000] 
fullpath: /var/recovery/upgrade-recovery.sh 
INFO[0000] 
fullpath: /var/recovery/cluster             
INFO[0000] 
fullpath: /var/recovery/etc.exclude.list    
INFO[0000] 
fullpath: /var/recovery/etc                 
INFO[0000] 
fullpath: /var/recovery/usrlocal            
INFO[0000] 
fullpath: /var/recovery/kubelet             
INFO[0000] Old directories deleted with contents        
INFO[0000] Old contents have been cleaned up            
INFO[0000] Available disk space : 843.69 GiB; Estimated disk space required for backup: 1.12 GiB  
INFO[0000] Sufficient disk space found to trigger backup 
INFO[0000] Upgrade recovery script written              
INFO[0000] Running: bash -c /var/recovery/upgrade-recovery.sh --take-backup --dir /var/recovery 
INFO[0000] ##### Thu Jan 12 17:32:10 UTC 2023: Taking backup 
INFO[0000] ##### Thu Jan 12 17:32:10 UTC 2023: Wiping previous deployments and pinning active 
INFO[0000] error: Out of range deployment index 1, expected < 1 
INFO[0000] Deployment 0 is already pinned               
INFO[0000] ##### Thu Jan 12 17:32:10 UTC 2023: Backing up container cluster and required files 
INFO[0000] Certificate /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt is missing. Checking in different directory 
INFO[0000] Certificate /etc/kubernetes/static-pod-resources/etcd-certs/configmaps/etcd-serving-ca/ca-bundle.crt found! 
INFO[0001] found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-10 
INFO[0001] found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-6 
INFO[0001] found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-6 
INFO[0001] found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-4 
INFO[0001] etcdctl is already installed                 
INFO[0001] {"level":"info","ts":"2023-01-12T17:32:11.315Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/recovery/cluster/snapshot_2023-01-12_173210.db.part"} 
INFO[0001] {"level":"info","ts":"2023-01-12T17:32:11.323Z","logger":"client","caller":"v3@v3.5.6/maintenance.go:212","msg":"opened snapshot stream; downloading"} 
INFO[0001] {"level":"info","ts":"2023-01-12T17:32:11.324Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://10.46.46.67:2379"} 
INFO[0003] {"level":"info","ts":"2023-01-12T17:32:12.787Z","logger":"client","caller":"v3@v3.5.6/maintenance.go:220","msg":"completed snapshot read; closing"} 
INFO[0003] {"level":"info","ts":"2023-01-12T17:32:13.085Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://10.46.46.67:2379","size":"119 MB","took":"1 second ago"} 
INFO[0003] {"level":"info","ts":"2023-01-12T17:32:13.085Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/recovery/cluster/snapshot_2023-01-12_173210.db"} 
INFO[0003] Snapshot saved at /var/recovery/cluster/snapshot_2023-01-12_173210.db 
INFO[0003] Deprecated: Use `etcdutl snapshot status` instead. 
INFO[0003]                                              
INFO[0003] {"hash":1827950534,"revision":2542000,"totalKey":10282,"totalSize":119013376} 
INFO[0003] snapshot db and kube resources are successfully saved to /var/recovery/cluster 
INFO[0004] Command succeeded: rsync -a /etc/ /var/recovery/etc/ 
INFO[0004] Command succeeded: rsync -a /usr/local/ /var/recovery/usrlocal/ 
INFO[0008] Command succeeded: rsync -a /var/lib/kubelet/ /var/recovery/kubelet/ 
INFO[0008] ##### Thu Jan 12 17:32:18 UTC 2023: Backup complete 
INFO[0008] ------------------------------------------------------------ 
INFO[0008] backup has successfully finished ...

duplicates

OCPBUGS-5797 TALM backup CGU only indicates status of one cluster when two clusters are being backed up

Closed

Assignee:: Jun Chen

Reporter:: Bonnie Block

QA Contact:: Yang Liu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2023/01/12 7:26 PM

Updated:: 2023/01/18 8:53 PM

Resolved:: 2023/01/17 4:47 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates