Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5796

TALM backup CGU only indicates status of one cluster when two clusters are being backed up

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • 4.12
    • TALM Operator
    • None
    • Moderate
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      TALM backup is only reporting the status of one cluster when two are being backed up.  It is not reported a completed backup.  backup-agent pod log for the unreported cluster confirms that backup did in fact complete.  Repeated tests show that no particular cluster is preferred - one test will show status for one cluster, a subsequent test will show status for the other.

      Version-Release number of selected component (if applicable):

      4.12

      How reproducible:

      Test was run several times in succession.

      Steps to Reproduce:

      Environment consists of a hub cluster with two spokes.
      
      Tests were executed via automation:
      https://gitlab.cee.redhat.com/cnf/cnf-gotests/-/blob/master/test/ran/talm/tests/talm_backup.go
      Test case "It("should not affect backup on second spoke in same batch" fails, within test suite and as isolated test.
      
      Test case causes one spoke backup to fail, but the other should (and does) back up successfully.  Test doe not pass because cgu is only reporting results for one cluster.
      

      Actual results:

      Here is the cgu output.  Note that only one cluster status is shown:
      
      [kni@registry.kni-qe-18 ~]$ oc get cgu -n talm-test generated-cgu-disk-full-multiple-spokes -o yaml 
      apiVersion: ran.openshift.io/v1alpha1
      kind: ClusterGroupUpgrade
      metadata:
        creationTimestamp: "2023-01-12T17:30:41Z"
        finalizers:
        - ran.openshift.io/cleanup-finalizer
        generation: 1
        name: generated-cgu-disk-full-multiple-spokes
        namespace: talm-test
        resourceVersion: "107942509"
        uid: a9c3ef94-2edc-45bf-92c3-f87a0ac913ad
      spec:
        actions:
          afterCompletion:
            deleteObjects: true
          beforeEnable: {}
        backup: true
        clusters:
        - ocp-edge87
        - ocp-edge88
        enable: true
        managedPolicies:
        - generated-policy-disk-full-multiple-spokes
        preCaching: false
        remediationStrategy:
          maxConcurrency: 100
          timeout: 250
      status:
        backup:
          clusters:
          - ocp-edge87
          - ocp-edge88
          startedAt: "2023-01-12T17:30:41Z"
          status:
            ocp-edge88: Starting
        computedMaxConcurrency: 2
        conditions:
        - lastTransitionTime: "2023-01-12T17:30:41Z"
          message: All selected clusters are valid
          reason: ClusterSelectionCompleted
          status: "True"
          type: ClustersSelected
        - lastTransitionTime: "2023-01-12T17:30:41Z"
          message: Completed validation
          reason: ValidationCompleted
          status: "True"
          type: Validated
        - lastTransitionTime: "2023-01-12T17:30:41Z"
          message: Backup in progress for 1 clusters
          reason: InProgress
          status: "False"
          type: BackupSuceeded
        - lastTransitionTime: "2023-01-12T17:30:41Z"
          message: Cluster backup is in progress
          reason: NotStarted
          status: "False"
          type: Progressing
        copiedPolicies:
        - generated-cgu-disk-full-multiple-spokes-generat-zz6lg
        managedPoliciesForUpgrade:
        - name: generated-policy-disk-full-multiple-spokes
          namespace: talm-test
        managedPoliciesNs:
          generated-policy-disk-full-multiple-spokes: talm-test
        placementBindings:
        - generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement-kstpc
        placementRules:
        - generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement-kstpc
        remediationPlan:
        - - ocp-edge87
          - ocp-edge88
        safeResourceNames:
          generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes: generated-cgu-disk-full-multiple-spokes-generat-zz6lg
          generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-config: generated-cgu-disk-full-multiple-spokes-generated-policy--tmfd8
          generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement: generated-cgu-disk-full-multiple-spokes-generated-policy-disk-full-multiple-spokes-placement-kstpc
        status:
          startedAt: "2023-01-12T17:30:41Z"
      
      

      Expected results:

      cgu should show status for both clusters:
      
      status:
        backup:
          clusters:
          - ocp-edge87
          - ocp-edge88
          startedAt: "2023-01-12T17:30:41Z"
          status:
            ocp-edge87: Starting
            ocp-edge88: Starting
      
      

      Additional info:

      This is the log from the unreported backed up cluster:
      
      7c759c7f-tb27v                     2/2     Running            10 (2d3h ago)    3d
      [kni@registry.kni-qe-18 ~]$ oc logs -n openshift-talo-backup backup-agent-xvwxr 
      INFO[0000] ------------------------------------------------------------ 
      INFO[0000] Cleaning up old content...                   
      INFO[0000] ------------------------------------------------------------ 
      INFO[0000] 
      fullpath: /var/recovery/upgrade-recovery.sh 
      INFO[0000] 
      fullpath: /var/recovery/cluster             
      INFO[0000] 
      fullpath: /var/recovery/etc.exclude.list    
      INFO[0000] 
      fullpath: /var/recovery/etc                 
      INFO[0000] 
      fullpath: /var/recovery/usrlocal            
      INFO[0000] 
      fullpath: /var/recovery/kubelet             
      INFO[0000] Old directories deleted with contents        
      INFO[0000] Old contents have been cleaned up            
      INFO[0000] Available disk space : 843.69 GiB; Estimated disk space required for backup: 1.12 GiB  
      INFO[0000] Sufficient disk space found to trigger backup 
      INFO[0000] Upgrade recovery script written              
      INFO[0000] Running: bash -c /var/recovery/upgrade-recovery.sh --take-backup --dir /var/recovery 
      INFO[0000] ##### Thu Jan 12 17:32:10 UTC 2023: Taking backup 
      INFO[0000] ##### Thu Jan 12 17:32:10 UTC 2023: Wiping previous deployments and pinning active 
      INFO[0000] error: Out of range deployment index 1, expected < 1 
      INFO[0000] Deployment 0 is already pinned               
      INFO[0000] ##### Thu Jan 12 17:32:10 UTC 2023: Backing up container cluster and required files 
      INFO[0000] Certificate /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt is missing. Checking in different directory 
      INFO[0000] Certificate /etc/kubernetes/static-pod-resources/etcd-certs/configmaps/etcd-serving-ca/ca-bundle.crt found! 
      INFO[0001] found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-10 
      INFO[0001] found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-6 
      INFO[0001] found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-6 
      INFO[0001] found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-4 
      INFO[0001] etcdctl is already installed                 
      INFO[0001] {"level":"info","ts":"2023-01-12T17:32:11.315Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/recovery/cluster/snapshot_2023-01-12_173210.db.part"} 
      INFO[0001] {"level":"info","ts":"2023-01-12T17:32:11.323Z","logger":"client","caller":"v3@v3.5.6/maintenance.go:212","msg":"opened snapshot stream; downloading"} 
      INFO[0001] {"level":"info","ts":"2023-01-12T17:32:11.324Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://10.46.46.67:2379"} 
      INFO[0003] {"level":"info","ts":"2023-01-12T17:32:12.787Z","logger":"client","caller":"v3@v3.5.6/maintenance.go:220","msg":"completed snapshot read; closing"} 
      INFO[0003] {"level":"info","ts":"2023-01-12T17:32:13.085Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://10.46.46.67:2379","size":"119 MB","took":"1 second ago"} 
      INFO[0003] {"level":"info","ts":"2023-01-12T17:32:13.085Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/recovery/cluster/snapshot_2023-01-12_173210.db"} 
      INFO[0003] Snapshot saved at /var/recovery/cluster/snapshot_2023-01-12_173210.db 
      INFO[0003] Deprecated: Use `etcdutl snapshot status` instead. 
      INFO[0003]                                              
      INFO[0003] {"hash":1827950534,"revision":2542000,"totalKey":10282,"totalSize":119013376} 
      INFO[0003] snapshot db and kube resources are successfully saved to /var/recovery/cluster 
      INFO[0004] Command succeeded: rsync -a /etc/ /var/recovery/etc/ 
      INFO[0004] Command succeeded: rsync -a /usr/local/ /var/recovery/usrlocal/ 
      INFO[0008] Command succeeded: rsync -a /var/lib/kubelet/ /var/recovery/kubelet/ 
      INFO[0008] ##### Thu Jan 12 17:32:18 UTC 2023: Backup complete 
      INFO[0008] ------------------------------------------------------------ 
      INFO[0008] backup has successfully finished ...        
      
      

              jche@redhat.com Jun Chen
              bblock@redhat.com Bonnie Block
              Yang Liu Yang Liu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: