Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19637

Cluster Backup Fails in upgrade-recovery.sh

XMLWordPrintable

    • Important
    • No
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-19555. The following is the description of the original issue:

      Description of problem:

      Cluster backup consistently fails when a CGU is created with backup: true.

      Version-Release number of selected component (if applicable):

      TALM v4.14.0-62
      OCP 4.14.0-rc.1

      How reproducible:

      Always

      Steps to Reproduce:

      1. Install hub cluster with OCP 4.14.0-rc.1
      2. Install latest TALM on hub cluster
      3. Provision managed cluster with OCP 4.14.0-rc.1
      4. Create a CGU with backup: true 
      5. Enable CGU
      6. CGU fails with backup status: UnrecoverableError
      7. View backup agent pod logs on managedcluster

      Actual results:

      Backup fails

      Expected results:

      Backup Should succeed. 

      Additional info:

      [kni@registry auth]$ oc logs -n openshift-talo-backup backup-agent-jnt9p --follow
      INFO[0002] Successfully remounted /host/sysroot with r/w permission 
      INFO[0002] ------------------------------------------------------------ 
      INFO[0002] Cleaning up old content...                   
      INFO[0002] ------------------------------------------------------------ 
      INFO[0002] 
      fullpath: /var/recovery/upgrade-recovery.sh 
      INFO[0002] 
      fullpath: /var/recovery/cluster             
      INFO[0002] 
      fullpath: /var/recovery/etc.exclude.list    
      INFO[0002] 
      fullpath: /var/recovery/etc                 
      INFO[0002] 
      fullpath: /var/recovery/local               
      INFO[0002] 
      fullpath: /var/recovery/kubelet             
      INFO[0025] 
      fullpath: /var/recovery/extras.tgz          
      INFO[0025] Old directories deleted with contents        
      INFO[0025] Old contents have been cleaned up            
      INFO[0031] Available disk space : 456.74 GiB; Estimated disk space required for backup: 32.28 GiB  
      INFO[0031] Sufficient disk space found to trigger backup 
      INFO[0031] Upgrade recovery script written              
      INFO[0031] Running: bash -c /var/recovery/upgrade-recovery.sh --take-backup --dir /var/recovery 
      INFO[0031] ##### Thu Sep 21 14:00:48 UTC 2023: Taking backup 
      INFO[0031] ##### Thu Sep 21 14:00:48 UTC 2023: Wiping previous deployments and pinning active 
      INFO[0031] error: Out of range deployment index 1, expected < 1 
      INFO[0031] Deployment 0 is already pinned               
      INFO[0031] ##### Thu Sep 21 14:00:48 UTC 2023: Backing up container cluster and required files 
      INFO[0031] Certificate /etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt is missing. Checking in different directory 
      INFO[0031] Certificate /etc/kubernetes/static-pod-resources/etcd-certs/configmaps/etcd-serving-ca/ca-bundle.crt found! 
      INFO[0031] found latest kube-apiserver: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-9 
      INFO[0031] found latest kube-controller-manager: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-6 
      INFO[0031] found latest kube-scheduler: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-6 
      INFO[0031] found latest etcd: /etc/kubernetes/static-pod-resources/etcd-pod-2 
      INFO[0031] etcdctl is already installed                 
      INFO[0031] etcdutl is already installed                 
      INFO[0031] {"level":"info","ts":"2023-09-21T14:00:48.48003Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/recovery/cluster/snapshot_2023-09-21_140048__POSSIBLY_DIRTY__.db.part"} 
      INFO[0031] {"level":"info","ts":"2023-09-21T14:00:48.490246Z","logger":"client","caller":"v3@v3.5.9/maintenance.go:212","msg":"opened snapshot stream; downloading"} 
      INFO[0031] {"level":"info","ts":"2023-09-21T14:00:48.49028Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://10.46.46.66:2379"} 
      INFO[0033] {"level":"info","ts":"2023-09-21T14:00:50.158759Z","logger":"client","caller":"v3@v3.5.9/maintenance.go:220","msg":"completed snapshot read; closing"} 
      INFO[0033] {"level":"info","ts":"2023-09-21T14:00:50.407955Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://10.46.46.66:2379","size":"115 MB","took":"1 second ago"} 
      INFO[0033] {"level":"info","ts":"2023-09-21T14:00:50.408049Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/recovery/cluster/snapshot_2023-09-21_140048__POSSIBLY_DIRTY__.db"} 
      INFO[0033] Snapshot saved at /var/recovery/cluster/snapshot_2023-09-21_140048__POSSIBLY_DIRTY__.db 
      INFO[0033] {"hash":1281395486,"revision":693323,"totalKey":7404,"totalSize":115171328} 
      INFO[0033] snapshot db and kube resources are successfully saved to /var/recovery/cluster 
      INFO[0034] Command succeeded: cp -Ra /etc/ /var/recovery/ 
      INFO[0034] Command succeeded: cp -Ra /usr/local/ /var/recovery/ 
      INFO[0099] Command succeeded: cp -Ra /var/lib/kubelet/ /var/recovery/ 
      INFO[0099] tar: Removing leading `/' from member names  
      INFO[0099] tar: /var/lib/ovn-ic/etc/enable_dynamic_cpu_affinity: Cannot stat: No such file or directory 
      INFO[0099] tar: Exiting with failure status due to previous errors 
      INFO[0099] ##### Thu Sep 21 14:01:55 UTC 2023: Failed to backup additional managed files 
      ERRO[0099] exit status 1                                
      Error: exit status 1
      Usage:
        upgrade-recovery launchBackup [flags]Flags:
        -h, --help   help for launchBackup
      exit status 1
      

              jche@redhat.com Jun Chen
              openshift-crt-jira-prow OpenShift Prow Bot
              Joshua Clark Joshua Clark
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: