Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-324

Add label to Expired backups failing garbage collection

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Passed
    • OADP Sprint 216, OADP Sprint 217, OADP Sprint 218
    • 3
    • 0
    • 0
    • 0
    • Untriaged
    • None

      Resolution summary: labels are added to backups failing to garbage collect after expiry. This will allow user to filter for these backups for further processing..

      Original issue:
      It seems that Backup in FailedValidation state are not GC when the expiration time is due.

      Proposed solution: allow user to take action on backup deletion failures
      User can manually or in scripts do something like
      ~~~
      oc get backup -n openshift-adp -l velero.io/deleteFailReason="BSLNotFound" -oname | oc delete backup -n openshift-adp
      ~~~
       

      This is related to slack discussion https://coreos.slack.com/archives/C0144ECKUJ0/p1646422802402329?thread_ts=1646405908.138569&cid=C0144ECKUJ0
      Upstream issue: https://github.com/vmware-tanzu/velero/issues/4728
      PR: https://github.com/vmware-tanzu/velero/pull/4757
       

      And it seems that has some common points with https://issues.redhat.com/browse/OADP-178

      The thinking for the above seems to be  that a failed backup may not be in bucket at all and the failure is stored on the cluster so you can check what the error is.

      This makes sense for the argument that FailedValidation backups should not be automatically synced so that the user can still investigate the issue.

       

      I think that an expired backup in FailedValidation  state is different. This backup is set for deletion by the user after a certain duration and this object has just expired.The object is not removed as soon as in not syncing, it is removed when the time is due. So the user must have had enough time to investigate and debug during the time the resource was available. If we never delete expired backups in FailedValidation state then the user is going to have to deal with left over resources that have to be deleted manually

       
      I have this invalid backup - the storage location was set incorrectly
      status:
      expiration: '2022-03-04T16:52:08Z'. <<<<<<
      formatVersion: 1.1.0
      phase: FailedValidation <<<
      validationErrors:

      • >-
        an existing backup storage location wasn't specified at backup creation
        time and the server default 'default' doesn't exist. Please address this
        issue (see `velero backup-location -h` for options) and create a new
        backup. Error: BackupStorageLocation.velero.io "default" not found
        This is what the log says
        time="2022-03-04T19:20:08Z" level=info msg="Backup has expired" backup=openshift-adp/acm-validation-policy-schedule-20220303174043 controller=gc expiration="2022-03-04 16:51:52 +0000 UTC" logSource="pkg/controller/gc_controller.go:135"
        time="2022-03-04T19:20:08Z" level=warning msg="Backup cannot be garbage-collected because backup storage location default does not exist" backup=openshift-adp/acm-validation-policy-schedule-20220303174043 controller=gc expiration="2022-03-04 16:51:52 +0000 UTC" logSource="pkg/controller/gc_controller.go:143"

              tkaovila@redhat.com Tiger Kaovilai
              vbirsan@redhat.com Valentina Birsan
              Prasad Joshi Prasad Joshi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: