-
Bug
-
Resolution: Done
-
Normal
-
OADP 1.1.0
-
False
-
-
False
-
Passed
-
-
OADP Sprint 216, OADP Sprint 217, OADP Sprint 218
-
3
-
0
-
0
-
0
-
Untriaged
-
None
Resolution summary: labels are added to backups failing to garbage collect after expiry. This will allow user to filter for these backups for further processing..
Original issue:
It seems that Backup in FailedValidation state are not GC when the expiration time is due.
Proposed solution: allow user to take action on backup deletion failures
User can manually or in scripts do something like
~~~
oc get backup -n openshift-adp -l velero.io/deleteFailReason="BSLNotFound" -oname | oc delete backup -n openshift-adp
~~~
This is related to slack discussion https://coreos.slack.com/archives/C0144ECKUJ0/p1646422802402329?thread_ts=1646405908.138569&cid=C0144ECKUJ0
Upstream issue: https://github.com/vmware-tanzu/velero/issues/4728
PR: https://github.com/vmware-tanzu/velero/pull/4757
And it seems that has some common points with https://issues.redhat.com/browse/OADP-178
The thinking for the above seems to be that a failed backup may not be in bucket at all and the failure is stored on the cluster so you can check what the error is.
This makes sense for the argument that FailedValidation backups should not be automatically synced so that the user can still investigate the issue.
I think that an expired backup in FailedValidation state is different. This backup is set for deletion by the user after a certain duration and this object has just expired.The object is not removed as soon as in not syncing, it is removed when the time is due. So the user must have had enough time to investigate and debug during the time the resource was available. If we never delete expired backups in FailedValidation state then the user is going to have to deal with left over resources that have to be deleted manually
I have this invalid backup - the storage location was set incorrectly
status:
expiration: '2022-03-04T16:52:08Z'. <<<<<<
formatVersion: 1.1.0
phase: FailedValidation <<<
validationErrors:
- >-
an existing backup storage location wasn't specified at backup creation
time and the server default 'default' doesn't exist. Please address this
issue (see `velero backup-location -h` for options) and create a new
backup. Error: BackupStorageLocation.velero.io "default" not found
This is what the log says
time="2022-03-04T19:20:08Z" level=info msg="Backup has expired" backup=openshift-adp/acm-validation-policy-schedule-20220303174043 controller=gc expiration="2022-03-04 16:51:52 +0000 UTC" logSource="pkg/controller/gc_controller.go:135"
time="2022-03-04T19:20:08Z" level=warning msg="Backup cannot be garbage-collected because backup storage location default does not exist" backup=openshift-adp/acm-validation-policy-schedule-20220303174043 controller=gc expiration="2022-03-04 16:51:52 +0000 UTC" logSource="pkg/controller/gc_controller.go:143"
- relates to
-
OADP-178 Failed/PartiallyFailed backups hang on Kubernetes although removed from bucket
- Closed
- links to