Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: OADP 1.1.0
Affects Version/s: OADP 1.1.0
Component/s: velero
Labels:
- Pull_Request_Sent
- velero-1.9

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
QEStatus:
Passed
Market:

Sprint:
OADP Sprint 216, OADP Sprint 217, OADP Sprint 218
sprint_count:
3
Cost of Delay:
0
WSJF:
0
Risk Score:
0

Root Cause:
Untriaged

Regression:
None

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Resolution summary: labels are added to backups failing to garbage collect after expiry. This will allow user to filter for these backups for further processing..

Original issue:
It seems that Backup in FailedValidation state are not GC when the expiration time is due.

Proposed solution: allow user to take action on backup deletion failures
User can manually or in scripts do something like
~~~
oc get backup -n openshift-adp -l velero.io/deleteFailReason="BSLNotFound" -oname | oc delete backup -n openshift-adp
~~~

This is related to slack discussion https://coreos.slack.com/archives/C0144ECKUJ0/p1646422802402329?thread_ts=1646405908.138569&cid=C0144ECKUJ0
Upstream issue: https://github.com/vmware-tanzu/velero/issues/4728
PR: https://github.com/vmware-tanzu/velero/pull/4757

And it seems that has some common points with https://issues.redhat.com/browse/OADP-178

The thinking for the above seems to be that a failed backup may not be in bucket at all and the failure is stored on the cluster so you can check what the error is.

This makes sense for the argument that FailedValidation backups should not be automatically synced so that the user can still investigate the issue.

I think that an expired backup in FailedValidation state is different. This backup is set for deletion by the user after a certain duration and this object has just expired.The object is not removed as soon as in not syncing, it is removed when the time is due. So the user must have had enough time to investigate and debug during the time the resource was available. If we never delete expired backups in FailedValidation state then the user is going to have to deal with left over resources that have to be deleted manually

I have this invalid backup - the storage location was set incorrectly
status:
expiration: '2022-03-04T16:52:08Z'. <<<<<<
formatVersion: 1.1.0
phase: FailedValidation <<<
validationErrors:

>-
an existing backup storage location wasn't specified at backup creation
time and the server default 'default' doesn't exist. Please address this
issue (see `velero backup-location -h` for options) and create a new
backup. Error: BackupStorageLocation.velero.io "default" not found
This is what the log says
time="2022-03-04T19:20:08Z" level=info msg="Backup has expired" backup=openshift-adp/acm-validation-policy-schedule-20220303174043 controller=gc expiration="2022-03-04 16:51:52 +0000 UTC" logSource="pkg/controller/gc_controller.go:135"
time="2022-03-04T19:20:08Z" level=warning msg="Backup cannot be garbage-collected because backup storage location default does not exist" backup=openshift-adp/acm-validation-policy-schedule-20220303174043 controller=gc expiration="2022-03-04 16:51:52 +0000 UTC" logSource="pkg/controller/gc_controller.go:143"

relates to

OADP-178 Failed/PartiallyFailed backups hang on Kubernetes although removed from bucket

Closed

links to

PR #4757

TTL: Expired Backup should be garbage collected even if they fail validation

Assignee:: Tiger Kaovilai

Reporter:: Valentina Birsan

QA Contact:: Prasad Joshi

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2022/03/04 9:07 PM

Updated:: 2022/09/01 1:28 AM

Resolved:: 2022/09/01 1:28 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates