-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
3
-
False
-
-
False
-
ToDo
-
-
-
Moderate
-
1.667
-
Very Likely
-
0
-
Customer Escalated, Customer Facing
-
5
-
None
-
Unset
-
Unknown
-
None
Dear team,
when doing a restore of a PVC which was at 100% disk utilization at backup time, the restore will fail with a "disk full" error.
Steps to reproduce:
1. create app/pod with a PVC
2. Fill up this PVC with "dd" or something like this to 100% usage
3. Do a backup using OADP
4. Restore from the backup to a new/same namespace
5. Restore will fail with "disk full" error message and pod using this PVC will hang in "restore-wait" init process.
Workaround:
1. Kill hanging pod. It will respawn and come up fine, since the "restore-wait" init process got killed and is no longer stopping pod upstart.
Reason:
1. PVCs are recreated via stored config
2. Data is copied to this PVCs from backup files
3. HERE IT HAPPENS: a "done" file has to be written to a hidden ".velero" directory in the root path of the PVC. And the "restore-wait" process is waiting and looking for this "done" file.
4. Since PVC is at 100% after data restore, there is no space left on device to create/store this "done" file.
Solution:
Separate user data on disks from restore information needed by the restore process.
Mitigation in Lab Setup:
1. Mount PVC to pod
2. Create and mount "emtpyDir" to PVCroot/.velero
3. Userdata with 100% gets restored to PVC recreation. Velero "done" file is written to emptyDir directory and hence has no issues with the original PVC being at 100% usage.
Thanks, Chris