-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
OADP 1.4.5
-
Incidents & Support
-
3
-
False
-
-
False
-
ToDo
-
-
-
Important
-
Very Likely
-
0
-
None
-
Unset
-
Unknown
-
None
Description of problem:
With FSB mode, old PersistentVolumes (this cluster has a long history, OCP 4.8), have labels 'failure-domain.beta.kubernetes.io/region' and 'failure-domain.beta.kubernetes.io/zone' which are now deprecated in favor of 'nodeAffinity ... topology.disk.csi.azure.com/zone|region' labels.
When restoring a PV which has these old labels, they see, after restic restore, that velero-server adds these 'failure-domain' labels to the restored persistent volume while the final restored pod is up and running.
This only happens after restic completes, before, the PV doesn't have any labels, though it does have the topology nodeAffinity which is in the zone the pod, which was previously created during the initial phase of restoration, got scheduled onto.
This ends with a restored pod scheduled onto a node within Zone A for example, and with a PV with 'failure-domain' labels pointing to Zone B but with the original nodeAffinity pointing to node labels 'topology.disk.csi.azure.com/zone|region' on Zone A
If they restart the pod, it won't start because the volume now has these 'failure-domain' labels pointing to a different zone.
They have to remove the labels manually to allow the pod to start.
Version-Release number of selected component (if applicable):
OCP 4.18
OADP 1.4.5
How reproducible:
All the time in customer environment with PVs with old 'failure-domain' labels
Steps to Reproduce:
1. Have an OCP cluster with some history related to upgrades, in this case the cluster was installed initially on OCP 4.8 in Azure cloud. It might also reproduce by adding labels to PVs.
2. Have PVs created in previous versions where the 'failure-domain.beta.kubernetes.io' labels were added. Currently, those labels are deprecated in favor of 'topology.disk.csi.azure.com' ones
3. Backup and restore DeploymentConfigs using these volumes.
Actual results:
After pod restore, check the labels in the PV associated to it looking for old ones (failure-domain), restart the pod, it won't be able to start because velero adds old 'failure-domain' labels to the associated PV, pointing to a different zone than the one exposed in the nodeAffinity 'topology.disk.csi.azure.com' label in the restored PV.
Expected results:
Velero to skip this step where it adds labels after restore completes
Additional info:
Workaround is to either manually delete old labels from restored PV or remove old labels from source PV before backing them up.
resource-modifier solution didn't work because the pod to be restored doesn't have these old labels , and initially, the to be restored PV doesn't have the labels either, it is after restic completes the restore that velero add these labels