-
Bug
-
Resolution: Done
-
Normal
-
OADP 1.3.1
-
3
-
False
-
-
False
-
ToDo
-
-
-
Sprint 13-MMSDOCS 2024
-
1
-
Important
-
8
-
2.667
-
Very Likely
-
0
-
None
-
Unset
-
Unknown
-
No
Description of problem:
When a Deployment referencing an ImageStream is restored, the field spec.template.spec.containers[0].image is not set properly.
This field is updated afterwards by an OpenShift controller to match the ImageStreamTag triggering the creation of a new ReplicaSet and a Pod rollout.
The Pod rollout breaks FSB volume restoration and Pod restoration postHooks.
Version-Release number of selected component (if applicable):
- OCP-v4.16.0-ec.5
- oadp-operator.v1.3.1
kind: DataProtectionApplication apiVersion: oadp.openshift.io/v1alpha1 metadata: name: dpa-instance namespace: openshift-adp spec: backupLocations: - name: minio-internal velero: config: insecureSkipTLSVerify: "true" profile: default region: us-east-1 s3ForcePathStyle: "true" s3Url: https://minio.minio.svc.cluster.local credential: key: cloud name: oadp-credentials-minio-internal default: true objectStorage: bucket: oadp prefix: velero provider: aws configuration: nodeAgent: enable: true uploaderType: kopia velero: defaultPlugins: - aws - csi - kubevirt - openshift
How reproducible
I narrowed down a lightweight reproducer, see attachment reproducer.yaml.
Steps to Reproduce
1. Create a simple workload with a Namespace, a PVC, and a Deployment referencing an imagestream (e.g. mariadb), see attachment reproducer.yaml
> oc apply -f reproducer.yaml > oc -n oadp-restore-issue wait deployment/oadp-restore-issue --for=condition=Available
2. Create a Backup of this workload, see attachment backup.yaml
> oc apply -f backup.yaml > oc wait -f backup.yaml --for=jsonpath='{.status.phase}=Completed'
3. Note the specific content of the Pod Volume
> oc -n oadp-restore-issue exec deployment/oadp-restore-issue -- ls /var/lib/mysql/data/
- oadp-restore-issue-9b97bf565-sxf9z.pid: this is also the name of the Pod
- velero-backup-pre-hook: created by the backup preHook and included in the Backup
- velero-backup-post-hook: created by the backup postHook and not included in the Backup
5. Delete the workload
> oc delete -n oadp-restore-issue deploy/oadp-restore-issue pvc/oadp-restore-issue
6. Restore the workload, see attachment restore.yaml
> oc apply -f restore.yaml > oc wait -f restore.yaml --for=jsonpath='{.status.phase}=Completed'
Actual results
While the Restoration is reported successful, the Pod restoration postHook was
not executed and the Volume content was not restored.
> oc -n oadp-restore-issue exec deployment/oadp-restore-issue -- ls /var/lib/mysql/data/
- oadp-restore-issue-6b5754d444-lpccv.pid: the previous pid file is missing and a new one has been created with a new Pod name
- velero-backup-pre-hook: the file has not been restored and is missing
- velero-restore-post-hook: the file has not been created by the Restore postHook
The problem does not occur when restoring the Pod without the Deployment (--exclude-resources=deployments.apps).
Expected results
The Pod restoration postHook must be executed and the Volume content restored.
Additional info
In the Restore logs (see attachment restore.log), we can see that the post-restore hooks were not executed because the restored Pod disappeared before being Ready:
> level=warning msg="pod entered phase Failed before some post-restore exec hooks ran" > level=error msg="hook oadp-restore-issue in container oadp-restore-issue in pod oadp-restore-issue/oadp-restore-issue-9b97bf565-sxf9z not executed: context canceled" > level=info msg="Waiting for all post-restore-exec hooks to complete" > level=info msg="Done waiting for all post-restore exec hooks to complete"
Although, I couldn't find Volume restoration error messages in the logs I do think both are related.
Upon investigation it seems the issue is triggered by the Restoration of the Deployment:
1. The Deployment is backed up with field spec.template.spec.containers[0].image set to image-registry.openshift-image-registry.svc:5000/openshift/mariadb@sha256:3dcf999b44b15f270d6ccd7450b95f16c22d8c2081e81b76c9393be854648e45 (see attachment deployment.backup.yaml extracted from BackupDownload
in backup-data.tar.gz)
2. The Deployment is restored with field spec.template.spec.containers[0].image set to image-registry.openshift-image-registry.svc:5000/openshift/mariadb, the image hash is missing (see attachment deployment.restore.yaml)
3. The Deployment is updated by an OpenShift controller to set the field spec.template.spec.containers[0].image to image-registry.openshift-image-registry.svc:5000/openshift/mariadb@sha256:3dcf999b44b15f270d6ccd7450b95f16c22d8c2081e81b76c9393be854648e45 which matched the ImageStreamTag (see attachment deployment.restore.yaml)
4. This update triggers a rollout of the Pod which is terminated before Pod is Ready and the Restore postHooks are not executed.
- links to