Loading...

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: OADP 1.4.1
Affects Version/s: OADP 1.3.1
Component/s: Documentation
Labels:
- triaged

Activity Type:
Quality / Stability / Reliability
Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
QEStatus:
ToDo
Intelligence Requested:
Market:

Sprint:
Sprint 13-MMSDOCS 2024
sprint_count:
1
Severity:
Important
WSJF:
2.667
Risk Probability:
Very Likely
Risk Score:
0
Cost of Delay:
8

Workstream:

None

Root Cause:
Unset
Failure Category:
Unknown

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

When a Deployment referencing an ImageStream is restored, the field spec.template.spec.containers[0].image is not set properly.

This field is updated afterwards by an OpenShift controller to match the ImageStreamTag triggering the creation of a new ReplicaSet and a Pod rollout.

The Pod rollout breaks FSB volume restoration and Pod restoration postHooks.

Version-Release number of selected component (if applicable):

OCP-v4.16.0-ec.5
oadp-operator.v1.3.1

kind: DataProtectionApplication
apiVersion: oadp.openshift.io/v1alpha1
metadata:
  name: dpa-instance
  namespace: openshift-adp
spec:
  backupLocations:
    - name: minio-internal
      velero:
        config:
          insecureSkipTLSVerify: "true"
          profile: default
          region: us-east-1
          s3ForcePathStyle: "true"
          s3Url: https://minio.minio.svc.cluster.local
        credential:
          key: cloud
          name: oadp-credentials-minio-internal
        default: true
        objectStorage:
          bucket: oadp
          prefix: velero
        provider: aws
    configuration:
      nodeAgent:
        enable: true
        uploaderType: kopia
      velero:
        defaultPlugins:
          - aws
          - csi
          - kubevirt
          - openshift

How reproducible

I narrowed down a lightweight reproducer, see attachment reproducer.yaml.

Steps to Reproduce

1. Create a simple workload with a Namespace, a PVC, and a Deployment referencing an imagestream (e.g. mariadb), see attachment reproducer.yaml

> oc apply -f reproducer.yaml
> oc -n oadp-restore-issue wait deployment/oadp-restore-issue --for=condition=Available

2. Create a Backup of this workload, see attachment backup.yaml

> oc apply -f backup.yaml
> oc wait -f backup.yaml --for=jsonpath='{.status.phase}=Completed'

3. Note the specific content of the Pod Volume

> oc -n oadp-restore-issue exec deployment/oadp-restore-issue -- ls /var/lib/mysql/data/

oadp-restore-issue-9b97bf565-sxf9z.pid: this is also the name of the Pod
velero-backup-pre-hook: created by the backup preHook and included in the Backup
velero-backup-post-hook: created by the backup postHook and not included in the Backup

5. Delete the workload

> oc delete -n oadp-restore-issue deploy/oadp-restore-issue pvc/oadp-restore-issue

6. Restore the workload, see attachment restore.yaml

> oc apply -f restore.yaml
> oc wait -f restore.yaml --for=jsonpath='{.status.phase}=Completed'

Actual results

While the Restoration is reported successful, the Pod restoration postHook was
not executed and the Volume content was not restored.

> oc -n oadp-restore-issue exec deployment/oadp-restore-issue -- ls /var/lib/mysql/data/

oadp-restore-issue-6b5754d444-lpccv.pid: the previous pid file is missing and a new one has been created with a new Pod name
velero-backup-pre-hook: the file has not been restored and is missing
velero-restore-post-hook: the file has not been created by the Restore postHook

The problem does not occur when restoring the Pod without the Deployment (--exclude-resources=deployments.apps).

Expected results

The Pod restoration postHook must be executed and the Volume content restored.

Additional info

In the Restore logs (see attachment restore.log), we can see that the post-restore hooks were not executed because the restored Pod disappeared before being Ready:

> level=warning msg="pod entered phase Failed before some post-restore exec hooks ran"
> level=error msg="hook oadp-restore-issue in container oadp-restore-issue in pod oadp-restore-issue/oadp-restore-issue-9b97bf565-sxf9z not executed: context canceled"
> level=info msg="Waiting for all post-restore-exec hooks to complete"
> level=info msg="Done waiting for all post-restore exec hooks to complete"

Although, I couldn't find Volume restoration error messages in the logs I do think both are related.

Upon investigation it seems the issue is triggered by the Restoration of the Deployment:

1. The Deployment is backed up with field spec.template.spec.containers[0].image set to image-registry.openshift-image-registry.svc:5000/openshift/mariadb@sha256:3dcf999b44b15f270d6ccd7450b95f16c22d8c2081e81b76c9393be854648e45 (see attachment deployment.backup.yaml extracted from BackupDownload
in backup-data.tar.gz)

2. The Deployment is restored with field spec.template.spec.containers[0].image set to image-registry.openshift-image-registry.svc:5000/openshift/mariadb, the image hash is missing (see attachment deployment.restore.yaml)

3. The Deployment is updated by an OpenShift controller to set the field spec.template.spec.containers[0].image to image-registry.openshift-image-registry.svc:5000/openshift/mariadb@sha256:3dcf999b44b15f270d6ccd7450b95f16c22d8c2081e81b76c9393be854648e45 which matched the ImageStreamTag (see attachment deployment.restore.yaml)

4. This update triggers a rollout of the Pod which is terminated before Pod is Ready and the Restore postHooks are not executed.