Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-3954

Deployment referencing ImageStream not restored properly leading to corrupted Pod / Volume

XMLWordPrintable

    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • ToDo
    • Sprint 13-MMSDOCS 2024
    • 1
    • Important
    • 8
    • 2.667
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown
    • No

      Description of problem:

      When a Deployment referencing an ImageStream is restored, the field spec.template.spec.containers[0].image is not set properly.

      This field is updated afterwards by an OpenShift controller to match the ImageStreamTag triggering the creation of a new ReplicaSet and a Pod rollout.

      The Pod rollout breaks FSB volume restoration and Pod restoration postHooks.

      Version-Release number of selected component (if applicable):

      • OCP-v4.16.0-ec.5
      • oadp-operator.v1.3.1
      kind: DataProtectionApplication
      apiVersion: oadp.openshift.io/v1alpha1
      metadata:
        name: dpa-instance
        namespace: openshift-adp
      spec:
        backupLocations:
          - name: minio-internal
            velero:
              config:
                insecureSkipTLSVerify: "true"
                profile: default
                region: us-east-1
                s3ForcePathStyle: "true"
                s3Url: https://minio.minio.svc.cluster.local
              credential:
                key: cloud
                name: oadp-credentials-minio-internal
              default: true
              objectStorage:
                bucket: oadp
                prefix: velero
              provider: aws
          configuration:
            nodeAgent:
              enable: true
              uploaderType: kopia
            velero:
              defaultPlugins:
                - aws
                - csi
                - kubevirt
                - openshift
      

      How reproducible

      I narrowed down a lightweight reproducer, see attachment reproducer.yaml.

      Steps to Reproduce

      1. Create a simple workload with a Namespace, a PVC, and a Deployment referencing an imagestream (e.g. mariadb), see attachment reproducer.yaml

      > oc apply -f reproducer.yaml
      > oc -n oadp-restore-issue wait deployment/oadp-restore-issue --for=condition=Available
      

      2. Create a Backup of this workload, see attachment backup.yaml

      > oc apply -f backup.yaml
      > oc wait -f backup.yaml --for=jsonpath='{.status.phase}=Completed'
      

      3. Note the specific content of the Pod Volume

      > oc -n oadp-restore-issue exec deployment/oadp-restore-issue -- ls /var/lib/mysql/data/
      
      • oadp-restore-issue-9b97bf565-sxf9z.pid: this is also the name of the Pod
      • velero-backup-pre-hook: created by the backup preHook and included in the Backup
      • velero-backup-post-hook: created by the backup postHook and not included in the Backup

      5. Delete the workload

      > oc delete -n oadp-restore-issue deploy/oadp-restore-issue pvc/oadp-restore-issue
      

      6. Restore the workload, see attachment restore.yaml

      > oc apply -f restore.yaml
      > oc wait -f restore.yaml --for=jsonpath='{.status.phase}=Completed'
      

      Actual results

      While the Restoration is reported successful, the Pod restoration postHook was
      not executed and the Volume content was not restored.

      > oc -n oadp-restore-issue exec deployment/oadp-restore-issue -- ls /var/lib/mysql/data/
      
      • oadp-restore-issue-6b5754d444-lpccv.pid: the previous pid file is missing and a new one has been created with a new Pod name
      • velero-backup-pre-hook: the file has not been restored and is missing
      • velero-restore-post-hook: the file has not been created by the Restore postHook

      The problem does not occur when restoring the Pod without the Deployment (--exclude-resources=deployments.apps).

      Expected results

      The Pod restoration postHook must be executed and the Volume content restored.

      Additional info

      In the Restore logs (see attachment restore.log), we can see that the post-restore hooks were not executed because the restored Pod disappeared before being Ready:

      > level=warning msg="pod entered phase Failed before some post-restore exec hooks ran"
      > level=error msg="hook oadp-restore-issue in container oadp-restore-issue in pod oadp-restore-issue/oadp-restore-issue-9b97bf565-sxf9z not executed: context canceled"
      > level=info msg="Waiting for all post-restore-exec hooks to complete"
      > level=info msg="Done waiting for all post-restore exec hooks to complete"
      

      Although, I couldn't find Volume restoration error messages in the logs I do think both are related.

      Upon investigation it seems the issue is triggered by the Restoration of the Deployment:

      1. The Deployment is backed up with field spec.template.spec.containers[0].image set to image-registry.openshift-image-registry.svc:5000/openshift/mariadb@sha256:3dcf999b44b15f270d6ccd7450b95f16c22d8c2081e81b76c9393be854648e45 (see attachment deployment.backup.yaml extracted from BackupDownload
      in backup-data.tar.gz)

      2. The Deployment is restored with field spec.template.spec.containers[0].image set to image-registry.openshift-image-registry.svc:5000/openshift/mariadb, the image hash is missing (see attachment deployment.restore.yaml)

      3. The Deployment is updated by an OpenShift controller to set the field spec.template.spec.containers[0].image to image-registry.openshift-image-registry.svc:5000/openshift/mariadb@sha256:3dcf999b44b15f270d6ccd7450b95f16c22d8c2081e81b76c9393be854648e45 which matched the ImageStreamTag (see attachment deployment.restore.yaml)

      4. This update triggers a rollout of the Pod which is terminated before Pod is Ready and the Restore postHooks are not executed.

        1. backup.details
          4 kB
        2. backup.log
          279 kB
        3. backup.yaml
          1 kB
        4. backup-data.tar.gz
          6 kB
        5. deployment.backup.yaml
          9 kB
        6. deployment.final.yaml
          5 kB
        7. deployment.restore.yaml
          4 kB
        8. reproducer.yaml
          2 kB
        9. restore.details
          2 kB
        10. restore.log
          38 kB
        11. restore.yaml
          0.7 kB

              rhn-support-shdeshpa Shruti Deshpande
              dollierp@redhat.com Denis Ollier Pinas
              Amos Mastbaum Amos Mastbaum
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: