Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-3954

Deployment referencing ImageStream not restored properly leading to corrupted Pod / Volume

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • OADP 1.3.1
    • openshift-plugin
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ToDo
    • No
    • Important
    • 8
    • 0
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown

      Description of problem:

      When a Deployment referencing an ImageStream is restored, the field spec.template.spec.containers[0].image is not set properly.

      This field is updated afterwards by an OpenShift controller to match the ImageStreamTag triggering the creation of a new ReplicaSet and a Pod rollout.

      The Pod rollout breaks FSB volume restoration and Pod restoration postHooks.

      Version-Release number of selected component (if applicable):

      • OCP-v4.16.0-ec.5
      • oadp-operator.v1.3.1
      kind: DataProtectionApplication
      apiVersion: oadp.openshift.io/v1alpha1
      metadata:
        name: dpa-instance
        namespace: openshift-adp
      spec:
        backupLocations:
          - name: minio-internal
            velero:
              config:
                insecureSkipTLSVerify: "true"
                profile: default
                region: us-east-1
                s3ForcePathStyle: "true"
                s3Url: https://minio.minio.svc.cluster.local
              credential:
                key: cloud
                name: oadp-credentials-minio-internal
              default: true
              objectStorage:
                bucket: oadp
                prefix: velero
              provider: aws
          configuration:
            nodeAgent:
              enable: true
              uploaderType: kopia
            velero:
              defaultPlugins:
                - aws
                - csi
                - kubevirt
                - openshift
      

      How reproducible

      I narrowed down a lightweight reproducer, see attachment reproducer.yaml.

      Steps to Reproduce

      1. Create a simple workload with a Namespace, a PVC, and a Deployment referencing an imagestream (e.g. mariadb), see attachment reproducer.yaml

      > oc apply -f reproducer.yaml
      > oc -n oadp-restore-issue wait deployment/oadp-restore-issue --for=condition=Available
      

      2. Create a Backup of this workload, see attachment backup.yaml

      > oc apply -f backup.yaml
      > oc wait -f backup.yaml --for=jsonpath='{.status.phase}=Completed'
      

      3. Note the specific content of the Pod Volume

      > oc -n oadp-restore-issue exec deployment/oadp-restore-issue -- ls /var/lib/mysql/data/
      
      • oadp-restore-issue-9b97bf565-sxf9z.pid: this is also the name of the Pod
      • velero-backup-pre-hook: created by the backup preHook and included in the Backup
      • velero-backup-post-hook: created by the backup postHook and not included in the Backup

      5. Delete the workload

      > oc delete -n oadp-restore-issue deploy/oadp-restore-issue pvc/oadp-restore-issue
      

      6. Restore the workload, see attachment restore.yaml

      > oc apply -f restore.yaml
      > oc wait -f restore.yaml --for=jsonpath='{.status.phase}=Completed'
      

      Actual results

      While the Restoration is reported successful, the Pod restoration postHook was
      not executed and the Volume content was not restored.

      > oc -n oadp-restore-issue exec deployment/oadp-restore-issue -- ls /var/lib/mysql/data/
      
      • oadp-restore-issue-6b5754d444-lpccv.pid: the previous pid file is missing and a new one has been created with a new Pod name
      • velero-backup-pre-hook: the file has not been restored and is missing
      • velero-restore-post-hook: the file has not been created by the Restore postHook

      The problem does not occur when restoring the Pod without the Deployment (--exclude-resources=deployments.apps).

      Expected results

      The Pod restoration postHook must be executed and the Volume content restored.

      Additional info

      In the Restore logs (see attachment restore.log), we can see that the post-restore hooks were not executed because the restored Pod disappeared before being Ready:

      > level=warning msg="pod entered phase Failed before some post-restore exec hooks ran"
      > level=error msg="hook oadp-restore-issue in container oadp-restore-issue in pod oadp-restore-issue/oadp-restore-issue-9b97bf565-sxf9z not executed: context canceled"
      > level=info msg="Waiting for all post-restore-exec hooks to complete"
      > level=info msg="Done waiting for all post-restore exec hooks to complete"
      

      Although, I couldn't find Volume restoration error messages in the logs I do think both are related.

      Upon investigation it seems the issue is triggered by the Restoration of the Deployment:

      1. The Deployment is backed up with field spec.template.spec.containers[0].image set to image-registry.openshift-image-registry.svc:5000/openshift/mariadb@sha256:3dcf999b44b15f270d6ccd7450b95f16c22d8c2081e81b76c9393be854648e45 (see attachment deployment.backup.yaml extracted from BackupDownload
      in backup-data.tar.gz)

      2. The Deployment is restored with field spec.template.spec.containers[0].image set to image-registry.openshift-image-registry.svc:5000/openshift/mariadb, the image hash is missing (see attachment deployment.restore.yaml)

      3. The Deployment is updated by an OpenShift controller to set the field spec.template.spec.containers[0].image to image-registry.openshift-image-registry.svc:5000/openshift/mariadb@sha256:3dcf999b44b15f270d6ccd7450b95f16c22d8c2081e81b76c9393be854648e45 which matched the ImageStreamTag (see attachment deployment.restore.yaml)

      4. This update triggers a rollout of the Pod which is terminated before Pod is Ready and the Restore postHooks are not executed.

        1. restore.yaml
          0.7 kB
          Denis Ollier Pinas
        2. restore.log
          38 kB
          Denis Ollier Pinas
        3. restore.details
          2 kB
          Denis Ollier Pinas
        4. reproducer.yaml
          2 kB
          Denis Ollier Pinas
        5. deployment.restore.yaml
          4 kB
          Denis Ollier Pinas
        6. deployment.final.yaml
          5 kB
          Denis Ollier Pinas
        7. deployment.backup.yaml
          9 kB
          Denis Ollier Pinas
        8. backup-data.tar.gz
          6 kB
          Denis Ollier Pinas
        9. backup.yaml
          1 kB
          Denis Ollier Pinas
        10. backup.log
          279 kB
          Denis Ollier Pinas
        11. backup.details
          4 kB
          Denis Ollier Pinas

            wnstb Wes Hayutin
            dollierp@redhat.com Denis Ollier Pinas
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: