Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-3866

Restore Failed , CR stuck on "In Progress" some PVRs missing phase status , with Kopia/Restic

XMLWordPrintable

    • False
    • Hide

      This bug prevents us from running our kopia/restic flows but our datamover flows run fine.
      This bug does not reproduce in 1.30 GA latest, or 1.3.1-27 but does in 1.3.1-54, 57, and 59 where velero version changed - note this statement is based on runs done on Apri 10 2024 (today)
       

      Show
      This bug prevents us from running our kopia/restic flows but our datamover flows run fine. This bug does not reproduce in 1.30 GA latest, or 1.3.1-27 but does in 1.3.1-54, 57, and 59 where velero version changed - note this statement is based on runs done on Apri 10 2024 (today)  
    • False
    • ToDo
    • No
    • Hide

       
      manually reproduction steps
      To reproduce:
      1)​ clone repo to your /tmp folder on machine which has access to openshift cluster git clone git@gitlab.cee.redhat.com:mlehrer/mpqe-scale-scripts.git
      2) ​execute this ansible-playbook
       
      ​ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-0 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-0 sc=ocs-storagecluster-ceph-rbd' -vvvv

      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-1 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-1 sc=ocs-storagecluster-ceph-rbd' -vvvv

      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-2 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-2 sc=ocs-storagecluster-ceph-rbd' -vvvv

      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-3 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-3 sc=ocs-storagecluster-ceph-rbd' -vvvv

      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-4 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-4 sc=ocs-storagecluster-ceph-rbd' -vvvv

      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-5 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-5 sc=ocs-storagecluster-ceph-rbd' -vvvv

      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-6 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-6 sc=ocs-storagecluster-ceph-rbd' -vvvv

      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-7 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-7 sc=ocs-storagecluster-ceph-rbd' -vvvv

      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-8 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-8 sc=ocs-storagecluster-ceph-rbd' -vvvv

      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-9 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-9 sc=ocs-storagecluster-ceph-rbd' -vvvv
       
      ​This will deploy 10 pods with 2GBs of utilized data in namespace perf-datagen-case3-rbd
       
      Perform backup of perf-datagen-case3-rbd this will complete successfully.
      Perform restore of perf-datagen-case3-rbd  and it will show 1 or 2 successful podvolumerstore and then fail and wait until velero timeout expires.

      Show
        manually reproduction steps To reproduce: 1)​ clone repo to your /tmp folder on machine which has access to openshift cluster git clone git@gitlab.cee.redhat.com:mlehrer/mpqe-scale-scripts.git 2) ​execute this ansible-playbook   ​ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-0 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-0 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-1 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-1 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-2 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-2 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-3 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-3 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-4 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-4 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-5 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-5 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-6 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-6 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-7 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-7 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-8 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-8 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-9 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-9 sc=ocs-storagecluster-ceph-rbd' -vvvv   ​This will deploy 10 pods with 2GBs of utilized data in namespace perf-datagen-case3-rbd   Perform backup of perf-datagen-case3-rbd this will complete successfully. Perform restore of perf-datagen-case3-rbd  and it will show 1 or 2 successful podvolumerstore and then fail and wait until velero timeout expires.
    • 0
    • 0
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown

      Description of problem:

      During Kopia or Restic restore operations, there is an issue with (PVR) resources.  the  restore CR ended as "PartiallyFailed"
      when checking  the restore CR  , some of them marked as completed , failed , and those which are appears as "new" didn't start  at all 
      Additionally, this issue has been reproduced during restores using Restic &  Kopia  as well.

      PVR  :

      39  - "Completed 
      3 - "Failed"
      58 - " New - without any progress " -    marked as new on the CR describe command - on the PVR resource they without any "Progress:" state  

      kni@f07-h27-000-r640 benchmark-runner-assistant]$ oc  get  podvolumerestore  -nopenshift-adp |grep restic | grep  Completed  |wc -l
      39
      
      [kni@f07-h27-000-r640 benchmark-runner-assistant]$ oc  get  podvolumerestore  -nopenshift-adp |grep restic | grep Failed  |wc -l
      3
      
      [kni@f07-h27-000-r640 benchmark-runner-assistant]$ oc  get  podvolumerestore  -nopenshift-adp |grep restic | grep -v "Failed\|Completed"  |wc -l
      58
      

      PVB : 

      [kni@f07-h27-000-r640 benchmark-runner-assistant]$ oc get Podvolumebackup -A |grep data |grep Completed | wc -l 
      100  
      [kni@f07-h27-000-r640 ~]$ velero restore describe restore-restic-datagen-single-ns-100pods-cephrbd
      Name:         restore-restic-datagen-single-ns-100pods-cephrbd
      Namespace:    openshift-adp
      Labels:       <none>
      Annotations:  <none>Phase:                       PartiallyFailed (run 'velero restore logs restore-restic-datagen-single-ns-100pods-cephrbd' for more information)
      Total items to be restored:  521
      Items restored:              521Started:    2024-04-07 14:33:50 +0000 UTC
      Completed:  2024-04-09 14:33:50 +0000 UTCWarnings:
        Velero:     <none>
        Cluster:  could not restore, CustomResourceDefinition "clusterserviceversions.operators.coreos.com" already exists. Warning: the in-cluster version is different than the backed-up version
        Namespaces:
          datagen-single-ns-100pods-cephrbd:  could not restore, RoleBinding "system:image-pullers" already exists. Warning: the in-cluster version is different than the backed-up version
                                              could not restore, ConfigMap "kube-root-ca.crt" already exists. Warning: the in-cluster version is different than the backed-up version
                                              could not restore, ConfigMap "openshift-service-ca.crt" already exists. Warning: the in-cluster version is different than the backed-up version
                                              could not restore, ClusterServiceVersion "volsync-product.v0.7.4-0.1698026108.p" already exists. Warning: the in-cluster version is different than the backed-up version
                                              could not restore, RoleBinding "system:deployers" already exists. Warning: the in-cluster version is different than the backed-up version
                                              could not restore, RoleBinding "system:image-builders" already exists. Warning: the in-cluster version is different than the backed-up version
                                              could not restore, RoleBinding "system:image-pullers" already exists. Warning: the in-cluster version is different than the backed-up versionErrors:
        Velero:   pod volume restore failed: data path restore failed: chdir /host_pods/1a3eda7f-cc21-4aa1-ba7d-12b9f4102584/volumes/kubernetes.io~csi/pvc-fc80f14b-0236-44ac-88bc-4c3b4e14aed3/mount: no such file or directory
                  pod volume restore failed: data path restore failed: chdir /host_pods/804a47c6-6eca-44c7-be3a-34a7c4ea749c/volumes/kubernetes.io~csi/pvc-4db942e2-4416-4965-be93-bf5ddb990a78/mount: no such file or directory
                  pod volume restore failed: data path restore failed: chdir /host_pods/cb26a3f3-1bcd-421d-a5ff-77a0e48ca5c0/volumes/kubernetes.io~csi/pvc-e8fa3eb5-a89d-4e93-9b9a-9430915f7b87/mount: no such file or directory
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
                  timed out waiting for all PodVolumeRestores to complete
        Cluster:    <none>
        Namespaces: <none>Backup:  backup-restic-datagen-single-ns-100pods-cephrbdNamespaces:
        Included:  all namespaces found in the backup
        Excluded:  <none>Resources:
        Included:        *
        Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
        Cluster-scoped:  autoNamespace mappings:  <none>Label selector:  <none>Or label selector:  <none>Restore PVs:  autorestic Restores (specify --details for more information):
        Completed:  39
        Failed:     3
        New:        58Existing Resource Policy:   <none>
      ItemOperationTimeout:       4h0m0sPreserve Service NodePorts:  auto
       

      PV & Pods 

      [kni@f07-h27-000-r640 benchmark-runner-assistant]$ oc get pv -n datagen-single-ns-100pods-cephrbd | grep gen |wc -l 
      100
      [kni@f07-h27-000-r640 benchmark-runner-assistant]$ oc get pods -ndatagen-single-ns-100pods-cephrbd |grep Running |wc -l 
      100  

      from velero log 

      I0404 15:56:15.980518       1 request.go:690] Waited for 1.045157493s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/flowcontrol.apiserver.k8s.io/v1beta3?timeout=32s
      
      
      E0405 07:40:22.212924       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server does not allow this method on the requested resource
      E0405 07:41:15.985431       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server does not allow this method on the requested resource
      E0405 07:42:07.039601       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server does not allow this method on the requested resource
      E0405 07:42:49.140141       1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server does not allow this method on the requested resource
      
      
      
      

      node-agent 

       time="2024-04-07T14:36:11Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-ldw7j control │
      │ time="2024-04-07T14:36:11Z" level=info msg="Got volume dir" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-ldw7j controlle │
      │ time="2024-04-07T14:36:11Z" level=info msg="Found path matching glob" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-ldw7j │
      │ time="2024-04-07T14:36:11Z" level=info msg="Founding existing repo" backupLocation=bucket logSource="/remote-source/velero/app/pkg/repository/ensurer.go:86 │
      │ time="2024-04-07T14:36:12Z" level=info msg="FileSystemBR is initialized" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-ld │
      │ time="2024-04-07T14:36:12Z" level=info msg="Async fs restore data path started" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cep │
      │ time="2024-04-07T14:36:13Z" level=info msg="Error cannot be convert to ExitError format." PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-1 │
      │ time="2024-04-07T14:36:13Z" level=info msg="Run command=restore, stdout=, stderr=" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods- │
      │ time="2024-04-07T14:36:13Z" level=error msg="Async fs restore data path failed" controller=PodVolumeRestore error="chdir /host_pods/cb26a3f3-1bcd-421d-a5ff │
      │ time="2024-04-07T14:36:13Z" level=info msg="FileSystemBR is closed" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-ldw7j c │

       

       time="2024-04-07T14:36:09Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-gzf79 control │
      │ time="2024-04-07T14:36:09Z" level=info msg="Got volume dir" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-gzf79 controlle │
      │ time="2024-04-07T14:36:09Z" level=info msg="Found path matching glob" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-gzf79 │
      │ time="2024-04-07T14:36:09Z" level=info msg="Founding existing repo" backupLocation=bucket logSource="/remote-source/velero/app/pkg/repository/ensurer.go:86 │
      │ time="2024-04-07T14:36:10Z" level=info msg="FileSystemBR is initialized" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-gz │
      │ time="2024-04-07T14:36:10Z" level=info msg="Async fs restore data path started" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cep │
      │ time="2024-04-07T14:36:10Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-pw6kj control │
      │ time="2024-04-07T14:36:10Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-52c5q control │
      │ time="2024-04-07T14:36:10Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-w8h7f control │
      │ time="2024-04-07T14:36:10Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-zhtth control │
      │ time="2024-04-07T14:36:10Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-jlc8x control │
      │ time="2024-04-07T14:36:10Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-mg6w2 control │
      │ time="2024-04-07T14:36:11Z" level=info msg="Error cannot be convert to ExitError format." PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-1 │
      │ time="2024-04-07T14:36:11Z" level=info msg="Run command=restore, stdout=, stderr=" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods- │
      │ time="2024-04-07T14:36:11Z" level=error msg="Async fs restore data path failed" controller=PodVolumeRestore error="chdir /host_pods/804a47c6-6eca-44c7-be3a │
      │ time="2024-04-07T14:36:11Z" level=info msg="FileSystemBR is closed" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-gzf79 c │ 

      this issue was reproduce  on 3 different cluster on perf and scale team 
      also on latest OADP build 1.3.1-57

      all the logs from the above cycle can be found here 

      https://drive.google.com/drive/folders/1WKXxlNuugmg_Dc-9k89waM-uIuK0dwLr?usp=sharing 

      Version-Release number of selected component (if applicable):

      OADP 1.3.1-54

      ODF 4.14.6
      OCP 4.14.13

      How reproducible:

       

      Steps to Reproduce:
      1. clone repo to your /tmp folder on machine which has access to openshift cluster git clone git@gitlab.cee.redhat.com:mlehrer/mpqe-scale-scripts.git
      2. execute this ansible-playbook command 10 times:

      ​ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-0 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-0 sc=ocs-storagecluster-ceph-rbd' -vvvv
      
      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-1 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-1 sc=ocs-storagecluster-ceph-rbd' -vvvv
      
      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-2 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-2 sc=ocs-storagecluster-ceph-rbd' -vvvv
      
      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-3 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-3 sc=ocs-storagecluster-ceph-rbd' -vvvv
      
      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-4 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-4 sc=ocs-storagecluster-ceph-rbd' -vvvv
      
      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-5 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-5 sc=ocs-storagecluster-ceph-rbd' -vvvv
      
      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-6 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-6 sc=ocs-storagecluster-ceph-rbd' -vvvv
      
      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-7 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-7 sc=ocs-storagecluster-ceph-rbd' -vvvv
      
      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-8 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-8 sc=ocs-storagecluster-ceph-rbd' -vvvv
      
      ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-9 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-9 sc=ocs-storagecluster-ceph-rbd' -vvvv
      
       

      Above command will deploy 10 pods with 2GBs of utilized data in namespace perf-datagen-case3-rbd

      3. Perform backup of perf-datagen-case3-rbd this will complete successfully.

      4. Perform restore of perf-datagen-case3-rbd  and it will show 1 or 2 successful podvolumerstore and then fail 1 pvr and wait until velero timeout expires.

      Actual results:

      1 or 2 PVR successful, and 1 Failed, other 7 or 8 PVR without status until timeout of velero - restore is unsuccessful.

       

      Expected results:

      Successful restore is possible in 1.3.1.-27 using  velero - 1.12.3, failing in 
      1.3.1-54, 57, and 59
       
       

       

      Additional info:

            wnstb Wes Hayutin
            tzahia Tzahi Ashkenazi
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: