-
Bug
-
Resolution: Not a Bug
-
Blocker
-
OADP 1.3.1
-
False
-
-
False
-
ToDo
-
-
-
-
0
-
0
-
Very Likely
-
0
-
None
-
Unset
-
Unknown
-
No
Description of problem:
During Kopia or Restic restore operations, there is an issue with (PVR) resources. the restore CR ended as "PartiallyFailed"
when checking the restore CR , some of them marked as completed , failed , and those which are appears as "new" didn't start at all
Additionally, this issue has been reproduced during restores using Restic & Kopia as well.
PVR :
39 - "Completed
3 - "Failed"
58 - " New - without any progress " - marked as new on the CR describe command - on the PVR resource they without any "Progress:" state
kni@f07-h27-000-r640 benchmark-runner-assistant]$ oc get podvolumerestore -nopenshift-adp |grep restic | grep Completed |wc -l
39
[kni@f07-h27-000-r640 benchmark-runner-assistant]$ oc get podvolumerestore -nopenshift-adp |grep restic | grep Failed |wc -l
3
[kni@f07-h27-000-r640 benchmark-runner-assistant]$ oc get podvolumerestore -nopenshift-adp |grep restic | grep -v "Failed\|Completed" |wc -l
58
PVB :
[kni@f07-h27-000-r640 benchmark-runner-assistant]$ oc get Podvolumebackup -A |grep data |grep Completed | wc -l 100
[kni@f07-h27-000-r640 ~]$ velero restore describe restore-restic-datagen-single-ns-100pods-cephrbd Name: restore-restic-datagen-single-ns-100pods-cephrbd Namespace: openshift-adp Labels: <none> Annotations: <none>Phase: PartiallyFailed (run 'velero restore logs restore-restic-datagen-single-ns-100pods-cephrbd' for more information) Total items to be restored: 521 Items restored: 521Started: 2024-04-07 14:33:50 +0000 UTC Completed: 2024-04-09 14:33:50 +0000 UTCWarnings: Velero: <none> Cluster: could not restore, CustomResourceDefinition "clusterserviceversions.operators.coreos.com" already exists. Warning: the in-cluster version is different than the backed-up version Namespaces: datagen-single-ns-100pods-cephrbd: could not restore, RoleBinding "system:image-pullers" already exists. Warning: the in-cluster version is different than the backed-up version could not restore, ConfigMap "kube-root-ca.crt" already exists. Warning: the in-cluster version is different than the backed-up version could not restore, ConfigMap "openshift-service-ca.crt" already exists. Warning: the in-cluster version is different than the backed-up version could not restore, ClusterServiceVersion "volsync-product.v0.7.4-0.1698026108.p" already exists. Warning: the in-cluster version is different than the backed-up version could not restore, RoleBinding "system:deployers" already exists. Warning: the in-cluster version is different than the backed-up version could not restore, RoleBinding "system:image-builders" already exists. Warning: the in-cluster version is different than the backed-up version could not restore, RoleBinding "system:image-pullers" already exists. Warning: the in-cluster version is different than the backed-up versionErrors: Velero: pod volume restore failed: data path restore failed: chdir /host_pods/1a3eda7f-cc21-4aa1-ba7d-12b9f4102584/volumes/kubernetes.io~csi/pvc-fc80f14b-0236-44ac-88bc-4c3b4e14aed3/mount: no such file or directory pod volume restore failed: data path restore failed: chdir /host_pods/804a47c6-6eca-44c7-be3a-34a7c4ea749c/volumes/kubernetes.io~csi/pvc-4db942e2-4416-4965-be93-bf5ddb990a78/mount: no such file or directory pod volume restore failed: data path restore failed: chdir /host_pods/cb26a3f3-1bcd-421d-a5ff-77a0e48ca5c0/volumes/kubernetes.io~csi/pvc-e8fa3eb5-a89d-4e93-9b9a-9430915f7b87/mount: no such file or directory timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete timed out waiting for all PodVolumeRestores to complete Cluster: <none> Namespaces: <none>Backup: backup-restic-datagen-single-ns-100pods-cephrbdNamespaces: Included: all namespaces found in the backup Excluded: <none>Resources: Included: * Excluded: nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io Cluster-scoped: autoNamespace mappings: <none>Label selector: <none>Or label selector: <none>Restore PVs: autorestic Restores (specify --details for more information): Completed: 39 Failed: 3 New: 58Existing Resource Policy: <none> ItemOperationTimeout: 4h0m0sPreserve Service NodePorts: auto
PV & Pods
[kni@f07-h27-000-r640 benchmark-runner-assistant]$ oc get pv -n datagen-single-ns-100pods-cephrbd | grep gen |wc -l 100 [kni@f07-h27-000-r640 benchmark-runner-assistant]$ oc get pods -ndatagen-single-ns-100pods-cephrbd |grep Running |wc -l 100
from velero log
I0404 15:56:15.980518 1 request.go:690] Waited for 1.045157493s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/flowcontrol.apiserver.k8s.io/v1beta3?timeout=32s E0405 07:40:22.212924 1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server does not allow this method on the requested resource E0405 07:41:15.985431 1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server does not allow this method on the requested resource E0405 07:42:07.039601 1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server does not allow this method on the requested resource E0405 07:42:49.140141 1 reflector.go:140] pkg/mod/k8s.io/client-go@v0.25.6/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server does not allow this method on the requested resource
node-agent
time="2024-04-07T14:36:11Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-ldw7j control │ │ time="2024-04-07T14:36:11Z" level=info msg="Got volume dir" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-ldw7j controlle │ │ time="2024-04-07T14:36:11Z" level=info msg="Found path matching glob" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-ldw7j │ │ time="2024-04-07T14:36:11Z" level=info msg="Founding existing repo" backupLocation=bucket logSource="/remote-source/velero/app/pkg/repository/ensurer.go:86 │ │ time="2024-04-07T14:36:12Z" level=info msg="FileSystemBR is initialized" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-ld │ │ time="2024-04-07T14:36:12Z" level=info msg="Async fs restore data path started" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cep │ │ time="2024-04-07T14:36:13Z" level=info msg="Error cannot be convert to ExitError format." PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-1 │ │ time="2024-04-07T14:36:13Z" level=info msg="Run command=restore, stdout=, stderr=" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods- │ │ time="2024-04-07T14:36:13Z" level=error msg="Async fs restore data path failed" controller=PodVolumeRestore error="chdir /host_pods/cb26a3f3-1bcd-421d-a5ff │ │ time="2024-04-07T14:36:13Z" level=info msg="FileSystemBR is closed" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-ldw7j c │
time="2024-04-07T14:36:09Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-gzf79 control │ │ time="2024-04-07T14:36:09Z" level=info msg="Got volume dir" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-gzf79 controlle │ │ time="2024-04-07T14:36:09Z" level=info msg="Found path matching glob" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-gzf79 │ │ time="2024-04-07T14:36:09Z" level=info msg="Founding existing repo" backupLocation=bucket logSource="/remote-source/velero/app/pkg/repository/ensurer.go:86 │ │ time="2024-04-07T14:36:10Z" level=info msg="FileSystemBR is initialized" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-gz │ │ time="2024-04-07T14:36:10Z" level=info msg="Async fs restore data path started" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cep │ │ time="2024-04-07T14:36:10Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-pw6kj control │ │ time="2024-04-07T14:36:10Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-52c5q control │ │ time="2024-04-07T14:36:10Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-w8h7f control │ │ time="2024-04-07T14:36:10Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-zhtth control │ │ time="2024-04-07T14:36:10Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-jlc8x control │ │ time="2024-04-07T14:36:10Z" level=info msg="Restore starting" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-mg6w2 control │ │ time="2024-04-07T14:36:11Z" level=info msg="Error cannot be convert to ExitError format." PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-1 │ │ time="2024-04-07T14:36:11Z" level=info msg="Run command=restore, stdout=, stderr=" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods- │ │ time="2024-04-07T14:36:11Z" level=error msg="Async fs restore data path failed" controller=PodVolumeRestore error="chdir /host_pods/804a47c6-6eca-44c7-be3a │ │ time="2024-04-07T14:36:11Z" level=info msg="FileSystemBR is closed" PodVolumeRestore=openshift-adp/restore-restic-datagen-single-ns-100pods-cephrbd-gzf79 c │
this issue was reproduce on 3 different cluster on perf and scale team
also on latest OADP build 1.3.1-57
all the logs from the above cycle can be found here
https://drive.google.com/drive/folders/1WKXxlNuugmg_Dc-9k89waM-uIuK0dwLr?usp=sharing
Version-Release number of selected component (if applicable):
OADP 1.3.1-54
ODF 4.14.6
OCP 4.14.13
How reproducible:
Steps to Reproduce:
1. clone repo to your /tmp folder on machine which has access to openshift cluster git clone git@gitlab.cee.redhat.com:mlehrer/mpqe-scale-scripts.git
2. execute this ansible-playbook command 10 times:
ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-0 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-0 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-1 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-1 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-2 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-2 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-3 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-3 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-4 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-4 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-5 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-5 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-6 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-6 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-7 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-7 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-8 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-8 sc=ocs-storagecluster-ceph-rbd' -vvvv ansible-playbook /tmp/mpqe-scale-scripts/mtc-helpers/data-generator/playbooks/playbook_case3.yml --extra-vars 'dir_count=30 files_count=230 files_size=307200 dept_count=1 pvc_size=6Gi deployment_name=deploy-perf-datagen-0-0-7-6gi-10-rbd-9 dataset_path=/opt/mounts/mnt1/ namespace=perf-datagen-case-0-0-7-cephrbd pvc_name=pvc-perf-datagen-0-0-7-6gi-10-rbd-9 sc=ocs-storagecluster-ceph-rbd' -vvvv
Above command will deploy 10 pods with 2GBs of utilized data in namespace perf-datagen-case3-rbd
3. Perform backup of perf-datagen-case3-rbd this will complete successfully.
4. Perform restore of perf-datagen-case3-rbd and it will show 1 or 2 successful podvolumerstore and then fail 1 pvr and wait until velero timeout expires.
Actual results:
1 or 2 PVR successful, and 1 Failed, other 7 or 8 PVR without status until timeout of velero - restore is unsuccessful.
Expected results:
Successful restore is possible in 1.3.1.-27 using velero - 1.12.3, failing in
1.3.1-54, 57, and 59
Additional info:
- mentioned on