-
Bug
-
Resolution: Done-Errata
-
Critical
-
OADP 1.4.0
-
4
-
False
-
-
False
-
oadp-operator-bundle-container-1.4.1-20
-
ToDo
-
-
-
0
-
0.000
-
Very Likely
-
0
-
None
-
Unset
-
Unknown
-
Yes
Description of problem:
during restore using kopia , the node-agent pod killed, due to larger consumption of ephemeral-storage.
perf cycle case :
restore-kopia-pvc-util-2-1-6-9-cephrbd-swift-1.5t
since there is no limitation for this resources , limit = 0
the node-agent pod took all the available capacity from the worker node which leads to killing the pod and re-create the node-agent pod from scratch
during restore operation
1. the error from the restore CR should be change to something more Precise
2. we need an option to control this value for ephemeral-storage capacity from the DPA for the velero and node-agent pods
from the restore CR :
Errors: Velero: pod volume restore failed: get a podvolumerestore with status "InProgress" during the server starting, mark it as "Failed" Cluster: <none> Namespaces: <none>
from the OCP events :
Stopping container registry-server The node was low on resource: ephemeral-storage. Threshold quantity: 71851730317, available: 68348632Ki. Container node-agent was using 192072140Ki, request is 0, has larger consumption of ephemeral-storage. Stopping container node-agent Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:... Successfully assigned openshift-adp/node-agent-9h44x to worker002-r640 Found succeeded daemon pod openshift-adp/node-agent-8vtvr on node worker002-r640, will try to delete it Successfully assigned openshift-adp/node-agent-fd92r to worker002-r640 The node had condition: [DiskPressure]. The node had condition: [DiskPressure]. Deleted pod: node-agent-fd92r Created pod: node-agent-9h44x Found failed daemon pod openshift-adp/node-agent-fd92r on node worker002-r640, will try to kill it Deleted pod: node-agent-8vtvr Created pod: node-agent-fd92r Created pod: node-agent-tbw82 Deleted pod: node-agent-9h44x Found failed daemon pod openshift-adp/node-agent-9h44x on node worker002-r640, will try to kill it The node had condition: [DiskPressure]. Successfully assigned openshift-adp/node-agent-tbw82 to worker002-r640 Found failed daemon pod openshift-adp/node-agent-tbw82 on node worker002-r640, will try to kill it Successfully assigned openshift-adp/node-agent-9zjdd to worker002-r640 Deleted pod: node-agent-tbw82 Created pod: node-agent-9zjdd The node had condition: [DiskPressure]. Successfully assigned openshift-adp/node-agent-g5phh to worker002-r640 The node had condition: [DiskPressure]. Created pod: node-agent-g5phh Deleted pod: node-agent-9zjdd Found failed daemon pod openshift-adp/node-agent-9zjdd on node worker002-r640, will try to kill it Successfully assigned openshift-adp/node-agent-dk56z to worker002-r640 Found failed daemon pod openshift-adp/node-agent-g5phh on node worker002-r640, will try to kill it The node had condition: [DiskPressure]. Created pod: node-agent-dk56z Deleted pod: node-agent-g5phh Successfully assigned openshift-adp/node-agent-gsmv8 to worker002-r640 Found failed daemon pod openshift-adp/node-agent-dk56z on node worker002-r640, will try to kill it Deleted pod: node-agent-dk56z Created pod: node-agent-gsmv8 Created container node-agent Successfully pulled image "registry.redhat.io/oadp/oadp-velero-rhel9@sha256:19a083030bdfdf56de52d4a9de4c95d4216dcfe863a395304ef5681d9c720ebe" in 670ms (670ms including waiting) Add eth0 [10.128.3.10/23] from openshift-sdn Started container node-agent Pulling image "registry.redhat.io/oadp/oadp-velero-rhel9@sha256:19a083030bdfdf56de52d4a9de4c95d4216dcfe863a395304ef5681d9c720ebe" Node worker002-r640 status is now: NodeHasNoDiskPressure Successfully assigned openshift-marketplace/redhat-marketplace-s6j6x to master-1 Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fdac21f6fe1c9e627a5025d9a92be272c60e8c7f3ca05638fa5bcffac669f7dd" already present on machine Add eth0 [10.130.1.161/23] from openshift-sdn Created container extract-utilities Started container extract-utilities
Version-Release number of selected component (if applicable):
OCP : 4.15.11
ODF : 4.15.3
OADP : 1.4.0-8
CLoud33 : BM
reference :
https://docs.openshift.com/container-platform/4.15/storage/understanding-ephemeral-storage.html
https://github.com/vmware-tanzu/velero/issues/5827
- relates to
-
OADP-4855 Kopia leaving cache on worker node
- New
- links to
-
RHBA-2024:132893 OpenShift API for Data Protection (OADP) 1.4.1 security and bug fix update
- mentioned on
1.
|
(QE) Add test coverage for OADP-4379 | In Progress | Prasad Joshi |