Loading...

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: OADP 1.5.0
Affects Version/s: OADP 1.4.0
Component/s: nodeagent
Labels:
- create-test
- triaged

Activity Type:
Quality / Stability / Reliability
Story Points:
4
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Fixed in Build:
oadp-operator-bundle-container-1.4.1-20
QEStatus:
ToDo
Intelligence Requested:
Market:

WSJF:
0
Risk Probability:
Very Likely
Risk Score:
0

Workstream:

None

Root Cause:
Unset
Failure Category:
Unknown

Regression:
Yes

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

during restore using kopia , the node-agent pod killed, due to larger consumption of ephemeral-storage.

perf cycle case :

restore-kopia-pvc-util-2-1-6-9-cephrbd-swift-1.5t

since there is no limitation for this resources , limit = 0
the node-agent pod took all the available capacity from the worker node which leads to killing the pod and re-create the node-agent pod from scratch
during restore operation
1. the error from the restore CR should be change to something more Precise
2. we need an option to control this value for ephemeral-storage capacity from the DPA for the velero and node-agent pods

from the restore CR :

 Errors:
  Velero:   pod volume restore failed: get a podvolumerestore with status "InProgress" during the server starting, mark it as "Failed"
  Cluster:    <none>
  Namespaces: <none>

from the OCP events :

 Stopping container registry-server                                                                                                                                                                                
The node was low on resource: ephemeral-storage. Threshold quantity: 71851730317, available: 68348632Ki. Container node-agent was using 192072140Ki, request is 0, has larger consumption of ephemeral-storage.   
Stopping container node-agent                                                                                                                                                                                     
Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:...                                                                                                                                        
Successfully assigned openshift-adp/node-agent-9h44x to worker002-r640                                                                                                                                            
Found succeeded daemon pod openshift-adp/node-agent-8vtvr on node worker002-r640, will try to delete it                                                                                                           
Successfully assigned openshift-adp/node-agent-fd92r to worker002-r640                                                                                                                                            
The node had condition: [DiskPressure].                                                                                                                                                                           
The node had condition: [DiskPressure].                                                                                                                                                                           
Deleted pod: node-agent-fd92r                                                                                                                                                                                     
Created pod: node-agent-9h44x                                                                                                                                                                                     
Found failed daemon pod openshift-adp/node-agent-fd92r on node worker002-r640, will try to kill it                                                                                                                
Deleted pod: node-agent-8vtvr                                                                                                                                                                                     
Created pod: node-agent-fd92r                                                                                                                                                                                     
Created pod: node-agent-tbw82                                                                                                                                                                                     
Deleted pod: node-agent-9h44x                                                                                                                                                                                     
Found failed daemon pod openshift-adp/node-agent-9h44x on node worker002-r640, will try to kill it                                                                                                                
The node had condition: [DiskPressure].                                                                                                                                                                           
Successfully assigned openshift-adp/node-agent-tbw82 to worker002-r640                                                                                                                                            
Found failed daemon pod openshift-adp/node-agent-tbw82 on node worker002-r640, will try to kill it                                                                                                                
Successfully assigned openshift-adp/node-agent-9zjdd to worker002-r640                                                                                                                                            
Deleted pod: node-agent-tbw82                                                                                                                                                                                     
Created pod: node-agent-9zjdd                                                                                                                                                                                     
The node had condition: [DiskPressure].                                                                                                                                                                           
Successfully assigned openshift-adp/node-agent-g5phh to worker002-r640                                                                                                                                            
The node had condition: [DiskPressure].                                                                                                                                                                           
Created pod: node-agent-g5phh                                                                                                                                                                                     
Deleted pod: node-agent-9zjdd                                                                                                                                                                                     
Found failed daemon pod openshift-adp/node-agent-9zjdd on node worker002-r640, will try to kill it                                                                                                                
Successfully assigned openshift-adp/node-agent-dk56z to worker002-r640                                                                                                                                            
Found failed daemon pod openshift-adp/node-agent-g5phh on node worker002-r640, will try to kill it                                                                                                                
The node had condition: [DiskPressure].                                                                                                                                                                           
Created pod: node-agent-dk56z                                                                                                                                                                                     
Deleted pod: node-agent-g5phh                                                                                                                                                                                     
Successfully assigned openshift-adp/node-agent-gsmv8 to worker002-r640                                                                                                                                            
Found failed daemon pod openshift-adp/node-agent-dk56z on node worker002-r640, will try to kill it                                                                                                                
Deleted pod: node-agent-dk56z                                                                                                                                                                                     
Created pod: node-agent-gsmv8                                                                                                                                                                                     
Created container node-agent                                                                                                                                                                                      
Successfully pulled image "registry.redhat.io/oadp/oadp-velero-rhel9@sha256:19a083030bdfdf56de52d4a9de4c95d4216dcfe863a395304ef5681d9c720ebe" in 670ms (670ms including waiting)                                  
Add eth0 [10.128.3.10/23] from openshift-sdn                                                                                                                                                                      
Started container node-agent                                                                                                                                                                                      
Pulling image "registry.redhat.io/oadp/oadp-velero-rhel9@sha256:19a083030bdfdf56de52d4a9de4c95d4216dcfe863a395304ef5681d9c720ebe"                                                                                 
Node worker002-r640 status is now: NodeHasNoDiskPressure                                                                                                                                                          
Successfully assigned openshift-marketplace/redhat-marketplace-s6j6x to master-1                                                                                                                                  
Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fdac21f6fe1c9e627a5025d9a92be272c60e8c7f3ca05638fa5bcffac669f7dd" already present on machine                                               
Add eth0 [10.130.1.161/23] from openshift-sdn                                                                                                                                                                     
Created container extract-utilities                                                                                                                                                                               
Started container extract-utilities