Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-4379

node-agent pod killed, due to larger consumption of ephemeral-storage.

XMLWordPrintable

    • 4
    • False
    • Hide

      None

      Show
      None
    • False
    • oadp-operator-bundle-container-1.4.1-20
    • ToDo
    • 0
    • 0.0
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown
    • Yes

      Description of problem:

      during restore using kopia , the node-agent pod killed, due to larger consumption of ephemeral-storage.

      perf cycle case : 

      restore-kopia-pvc-util-2-1-6-9-cephrbd-swift-1.5t 

      since there is no limitation for this resources , limit = 0 
      the node-agent pod took all the available  capacity  from the worker node which leads to killing  the pod  and re-create the node-agent pod  from scratch
      during restore operation 
      1.  the error from the restore CR should be change to something more Precise  
      2.  we need an option to control this value for ephemeral-storage capacity from the DPA for the velero and node-agent pods 

       
      from the restore CR : 

       Errors:
        Velero:   pod volume restore failed: get a podvolumerestore with status "InProgress" during the server starting, mark it as "Failed"
        Cluster:    <none>
        Namespaces: <none>
      

      from the OCP events  : 

       Stopping container registry-server                                                                                                                                                                                
      The node was low on resource: ephemeral-storage. Threshold quantity: 71851730317, available: 68348632Ki. Container node-agent was using 192072140Ki, request is 0, has larger consumption of ephemeral-storage.   
      Stopping container node-agent                                                                                                                                                                                     
      Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:...                                                                                                                                        
      Successfully assigned openshift-adp/node-agent-9h44x to worker002-r640                                                                                                                                            
      Found succeeded daemon pod openshift-adp/node-agent-8vtvr on node worker002-r640, will try to delete it                                                                                                           
      Successfully assigned openshift-adp/node-agent-fd92r to worker002-r640                                                                                                                                            
      The node had condition: [DiskPressure].                                                                                                                                                                           
      The node had condition: [DiskPressure].                                                                                                                                                                           
      Deleted pod: node-agent-fd92r                                                                                                                                                                                     
      Created pod: node-agent-9h44x                                                                                                                                                                                     
      Found failed daemon pod openshift-adp/node-agent-fd92r on node worker002-r640, will try to kill it                                                                                                                
      Deleted pod: node-agent-8vtvr                                                                                                                                                                                     
      Created pod: node-agent-fd92r                                                                                                                                                                                     
      Created pod: node-agent-tbw82                                                                                                                                                                                     
      Deleted pod: node-agent-9h44x                                                                                                                                                                                     
      Found failed daemon pod openshift-adp/node-agent-9h44x on node worker002-r640, will try to kill it                                                                                                                
      The node had condition: [DiskPressure].                                                                                                                                                                           
      Successfully assigned openshift-adp/node-agent-tbw82 to worker002-r640                                                                                                                                            
      Found failed daemon pod openshift-adp/node-agent-tbw82 on node worker002-r640, will try to kill it                                                                                                                
      Successfully assigned openshift-adp/node-agent-9zjdd to worker002-r640                                                                                                                                            
      Deleted pod: node-agent-tbw82                                                                                                                                                                                     
      Created pod: node-agent-9zjdd                                                                                                                                                                                     
      The node had condition: [DiskPressure].                                                                                                                                                                           
      Successfully assigned openshift-adp/node-agent-g5phh to worker002-r640                                                                                                                                            
      The node had condition: [DiskPressure].                                                                                                                                                                           
      Created pod: node-agent-g5phh                                                                                                                                                                                     
      Deleted pod: node-agent-9zjdd                                                                                                                                                                                     
      Found failed daemon pod openshift-adp/node-agent-9zjdd on node worker002-r640, will try to kill it                                                                                                                
      Successfully assigned openshift-adp/node-agent-dk56z to worker002-r640                                                                                                                                            
      Found failed daemon pod openshift-adp/node-agent-g5phh on node worker002-r640, will try to kill it                                                                                                                
      The node had condition: [DiskPressure].                                                                                                                                                                           
      Created pod: node-agent-dk56z                                                                                                                                                                                     
      Deleted pod: node-agent-g5phh                                                                                                                                                                                     
      Successfully assigned openshift-adp/node-agent-gsmv8 to worker002-r640                                                                                                                                            
      Found failed daemon pod openshift-adp/node-agent-dk56z on node worker002-r640, will try to kill it                                                                                                                
      Deleted pod: node-agent-dk56z                                                                                                                                                                                     
      Created pod: node-agent-gsmv8                                                                                                                                                                                     
      Created container node-agent                                                                                                                                                                                      
      Successfully pulled image "registry.redhat.io/oadp/oadp-velero-rhel9@sha256:19a083030bdfdf56de52d4a9de4c95d4216dcfe863a395304ef5681d9c720ebe" in 670ms (670ms including waiting)                                  
      Add eth0 [10.128.3.10/23] from openshift-sdn                                                                                                                                                                      
      Started container node-agent                                                                                                                                                                                      
      Pulling image "registry.redhat.io/oadp/oadp-velero-rhel9@sha256:19a083030bdfdf56de52d4a9de4c95d4216dcfe863a395304ef5681d9c720ebe"                                                                                 
      Node worker002-r640 status is now: NodeHasNoDiskPressure                                                                                                                                                          
      Successfully assigned openshift-marketplace/redhat-marketplace-s6j6x to master-1                                                                                                                                  
      Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fdac21f6fe1c9e627a5025d9a92be272c60e8c7f3ca05638fa5bcffac669f7dd" already present on machine                                               
      Add eth0 [10.130.1.161/23] from openshift-sdn                                                                                                                                                                     
      Created container extract-utilities                                                                                                                                                                               
      Started container extract-utilities                     

      Version-Release number of selected component (if applicable):

      OCP :  4.15.11

      ODF :  4.15.3
      OADP : 1.4.0-8
      CLoud33 :  BM

      reference : 
      https://docs.openshift.com/container-platform/4.15/storage/understanding-ephemeral-storage.html
      https://github.com/vmware-tanzu/velero/issues/5827

        1. Worker9_Var-1.txt
          22 kB
        2. Worker9_Var.txt
          22 kB
        3. ThirdCycle_NotRun-1.txt
          10 kB
        4. ThirdCycle_NotRun.txt
          10 kB
        5. SecondCycle-1.txt
          0.3 kB
        6. SecondCycle.txt
          0.3 kB
        7. output-1.tar
          1.81 MB
        8. output.tar
          1.81 MB
        9. FirstCycle-1.txt
          0.6 kB
        10. FirstCycle.txt
          0.6 kB

            msouzaol Mateus Oliveira
            tzahia Tzahi Ashkenazi
            David Vaanunu David Vaanunu
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: