Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27290

capi pods restarting and getting evicted

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.14
    • HyperShift / Agent
    • None
    • Important
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      capi pods exited and restarted. There was an original capi pod which failed/finished at Wed, 17 Jan 2024 12:18:04 and the new pod's logs started at 2024-01-17T12:57:50Z. The original pod was evicted because of
      Status:               Failed
      Reason:               Evicted
      Message:              The node was low on resource: ephemeral-storage. Threshold quantity: 6351693871, available: 6143252Ki. Container manager was using 332Ki, request is 0, has larger consumption of ephemeral-storage. 

      Version-Release number of selected component (if applicable):

          (.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc version
      Client Version: 4.14.0-0.nightly-2024-01-15-085353
      Kustomize Version: v5.0.1
      Server Version: 4.14.9
      Kubernetes Version: v1.27.9+5c56cc3
      (.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc get hc -A --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig
      NAMESPACE   NAME       VERSION   KUBECONFIG                  PROGRESS    AVAILABLE   PROGRESSING   MESSAGE
      clusters    hosted-0   4.14.9    hosted-0-admin-kubeconfig   Completed   True        False         The hosted control plane is available
      (.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ 
      
      
      Updated hypershift operator version to : https://quay.io/repository/hypershift/hypershift-operator/manifest/sha256:6ffead94e14d7aa56e4fd9adb9e45e337c1b32a3fc191c10a5c1306c3990fc9f
          

      How reproducible:

      Happens sometimes 

      Steps to Reproduce:

          1.Deploy a hub cluster, and on it hosted cluster with 6 nodes , agent provider. 
      
      2.I ran this automation test,for scaling down : https://gitlab.cee.redhat.com/ocp-edge-qe/ocp-edge-auto/-/blob/master/edge_tests/deployment/installer/scale/test_scale_nodepool.py?ref_type=heads#L43   
      
      3. Although the test has passed, the theardown that scale the node-pool back to 2 nodes, took more than 30 min, so the theardown failed for timeout.
      
      4. I don't know when exactly, but after some 3 hours, the nodepool scale to 2 nodes ended, if needed, I can recheck with a longet timeout, to give the accurate time. 

      Actual results:

          more than 30 min scaling up to 2 nodes 

      Expected results:

          scaling in a reasonable time (<10min)

      Additional info:

          

       

            cchun@redhat.com Crystal Chun
            rhn-support-gamado Gal Amado
            Gal Amado Gal Amado
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: