Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-29287

Pods are stuck in "Terminating" status causing nodepool autoscaling to fail adding new nodes

XMLWordPrintable

    • Important
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

          pods are terminating forever, although nodeDrainTimeout is set to 30s

      Version-Release number of selected component (if applicable):

      [kni@ocp-edge99 ~]$ ~/hypershift_working/hypershift/bin/hcp -v
      hcp version openshift/hypershift: 6d15150857fa2d8d3c134deb7624d9cab42889dc. Latest supported OCP: 4.14.0
      [kni@ocp-edge99 ~]$ oc version
      Client Version: 4.13.0-0.nightly-2023-06-09-152551
      Kustomize Version: v4.5.7
      Server Version: 4.14.0-0.nightly-2024-02-06-070712
      Kubernetes Version: v1.27.10+28ed2d7
      [kni@ocp-edge99 ~]$     

      How reproducible:

          happens sometimes

       

      Steps to Reproduce:

          1.Deploy a hub cluster, and on it hosted cluster with 6 nodes , agent provider, (I used https://auto-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/CI/job/job-runner/2205/)
          2. scale the nodepool to 2 nodes:
      [kni@ocp-edge99 ~]$ oc scale nodepool/hosted-0 --namespace clusters --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig --replicas=2
      
      
      3. set a load pod of 10 replicas and check that it is reachable (an hello-poenshift pod that respond to http curl will do) - there are still 2 nodes to provide that load of 10 pods.
      
      4. set autoscaling nodepool enable with min1 max5 nodes:
      [kni@ocp-edge99 ~]$ oc -n clusters patch nodepool hosted-0 --type=json -p '[{"op": "remove", "path": "/spec/replicas"},{"op":"add", "path": "/spec/autoScaling", "value": { "max": 5, "min": 1 }}]'
      
      5. deploy a load of 50 pods, autoscaling is trying to alocate 5 nodes for that, but manages to allocate a total of 3 nodes only
      6. after timeout, test teardown fails to scale down to 2 nodes, and fails clearing the envir pods removal.   
      
      

      Actual results:

      Some 23 pods are stuck in Terminating, pasting the head 3 :
      [kni@ocp-edge99 ~]$  oc get pods -n pod-test-project --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig | grep hello | grep Terminating
      hello-openshift-5bb8d466b6-2f4gs   0/1     Terminating   0          44h
      hello-openshift-5bb8d466b6-2w9hf   0/1     Terminating   0          44h
      hello-openshift-5bb8d466b6-42vv9   0/1     Terminating   0          44h
      
      in the cluster-api logs the draining of nodes fails:
      
      E0207 14:44:36.337471       1 machine_controller.go:641] "Drain failed, retry in 20s" err="error when evicting pods/\"thanos-querier-967c66688-6w6pm\" -n \"openshift-monitoring\": global timeout reached: 20s" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="clusters-hosted-0/hosted-0-5d7f7fc9d6xb668d-g54rb" namespace="clusters-hosted-0" name="hosted-0-5d7f7fc9d6xb668d-g54rb" reconcileID=26672e6f-9d08-4146-b99c-22f2116e04f9 MachineSet="clusters-hosted-0/hosted-0-5d7f7fc9d6xb668d" MachineDeployment="clusters-hosted-0/hosted-0" Cluster="clusters-hosted-0/hosted-0-79lpm" Node="hosted-worker-0-1"
      nodepoos is iterating for trying 3,5 nodes every few min:    
      Sometimes :
      (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ oc get nodepool -n clusters
      NAME       CLUSTER    DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
      hosted-0   hosted-0                   3               True          False        4.14.11                                      
      
      
      (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$  oc get nodes -n pod-test-project --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig
      NAME                STATUS     ROLES    AGE   VERSION
      hosted-worker-0-1   NotReady   worker   11m   v1.27.10+28ed2d7
      hosted-worker-0-2   Ready      worker   32m   v1.27.10+28ed2d7
      hosted-worker-0-3   Ready      worker   20h   v1.27.10+28ed2d7
      hosted-worker-0-4   Ready      worker   20h   v1.27.10+28ed2d7
      hosted-worker-0-5   NotReady   worker   12m   v1.27.10+28ed2d7
      (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ oc get agents -n hosted-0
      NAME                                   CLUSTER    APPROVED   ROLE     STAGE
      35e1beb9-16ee-40b0-9b46-4418ff7c700d              true       worker   
      918f6bcd-dd6f-4114-994b-bd0ef4eef341   hosted-0   true       worker   Done
      a3efb0ed-a94a-4312-ba68-35cf2d6fc21d   hosted-0   true       worker   Done
      a4ddefcb-5822-4963-adf8-2af97680d637              true       worker   
      bd353678-5ac3-47d5-9a2e-7098c750fe4f              true       worker   
      d958ea6d-e511-4126-b14b-b79f6555e50b   hosted-0   true       worker   Done
      (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ 
      
      
      ---------------------------------------------------------------------------
      
      (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ oc get nodepool -n clusters
      NAME       CLUSTER    DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
      hosted-0   hosted-0                   3               True          False        4.14.11                                      Minimum availability requires 5 replicas, current 3 available
      (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ oc get agents -n hosted-0
      NAME                                   CLUSTER    APPROVED   ROLE     STAGE
      35e1beb9-16ee-40b0-9b46-4418ff7c700d              true       worker   
      918f6bcd-dd6f-4114-994b-bd0ef4eef341   hosted-0   true       worker   Done
      a3efb0ed-a94a-4312-ba68-35cf2d6fc21d   hosted-0   true       worker   Done
      a4ddefcb-5822-4963-adf8-2af97680d637   hosted-0   true       worker   Joined
      bd353678-5ac3-47d5-9a2e-7098c750fe4f   hosted-0   true       worker   Joined
      d958ea6d-e511-4126-b14b-b79f6555e50b   hosted-0   true       worker   Done
      (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ 5:42

       

       

      Expected results:

      - no pods are in 'Terminating' status for 44h.   
      - 5 current nodes in the 'get nodepool' command
      - 5 active nodes in 'get nodes'
      - 5 allocated agents in ready state                                        

       

      Additional info on slack thread : https://redhat-internal.slack.com/archives/C058TF9K37Z/p1707931770923379

       

            cchun@redhat.com Crystal Chun
            rhn-support-gamado Gal Amado
            Gal Amado Gal Amado
            Crystal Chun
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: