-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.14
-
Important
-
No
-
False
-
Description of problem:
pods are terminating forever, although nodeDrainTimeout is set to 30s
Version-Release number of selected component (if applicable):
[kni@ocp-edge99 ~]$ ~/hypershift_working/hypershift/bin/hcp -v hcp version openshift/hypershift: 6d15150857fa2d8d3c134deb7624d9cab42889dc. Latest supported OCP: 4.14.0 [kni@ocp-edge99 ~]$ oc version Client Version: 4.13.0-0.nightly-2023-06-09-152551 Kustomize Version: v4.5.7 Server Version: 4.14.0-0.nightly-2024-02-06-070712 Kubernetes Version: v1.27.10+28ed2d7 [kni@ocp-edge99 ~]$
How reproducible:
happens sometimes
Steps to Reproduce:
1.Deploy a hub cluster, and on it hosted cluster with 6 nodes , agent provider, (I used https://auto-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/CI/job/job-runner/2205/) 2. scale the nodepool to 2 nodes: [kni@ocp-edge99 ~]$ oc scale nodepool/hosted-0 --namespace clusters --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig --replicas=2 3. set a load pod of 10 replicas and check that it is reachable (an hello-poenshift pod that respond to http curl will do) - there are still 2 nodes to provide that load of 10 pods. 4. set autoscaling nodepool enable with min1 max5 nodes: [kni@ocp-edge99 ~]$ oc -n clusters patch nodepool hosted-0 --type=json -p '[{"op": "remove", "path": "/spec/replicas"},{"op":"add", "path": "/spec/autoScaling", "value": { "max": 5, "min": 1 }}]' 5. deploy a load of 50 pods, autoscaling is trying to alocate 5 nodes for that, but manages to allocate a total of 3 nodes only
6. after timeout, test teardown fails to scale down to 2 nodes, and fails clearing the envir pods removal.
Actual results:
Some 23 pods are stuck in Terminating, pasting the head 3 : [kni@ocp-edge99 ~]$ oc get pods -n pod-test-project --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig | grep hello | grep Terminating hello-openshift-5bb8d466b6-2f4gs 0/1 Terminating 0 44h hello-openshift-5bb8d466b6-2w9hf 0/1 Terminating 0 44h hello-openshift-5bb8d466b6-42vv9 0/1 Terminating 0 44h
in the cluster-api logs the draining of nodes fails: E0207 14:44:36.337471 1 machine_controller.go:641] "Drain failed, retry in 20s" err="error when evicting pods/\"thanos-querier-967c66688-6w6pm\" -n \"openshift-monitoring\": global timeout reached: 20s" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="clusters-hosted-0/hosted-0-5d7f7fc9d6xb668d-g54rb" namespace="clusters-hosted-0" name="hosted-0-5d7f7fc9d6xb668d-g54rb" reconcileID=26672e6f-9d08-4146-b99c-22f2116e04f9 MachineSet="clusters-hosted-0/hosted-0-5d7f7fc9d6xb668d" MachineDeployment="clusters-hosted-0/hosted-0" Cluster="clusters-hosted-0/hosted-0-79lpm" Node="hosted-worker-0-1"
nodepoos is iterating for trying 3,5 nodes every few min:
Sometimes : (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ oc get nodepool -n clusters NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE hosted-0 hosted-0 3 True False 4.14.11 (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ oc get nodes -n pod-test-project --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig NAME STATUS ROLES AGE VERSION hosted-worker-0-1 NotReady worker 11m v1.27.10+28ed2d7 hosted-worker-0-2 Ready worker 32m v1.27.10+28ed2d7 hosted-worker-0-3 Ready worker 20h v1.27.10+28ed2d7 hosted-worker-0-4 Ready worker 20h v1.27.10+28ed2d7 hosted-worker-0-5 NotReady worker 12m v1.27.10+28ed2d7 (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ oc get agents -n hosted-0 NAME CLUSTER APPROVED ROLE STAGE 35e1beb9-16ee-40b0-9b46-4418ff7c700d true worker 918f6bcd-dd6f-4114-994b-bd0ef4eef341 hosted-0 true worker Done a3efb0ed-a94a-4312-ba68-35cf2d6fc21d hosted-0 true worker Done a4ddefcb-5822-4963-adf8-2af97680d637 true worker bd353678-5ac3-47d5-9a2e-7098c750fe4f true worker d958ea6d-e511-4126-b14b-b79f6555e50b hosted-0 true worker Done (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ --------------------------------------------------------------------------- (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ oc get nodepool -n clusters NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE hosted-0 hosted-0 3 True False 4.14.11 Minimum availability requires 5 replicas, current 3 available (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ oc get agents -n hosted-0 NAME CLUSTER APPROVED ROLE STAGE 35e1beb9-16ee-40b0-9b46-4418ff7c700d true worker 918f6bcd-dd6f-4114-994b-bd0ef4eef341 hosted-0 true worker Done a3efb0ed-a94a-4312-ba68-35cf2d6fc21d hosted-0 true worker Done a4ddefcb-5822-4963-adf8-2af97680d637 hosted-0 true worker Joined bd353678-5ac3-47d5-9a2e-7098c750fe4f hosted-0 true worker Joined d958ea6d-e511-4126-b14b-b79f6555e50b hosted-0 true worker Done (.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ 5:42
Expected results:
- no pods are in 'Terminating' status for 44h. - 5 current nodes in the 'get nodepool' command - 5 active nodes in 'get nodes' - 5 allocated agents in ready state
Additional info on slack thread : https://redhat-internal.slack.com/archives/C058TF9K37Z/p1707931770923379
- incorporates
-
OCPBUGS-27774 Hosted cluster's monitoring cluster operator, gets unavailable for failing to reconcile node-exporter DaemonSet
- Closed