Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.14
Component/s: HyperShift / Agent
Labels:
- automation-blocker

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    pods are terminating forever, although nodeDrainTimeout is set to 30s

Version-Release number of selected component (if applicable):

[kni@ocp-edge99 ~]$ ~/hypershift_working/hypershift/bin/hcp -v
hcp version openshift/hypershift: 6d15150857fa2d8d3c134deb7624d9cab42889dc. Latest supported OCP: 4.14.0
[kni@ocp-edge99 ~]$ oc version
Client Version: 4.13.0-0.nightly-2023-06-09-152551
Kustomize Version: v4.5.7
Server Version: 4.14.0-0.nightly-2024-02-06-070712
Kubernetes Version: v1.27.10+28ed2d7
[kni@ocp-edge99 ~]$

How reproducible:

    happens sometimes

Steps to Reproduce:

    1.Deploy a hub cluster, and on it hosted cluster with 6 nodes , agent provider, (I used https://auto-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/CI/job/job-runner/2205/)
    2. scale the nodepool to 2 nodes:
[kni@ocp-edge99 ~]$ oc scale nodepool/hosted-0 --namespace clusters --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig --replicas=2


3. set a load pod of 10 replicas and check that it is reachable (an hello-poenshift pod that respond to http curl will do) - there are still 2 nodes to provide that load of 10 pods.

4. set autoscaling nodepool enable with min1 max5 nodes:
[kni@ocp-edge99 ~]$ oc -n clusters patch nodepool hosted-0 --type=json -p '[{"op": "remove", "path": "/spec/replicas"},{"op":"add", "path": "/spec/autoScaling", "value": { "max": 5, "min": 1 }}]'

5. deploy a load of 50 pods, autoscaling is trying to alocate 5 nodes for that, but manages to allocate a total of 3 nodes only

6. after timeout, test teardown fails to scale down to 2 nodes, and fails clearing the envir pods removal.

Actual results:

Some 23 pods are stuck in Terminating, pasting the head 3 :
[kni@ocp-edge99 ~]$  oc get pods -n pod-test-project --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig | grep hello | grep Terminating
hello-openshift-5bb8d466b6-2f4gs   0/1     Terminating   0          44h
hello-openshift-5bb8d466b6-2w9hf   0/1     Terminating   0          44h
hello-openshift-5bb8d466b6-42vv9   0/1     Terminating   0          44h

in the cluster-api logs the draining of nodes fails:

E0207 14:44:36.337471       1 machine_controller.go:641] "Drain failed, retry in 20s" err="error when evicting pods/\"thanos-querier-967c66688-6w6pm\" -n \"openshift-monitoring\": global timeout reached: 20s" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="clusters-hosted-0/hosted-0-5d7f7fc9d6xb668d-g54rb" namespace="clusters-hosted-0" name="hosted-0-5d7f7fc9d6xb668d-g54rb" reconcileID=26672e6f-9d08-4146-b99c-22f2116e04f9 MachineSet="clusters-hosted-0/hosted-0-5d7f7fc9d6xb668d" MachineDeployment="clusters-hosted-0/hosted-0" Cluster="clusters-hosted-0/hosted-0-79lpm" Node="hosted-worker-0-1"

nodepoos is iterating for trying 3,5 nodes every few min:

Sometimes :
(.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ oc get nodepool -n clusters
NAME       CLUSTER    DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
hosted-0   hosted-0                   3               True          False        4.14.11                                      


(.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$  oc get nodes -n pod-test-project --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig
NAME                STATUS     ROLES    AGE   VERSION
hosted-worker-0-1   NotReady   worker   11m   v1.27.10+28ed2d7
hosted-worker-0-2   Ready      worker   32m   v1.27.10+28ed2d7
hosted-worker-0-3   Ready      worker   20h   v1.27.10+28ed2d7
hosted-worker-0-4   Ready      worker   20h   v1.27.10+28ed2d7
hosted-worker-0-5   NotReady   worker   12m   v1.27.10+28ed2d7
(.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ oc get agents -n hosted-0
NAME                                   CLUSTER    APPROVED   ROLE     STAGE
35e1beb9-16ee-40b0-9b46-4418ff7c700d              true       worker   
918f6bcd-dd6f-4114-994b-bd0ef4eef341   hosted-0   true       worker   Done
a3efb0ed-a94a-4312-ba68-35cf2d6fc21d   hosted-0   true       worker   Done
a4ddefcb-5822-4963-adf8-2af97680d637              true       worker   
bd353678-5ac3-47d5-9a2e-7098c750fe4f              true       worker   
d958ea6d-e511-4126-b14b-b79f6555e50b   hosted-0   true       worker   Done
(.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ 


---------------------------------------------------------------------------

(.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ oc get nodepool -n clusters
NAME       CLUSTER    DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
hosted-0   hosted-0                   3               True          False        4.14.11                                      Minimum availability requires 5 replicas, current 3 available
(.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ oc get agents -n hosted-0
NAME                                   CLUSTER    APPROVED   ROLE     STAGE
35e1beb9-16ee-40b0-9b46-4418ff7c700d              true       worker   
918f6bcd-dd6f-4114-994b-bd0ef4eef341   hosted-0   true       worker   Done
a3efb0ed-a94a-4312-ba68-35cf2d6fc21d   hosted-0   true       worker   Done
a4ddefcb-5822-4963-adf8-2af97680d637   hosted-0   true       worker   Joined
bd353678-5ac3-47d5-9a2e-7098c750fe4f   hosted-0   true       worker   Joined
d958ea6d-e511-4126-b14b-b79f6555e50b   hosted-0   true       worker   Done
(.venv) [kni@ocp-edge99 ocp-edge-auto_cluster]$ 5:42

Expected results:

- no pods are in 'Terminating' status for 44h.   
- 5 current nodes in the 'get nodepool' command
- 5 active nodes in 'get nodes'
- 5 allocated agents in ready state

Additional info on slack thread : https://redhat-internal.slack.com/archives/C058TF9K37Z/p1707931770923379

incorporates

OCPBUGS-27774 Hosted cluster's monitoring cluster operator, gets unavailable for failing to reconcile node-exporter DaemonSet

Closed

Assignee:: Crystal Chun

Reporter:: Gal Amado

Need Info From:: None

Contributors:: Crystal Chun

QA Contact:: Gal Amado

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/02/09 11:19 AM

Updated:: 2025/07/23 5:38 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates