Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.14
Component/s: HyperShift / Agent
Labels:
None

Severity:
Important
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

capi pods exited and restarted. There was an original capi pod which failed/finished at Wed, 17 Jan 2024 12:18:04 and the new pod's logs started at 2024-01-17T12:57:50Z. The original pod was evicted because of
Status:               Failed
Reason:               Evicted
Message:              The node was low on resource: ephemeral-storage. Threshold quantity: 6351693871, available: 6143252Ki. Container manager was using 332Ki, request is 0, has larger consumption of ephemeral-storage.

Version-Release number of selected component (if applicable):

    (.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc version
Client Version: 4.14.0-0.nightly-2024-01-15-085353
Kustomize Version: v5.0.1
Server Version: 4.14.9
Kubernetes Version: v1.27.9+5c56cc3
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ oc get hc -A --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig
NAMESPACE   NAME       VERSION   KUBECONFIG                  PROGRESS    AVAILABLE   PROGRESSING   MESSAGE
clusters    hosted-0   4.14.9    hosted-0-admin-kubeconfig   Completed   True        False         The hosted control plane is available
(.venv) [kni@ocp-edge77 ocp-edge-auto_cluster]$ 


Updated hypershift operator version to : https://quay.io/repository/hypershift/hypershift-operator/manifest/sha256:6ffead94e14d7aa56e4fd9adb9e45e337c1b32a3fc191c10a5c1306c3990fc9f

How reproducible:

Happens sometimes

Steps to Reproduce:

    1.Deploy a hub cluster, and on it hosted cluster with 6 nodes , agent provider. 

2.I ran this automation test,for scaling down : https://gitlab.cee.redhat.com/ocp-edge-qe/ocp-edge-auto/-/blob/master/edge_tests/deployment/installer/scale/test_scale_nodepool.py?ref_type=heads#L43   

3. Although the test has passed, the theardown that scale the node-pool back to 2 nodes, took more than 30 min, so the theardown failed for timeout.

4. I don't know when exactly, but after some 3 hours, the nodepool scale to 2 nodes ended, if needed, I can recheck with a longet timeout, to give the accurate time.

Actual results:

    more than 30 min scaling up to 2 nodes

Expected results:

    scaling in a reasonable time (<10min)

Additional info:

Assignee:: Crystal Chun

Reporter:: Gal Amado

QA Contact:: Gal Amado

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/01/17 5:44 PM

Updated:: 2024/01/17 5:53 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates