Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.14, 4.15
Component/s: HyperShift / Agent
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

unavailable monitoring operator :

monitoring                                 4.15.0-rc.3   False       True          True       29m     UpdatingNodeExporter: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: context deadline exceeded

auto scaling nodepool from 2 nodes to 5, (by setting load of 50 pods of 256Mi each), 3rd node is up after 04:49 min, but the next 2 nodes are not ready after 20 min (timeout), and their agents are stuck in "Joined" stage for most of the time.

After disabling the autoscaling and setting node number to be 2,
hosted cluster nodes shows 4 nodes of which 2 are in not ready status
  
in the nodepool there are 2 nodes and autoscaling off , as expected, but an irrelevant msg of "Scaaling down MachineSet to 2 replicas (actual 4)"

Version-Release number of selected component (if applicable):

 [kni@ocp-edge119 ocp-edge-auto_cluster]$ oc version
Client Version: 4.14.0-0.nightly-2023-07-27-104118
Kustomize Version: v5.0.1
[kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get hc -A --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig
NAMESPACE   NAME       VERSION       KUBECONFIG                  PROGRESS    AVAILABLE   PROGRESSING   MESSAGE
clusters    hosted-0   4.15.0-rc.3   hosted-0-admin-kubeconfig   Completed   True        False         The hosted control plane is available
[kni@ocp-edge119 ocp-edge-auto_cluster]$

How reproducible:

happens sometimes

Steps to Reproduce:

    1.Deploy a hub cluster, and on it hosted cluster with 6 nodes , agent provider, (I used https://auto-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/CI/job/job-runner/2205/)
    2. run test_toggling_autoscaling_nodepool (https://gitlab.cee.redhat.com/ocp-edge-qe/ocp-edge-auto/-/blob/master/edge_tests/deployment/installer/scale/test_scale_nodepool.py#L322)
    3.test fail for pulling wrong number of nodes for 20 min timeout

Actual results:

    (.venv) [kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get co  --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig 
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring                                 4.15.0-rc.3   False       True          True       63m     UpdatingNodeExporter: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: context deadline exceeded

[kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get nodes --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig 
NAME                STATUS     ROLES    AGE   VERSION
hosted-worker-0-1   NotReady   worker   65m   v1.28.5+c84a6b8
hosted-worker-0-2   Ready      worker   18h   v1.28.5+c84a6b8
hosted-worker-0-4   NotReady   worker   96m   v1.28.5+c84a6b8
hosted-worker-0-5   Ready      worker   18h   v1.28.5+c84a6b8

(.venv) [kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get nodepool -A -o wide --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig NAMESPACE   NAME       CLUSTER    DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION       UPDATINGVERSION   UPDATINGCONFIG   MESSAGE clusters    hosted-0   hosted-0   2               2               False         False        4.15.0-rc.3                                      Scaling down MachineSet to 2 replicas (actual 4)

Expected results:

     
    (.venv) [kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get co  --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig 
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring                                 4.15.0-rc.3   True       True          True       63m

would expect 5 nodes in the test itself, but after disabling autoscaling and scaling explicitly to 2 nodes:

[kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get nodes --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig 
NAME                STATUS     ROLES    AGE   VERSION
hosted-worker-0-2   Ready      worker   18h   v1.28.5+c84a6b8
hosted-worker-0-5   Ready      worker   18h   v1.28.5+c84a6b8

(.venv) [kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get nodepool -A -o wide --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig NAMESPACE   NAME       CLUSTER    DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION       UPDATINGVERSION   UPDATINGCONFIG   MESSAGE clusters    hosted-0   hosted-0   2               2               False         False        4.15.0-rc.3

Additional info:

is incorporated by

OCPBUGS-29287 Pods are stuck in "Terminating" status causing nodepool autoscaling to fail adding new nodes

links to

Discussion slack thread

Assignee:: Crystal Chun

Reporter:: Gal Amado

Need Info From:: None

Contributors:: None

QA Contact:: Gal Amado

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/01/23 2:56 PM

Updated:: 2025/07/23 11:57 PM

Resolved:: 2024/05/12 10:01 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates