Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27774

Hosted cluster's monitoring cluster operator, gets unavailable for failing to reconcile node-exporter DaemonSet

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • 4.14, 4.15
    • HyperShift / Agent
    • None
    • Moderate
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      unavailable monitoring operator :
      
      monitoring                                 4.15.0-rc.3   False       True          True       29m     UpdatingNodeExporter: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: context deadline exceeded
      
      auto scaling nodepool from 2 nodes to 5, (by setting load of 50 pods of 256Mi each), 3rd node is up after 04:49 min, but the next 2 nodes are not ready after 20 min (timeout), and their agents are stuck in "Joined" stage for most of the time.
      After disabling the autoscaling and setting node number to be 2,
      hosted cluster nodes shows 4 nodes of which 2 are in not ready status
        
      in the nodepool there are 2 nodes and autoscaling off , as expected, but an irrelevant msg of "Scaaling down MachineSet to 2 replicas (actual 4)"
      

       

      Version-Release number of selected component (if applicable):

       [kni@ocp-edge119 ocp-edge-auto_cluster]$ oc version
      Client Version: 4.14.0-0.nightly-2023-07-27-104118
      Kustomize Version: v5.0.1
      [kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get hc -A --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig
      NAMESPACE   NAME       VERSION       KUBECONFIG                  PROGRESS    AVAILABLE   PROGRESSING   MESSAGE
      clusters    hosted-0   4.15.0-rc.3   hosted-0-admin-kubeconfig   Completed   True        False         The hosted control plane is available
      [kni@ocp-edge119 ocp-edge-auto_cluster]$ 
         

      How reproducible:

      happens sometimes    

      Steps to Reproduce:

          1.Deploy a hub cluster, and on it hosted cluster with 6 nodes , agent provider, (I used https://auto-jenkins-csb-kniqe.apps.ocp-c1.prod.psi.redhat.com/job/CI/job/job-runner/2205/)
          2. run test_toggling_autoscaling_nodepool (https://gitlab.cee.redhat.com/ocp-edge-qe/ocp-edge-auto/-/blob/master/edge_tests/deployment/installer/scale/test_scale_nodepool.py#L322)
          3.test fail for pulling wrong number of nodes for 20 min timeout      

      Actual results:

          (.venv) [kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get co  --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig 
      NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      monitoring                                 4.15.0-rc.3   False       True          True       63m     UpdatingNodeExporter: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: context deadline exceeded
      
      
      [kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get nodes --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig 
      NAME                STATUS     ROLES    AGE   VERSION
      hosted-worker-0-1   NotReady   worker   65m   v1.28.5+c84a6b8
      hosted-worker-0-2   Ready      worker   18h   v1.28.5+c84a6b8
      hosted-worker-0-4   NotReady   worker   96m   v1.28.5+c84a6b8
      hosted-worker-0-5   Ready      worker   18h   v1.28.5+c84a6b8
      
      (.venv) [kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get nodepool -A -o wide --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig NAMESPACE   NAME       CLUSTER    DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION       UPDATINGVERSION   UPDATINGCONFIG   MESSAGE clusters    hosted-0   hosted-0   2               2               False         False        4.15.0-rc.3                                      Scaling down MachineSet to 2 replicas (actual 4) 
      
      

      Expected results:

           
          (.venv) [kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get co  --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig 
      NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      monitoring                                 4.15.0-rc.3   True       True          True       63m    
      would expect 5 nodes in the test itself, but after disabling autoscaling and scaling explicitly to 2 nodes: 
      [kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get nodes --kubeconfig ~/clusterconfigs/hosted-0/auth/kubeconfig 
      NAME                STATUS     ROLES    AGE   VERSION
      hosted-worker-0-2   Ready      worker   18h   v1.28.5+c84a6b8
      hosted-worker-0-5   Ready      worker   18h   v1.28.5+c84a6b8
      
      (.venv) [kni@ocp-edge119 ocp-edge-auto_cluster]$ oc get nodepool -A -o wide --kubeconfig ~/clusterconfigs/auth/hub-kubeconfig NAMESPACE   NAME       CLUSTER    DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION       UPDATINGVERSION   UPDATINGCONFIG   MESSAGE clusters    hosted-0   hosted-0   2               2               False         False        4.15.0-rc.3                                         

      Additional info:

          

       

            cchun@redhat.com Crystal Chun
            rhn-support-gamado Gal Amado
            Gal Amado Gal Amado
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: