Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-48367

4.17: The kubelet crashes on the server worker during GTP-U tests

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Normal Normal
    • None
    • 4.17.z, 4.16.z
    • Node / Kubelet
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      4.17: The kubelet crashes on the server worker during GTP-U tests.
      
      
      [root@ocp-svc ~]# sudo virsh list --all
       Id  Name       State
      ---------------------------------
       1  dciokd-master-0  running
       2  dciokd-master-1  running
       3  dciokd-master-2  running
       4  dciokd-worker-0  running
       5  dciokd-worker-1  running
      
      
      [root@ocp-svc ~]# oc get nodes -A
      NAME       STATUS   ROLES          AGE   VERSION
      dciokd-master-0  Ready   control-plane,master  4d13h  v1.29.7+4510e9c
      dciokd-master-1  Ready   control-plane,master  4d13h  v1.29.7+4510e9c
      dciokd-master-2  Ready   control-plane,master  4d13h  v1.29.7+4510e9c
      dciokd-worker-0  Ready   kcos-licensing,worker  4d13h  v1.29.7+4510e9c
      dciokd-worker-1  NotReady  worker         4d13h  v1.29.7+4510e9c 
      
      
      # dciokd-worker-1
      Conditions:
        Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
        ----             ------    -----------------                 ------------------                ------              -------
        MemoryPressure   Unknown   Mon, 14 Oct 2024 02:19:34 +0200   Mon, 14 Oct 2024 02:22:03 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
        DiskPressure     Unknown   Mon, 14 Oct 2024 02:19:34 +0200   Mon, 14 Oct 2024 02:22:03 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
        PIDPressure      Unknown   Mon, 14 Oct 2024 02:19:34 +0200   Mon, 14 Oct 2024 02:22:03 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
        Ready            Unknown   Mon, 14 Oct 2024 02:19:34 +0200   Mon, 14 Oct 2024 02:22:03 +0200   NodeStatusUnknown   Kubelet stopped posting node status. 

      Version-Release number of selected component (if applicable):

      4.17.z    

      How reproducible:

      stable bug, shows up all the time    

      Steps to Reproduce:

      The scenario described by our Telco partner to reproduce the worker failure is as follows:
      
      1. Install OCP-4.17 with ABI installer on VM nodes. This is a vanilla OCP setup, so no resources are reserved for the kubelet via kubeletconfig.
      2. Configure SR-IOV VFs on workers-0 and worker-1 (Mellanox Connectx-5)
      3. Run GTP-U tests from worker-0 toward the VF SR-IOV interface on worker-1. The test is configured to consume 12 CPUs, while 30 vCPUs are available on the worker.
      4. The kubelet on the server (worker-1) crashes during the tests. 
      5. The failed node is not accessible via SSH and cannot be rebooted.

      Actual results:

      The kubelet on the server (worker-1) crashes during the tests. 

      Expected results:

      The kubelet on the node should not crash, even under intensive network traffic.

      Additional info:

      1. Must-gather is in the attachment
      2. Workaround: Explicitly reserving some resources for the kubelet prevents it from crashing. The Kubeletconfig is from here: https://docs.openshift.com/container-platform/4.17/nodes/nodes/nodes-nodes-managing.html#nodes-nodes-managing-about_nodes-nodes-managing 
      
      apiVersion: machineconfiguration.openshift.io/v1
      kind: KubeletConfig
      metadata:
        name: custom-config 
      spec:
        machineConfigPoolSelector:
          matchLabels:
            custom-kubelet: enabled 
        kubeletConfig: 
          podsPerCore: 10
          maxPods: 250
          systemReserved:
            cpu: 2000m
            memory: 1Gi 

              pehunt@redhat.com Peter Hunt
              tkrishto@redhat.com Tatiana Krishtop (Inactive)
              None
              None
              Cameron Meadors Cameron Meadors
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: