-
Bug
-
Resolution: Cannot Reproduce
-
Normal
-
None
-
4.17.z, 4.16.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
4.17: The kubelet crashes on the server worker during GTP-U tests. [root@ocp-svc ~]# sudo virsh list --all Id Name State --------------------------------- 1 dciokd-master-0 running 2 dciokd-master-1 running 3 dciokd-master-2 running 4 dciokd-worker-0 running 5 dciokd-worker-1 running [root@ocp-svc ~]# oc get nodes -A NAME STATUS ROLES AGE VERSION dciokd-master-0 Ready control-plane,master 4d13h v1.29.7+4510e9c dciokd-master-1 Ready control-plane,master 4d13h v1.29.7+4510e9c dciokd-master-2 Ready control-plane,master 4d13h v1.29.7+4510e9c dciokd-worker-0 Ready kcos-licensing,worker 4d13h v1.29.7+4510e9c dciokd-worker-1 NotReady worker 4d13h v1.29.7+4510e9c # dciokd-worker-1 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure Unknown Mon, 14 Oct 2024 02:19:34 +0200 Mon, 14 Oct 2024 02:22:03 +0200 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Mon, 14 Oct 2024 02:19:34 +0200 Mon, 14 Oct 2024 02:22:03 +0200 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Mon, 14 Oct 2024 02:19:34 +0200 Mon, 14 Oct 2024 02:22:03 +0200 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Mon, 14 Oct 2024 02:19:34 +0200 Mon, 14 Oct 2024 02:22:03 +0200 NodeStatusUnknown Kubelet stopped posting node status.
Version-Release number of selected component (if applicable):
4.17.z
How reproducible:
stable bug, shows up all the time
Steps to Reproduce:
The scenario described by our Telco partner to reproduce the worker failure is as follows: 1. Install OCP-4.17 with ABI installer on VM nodes. This is a vanilla OCP setup, so no resources are reserved for the kubelet via kubeletconfig. 2. Configure SR-IOV VFs on workers-0 and worker-1 (Mellanox Connectx-5) 3. Run GTP-U tests from worker-0 toward the VF SR-IOV interface on worker-1. The test is configured to consume 12 CPUs, while 30 vCPUs are available on the worker. 4. The kubelet on the server (worker-1) crashes during the tests. 5. The failed node is not accessible via SSH and cannot be rebooted.
Actual results:
The kubelet on the server (worker-1) crashes during the tests.
Expected results:
The kubelet on the node should not crash, even under intensive network traffic.
Additional info:
1. Must-gather is in the attachment 2. Workaround: Explicitly reserving some resources for the kubelet prevents it from crashing. The Kubeletconfig is from here: https://docs.openshift.com/container-platform/4.17/nodes/nodes/nodes-nodes-managing.html#nodes-nodes-managing-about_nodes-nodes-managing apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: custom-config spec: machineConfigPoolSelector: matchLabels: custom-kubelet: enabled kubeletConfig: podsPerCore: 10 maxPods: 250 systemReserved: cpu: 2000m memory: 1Gi