Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.17.z, 4.16.z
Component/s: Node / Kubelet
Labels:
- 4.16
- 4.17
- green
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

4.17: The kubelet crashes on the server worker during GTP-U tests.


[root@ocp-svc ~]# sudo virsh list --all
 Id  Name       State
---------------------------------
 1  dciokd-master-0  running
 2  dciokd-master-1  running
 3  dciokd-master-2  running
 4  dciokd-worker-0  running
 5  dciokd-worker-1  running


[root@ocp-svc ~]# oc get nodes -A
NAME       STATUS   ROLES          AGE   VERSION
dciokd-master-0  Ready   control-plane,master  4d13h  v1.29.7+4510e9c
dciokd-master-1  Ready   control-plane,master  4d13h  v1.29.7+4510e9c
dciokd-master-2  Ready   control-plane,master  4d13h  v1.29.7+4510e9c
dciokd-worker-0  Ready   kcos-licensing,worker  4d13h  v1.29.7+4510e9c
dciokd-worker-1  NotReady  worker         4d13h  v1.29.7+4510e9c 


# dciokd-worker-1
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Mon, 14 Oct 2024 02:19:34 +0200   Mon, 14 Oct 2024 02:22:03 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Mon, 14 Oct 2024 02:19:34 +0200   Mon, 14 Oct 2024 02:22:03 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Mon, 14 Oct 2024 02:19:34 +0200   Mon, 14 Oct 2024 02:22:03 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Mon, 14 Oct 2024 02:19:34 +0200   Mon, 14 Oct 2024 02:22:03 +0200   NodeStatusUnknown   Kubelet stopped posting node status.

Version-Release number of selected component (if applicable):

4.17.z

How reproducible:

stable bug, shows up all the time

Steps to Reproduce:

The scenario described by our Telco partner to reproduce the worker failure is as follows:

1. Install OCP-4.17 with ABI installer on VM nodes. This is a vanilla OCP setup, so no resources are reserved for the kubelet via kubeletconfig.
2. Configure SR-IOV VFs on workers-0 and worker-1 (Mellanox Connectx-5)
3. Run GTP-U tests from worker-0 toward the VF SR-IOV interface on worker-1. The test is configured to consume 12 CPUs, while 30 vCPUs are available on the worker.
4. The kubelet on the server (worker-1) crashes during the tests. 
5. The failed node is not accessible via SSH and cannot be rebooted.

Actual results:

The kubelet on the server (worker-1) crashes during the tests.

Expected results:

The kubelet on the node should not crash, even under intensive network traffic.

Additional info:

1. Must-gather is in the attachment
2. Workaround: Explicitly reserving some resources for the kubelet prevents it from crashing. The Kubeletconfig is from here: https://docs.openshift.com/container-platform/4.17/nodes/nodes/nodes-nodes-managing.html#nodes-nodes-managing-about_nodes-nodes-managing 

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: custom-config 
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: enabled 
  kubeletConfig: 
    podsPerCore: 10
    maxPods: 250
    systemReserved:
      cpu: 2000m
      memory: 1Gi

Assignee:: Peter Hunt

Reporter:: Tatiana Krishtop (Inactive)

QA Contact:: Cameron Meadors

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/01/14 12:07 PM

Updated:: 2025/09/04 2:55 PM

Resolved:: 2025/09/04 2:55 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates