Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.11, 4.10
Component/s: Node / Kubelet
Labels:

Test Coverage:

+
Severity:
Moderate
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.11.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

This is a clone of issue ~~OCPBUGS-4805~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-4101~~. The following is the description of the original issue:
—
Description of problem:

We experienced two separate upgrade failures relating to the introduction of the SYSTEM_RESERVED_ES node sizing parameter, causing kubelet to stop running.

One cluster (clusterA) upgraded from 4.11.14 to 4.11.17. It experienced an issue whereby
/etc/node-sizing.env
on its master nodes contained an empty SYSTEM_RESERVED_ES value:

---
cat /etc/node-sizing.env
SYSTEM_RESERVED_MEMORY=5.36Gi
SYSTEM_RESERVED_CPU=0.11
SYSTEM_RESERVED_ES=
---

causing the kubelet to not start up. To restore service, this file was manually updated to set a value (1Gi), and kubelet was restarted.

We are uncertain what conditions led to this occuring on the clusterA master nodes as part of the upgrade.

A second cluster (clusterB) upgraded from 4.11.16 to 4.11.17. It experienced an issue whereby worker nodes were impacted by a similar problem, however this was because a custom node-sizing-enabled.env MachineConfig which did not set SYSTEM_RESERVED_ES

This caused existing worker nodes to go into a NotReady state after the ugprade, and additionally new nodes did not join the cluster as their kubelet would become impacted.

For clusterB the conditions are more well-known of why the value is empty.

However, for both clusters, if SYSTEM_RESERVED_ES ends up as empty on a node it can cause the kubelet to not start.

We have some asks as a result:
- Can MCO be made to recover from this situation if it occurs, perhaps through application of a safe default if none exists, such that kubelet would start correctly?
- Can there possibly be alerting that could indicate and draw attention to the misconfiguration?

Version-Release number of selected component (if applicable):

4.11.17

How reproducible:

Have not been able to reproduce it on a fresh cluster upgrading from 4.11.16 to 4.11.17

Expected results:

If SYSTEM_RESERVED_ES is empty in /etc/node-sizing*env then a default should be applied and/or kubelet able to continue running.

Additional info:

blocks

OCPBUGS-5831 Empty/missing node-sizing SYSTEM_RESERVED_ES parameter can result in kubelet not starting

Closed

clones

OCPBUGS-4805 Empty/missing node-sizing SYSTEM_RESERVED_ES parameter can result in kubelet not starting

Closed

is blocked by

OCPBUGS-4805 Empty/missing node-sizing SYSTEM_RESERVED_ES parameter can result in kubelet not starting

Closed

is cloned by

OCPBUGS-5831 Empty/missing node-sizing SYSTEM_RESERVED_ES parameter can result in kubelet not starting

Closed

is related to

OCPNODE-1367 Impact Empty/missing node-sizing SYSTEM_RESERVED_ES parameter can result in kubelet not starting

Closed

links to

openshift/machine-config-operator#3459: [release-4.11] OCPBUGS-4945: Do not allow empty system reserved values

(1 links to)

Assignee:: Harshal Patil

Reporter:: OpenShift Prow Bot

QA Contact:: Sunil Choudhary

Contributing Groups:: Red Hat Employee

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2022/12/15 3:43 PM

Updated:: 2023/10/10 3:04 AM

Resolved:: 2023/01/23 3:53 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates