-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
4.16
-
Incidents & Support
-
False
-
-
None
-
5
-
Critical
-
None
-
None
-
None
-
None
-
CORENET Sprint 284
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
IHAC who have new bare metal cluster, and immediately after migration, they experienced problems with pod scheduling. Not all pods were able to start at all.
After investigation, found that there is a limit set by IP range that each node has the ability to host 510 pods in its own IP range. Customer now have large bare metal workers, and over 6000 pods altogether. After migration, with the 510 pod limit, the bare metal workers could not handle all the pod workload.
So, the customer aims to increase this limit to accommodate around 6000 pods, necessitating a larger address space per node. However, the core problem is the inability to change the hostPrefix after cluster installation, which is not supported as a Day-2 operation. The customer is currently on OpenShift Container Platform (OCP) version 4.16 and planning to upgrade to v4.18.
There was a KCS[1] regarding increasing the pod network and changing the host prefix, and the customer has requested and received a Support Exception SUPPORTEX-29444[2], to change the hostPrefix from 23 to 21. This adjustment will allow for 2046 pods per node. This support exception has also been approved by the PM and Engineering.
[1] KCS: https://access.redhat.com/solutions/6456731
[2] Support Exception: https://issues.redhat.com/browse/SUPPORTEX-29444
The issue comes up now after they began testing this procedure, as mentioned in the KCS. The customer began testing the procedure in a LAB-cluster but encountered issues during the node draining step, causing loss of cluster connection and API server access. High CPU usage alerts were observed in VMware when the cluster was unresponsive, resolved by forcing restarts of etcd-nodes.
We have captured the SOS report from the node, which was drained for review, and need assistance from engineering to suggest the next steps and execute this change successfully.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Execute the steps mentioned in this private KCS on a baremetal vSphere cluster : https://access.redhat.com/solutions/6456731
Actual results:
1. During the node draining step, causing loss of cluster connection and API server access.
2. High CPU usage alerts were observed in VMware when the cluster was unresponsive, resolved by forcing restarts of etcd-nodes
Expected results:
The hostPrefix should be changed successfully without breaking the cluster.
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms: Baremetal cluster on vSphere