-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
4.15.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
AUTOSCALE - Sprint 270
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Job Execution: 250 jobs were launched in the optimiser-prd namespace, 248 ran successfully, but 2 remain in a Pending state.
Autoscaling: The Cluster Autoscaler created 32 new nodes to handle the workload, and all nodes appear in a Ready state.
Detected Issue: Despite the new nodes, two pods were not scheduled, indicating a possible restriction related to resources, affinity, or another cluster condition preventing their assignment.
Our colleague marpears@redhat.com Proposed this:
**
1. Node Taints Prevent Scheduling:
- The node ip-10-163-150-170.eu-west-1.compute.internal has taints that may have been left over from a previous scale-down of the machine pool.
- This node is the only one that meets the pending pod's nodeSelector requirements and has enough allocatable memory.
- However, pods cannot be scheduled on it because of the ToBeDeletedByClusterAutoscaler taint.
2. Possible Fix in OpenShift 4.18:
- The issue may be resolved in OpenShift 4.18, as part of changes aligning OpenShift's autoscaler with upstream Kubernetes.
- A reference to the commit that introduces this fix is provided:
https://github.com/openshift/kubernetes-autoscaler/commit/8c7fe0fc19e8b8e29af065a6ab91372803834015 - An internal Slack thread is also mentioned for further discussion: https://redhat-internal.slack.com/archives/C061LV49G1W/p1737103554643559
3. Bug Reference for Another Related Taint:
- The DeletionCandidateOfClusterAutoscaler taint seems related to an OpenShift bug: https://issues.redhat.com/browse/OCPBUGS-42132
spec: providerID: aws:///eu-west-1b/i-0b625b74bb78f2410 taints: - effect: NoSchedule key: mp value: mp01 - effect: NoSchedule key: type value: compute - effect: NoExecute key: type value: compute - effect: PreferNoSchedule key: DeletionCandidateOfClusterAutoscaler value: "1740615052" - effect: NoSchedule key: ToBeDeletedByClusterAutoscaler value: "1740615662"
4. Request to Check with Engineering and Consider Backporting:
- We requests your help verifying if they agree with his assessment of the issue.
- Since CX is still running OCP 4.16 and far from upgrading to 4.18, he asks if engineering could backport the fix to 4.16.
- This would help resolve the issue sooner rather than waiting for an upgrade.
- is blocked by
-
OCPBUGS-54231 After scale down the last node has ToBeDeletedByClusterAutoscaler taint
-
- Closed
-