-
Feature Request
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
-
None
Issue:
In ROSA HCP clusters configured with: - Minimum worker
nodes = 2 - Cluster Autoscaler enabled - Default OpenShift managed
components configuration
We observe transient node scale-out events triggered during
RollingUpdate of OpenShift metrics-server components. (Even the exist 2 nodes have enough cpu, memory etc resouce)
Detailed Technical Analysis:
Found
Deployment: metrics-server Strategy: RollingUpdate maxUnavailable: 1
maxSurge: 25% Container args: -shutdown-delay-duration=150s Replicas: 2
Which might cause new pods in pending status while old pods are in terminating status.
$ oc describe deployment/metrics-server -n openshift-monitoring
Name: metrics-server
。。。。
Replicas: 2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 25% max surge
。。。。。
Containers:
--shutdown-delay-duration=150s
........
$ oc get pods/metrics-server-xxxx -n openshift-monitoring -o yaml
apiVersion: v1
kind: Pod
.........
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: metrics-server
app.kubernetes.io/name: metrics-server
app.kubernetes.io/part-of: openshift-monitoring
namespaces:
- openshift-monitoring
topologyKey: kubernetes.io/hostname
Sequence during update:
1. metrics-server old pod enters graceful termination (150s shutdown delay)
2. metrics-server new pod is created due to maxSurge
3. For a period of time:
a) Pod old is Terminating but still holding resources
b) Pod new is Pending
4. Scheduler may determine insufficient allocatable capacity
5. machine Autoscaler sees unschedulable pod
6. Autoscaler scales out
7. After Pod A exits, scale-down eventually occurs soon
I0202 08:05:57.430673 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-monitoring", Name:"metrics-server-xxxxx", UID:"xxxxx", APIVersion:"v1", ResourceVersion:"xxxxx", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{MachineDeployment/ocm-production-xxxxx-xxxxx/xxxxx-workers-1 1->2 (max: 2)}]
I0202 08:22:39.484581 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-xxxxxx.ap-northeast-1.compute.internal", UID:"xxxxxx", APIVersion:"v1", ResourceVersion:"xxxxx", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' marked the node as toBeDeleted/unschedulable
Key Point:
Autoscaler decision is made while the terminating pod still
occupies resources but before they are reclaimed. This creates a
temporary scheduling pressure window that does not reflect steady-state
cluster capacity.
Why This Matters:
In minimal-node ROSA clusters (2 nodes): - Platform components run with
tight packing - Graceful termination windows create deterministic
transient capacity pressure - This pressure consistently triggers
scale-out
Result: - Additional AWS instances launched - Short-lived infrastructure
cost - Scale-out events not driven by customer workload
Request: Requesting engineering evaluation on whether metrics-server component
rolling updates in minimal-node clusters can be handled in a more
topology-aware manner to prevent transient scale-out not driven by user
workload.
May be below change can help to avoid this issue, please have a check if it helps.
- add cluster-autoscaler.kubernetes.io/pod-scale-up-delay: "150s" to pod template annotation to not trigger the autoscaler with pending pods
- Or change the RollingUpdateStrategy to maxSurge: 0, maxUnavailable: 1 so no extra pod is created during rollout