Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-8834

Avoid transient Cluster Autoscaler scale-out triggered by metrics-server component RollingUpdate in minimal-node ROSA HCP clusters

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • ROSA
    • None
    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Issue:

      In ROSA HCP clusters configured with: - Minimum worker
      nodes = 2 - Cluster Autoscaler enabled - Default OpenShift managed
      components configuration

      We observe transient node scale-out events triggered during
      RollingUpdate of OpenShift metrics-server components. (Even the exist 2 nodes have enough cpu, memory etc resouce)

       

      Detailed Technical Analysis:

      Found 
      Deployment: metrics-server Strategy: RollingUpdate maxUnavailable: 1
      maxSurge: 25% Container args: -shutdown-delay-duration=150s Replicas: 2

      Which might cause new pods in pending status while old pods are in terminating status.

      $ oc describe deployment/metrics-server -n openshift-monitoring
      Name:                   metrics-server
      
      。。。。
      Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
      StrategyType:           RollingUpdate
      MinReadySeconds:        0
      RollingUpdateStrategy:  1 max unavailable, 25% max surge
      。。。。。
        Containers:
            --shutdown-delay-duration=150s
      
      ........
      $ oc get pods/metrics-server-xxxx -n openshift-monitoring -o yaml
      apiVersion: v1
      kind: Pod
      .........
      spec:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app.kubernetes.io/component: metrics-server
                  app.kubernetes.io/name: metrics-server
                  app.kubernetes.io/part-of: openshift-monitoring
              namespaces:
              - openshift-monitoring
              topologyKey: kubernetes.io/hostname
      
      

      Sequence during update:
      1. metrics-server old pod enters graceful termination (150s shutdown delay)
      2. metrics-server new pod is created due to maxSurge
      3. For a period of time:
      a) Pod old is Terminating but still holding resources
      b) Pod new is Pending
      4. Scheduler may determine insufficient allocatable capacity
      5. machine Autoscaler sees unschedulable pod
      6. Autoscaler scales out
      7. After Pod A exits, scale-down eventually occurs soon

      I0202 08:05:57.430673       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-monitoring", Name:"metrics-server-xxxxx", UID:"xxxxx", APIVersion:"v1", ResourceVersion:"xxxxx", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{MachineDeployment/ocm-production-xxxxx-xxxxx/xxxxx-workers-1 1->2 (max: 2)}]
      
      
      I0202 08:22:39.484581       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-xxxxxx.ap-northeast-1.compute.internal", UID:"xxxxxx", APIVersion:"v1", ResourceVersion:"xxxxx", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' marked the node as toBeDeleted/unschedulable
      

      Key Point:

      Autoscaler decision is made while the terminating pod still
      occupies resources but before they are reclaimed. This creates a
      temporary scheduling pressure window that does not reflect steady-state
      cluster capacity.

       

      Why This Matters:
      In minimal-node ROSA clusters (2 nodes): - Platform components run with
      tight packing - Graceful termination windows create deterministic
      transient capacity pressure - This pressure consistently triggers
      scale-out

      Result: - Additional AWS instances launched - Short-lived infrastructure
      cost - Scale-out events not driven by customer workload

      Request: Requesting engineering evaluation on whether metrics-server  component
      rolling updates in minimal-node clusters can be handled in a more
      topology-aware manner to prevent transient scale-out not driven by user
      workload.

       
      May be below change can help to avoid this issue, please have a check if it helps.

      1. add cluster-autoscaler.kubernetes.io/pod-scale-up-delay: "150s" to pod template annotation to not trigger the autoscaler with pending pods
      2. Or change the RollingUpdateStrategy to maxSurge: 0, maxUnavailable: 1 so no extra pod is created during rollout

              rh-ee-adejong Aaren de Jong
              rhn-support-jayu Jacob Yu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                None
                None