Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19697

[GCP 4.14] [Azure/AWS <=4.13] Pod didn't trigger arm64 machineset scale out from 0 when a required node selector term on non-amd64 nodes is set

XMLWordPrintable

    • Low
    • No
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-18137. The following is the description of the original issue:

      Description of problem:

      When a workload includes a node selector term on the label kubernetes.io/arch and the allowed values do not include amd64, the auto scaler does not trigger the scale out of a valid, non-amd64, machine set if its current replicas are 0 and (for 4.14+) no architecture capacity annotation is set (ref MIXEDARCH-129).

      The issue is due to https://github.com/openshift/kubernetes-autoscaler/blob/f0ceeacfca57014d07f53211a034641d52d85cfd/cluster-autoscaler/cloudprovider/utils.go#L33

      This bug should be considered at first on clusters having the same architecture for the control plane and the data plane.

      In the case of multi-arch compute clusters, there is probably no alternative than letting the capacity annotation to be properly set in the machine set either manually or by the cloud provider actuator, as already discussed in the MIXEDARCH-129 works, otherwise relying to the control plane architecture.

      Version-Release number of selected component (if applicable):

      - ARM64 IPI on GCP 4.14
      - ARM64 IPI on Aws and Azure <=4.13
      - In general, non-amd64 single-arch clusters supporting autoscale from 0

      How reproducible:

      Always

      Steps to Reproduce:

      1. Create an arm64 IPI cluster on GCP
      2. Set one of the machinesets to have 0 replicas: 
          oc scale -n openshift-machine-api machineset/adistefa-a1-zn8pg-worker-f
      3. Deploy the default autoscaler
      4. Deploy the machine autoscaler for the given machineset
      5. Deploy a workload with node affinity to arm64 only nodes, large resource requests and enough number of replicas. 

      Actual results:

      From the pod events: 
      
      pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
      

      Expected results:

      The cluster autoscaler scales the machineset with 0 replicas in order to provide resources for the pending pods.
      

      Additional info:

      ---
      apiVersion: autoscaling.openshift.io/v1
      kind: ClusterAutoscaler
      metadata:
        name: default
      spec: {}
      ---
      apiVersion: autoscaling.openshift.io/v1beta1
      kind: MachineAutoscaler
      metadata:
        name: worker-us-east-1a
        namespace: openshift-machine-api
      spec:
        minReplicas: 0
        maxReplicas: 12
        scaleTargetRef:
          apiVersion: machine.openshift.io/v1beta1
          kind: MachineSet
          name: adistefa-a1-zn8pg-worker-f
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        namespace: openshift-machine-api
        name: 'my-deployment'
        annotations: {}
      spec:
        selector:
          matchLabels:
            app: name
        replicas: 3
        template:
          metadata:
            labels:
              app: name
          spec:
            affinity:
              nodeAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                  nodeSelectorTerms:
                    - matchExpressions:
                      - key: kubernetes.io/arch
                        operator: In
                        values:
                          - "arm64"
            containers:
              - name: container
                image: >-
                  image-registry.openshift-image-registry.svc:5000/openshift/httpd:latest
                ports:
                  - containerPort: 8080
                    protocol: TCP
                env: []
                resources:
                    requests:
                      cpu: "2"
            imagePullSecrets: []
        strategy:
          type: RollingUpdate
          rollingUpdate:
            maxSurge: 25%
            maxUnavailable: 25%
        paused: false

              rhn-support-adistefa Alessandro Di Stefano
              openshift-crt-jira-prow OpenShift Prow Bot
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: