Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-13541

pod with GPU request, volume assigned and nodeSelector applied is failing to trigger OpenShift Container Platform 4 - Node scale-up

    XMLWordPrintable

Details

    • Moderate
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      pod with GPU requested, a volume assigned and nodeSelector set to "topology.kubernetes.io/zone" is failing to trigger OpenShift Container Platform 4 - Node scale-up:
      
      I0511 12:31:41.594376       1 request.go:1171] Response Body: {"kind":"Scale","apiVersion":"autoscaling/v1","metadata":{"name":"foo-mbr2h-gpu-us-east-2b","namespace":"openshift-machine-api","uid":"c75f1686-b15f-4fa5-bee0-a78ea711a3d5","resourceVersion":"4502637","creationTimestamp":"2023-05-11T11:23:56Z"},"spec":{},"status":{"replicas":0}}
      I0511 12:31:41.594532       1 clusterapi_provider.go:67] discovered node group: MachineSet/openshift-machine-api/foo-mbr2h-gpu-us-east-2b (min: 0, max: 2, replicas: 0)
      I0511 12:31:41.594691       1 binder.go:724] "PVC is not bound" PVC="project-300/gpu"
      I0511 12:31:41.594799       1 scale_up.go:93] Pod gpu-848f7d47d9-9rzqz can't be scheduled on MachineSet/openshift-machine-api/foo-mbr2h-gpu-us-east-2b, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
      I0511 12:31:41.594826       1 scale_up.go:262] No pod can fit to MachineSet/openshift-machine-api/foo-mbr2h-gpu-us-east-2b
      I0511 12:31:41.594883       1 binder.go:724] "PVC is not bound" PVC="project-300/gpu"
      I0511 12:31:41.594920       1 scale_up.go:93] Pod gpu-848f7d47d9-9rzqz can't be scheduled on MachineSet/openshift-machine-api/foo-mbr2h-gpu-us-east-2a, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
      I0511 12:31:41.594939       1 scale_up.go:262] No pod can fit to MachineSet/openshift-machine-api/foo-mbr2h-gpu-us-east-2a
      I0511 12:31:41.595005       1 binder.go:724] "PVC is not bound" PVC="project-300/gpu"
      I0511 12:31:41.595062       1 scale_up.go:93] Pod gpu-848f7d47d9-9rzqz can't be scheduled on MachineSet/openshift-machine-api/foo-mbr2h-gpu-us-east-2c, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
      I0511 12:31:41.595093       1 scale_up.go:262] No pod can fit to MachineSet/openshift-machine-api/foo-mbr2h-gpu-us-east-2c
      I0511 12:31:41.595157       1 binder.go:724] "PVC is not bound" PVC="project-300/gpu"
      I0511 12:31:41.595203       1 scale_up.go:93] Pod gpu-848f7d47d9-9rzqz can't be scheduled on MachineSet/openshift-machine-api/foo-mbr2h-ossm-us-east-2a, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
      I0511 12:31:41.595224       1 scale_up.go:262] No pod can fit to MachineSet/openshift-machine-api/foo-mbr2h-ossm-us-east-2a
      I0511 12:31:41.595299       1 binder.go:724] "PVC is not bound" PVC="project-300/gpu"
      I0511 12:31:41.595378       1 scale_up.go:93] Pod gpu-848f7d47d9-9rzqz can't be scheduled on MachineSet/openshift-machine-api/foo-mbr2h-ossm-us-east-2b, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
      I0511 12:31:41.595419       1 scale_up.go:262] No pod can fit to MachineSet/openshift-machine-api/foo-mbr2h-ossm-us-east-2b
      I0511 12:31:41.595440       1 scale_up.go:267] No expansion options
      
      When removing the nodeSelector the OpenShift Container Platform 4 - Node scale-up is triggered as expected.
      
      

      Version-Release number of selected component (if applicable):

      OpenShift Container Platform 4.10, 4.11, 4.12 and 4.13
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Install OpenShift Container Platform 4
      2. Create MachineSet with MachineAutoscaler in 3 availability zones with GPU instanceType
      3. Create deployment referencing a PVC that is not Bound, with nvidia.com/gpu: "1" set in request and nodeSelector with topology.kubernetes.io/zone set to a available zone.
      

      Actual results:

      0s          Normal    NotTriggerScaleUp      pod/gpu-5bcb679b75-6vc9v    pod didn't trigger scale-up: 5 node(s) didn't match Pod's node affinity/selector
      0s          Normal    NotTriggerScaleUp      pod/gpu-5bcb679b75-6vc9v    pod didn't trigger scale-up: 5 node(s) didn't match Pod's node affinity/selector
      

      Expected results:

      Scale-up of OpenShift Container Platform 4 - Node with available GPU to be triggered and pod eventually scheduled on the newly added OpenShift Container Platform 4 - Node
      

      Additional info:

      A similar problem was reported in https://bugzilla.redhat.com/show_bug.cgi?id=1891551 but solved. Hence not sure to what extend this is related but with OpenShift Container Platform 4.13 as latest test version this fix should be available and therefore not trigger any issues.
      

      Attachments

        Issue Links

          Activity

            People

              mimccune@redhat.com Michael McCune
              rhn-support-sreber Simon Reber
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: