Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59558

Bad pod spread/skew causing workload to block

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.18
    • kube-scheduler
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          Deploying a workload that deletes and recreates pods on +100 nodes while scaling it up ~6000 pods (between server, client and dpdk pods) in total. About 55% of these pods consume sriov/VFs, which are a limited resource on the nodes (64/node). 
      
      When pods are first deployed the distribution is already skewed (see below), when recreating the pods (churn), the distribution becomes worse (see below), causing pods to end up pending with an VF exhaustion error (see below).
      
      I have tried node/pod affinities and TSC to meddle with the distribution, and the only configuration that does seem to avoid this scenario is when server pods (can be scheduled anywhere) are not scheduled on the same nodes as dpdk pods (only on worker-dpdk nodes). Both pod types request VFs.
      
      pod description (error):
      Events:
        Type     Reason            Age                 From               Message
        ----     ------            ----                ----               -------
        Warning  FailedScheduling  7m4s                default-scheduler  0/118 nodes are available: 1 Insufficient hugepages-1Gi, 26 Insufficient openshift.io/intelnics2, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 4 node(s) were unschedulable, 85 node(s) didn't match Pod's node affinity/selector. preemption: 0/118 nodes are available: 26 No preemption victims found for incoming pod, 92 Preemption is not helpful for scheduling.
        Warning  FailedScheduling  54s (x2 over 6m2s)  default-scheduler  0/118 nodes are available: 1 Insufficient hugepages-1Gi, 26 Insufficient openshift.io/intelnics2, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 4 node(s) were unschedulable, 85 node(s) didn't match Pod's node affinity/selector. preemption: 0/118 nodes are available: 26 No preemption victims found for incoming pod, 92 Preemption is not helpful for scheduling.
      
      node events:
      7m37s       Warning   FailedScheduling       pod/dpdk-1-57fdbfbb54-ddz8t       0/118 nodes are available: 1 Insufficient hugepages-1Gi, 26 Insufficient openshift.io/intelnics2, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 4 node(s) were unschedulable, 85 node(s) didn't match Pod's node affinity/selector. preemption: 0/118 nodes are available: 26 No preemption victims found for incoming pod, 92 Preemption is not helpful for scheduling.
      87s         Warning   FailedScheduling       pod/dpdk-1-57fdbfbb54-ddz8t       0/118 nodes are available: 1 Insufficient hugepages-1Gi, 26 Insufficient openshift.io/intelnics2, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 4 node(s) were unschedulable, 85 node(s) didn't match Pod's node affinity/selector. preemption: 0/118 nodes are available: 26 No preemption victims found for incoming pod, 92 Preemption is not helpful for scheduling.
      
      Pod distribution/spread (server) - BEFORE churn, first deployment:
      48  e23-h26-b03-fc640  customcnf",worker"
      47  e23-h26-b02-fc640  customcnf",worker"
      41  e17-h24-b01-fc640  worker",worker-metallb"
      41  e17-h18-b04-fc640  worker",worker-metallb"
      40  e17-h20-b04-fc640  worker",worker-metallb"
      39  e18-h12-b02-fc640  worker",worker-metallb"
      39  e17-h20-b01-fc640  worker",worker-metallb"
      38  e17-h24-b02-fc640  worker",worker-metallb"
      38  e17-h20-b03-fc640  worker",worker-metallb"
      37  e23-h26-b01-fc640  customcnf",worker"
      37  e18-h12-b03-fc640  worker",worker-metallb"
      37  e17-h20-b02-fc640  worker",worker-metallb"
      35  e17-h24-b04-fc640  worker",worker-metallb"
      34  e18-h14-b01-fc640  customcnf",worker"
      33  e20-h18-b01-fc640  customcnf",worker"
      33  e19-h26-b02-fc640  customcnf",worker"
      33  e19-h18-b02-fc640  customcnf",worker"
      33  e18-h20-b02-fc640  customcnf",worker"
      33  e18-h18-b02-fc640  customcnf",worker"
      33  e18-h14-b04-fc640  customcnf",worker"
      32  e23-h14-b03-fc640  customcnf",worker"
      32  e19-h24-b02-fc640  customcnf",worker"
      32  e19-h18-b04-fc640  customcnf",worker"
      32  e19-h18-b03-fc640  customcnf",worker"
      32  e18-h24-b03-fc640  customcnf",worker"
      32  e18-h18-b04-fc640  customcnf",worker"
      32  e18-h18-b03-fc640  customcnf",worker"
      32  e17-h18-b03-fc640  customcnf",worker"
      31  e23-h20-b02-fc640  customcnf",worker"
      31  e23-h18-b03-fc640  customcnf",worker"
      31  e22-h18-b03-fc640  customcnf",worker"
      31  e22-h18-b01-fc640  customcnf",worker"
      31  e20-h26-b01-fc640  customcnf",worker"
      31  e20-h24-b01-fc640  customcnf",worker"
      31  e20-h14-b01-fc640  customcnf",worker"
      31  e20-h12-b01-fc640  customcnf",worker"
      31  e19-h20-b02-fc640  customcnf",worker"
      31  e18-h20-b04-fc640  customcnf",worker"
      30  e23-h24-b04-fc640  customcnf",worker"
      30  e23-h24-b03-fc640  customcnf",worker"
      30  e23-h24-b02-fc640  customcnf",worker"
      30  e23-h20-b04-fc640  customcnf",worker"
      30  e23-h20-b01-fc640  customcnf",worker"
      30  e23-h18-b02-fc640  customcnf",worker"
      30  e23-h12-b04-fc640  customcnf",worker"
      30  e21-h20-b03-fc640  customcnf",worker"
      30  e20-h20-b03-fc640  customcnf",worker"
      30  e20-h14-b02-fc640  customcnf",worker"
      30  e20-h12-b04-fc640  customcnf",worker"
      30  e19-h24-b04-fc640  customcnf",worker"
      30  e18-h24-b01-fc640  customcnf",worker"
      30  e18-h20-b01-fc640  customcnf",worker"
      30  e16-h26-b04-fc640  customcnf",worker"
      29  e23-h18-b01-fc640  customcnf",worker"
      29  e23-h12-b03-fc640  customcnf",worker"
      29  e20-h24-b04-fc640  customcnf",worker"
      29  e20-h24-b02-fc640  customcnf",worker"
      29  e20-h20-b04-fc640  customcnf",worker"
      29  e20-h20-b02-fc640  customcnf",worker"
      29  e20-h20-b01-fc640  customcnf",worker"
      29  e20-h18-b02-fc640  customcnf",worker"
      29  e20-h12-b03-fc640  customcnf",worker"
      29  e19-h20-b04-fc640  customcnf",worker"
      29  e19-h18-b01-fc640  customcnf",worker"
      29  e18-h14-b03-fc640  customcnf",worker"
      29  e18-h12-b04-fc640  customcnf",worker"
      28  e23-h14-b04-fc640  customcnf",worker"
      28  e20-h26-b04-fc640  customcnf",worker"
      28  e20-h24-b03-fc640  customcnf",worker"
      28  e20-h14-b03-fc640  customcnf",worker"
      28  e19-h24-b01-fc640  customcnf",worker"
      27  e19-h26-b04-fc640  customcnf",worker"
      27  e19-h26-b03-fc640  customcnf",worker"
      26  e22-h18-b02-fc640  customcnf",worker"
      26  e21-h24-b02-fc640  customcnf",worker"
      26  e18-h20-b03-fc640  customcnf",worker"
      26  e18-h18-b01-fc640  customcnf",worker"
      25  e19-h20-b01-fc640  customcnf",worker"
      24  e23-h20-b03-fc640  customcnf",worker"
      24  e19-h20-b03-fc640  customcnf",worker"
      23  e20-h12-b02-fc640  customcnf",worker"
      23  e17-h24-b03-fc640  worker",worker-metallb"
      22  e19-h26-b01-fc640  customcnf",worker"
      20  e20-h14-b04-fc640  customcnf",worker"
      20  e19-h24-b03-fc640  customcnf",worker"
      20  e17-h12-b04-fc640  worker",worker-dpdk"
      17  e17-h14-b04-fc640  worker",worker-dpdk"
      17  e16-h26-b02-fc640  worker",worker-dpdk"
      16  e17-h18-b02-fc640  worker",worker-dpdk"
      16  e17-h14-b01-fc640  worker",worker-dpdk"
      16  e16-h14-b02-fc640  worker",worker-dpdk"
      15  e17-h14-b03-fc640  worker",worker-dpdk"
      15  e16-h20-b01-fc640  worker",worker-dpdk"
      15  e16-h18-b02-fc640  worker",worker-dpdk"
      14  e16-h18-b01-fc640  worker",worker-dpdk"
      13  e17-h12-b03-fc640  worker",worker-dpdk"
      13  e16-h24-b02-fc640  worker",worker-dpdk"
      13  e16-h18-b03-fc640  worker",worker-dpdk"
      12  e17-h12-b02-fc640  worker",worker-dpdk"
      12  e16-h26-b01-fc640  worker",worker-dpdk"
      11  e17-h12-b01-fc640  worker",worker-dpdk"
      11  e16-h26-b03-fc640  worker",worker-dpdk"
      11  e16-h24-b03-fc640  worker",worker-dpdk"
      11  e16-h18-b04-fc640  worker",worker-dpdk"
      10  e16-h14-b04-fc640  worker",worker-dpdk"
      10  e16-h14-b01-fc640  worker",worker-dpdk"
      9  e17-h14-b02-fc640  worker",worker-dpdk"
      8  e17-h18-b01-fc640  worker",worker-dpdk"
      6  e16-h24-b01-fc640  worker",worker-dpdk"
      5  e16-h24-b04-fc640  worker",worker-dpdk"
      3  e16-h20-b03-fc640  worker",worker-dpdk"
      
      Pod spread (server), AFTER/DURING churn, recreating pods:
      64  e23-h26-b03-fc640  customcnf",worker"
      64  e23-h26-b02-fc640  customcnf",worker"
      64  e18-h12-b03-fc640  worker",worker-metallb"
      64  e18-h12-b02-fc640  worker",worker-metallb"
      64  e17-h24-b04-fc640  worker",worker-metallb"
      64  e17-h24-b03-fc640  worker",worker-metallb"
      64  e17-h24-b02-fc640  worker",worker-metallb"
      64  e17-h24-b01-fc640  worker",worker-metallb"
      64  e17-h20-b04-fc640  worker",worker-metallb"
      64  e17-h20-b03-fc640  worker",worker-metallb"
      64  e17-h20-b02-fc640  worker",worker-metallb"
      64  e17-h20-b01-fc640  worker",worker-metallb"
      64  e17-h18-b04-fc640  worker",worker-metallb"
      56  e17-h12-b04-fc640  worker",worker-dpdk"
      54  e17-h14-b03-fc640  worker",worker-dpdk"
      54  e17-h14-b02-fc640  worker",worker-dpdk"
      54  e17-h14-b01-fc640  worker",worker-dpdk"
      54  e17-h12-b03-fc640  worker",worker-dpdk"
      54  e17-h12-b01-fc640  worker",worker-dpdk"
      54  e16-h24-b02-fc640  worker",worker-dpdk"
      54  e16-h24-b01-fc640  worker",worker-dpdk"
      54  e16-h14-b01-fc640  worker",worker-dpdk"
      52  e17-h18-b02-fc640  worker",worker-dpdk"
      52  e17-h18-b01-fc640  worker",worker-dpdk"
      52  e17-h12-b02-fc640  worker",worker-dpdk"
      52  e16-h26-b02-fc640  worker",worker-dpdk"
      52  e16-h24-b03-fc640  worker",worker-dpdk"
      52  e16-h20-b01-fc640  worker",worker-dpdk"
      52  e16-h18-b02-fc640  worker",worker-dpdk"
      52  e16-h14-b04-fc640  worker",worker-dpdk"
      52  e16-h14-b02-fc640  worker",worker-dpdk"
      50  e17-h14-b04-fc640  worker",worker-dpdk"
      50  e16-h26-b03-fc640  worker",worker-dpdk"
      50  e16-h24-b04-fc640  worker",worker-dpdk"
      50  e16-h18-b03-fc640  worker",worker-dpdk"
      50  e16-h18-b01-fc640  worker",worker-dpdk"
      48  e16-h20-b03-fc640  worker",worker-dpdk"
      46  e16-h26-b01-fc640  worker",worker-dpdk"
      46  e16-h18-b04-fc640  worker",worker-dpdk"
      17  e23-h14-b03-fc640  customcnf",worker"
      17  e19-h18-b02-fc640  customcnf",worker"
      15  e23-h24-b03-fc640  customcnf",worker"
      15  e18-h24-b01-fc640  customcnf",worker"
      15  e18-h12-b04-fc640  customcnf",worker"
      14  e23-h20-b04-fc640  customcnf",worker"
      14  e22-h18-b03-fc640  customcnf",worker"
      14  e22-h18-b02-fc640  customcnf",worker"
      14  e21-h24-b02-fc640  customcnf",worker"
      14  e20-h26-b04-fc640  customcnf",worker"
      14  e20-h20-b04-fc640  customcnf",worker"
      14  e20-h18-b02-fc640  customcnf",worker"
      14  e19-h26-b02-fc640  customcnf",worker"
      13  e23-h26-b01-fc640  customcnf",worker"
      13  e23-h18-b02-fc640  customcnf",worker"
      13  e23-h14-b04-fc640  customcnf",worker"
      13  e23-h12-b03-fc640  customcnf",worker"
      13  e20-h24-b01-fc640  customcnf",worker"
      13  e20-h12-b04-fc640  customcnf",worker"
      13  e20-h12-b01-fc640  customcnf",worker"
      13  e19-h26-b03-fc640  customcnf",worker"
      13  e19-h24-b04-fc640  customcnf",worker"
      13  e19-h20-b04-fc640  customcnf",worker"
      13  e19-h20-b02-fc640  customcnf",worker"
      13  e19-h20-b01-fc640  customcnf",worker"
      13  e19-h18-b03-fc640  customcnf",worker"
      13  e18-h24-b03-fc640  customcnf",worker"
      13  e18-h20-b01-fc640  customcnf",worker"
      13  e18-h18-b02-fc640  customcnf",worker"
      13  e18-h14-b04-fc640  customcnf",worker"
      13  e18-h14-b01-fc640  customcnf",worker"
      13  e17-h18-b03-fc640  customcnf",worker"
      12  e23-h24-b02-fc640  customcnf",worker"
      12  e23-h18-b03-fc640  customcnf",worker"
      12  e23-h18-b01-fc640  customcnf",worker"
      12  e23-h12-b04-fc640  customcnf",worker"
      12  e21-h20-b03-fc640  customcnf",worker"
      12  e20-h26-b01-fc640  customcnf",worker"
      12  e20-h20-b03-fc640  customcnf",worker"
      12  e20-h20-b02-fc640  customcnf",worker"
      12  e20-h20-b01-fc640  customcnf",worker"
      12  e20-h14-b01-fc640  customcnf",worker"
      12  e20-h12-b02-fc640  customcnf",worker"
      12  e19-h18-b04-fc640  customcnf",worker"
      12  e19-h18-b01-fc640  customcnf",worker"
      12  e18-h18-b04-fc640  customcnf",worker"
      12  e16-h26-b04-fc640  customcnf",worker"
      11  e23-h24-b04-fc640  customcnf",worker"
      11  e23-h20-b03-fc640  customcnf",worker"
      11  e22-h18-b01-fc640  customcnf",worker"
      11  e20-h24-b04-fc640  customcnf",worker"
      11  e20-h18-b01-fc640  customcnf",worker"
      11  e20-h14-b04-fc640  customcnf",worker"
      11  e20-h14-b03-fc640  customcnf",worker"
      11  e20-h12-b03-fc640  customcnf",worker"
      11  e19-h24-b01-fc640  customcnf",worker"
      11  e18-h20-b04-fc640  customcnf",worker"
      11  e18-h20-b03-fc640  customcnf",worker"
      11  e18-h20-b02-fc640  customcnf",worker"
      11  e18-h14-b03-fc640  customcnf",worker"
      10  e23-h20-b01-fc640  customcnf",worker"
      10  e19-h26-b04-fc640  customcnf",worker"
      10  e18-h18-b03-fc640  customcnf",worker"
      9  e23-h20-b02-fc640  customcnf",worker"
      9  e20-h14-b02-fc640  customcnf",worker"
      8  e19-h24-b02-fc640  customcnf",worker"
      7  e19-h26-b01-fc640  customcnf",worker"
      6  e20-h24-b03-fc640  customcnf",worker"
      6  e20-h24-b02-fc640  customcnf",worker"
      6  e19-h20-b03-fc640  customcnf",worker"
      5  e19-h24-b03-fc640  customcnf",worker"
      5  e18-h18-b01-fc640  customcnf",worker"

      Version-Release number of selected component (if applicable):

          4.18

      How reproducible:

          Always with this workload at the scale that at least 50% of the namespaces are churned (100% of the namespaces mean 115).
      
      PS: To rule out SRIOV as one of the sources of problem, I have tried to reset SRIOV VFs and restarted the config-daemon to verify that the VFs corresponding to deleted namespaces are returned back to the system.

      Steps to Reproduce:

          1. Run the workload to deploy the 115 namespaces with +6000 pods in total (20 server/ 30 client/ 2 dpdk pods per namespace).
          2. Wait for churn (delete and recreate) at least 50% of the namespaces (57)
          3. Wait to observe dpdk pods pending, and only about 70 namespaces recreated. With a consistent bad distribution of server pods clustering up on the worker-metallb and worker-dpdk nodes (see above).
      

      Actual results:

          Many pods clustering up in a few nodes instead of using the capacity across the cluster, exhausting resources.

      Expected results:

       Workload should be able to deploy, if the distribution/spread was more even, since, in total, there +7000 VFs available (all worker nodes have 64), and the scale of the workload does not reach this number.   

      Additional info:

         workload deployed is kube-burner-ocp rds-core 
      
      Server spec pod trying to give preference for balance and "worker" nodes only:
      
          spec:
            affinity:
              nodeAffinity:
                requiredDuringSchedulingIgnoredDuringExecution: #block 'infra' and 'workload' labeled nodes
                  nodeSelectorTerms:
                    - matchExpressions:
                        - key: node-role.kubernetes.io/infra
                          operator: DoesNotExist
                        - key: node-role.kubernetes.io/workload
                          operator: DoesNotExist
                        - key: node-role.kubernetes.io/worker
                          operator: Exists
                preferredDuringSchedulingIgnoredDuringExecution: 
                  - weight: 100 
                    preference:
                      matchExpressions:
                        - key: node-role.kubernetes.io/worker 
                          operator: Exists
                preferredDuringSchedulingIgnoredDuringExecution: 
                  - weight: 100
                    podAffinityTerm:
                      labelSelector:
                        matchLabels:
                          app: nginx
                      topologyKey: kubernetes.io/hostname
            topologySpreadConstraints: 
              - maxSkew: 1
                topologyKey: kubernetes.io/hostname
                whenUnsatisfiable: ScheduleAnyway
                labelSelector:
                  matchLabels:
                    app: nginx
      

              aos-workloads-staff Workloads Team Bot Account
              rh-ee-sferlinr Simone Ferlin-Reiter
              None
              None
              Bhargavi Gudi Bhargavi Gudi
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: