Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-34583

secondary-scheduler doesn't account for new topology manger configuration unless restarted

XMLWordPrintable

    • +
    • Important
    • No
    • CNF Compute Sprint 255, CNF Compute Sprint 256
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      known issue: if the worker nodes' `kubeletconfig` changes the configuration pertaining the topology manager, changes won't be notice by the NUMA-aware scheduler, which may lead to incorrect scheduling decisions. The workaround is to restart the NUMA-aware scheduler.
      Show
      known issue: if the worker nodes' `kubeletconfig` changes the configuration pertaining the topology manager, changes won't be notice by the NUMA-aware scheduler, which may lead to incorrect scheduling decisions. The workaround is to restart the NUMA-aware scheduler.
    • Known Issue
    • In Progress
    • Hide
      2024-07-02: GREEN: fix merged, backported, d/s builds pending
      2024-06-25: GREEN: fix merged, backported, d/s builds pending
      2024-06-19: GREEN: u/s fix posted, got preliminary approval
      Show
      2024-07-02: GREEN: fix merged, backported, d/s builds pending 2024-06-25: GREEN: fix merged, backported, d/s builds pending 2024-06-19: GREEN: u/s fix posted, got preliminary approval

      Description of problem:

          The numaresources secondary-scehduler doesn't account for topology-manager new configuration unless it gets deleted and restarted again. This means TAE can still occur if topology manager policy was updated to single-numa-node and the scheduler not restarted after that update

      Version-Release number of selected component (if applicable):

          4.16

      How reproducible:

          always

      Steps to Reproduce:

      1. cluster with kubeletconfig with TM manager policy none (or any fresh cluster with default TMpolicy)
      2. install nrop + scheduler 
      3. update TMscope to single numa node
      4. reproduce the TAE scenario with simple memory-cpu gu pod   
      
        apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: gu-one-cnt
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: test
        template:
          metadata:
            labels:
              app: test
          spec:
            schedulerName: topo-aware-scheduler 
            containers:
            - name: ctnr
              image: registry.hlxcl12.lab.eng.tlv2.redhat.com:5000/numaresources-operator-tests:4.16.999-snapshot
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  memory: "150Mi"
                  cpu: "70"
                  ephemeral-storage: "100Mi"
                requests:
                  memory: "150Mi"
                  cpu: "70"
                  ephemeral-storage: "100Mi"
      
      * this was tested on a cluster that 78 cpus were available on node level while on numa-zone level only 40 cpus at most.For resproduction, please adjust the cpus amount to what's required on your cluster to hit TAE.

      Actual results:

      pod is scheduled although no resources avialable under single numa, leading to TAE on the scheduled pod(s).
      
      # oc get pod
      ..
        gu-one-cnt-796b55799-twxtg                          0/1     ContainerCreating        0               1s
      gu-one-cnt-796b55799-txt2m                          0/1     ContainerStatusUnknown   0               4s
      gu-one-cnt-796b55799-vmbhs                          0/1     ContainerStatusUnknown   0               3s
      gu-one-cnt-796b55799-vwfnn                          0/1     ContainerStatusUnknown   0               8s
      gu-one-cnt-796b55799-wjvg6                          0/1     ContainerStatusUnknown   0               7s
      gu-one-cnt-796b55799-xgrx9                          0/1     ContainerStatusUnknown   0               3s
      gu-one-cnt-796b55799-xhzkj                          0/1     ContainerStatusUnknown   0               8s
      gu-one-cnt-796b55799-xz8lz                          0/1     ContainerStatusUnknown   0               8s
      gu-one-cnt-796b55799-z7pmh                          0/1     ContainerStatusUnknown   0               7s
      gu-one-cnt-796b55799-z7zns                          0/1     ContainerStatusUnknown   0               3s
      gu-one-cnt-796b55799-zfvnb                          0/1     ContainerStatusUnknown   0               4s
      gu-one-cnt-796b55799-zjzfh                          0/1     ContainerStatusUnknown   0               8s
      numaresources-controller-manager-65445fd888-4vsw6   1/1     Running                  0               3h31m
      numaresourcesoperator-worker-h87tl                  2/2     Running                  10              17h
      numaresourcesoperator-worker-x9gc7                  2/2     Running                  10              17h
      secondary-scheduler-b56fbbf79-v7qmt                 1/1     Running                  0               17m
      
      
      from one pod's description:
      Events:
        Type     Reason                 Age   From                  Message
        ----     ------                 ----  ----                  -------
        Normal   Scheduled              25s   topo-aware-scheduler  Successfully assigned openshift-numaresources/gu-one-cnt-796b55799-zfvnb to worker-0
        Warning  TopologyAffinityError  25s   kubelet               Resources cannot be allocated with Topology locality
      

      Expected results:

          pod should stay pending

      Additional info:

          w/a is to restart the scheduler pod by deleting it 

              fromani@redhat.com Francesco Romani
              rhn-support-shajmakh Shereen Haj
              Roy Shemtov Roy Shemtov
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: