Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43280

Kubelet: Change in the available CPUs accounting

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.18.0
    • Node Tuning Operator
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • No
    • Hide
      2024-12-08 : Once u/s fix gets in k8s 1.33 we will require a backport - if not accepted until 4.18 timeline we will require a release note of known issue

      2024-12-05 : issue and a fix filed u/s
      Show
      2024-12-08 : Once u/s fix gets in k8s 1.33 we will require a backport - if not accepted until 4.18 timeline we will require a release note of known issue 2024-12-05 : issue and a fix filed u/s
    • None
    • None
    • Rejected
    • CNF Compute Sprint 263, CNF Compute Sprint 264, CNF Compute Sprint 265, CNF Compute Sprint 266, CNF Compute Sprint 267, CNF Compute Sprint 268, CNF Compute Sprint 269, CNF Compute Sprint 270, CNF Compute Sprint 271, CNF Compute Sprint 272, CNF Compute Sprint 273, CNF Compute Sprint 274, CNF Compute Sprint 275, CNF Compute Sprint 276, CNF Compute Sprint 277, CNF Compute Sprint 278
    • 16
    • Done
    • Known Issue
    • Hide
      Currently, pods that use a `guaranteed` QoS class and request whole CPUs might not restart automatically after a node reboot or kubelet restart. The issue might occur in nodes configured with a static CPU Manager policy and using the `full-pcpus-only` specification, and when most or all CPUs on the node are already allocated by such workloads. As a workaround, manually delete and recreate the affected pods. (link:https://issues.redhat.com/browse/OCPBUGS-43280[*OCPBUGS-43280*])
      Show
      Currently, pods that use a `guaranteed` QoS class and request whole CPUs might not restart automatically after a node reboot or kubelet restart. The issue might occur in nodes configured with a static CPU Manager policy and using the `full-pcpus-only` specification, and when most or all CPUs on the node are already allocated by such workloads. As a workaround, manually delete and recreate the affected pods. (link: https://issues.redhat.com/browse/OCPBUGS-43280 [* OCPBUGS-43280 *])
    • None
    • None
    • None
    • None

      Description of problem:

      NTO CI starts falling with:
       • [FAILED] [247.873 seconds]
      [rfe_id:27363][performance] CPU Management Verification of cpu_manager_state file when kubelet is restart [It] [test_id: 73501] defaultCpuset should not change [tier-0]
      /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:309
        [FAILED] Expected
            <cpuset.CPUSet>: {
                elems: {0: {}, 2: {}},
            }
        to equal
            <cpuset.CPUSet>: {
                elems: {0: {}, 1: {}, 2: {}, 3: {}},
            }
        In [It] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:332 @ 10/04/24 16:56:51.436 
      
      The failure happened due to the fact that the test pod couldn't get admitted after Kubelet restart.
      
      Adding the failure is happening at this line:
      https://github.com/openshift/kubernetes/blob/cec2232a4be561df0ba32d98f43556f1cad1db01/pkg/kubelet/cm/cpumanager/policy_static.go#L352 
      
      something has changed with how Kubelet accounts for `availablePhysicalCPUs`
      
      

      Version-Release number of selected component (if applicable):

          4.18 (start happening after OCP rebased on top of k8s 1.31

      How reproducible:

          Always

      Steps to Reproduce:

          1. Set up a system with 4 CPUs and apply performance-profile with single-numa-policy
          2. Run pao-functests
          

      Actual results:

          Tests falling with:
       • [FAILED] [247.873 seconds] [rfe_id:27363][performance] CPU Management Verification of cpu_manager_state file when kubelet is restart [It] [test_id: 73501] defaultCpuset should not change [tier-0] /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:309 [FAILED] Expected <cpuset.CPUSet>: { elems: {0: {}, 2: {}}, } to equal <cpuset.CPUSet>: { elems: {0: {}, 1: {}, 2: {}, 3: {}}, } In [It] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:332 @ 10/04/24 16:56:51.436 

      Expected results:

          Tests should pass

      Additional info:

          NOTE: The issue occurs only on system with small amount of CPUs (4 in our case) 

              titzhak Talor Itzhak
              titzhak Talor Itzhak
              None
              None
              Mallapadi Niranjan Mallapadi Niranjan
              None
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated: