Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.18.0
Component/s: Node Tuning Operator
Labels:
None

Severity:
Important
Regression:
Yes
Sprint:
CNF Compute Sprint 263, CNF Compute Sprint 264, CNF Compute Sprint 265, CNF Compute Sprint 266, CNF Compute Sprint 267, CNF Compute Sprint 268
sprint_count:
6
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
The issue might reproduce on the following terms:
1. Nodes are configured with CPUManager static policy + full-pcpus-only option.
2. All or almost all CPUs under the nodes are allocated by Guaranteed workloads requesting integral CPUs
3. Node goes to reboot/kubelet restart

When the above conditions are met some of the Guaranteed workloads requesting integral CPUs, might not get back after the reboot.

Workaround: delete the pods that did not go up and recreate them again.

Show
The issue might reproduce on the following terms: 1. Nodes are configured with CPUManager static policy + full-pcpus-only option. 2. All or almost all CPUs under the nodes are allocated by Guaranteed workloads requesting integral CPUs 3. Node goes to reboot/kubelet restart When the above conditions are met some of the Guaranteed workloads requesting integral CPUs, might not get back after the reboot. Workaround: delete the pods that did not go up and recreate them again.
Release Note Type:
Known Issue
Release Note Status:
In Progress
Latest Status Summary:

Hide
2024-12-08 : Once u/s fix gets in k8s 1.33 we will require a backport - if not accepted until 4.18 timeline we will require a release note of known issue

2024-12-05 : issue and a fix filed u/s

Show
2024-12-08 : Once u/s fix gets in k8s 1.33 we will require a backport - if not accepted until 4.18 timeline we will require a release note of known issue 2024-12-05 : issue and a fix filed u/s
RH Private Keywords:
Target Version:

4.18.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

NTO CI starts falling with:
 • [FAILED] [247.873 seconds]
[rfe_id:27363][performance] CPU Management Verification of cpu_manager_state file when kubelet is restart [It] [test_id: 73501] defaultCpuset should not change [tier-0]
/go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:309
  [FAILED] Expected
      <cpuset.CPUSet>: {
          elems: {0: {}, 2: {}},
      }
  to equal
      <cpuset.CPUSet>: {
          elems: {0: {}, 1: {}, 2: {}, 3: {}},
      }
  In [It] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:332 @ 10/04/24 16:56:51.436 

The failure happened due to the fact that the test pod couldn't get admitted after Kubelet restart.

Adding the failure is happening at this line:
https://github.com/openshift/kubernetes/blob/cec2232a4be561df0ba32d98f43556f1cad1db01/pkg/kubelet/cm/cpumanager/policy_static.go#L352 

something has changed with how Kubelet accounts for `availablePhysicalCPUs`

Version-Release number of selected component (if applicable):

    4.18 (start happening after OCP rebased on top of k8s 1.31

How reproducible:

    Always

Steps to Reproduce:

    1. Set up a system with 4 CPUs and apply performance-profile with single-numa-policy
    2. Run pao-functests

Actual results:

    Tests falling with:
 • [FAILED] [247.873 seconds] [rfe_id:27363][performance] CPU Management Verification of cpu_manager_state file when kubelet is restart [It] [test_id: 73501] defaultCpuset should not change [tier-0] /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:309 [FAILED] Expected <cpuset.CPUSet>: { elems: {0: {}, 2: {}}, } to equal <cpuset.CPUSet>: { elems: {0: {}, 1: {}, 2: {}, 3: {}}, } In [It] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:332 @ 10/04/24 16:56:51.436

Expected results:

    Tests should pass

Additional info:

    NOTE: The issue occurs only on system with small amount of CPUs (4 in our case)

blocks

OCPBUGS-43566 [4.17] E2E: test related to cpumanager state file check during kubelet restart fails

Closed

is cloned by

OCPBUGS-44177 [4.17] Kubelet: Change in the available CPUs accounting

ASSIGNED

OCPBUGS-43566 [4.17] E2E: test related to cpumanager state file check during kubelet restart fails

Closed

links to

https://github.com/kubernetes/kubernetes/pull/129079

Assignee:: Talor Itzhak

Reporter:: Talor Itzhak

QA Contact:: Mallapadi Niranjan

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2024/10/14 9:38 AM

Updated:: 2025/03/10 6:08 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates