Loading...

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.14.0
Affects Version/s: 4.14.0
Component/s: Node Tuning Operator
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
No
Latest Status Summary:
8/1: critical/blocker, blocking CI lanes

Target Backport Versions:
None
Target Version:

4.14.0
Release Blocker:
Proposed
Sprint:
None

Internal Whiteboard:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
Done
Release Note Type:
Known Issue
Release Note Text:

Hide
Note: Preliminary RN based on the proposed patch. Even after this bug is fixed it is a known issue that must be added to the docs. Both release note and the Low lat tuning section.

In {product-title} 4.14, all nodes use Linux control group version 2 (cgroup v2) for internal resource management in alignment with the default RHEL 9 configuration. However, if you apply a performance profile in your cluster, the low latency tuning features associated with the performance profile do not support cgroup v2.
+
As a result, if you apply a performance profile, all nodes in the cluster will reboot to switch back to the cgroup v1 configuration. This reboot includes control plane nodes and worker nodes that were not targeted by the performance profile.
+
To revert all nodes in the cluster to the cgroups v2 configuration, you must edit the `Node` resource. For more information, see xref:../nodes/clusters/nodes-cluster-cgroups-2.adoc#nodes-clusters-cgroups-2_nodes-cluster-cgroups-2[Configuring Linux cgroup v2]. You cannot revert the cluster to the cgroups v2 configuration by removing the last performance profile. (link:https://issues.redhat.com/browse/OCPBUGS-16976[*~~OCPBUGS-16976~~*])

Show
Note: Preliminary RN based on the proposed patch. Even after this bug is fixed it is a known issue that must be added to the docs. Both release note and the Low lat tuning section. In {product-title} 4.14, all nodes use Linux control group version 2 (cgroup v2) for internal resource management in alignment with the default RHEL 9 configuration. However, if you apply a performance profile in your cluster, the low latency tuning features associated with the performance profile do not support cgroup v2. + As a result, if you apply a performance profile, all nodes in the cluster will reboot to switch back to the cgroup v1 configuration. This reboot includes control plane nodes and worker nodes that were not targeted by the performance profile. + To revert all nodes in the cluster to the cgroups v2 configuration, you must edit the `Node` resource. For more information, see xref:../nodes/clusters/nodes-cluster-cgroups-2.adoc#nodes-clusters-cgroups-2_nodes-cluster-cgroups-2[Configuring Linux cgroup v2]. You cannot revert the cluster to the cgroups v2 configuration by removing the last performance profile. (link: https://issues.redhat.com/browse/OCPBUGS-16976 [* OCPBUGS-16976 *])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Critical Features of Node tuning operator provide using PAO  like cpu load balancing, quota etc are failing

Example starting GU pods for example Running latency tests oslat etc fail with error:

  ----     ------          ----             ----               -------
  Normal   Scheduled       25s              default-scheduler  Successfully assigned default/pod1 to ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com
  Normal   AddedInterface  24s              multus             Add eth0 [10.135.0.82/23] from ovn-kubernetes
  Normal   Pulling         24s              kubelet            Pulling image "quay.io/openshift-kni/cnf-tests:4.13"
  Normal   Pulled          3s               kubelet            Successfully pulled image "quay.io/openshift-kni/cnf-tests:4.13" in 21.459560425s (21.459579982s including waiting)
  Normal   Pulled          2s               kubelet            Container image "quay.io/openshift-kni/cnf-tests:4.13" already present on machine
  Normal   Created         1s (x2 over 2s)  kubelet            Created container test-container1
  Warning  Failed          1s (x2 over 2s)  kubelet            Error: failed to run pre-start hook for container "test-container1": set CPU load balancing: disabling CPU load balancing on cgroupv2 not yet supported

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-07-27-172239

How reproducible:

everytime

Steps to Reproduce:

1. Install OCP version 4.14
2. Apply Performance profile
3. Create a gu pod with cri-o annotation to disable cpu load balance:
apiVersion: v1
kind: Pod
metadata:
  name: pod1
  annotations: 
    cpu-load-balancing.crio.io: "disable"
  labels:
      name: "cpuloadbalancing1"
spec:
  containers:
  - name: test-container1
    image: quay.io/openshift-kni/cnf-tests:4.13
    command:
    - sleep
    - inf
    resources:
      limits:
        memory: "100Mi"
        cpu: "4"
      requests:
        memory: "100Mi"
        cpu: "4"
  runtimeClassName: performance-performance
  nodeSelector:
    kubernetes.io/hostname: ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com

4. Apply the above yaml

Actual results:

Events:
  Type     Reason          Age              From               Message
  ----     ------          ----             ----               -------
  Normal   Scheduled       25s              default-scheduler  Successfully assigned default/pod1 to ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com
  Normal   AddedInterface  24s              multus             Add eth0 [10.135.0.82/23] from ovn-kubernetes
  Normal   Pulling         24s              kubelet            Pulling image "quay.io/openshift-kni/cnf-tests:4.13"
  Normal   Pulled          3s               kubelet            Successfully pulled image "quay.io/openshift-kni/cnf-tests:4.13" in 21.459560425s (21.459579982s including waiting)
  Normal   Pulled          2s               kubelet            Container image "quay.io/openshift-kni/cnf-tests:4.13" already present on machine
  Normal   Created         1s (x2 over 2s)  kubelet            Created container test-container1
  Warning  Failed          1s (x2 over 2s)  kubelet            Error: failed to run pre-start hook for container "test-container1": set CPU load balancing: disabling CPU load balancing on cgroupv2 not yet supported

Expected results:

Pod should be running and cpu load balancing should be disabled.

Additional info:

causes

OCPBUGS-17858 masters reboot after applying PerformanceProfile for workers

Closed

is caused by

OCPSTRAT-696 Make CgroupV2 default in 4.14

Closed

links to

openshift/cluster-node-tuning-operator#737: OCPBUGS-16976: Update the config.openshift.io/node object's cgroupMode to "v1"

openshift/openshift-docs#66164: OCPBUGS-16976: Adding performance profile admonitions for cgroups known issue

RHSA-2023:5006 OpenShift Container Platform 4.14.z security update

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide