Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-16976

Adding cgroupv2 in openshift 4.14 breaks Critical low latency features of NTO

XMLWordPrintable

    • Critical
    • No
    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Hide
      Note: Preliminary RN based on the proposed patch. Even after this bug is fixed it is a known issue that must be added to the docs. Both release note and the Low lat tuning section.

      In {product-title} 4.14, all nodes use Linux control group version 2 (cgroup v2) for internal resource management in alignment with the default RHEL 9 configuration. However, if you apply a performance profile in your cluster, the low latency tuning features associated with the performance profile do not support cgroup v2.
      +
      As a result, if you apply a performance profile, all nodes in the cluster will reboot to switch back to the cgroup v1 configuration. This reboot includes control plane nodes and worker nodes that were not targeted by the performance profile.
      +
      To revert all nodes in the cluster to the cgroups v2 configuration, you must edit the `Node` resource. For more information, see xref:../nodes/clusters/nodes-cluster-cgroups-2.adoc#nodes-clusters-cgroups-2_nodes-cluster-cgroups-2[Configuring Linux cgroup v2]. You cannot revert the cluster to the cgroups v2 configuration by removing the last performance profile. (link:https://issues.redhat.com/browse/OCPBUGS-16976[*OCPBUGS-16976*])
      Show
      Note: Preliminary RN based on the proposed patch. Even after this bug is fixed it is a known issue that must be added to the docs. Both release note and the Low lat tuning section. In {product-title} 4.14, all nodes use Linux control group version 2 (cgroup v2) for internal resource management in alignment with the default RHEL 9 configuration. However, if you apply a performance profile in your cluster, the low latency tuning features associated with the performance profile do not support cgroup v2. + As a result, if you apply a performance profile, all nodes in the cluster will reboot to switch back to the cgroup v1 configuration. This reboot includes control plane nodes and worker nodes that were not targeted by the performance profile. + To revert all nodes in the cluster to the cgroups v2 configuration, you must edit the `Node` resource. For more information, see xref:../nodes/clusters/nodes-cluster-cgroups-2.adoc#nodes-clusters-cgroups-2_nodes-cluster-cgroups-2[Configuring Linux cgroup v2]. You cannot revert the cluster to the cgroups v2 configuration by removing the last performance profile. (link: https://issues.redhat.com/browse/OCPBUGS-16976 [* OCPBUGS-16976 *])
    • Known Issue
    • Done
    • 8/1: critical/blocker, blocking CI lanes

      Description of problem:

      Critical Features of Node tuning operator provide using PAO  like cpu load balancing, quota etc are failing
      Example starting GU pods for example Running latency tests oslat etc fail with error:

       

        ----     ------          ----             ----               -------
        Normal   Scheduled       25s              default-scheduler  Successfully assigned default/pod1 to ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com
        Normal   AddedInterface  24s              multus             Add eth0 [10.135.0.82/23] from ovn-kubernetes
        Normal   Pulling         24s              kubelet            Pulling image "quay.io/openshift-kni/cnf-tests:4.13"
        Normal   Pulled          3s               kubelet            Successfully pulled image "quay.io/openshift-kni/cnf-tests:4.13" in 21.459560425s (21.459579982s including waiting)
        Normal   Pulled          2s               kubelet            Container image "quay.io/openshift-kni/cnf-tests:4.13" already present on machine
        Normal   Created         1s (x2 over 2s)  kubelet            Created container test-container1
        Warning  Failed          1s (x2 over 2s)  kubelet            Error: failed to run pre-start hook for container "test-container1": set CPU load balancing: disabling CPU load balancing on cgroupv2 not yet supported

       

      Version-Release number of selected component (if applicable):

      4.14.0-0.nightly-2023-07-27-172239

      How reproducible:

      everytime

      Steps to Reproduce:

      1. Install OCP version 4.14
      2. Apply Performance profile
      3. Create a gu pod with cri-o annotation to disable cpu load balance:
      apiVersion: v1
      kind: Pod
      metadata:
        name: pod1
        annotations: 
          cpu-load-balancing.crio.io: "disable"
        labels:
            name: "cpuloadbalancing1"
      spec:
        containers:
        - name: test-container1
          image: quay.io/openshift-kni/cnf-tests:4.13
          command:
          - sleep
          - inf
          resources:
            limits:
              memory: "100Mi"
              cpu: "4"
            requests:
              memory: "100Mi"
              cpu: "4"
        runtimeClassName: performance-performance
        nodeSelector:
          kubernetes.io/hostname: ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com
      
      4. Apply the above yaml 
      

       

      Actual results:

      Events:
        Type     Reason          Age              From               Message
        ----     ------          ----             ----               -------
        Normal   Scheduled       25s              default-scheduler  Successfully assigned default/pod1 to ocp-worker-0.libvirt.lab.eng.tlv2.redhat.com
        Normal   AddedInterface  24s              multus             Add eth0 [10.135.0.82/23] from ovn-kubernetes
        Normal   Pulling         24s              kubelet            Pulling image "quay.io/openshift-kni/cnf-tests:4.13"
        Normal   Pulled          3s               kubelet            Successfully pulled image "quay.io/openshift-kni/cnf-tests:4.13" in 21.459560425s (21.459579982s including waiting)
        Normal   Pulled          2s               kubelet            Container image "quay.io/openshift-kni/cnf-tests:4.13" already present on machine
        Normal   Created         1s (x2 over 2s)  kubelet            Created container test-container1
        Warning  Failed          1s (x2 over 2s)  kubelet            Error: failed to run pre-start hook for container "test-container1": set CPU load balancing: disabling CPU load balancing on cgroupv2 not yet supported
      
       

      Expected results:

      Pod should be running and cpu load balancing should be disabled.
       

      Additional info:

       

              yquinn@redhat.com Yanir Quinn
              mniranja Mallapadi Niranjan
              Liquan Cui Liquan Cui
              Votes:
              0 Vote for this issue
              Watchers:
              20 Start watching this issue

                Created:
                Updated:
                Resolved: