Uploaded image for project: 'OpenShift Node'
  1. OpenShift Node
  2. OCPNODE-2305

Schedule crun process on the first CPU within the cgroups cpuset for container on isolated CPUs

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • Schedule crun process on the first CPU within the cgroups cpuset for container on isolated CPUs
    • BU Product Work
    • False
    • None
    • False
    • Not Selected
    • To Do
    • OCPSTRAT-1292 - Don't interrupt pinned CPU pods by exec probes
    • OCPSTRAT-1292Don't interrupt pinned CPU pods by exec probes
    • 67% To Do, 17% In Progress, 17% Done
    • L
    • Telco 5G RAN

      Epic Goal

      This epic will introduce optional functionality in crun to direct exec operations to specific CPU in a guaranteed QoS class pod.

      Originally, this was going to be a direct port of commit afc23e33 which was recently done in runc. However, it has now been decided that this was not the right approach (see https://github.com/opencontainers/runc/pull/4283 ). A new proposal to allow the CPU affinity to be specified for executed processes has been made for the runtime-spec (see https://github.com/opencontainers/runtime-spec/pull/1253 ) which would then be used by runc/crun to affine the new process to specific CPUs.

      This change will likely require changes in cri-o to take advantage of the new runc/crun functionality and would also need a way to trigger this for a pod/container (e.g. a new annotation).

      Why is this important?

      Quoting from RFE-5011:

      When a partner develops a containerized DPDK application, it will want to give full exclusive CPU access to the busy-loop polling threads. It is also possible that some housekeeping process will be running on a separate CPU inside the same pod.

      However, we have seen that certain common Kubernetes operations cannot be done when this type of configuration is run on the RT kernel. The list of these operations is:

      • Running oc exec / oc rsh / oc cp / oc rsync on the pod
      • Having exec probes for livenessProbe or readynessProbe
      • Having an exec postStart or preStop hook

      Those operations cannot be done because the new processes started in the pod will run at a non-RT priority, and could land on one of the CPUs running the busy-loop polling threads. This can add latency to the DPDK application, and in a worse case scenario cause a deadlock between the non-RT process and some kernel thread. Several support cases have been opened, where the vmcore crash analysis showed the issue.

      Currently, there is no control over the CPU(s) where a newly exec'ed process will be run on a pod. If we had a way to ensure the new process will not run on the CPUs owned by the busy-loop polling threads, we would be able to run those common admin tasks on the pod.

      Scenarios

      1. See above for a list of the operations which can cause an issue for a guaranteed QoS pod running an RT application with busy-loop threads.

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.
      • Pods without the new annotation preserve the existing behavior - new processes in a pod are distributed to any CPU in the pod.
      • Pods with the new annotation have new processes assigned to the first CPU in the pod.

      Dependencies (internal and external)

      1. None

      Previous Work:

      1. runc commit afc23e33

      Open questions::

      1. Name for the annotation that will trigger this behavior.

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

              bwensley@redhat.com Bart Wensley
              bwensley@redhat.com Bart Wensley
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated: