This is a tracking/planning Epic to make the dependency between CNF and OCPNODE explicit.

Epic Goal

Enhance the existing cpu/topology manager kubelet policies, or post new ones, to make sure we enable latency optimal container pinning in constrained environments. The biggest example is RAN-like workers, with 20-24 cores, possibly hyperthreaded. There are two colliding requirements - reducing overhead (using all cores) vs. avoiding noisy neighbours.

Why is this important?

Not enough threads in total if we keep some of them unused
Latency sensitive workload needs to avoid any neigbours on the same core(s)

Scenarios

The isolated cpu pool contains a partial core (one thread from a core that has a sibling in the reserved pool). The platform needs to make sure that anything latency sensitive is not pinned to that thread, because otherwise if will be affected by a noisy neighbour. This scenario is useful for minimizing the number of threads used for housekeeping (one thread for reserved and one for infrastructure pods).
A workload that is latency sensitive must be the only workload running on a core or must be rejected / report a noisy neigbour warning of some kind
A workload that is security sensitive must be the only workload running on a core or must be rejected to make sure it cannot be compromised using timing and other cache related attacks (Spectre and other vulnerabilities included).
Being the only workload on a core might mean using all threads or making unused threads unavailable to others

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
E2E or functional test must demonstrate the correct allocation happens.
A guaranteed latency sensitive workload has a way to be isolated from noisy neighbours on sibling threads
A guaranteed latency sensitive workload that does not occupy a whole core (all its threads) must be rejected with a meaningful error

Dependencies (internal and external)

cpu manager
(topology manager as it shares some data with cpu manager)

Previous Work (Optional):

Open questions::

Upstream or downstream first?
(related to previous work to some extent) can the existing cpumanager static policy guarantee the desired behaviour?
where does the testsuite belong? not sure it fits k8s (same reasons of the policy, too narrow use case?), and we (telco 5g) we want to run anyway. Perhaps submit u/s first and take it in ocp/cnf if u/s rejects?
Is rejection the only way if the pod is not requesting the whole core? Can the infrastructure "block" other threads from the rest of the system?

Risk assessment and work estimate

There is significant risk here if upstream solution is expected. We have a design proposal, but the KEP process is lenghty and uncertain. Downstream only solution depends on the willingness of OCP team.

The proposed solution is mostly isolated from existing code at node (kubelet) level. The impact of the policies on the resource accounting can be relevant, increasing the risk of quick acceptance.

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Assignee:: Francesco Romani

Reporter:: Martin Sivak

Need Info From:: None

Contributors:: None

QA Contact:: Sunil Choudhary

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2021/04/08 6:46 AM

Updated:: 2025/07/16 1:18 PM

Resolved:: 2021/05/13 1:44 PM

Details

Description

This is a tracking/planning Epic to make the dependency between CNF and OCPNODE explicit.

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Risk assessment and work estimate

Done Checklist

Attachments

Easy Agile Planning Poker

Activity

People

Dates