Background:
ZTE (Acct # 1329724) is deploying low latency applications to run their business critical 5G CNF applications. See case 03518427.
Proposed title of this feature request
Flexible Topology Manager policy to allow customized NUMA Node affinity
What is the nature and description of the request?
Currently we provide restricted and single-numa-node topology policies to make sure pods are accessing the resources from a single NUMA zone to meet high performance and/or low latency requirements.
Customer requirement: CPU strong affinity to collocate with memory(huge pages) in the same NUMA zone, but NIC are soft affinity (allow remote NUMA zone access).
For example, Customer has below NUMA zones resource availability.
NUMA zone 1: 4 CPUs, 16GB RAMs (Huge Pages), 1 physical NIC.
NUMA zone 2: 8 CPUs 32 GB RAMs (Huge Pages), no physical NIC.
NUMA zone 3: 16 CPUs, 64 GB RAMs (Huge Pages), no physical NIC.
Topology Manager Policy used: restricted
Pod requirement: 8 CPUs, 32GB RAM + 1 VF, Guaranteed QoS class
Current Result:
Currently, the pod cannot be scheduled due to TopologyAffinityError. This is currently as designed since restricted policy enforces all resources requirements (CPU, RAM, NIC) are always in the same NUMA zone. Neither of the above NUMA zones meet the requirements.
Customer requirement:
Design a new topology manager policy to allow the TopologyAffinity to always have CPU strong affinity with local RAM (NUMA zone 2/3 in this case) but still have the flexibility for remote NIC access from NUMA zone 1 (soft affinity).
Workaround:
The best-effort policy works fine in this scenario and can schedule the pod to NUMA zone 2/3 with remote access NIC from NUMA zone 1.
Why the workaround cannot be accepted:
The best-effort policy does not guarantee the CPU and RAM are always from the same NUMA zone. There are chances that the CPU is from NUMA zone 2 but accessing RAM from NUMA zone 3 which potentially leads to worse performance degradation. Or, some CPUs are accessing local zone RAMs but some CPUs are accessing remote zone RAMs. In other words, the best-effort affinity might lose CPU-RAM strong affinity.
The single-numa-node also does not fit for this scenario because there has to be 2 NUMA zones involved.
In the OCP 4.13 with NUMA Resources Operator, we still don't have such flexibility.
Why does the customer need this? (List the business requirements here)
According to the end user(telecom supplier) requirement, there is only one NIC for business application pods running. The customer does understand that using remote NIC will cause somewhat performance degradation, but if using best-effort TopologyManager policy, there are chances that CPU and RAM lose strong affinity, leading to worse performance degradation. So compared with the side effects of the two solutions, they decide to let the NIC remote NUMA zone access. That's the best solution as of now. So they do need this RFE to make sure the CPU and RAM(Huge-pages) have strong affinity in the same NUMA zone, but still have the flexibility to access NIC in another NUMA zone (soft affinity). Currently we don't have such a flexible policy.
List any affected packages or components.
Topology Manager, Topology Manager Policies, Secondary scheduler, CPU Manager, Device Manager