Loading...

XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Labels:
- collab

Epic Name:
LoadRiskAwareCommitment
Work Type:
BU Product Work
Blocked:
False
Ready:
False
Epic Status:
Done
Feature Link:
OCPSTRAT-116 - Load Aware Scheduling with trimaran
Flagged:

Impediment
Parent Link:
OCPSTRAT-116Load Aware Scheduling with trimaran
Release Note Text:
Undefined
Target Version:

openshift-4.14

Sprint:
Workloads - 4.12, Workloads Sprint 225, Workloads Sprint 226, Workloads Sprint 227, Workloads Sprint 228, Workloads Sprint 229, Workloads Sprint 230, Workloads Sprint 231, Workloads Sprint 232, Workloads Sprint 233, Workloads Sprint 234, Workloads Sprint 235

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

What is it?

A load-aware scheduler plugin which scores the nodes based on the risk of overloading resources. An acceptable overload is specified by the administrator via two configuration parameters: SafeUtilization and SafePercentile. For example, a setting of 0.95 and 0.10, respectively, for the CPU resource, means that a CPU utilization of more than 95% during up to 10% of the time is acceptable. In this case, this LoadRiskAwareCommitment scheduler plugin will evaluate the chance (risk) of a node having CPU utilization above 95%. This risk is compared to the acceptable percentile and a score is calculated. A low (high) score implies high (low) risk, with a zero score corresponding to a risk higher than the acceptable risk.

Why we need it?

Typically, the values of requested (and limit) amount of resources for a pod do not reflect the actual resource utilization during the lifetime of the pod. A desirable resource management criterion might be to place pods, not only based on requested resources, but also to avoid overloading resources. A simplistic threshold value for overload is not enough, as it does not consider the chance of being above that threshold. Thus, one needs to specify both a SafeUtilization and a SafePercentile. As a result, pods will be less likely to be assigned to risky nodes. This will avoid nodes being saturated, leading to bad performance. Also, it lowers the probability of evictions. Further, this LoadRiskAwareCommitment scheduler plugin will accommodate scenarios of over allocating resources.

How its done?

The cluster administrator configures two parameters: SafeUtilization and SafePercentile. (Default values might be 0.90 and 0.10, respectively.) We consider CPU and memory as the resources of interest. The parameters might be resource-specific.
A load monitor (Prometheus or a load watcher) calculates the average utilization and standard deviation (std) quantities for the resources of interest over a period of time
The scheduler plugin fits a Beta probability distribution to the observed average and standard deviation values using the method of matching moments (simple two algebraic equations). This is done for all resources of interest.
Risk is calculated as the probability the utilization is above the SafeUtilization value (tail of the distribution).
Relative risk is the value of risk divided by the SafePercentile, and upper bounded by one.
Each resource will have a score value, calculated as: ( 1 - relativeRisk) * maxScore.
The node score is the minimum (worst case) score among the resources of interest.

An example is provided.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

example.pdf
1.95 MB
2021/02/03 4:52 PM

1.	Docs Tracker	Closed	Unassigned
2.	QE Tracker	Closed	Unassigned
3.	TE Tracker	Closed	Dave Mulford
4.	PX Tracker	Closed	Dave Mulford

Assignee:: Asser Tantawi (Inactive)

Reporter:: Gaurav Singh

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2021/02/02 9:49 AM

Updated:: 2024/11/28 9:12 PM

Resolved:: 2023/06/22 5:27 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates