Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Labels:
- groomed

Activity Type:
None
Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
None
Story Points:
8

Target Version:
None
Release Blocker:
None
Sprint:
CLOUD Sprint 204, CLOUD Sprint 205, CLOUD Sprint 206, CLOUD Sprint 207, CLOUD Sprint 208

When new GPU nodes are added to a cluster through the autoscaler there is a period of time in which the new instance has become a node in kubernetes but is waiting for the GPU resource to become active. During this time period Kubernetes will not schedule pods with GPU requirements to that node until it reports that the GPU is available. While this is happening the cluster autoscaler does not know that a node has been created with a GPU, and in some cases will create additional nodes until the GPU configuration is complete.

There are a couple ways this could be fixed in OpenShift. The most direct way is to introduce a node or machine label which can be applied to indicate that the node will have a GPU. This is similar to guidance provided upstream about how mark nodes which will have GPUs. In essence we would create a solution for Cluster API. This should be discussed upstream in the CAPI community.

Another way to fix this would be to implement a new method in the autoscaler provider interface which would allow the cloud provider to specify when a node is waiting for GPU activity. This might be technically challenging as the solution will need to occur in the autoscaler before it decides to scale up, when it is still attempting to simulate the scaling activity. Although this would be difficult, it is probably a better solution for the wider community. This has been discussed with the upstream and they are amenable to a solution if something appropriate can be found.

https://bugzilla.redhat.com/show_bug.cgi?id=1943194

Project DoD:

cluster autoscaler properly waits for GPU nodes to become active before creating more
documentation updated to provide details on how this mechanism works
end-to-end test added to ensure no regression on GPU creation

Spike DoD:

Work out which approach we want to take here
Work out how the preferred approach will be implemented
10/15 Minute demo to team to explain the research
Create Epic and break down work for implementing the project

links to

KCS solution

Assignee:: Michael McCune

Reporter:: Michael McCune

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2021/05/25 8:47 PM

Updated:: 2025/09/13 1:55 AM

Resolved:: 2021/10/18 2:36 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates