-
Story
-
Resolution: Done
-
Normal
-
None
-
None
-
None
-
False
-
None
-
False
-
None
-
8
-
None
-
None
-
CLOUD Sprint 204, CLOUD Sprint 205, CLOUD Sprint 206, CLOUD Sprint 207, CLOUD Sprint 208
When new GPU nodes are added to a cluster through the autoscaler there is a period of time in which the new instance has become a node in kubernetes but is waiting for the GPU resource to become active. During this time period Kubernetes will not schedule pods with GPU requirements to that node until it reports that the GPU is available. While this is happening the cluster autoscaler does not know that a node has been created with a GPU, and in some cases will create additional nodes until the GPU configuration is complete.
There are a couple ways this could be fixed in OpenShift. The most direct way is to introduce a node or machine label which can be applied to indicate that the node will have a GPU. This is similar to guidance provided upstream about how mark nodes which will have GPUs. In essence we would create a solution for Cluster API. This should be discussed upstream in the CAPI community.
Another way to fix this would be to implement a new method in the autoscaler provider interface which would allow the cloud provider to specify when a node is waiting for GPU activity. This might be technically challenging as the solution will need to occur in the autoscaler before it decides to scale up, when it is still attempting to simulate the scaling activity. Although this would be difficult, it is probably a better solution for the wider community. This has been discussed with the upstream and they are amenable to a solution if something appropriate can be found.
Related
Project DoD:
- cluster autoscaler properly waits for GPU nodes to become active before creating more
- documentation updated to provide details on how this mechanism works
- end-to-end test added to ensure no regression on GPU creation
Spike DoD:
- Work out which approach we want to take here
- Work out how the preferred approach will be implemented
- 10/15 Minute demo to team to explain the research
- Create Epic and break down work for implementing the project
- links to