-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
Epic Goal
- Make hibernation feature able to stop GCP VirtualMachines which have Local SSD(s) attached (e.g., a2-ultragpu-4g- NVIDIA A100 GPUs).
Why is this important?
- clusters with GPU nodes becomes partially hibernated with all the nodes except GPU ones stopped - GPUs keeps being billed
- user don't have idea of it happening until they check on Google Cloud console or read the Hive ClusterDeployment status
- it affects both Managed (OSD) and self-managed clusters
Scenarios
- Create a cluster with Hive on GCP
- add a GPU worker node to the cluster - use flavor a2-ultragpu-4 for example
- trigger cluster hibernation via Hive
- check the VM status in GCP console
- check the Hive Cluster Deployment conditions
Acceptance Criteria
- GPU nodes get hibernated as the other worker/master nodes
- Hive exposes a corresponding option to --discard-local-ssd (
https://cloud.google.com/compute/docs/disks/local-ssd#stop_instance)
others TBD
Dependencies (internal and external)
- OCM has a dependency on Hive ?
Previous Work (Optional):
- …
Open questions::
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>
- clones
-
HIVE-2693 [GCP] Handle hibernation for VMs with Local SSDs
- Closed
- links to