Uploaded image for project: 'OpenShift Installer'
  1. OpenShift Installer
  2. CORS-3287

List GCP's NVIDIA H100 instances as tested instance type

    • Icon: Epic Epic
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • List GCP's NVIDIA H100 instances as tested instance type
    • False
    • None
    • False
    • Not Selected
    • Done
    • 0% To Do, 0% In Progress, 100% Done

      Epic Goal

      Why is this important?

      • This is a new GPU-enabled Machine Type from Google Cloud that customers are planning to use and customers need to ensure we have validated this Machine Type as compute Nodes for OCP

      Scenarios

      1. The A3 machine series (as of today only a3-highgpu-8g is available) are highlighted in the OpenShift Container Platform as a "Tested instance type"

      Previous Work (Optional):

      1. The instance has been already validated via NVIDIA-82 where GPU Operators have been validated as well.

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

            [CORS-3287] List GCP's NVIDIA H100 instances as tested instance type

            rhn-support-jiwei when I tested https://issues.redhat.com/browse/NVIDIA-82, I only created a new machineset by copying one of the existing ones and changing the instance type to a3-highgpu-8g and ran a GPU test workload successfully when that H100 instance was available.  I did not install an entire OCP cluster with the H100 (a3) instances during install, but only scaled the cluster and added an H100 (a3) machineset to a pre-built SNO cluster.   The testing I have done in NVIDIA-82 is sufficient for specifically testing if we can successfully add an H100 machineset to an existing OCP 4.14 cluster.

            Walid Abouhamad added a comment - rhn-support-jiwei when I tested https://issues.redhat.com/browse/NVIDIA-82 , I only created a new machineset by copying one of the existing ones and changing the instance type to a3-highgpu-8g and ran a GPU test workload successfully when that H100 instance was available.  I did not install an entire OCP cluster with the H100 (a3) instances during install, but only scaled the cluster and added an H100 (a3) machineset to a pre-built SNO cluster.   The testing I have done in NVIDIA-82 is sufficient for specifically testing if we can successfully add an H100 machineset to an existing OCP 4.14 cluster.

              rhn-support-jiwei Jianli Wei
              mak.redhat.com Marcos Entenza Garcia
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: