Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-34638

[GCP NVIDIA H100] "destroy cluster" will hang at "VM has a Local SSD attached but an undefined value for 'discard-local-ssd'" when trying to stop the A3 instance

XMLWordPrintable

    • Important
    • No
    • Sprint 255, Installer Sprint 256
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required
    • Done

      Description of problem:

          For a cluster having one worker machine of A3 instance type, during "destroy cluster" it keeps telling below failure until I stopped the instance via "gcloud".
      
      WARNING failed to stop instance jiwei-0530b-q9t8w-worker-c-ck6s8 in zone us-central1-c: googleapi: Error 400: VM has a Local SSD attached but an undefined value for `discard-local-ssd`. If using gcloud, please add `--discard-local-ssd=false` or `--discard-local-ssd=true` to your command., badRequest

      Version-Release number of selected component (if applicable):

          4.16.0-0.nightly-multi-2024-05-29-143245

      How reproducible:

          Always

      Steps to Reproduce:

          1. "create install-config" and then "create manifests"
          2. edit a worker machineset YAML, to specify "machineType: a3-highgpu-8g" along with "onHostMaintenance: Terminate"
          3. "create cluster", and make sure it succeeds
          4. "destroy cluster"     

      Actual results:

          Uninstalling the cluster keeps telling stopping instance error.

      Expected results:

          "destroy cluster" should proceed without any warning/error, and delete everything finally.

      Additional info:

      FYI the .openshift-install.log is available at https://drive.google.com/file/d/15xIwzi0swDk84wqg32tC_4KfUahCalrL/view?usp=drive_link
      
      FYI to stop the A3 instance via "gcloud" by specifying "--discard-local-ssd=false" does succeed.
      
      $ gcloud  compute instances list --format="table(creationTimestamp.date('%Y-%m-%d %H:%M:%S'):sort=1,zone,status,name,machineType,tags.items)" --filter="name~jiwei" 2>/dev/null
      CREATION_TIMESTAMP   ZONE           STATUS      NAME                              MACHINE_TYPE   ITEMS
      2024-05-29 20:55:52  us-central1-a  TERMINATED  jiwei-0530b-q9t8w-master-0        n2-standard-4  ['jiwei-0530b-q9t8w-master']
      2024-05-29 20:55:52  us-central1-b  TERMINATED  jiwei-0530b-q9t8w-master-1        n2-standard-4  ['jiwei-0530b-q9t8w-master']
      2024-05-29 20:55:52  us-central1-c  TERMINATED  jiwei-0530b-q9t8w-master-2        n2-standard-4  ['jiwei-0530b-q9t8w-master']
      2024-05-29 21:10:08  us-central1-a  TERMINATED  jiwei-0530b-q9t8w-worker-a-rkxkk  n2-standard-4  ['jiwei-0530b-q9t8w-worker']
      2024-05-29 21:10:19  us-central1-b  TERMINATED  jiwei-0530b-q9t8w-worker-b-qg6jv  n2-standard-4  ['jiwei-0530b-q9t8w-worker']
      2024-05-29 21:10:31  us-central1-c  RUNNING     jiwei-0530b-q9t8w-worker-c-ck6s8  a3-highgpu-8g  ['jiwei-0530b-q9t8w-worker']
      $ gcloud compute instances stop jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c
      ERROR: (gcloud.compute.instances.stop) HTTPError 400: VM has a Local SSD attached but an undefined value for `discard-local-ssd`. If using gcloud, please add `--discard-local-ssd=false` or `--discard-local-ssd=true` to your command.
      $ gcloud compute instances stop jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c --discard-local-ssd=false
      Stopping instance(s) jiwei-0530b-q9t8w-worker-c-ck6s8...done.                                                                                    
      Updated [https://compute.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instances/jiwei-0530b-q9t8w-worker-c-ck6s8].
      $ gcloud  compute instances list --format="table(creationTimestamp.date('%Y-%m-%d %H:%M:%S'):sort=1,zone,status,name,machineType,tags.items)" --filter="name~jiwei" 2>/dev/null
      CREATION_TIMESTAMP   ZONE           STATUS      NAME                              MACHINE_TYPE   ITEMS
      2024-05-29 20:55:52  us-central1-a  TERMINATED  jiwei-0530b-q9t8w-master-0        n2-standard-4  ['jiwei-0530b-q9t8w-master']
      2024-05-29 20:55:52  us-central1-b  TERMINATED  jiwei-0530b-q9t8w-master-1        n2-standard-4  ['jiwei-0530b-q9t8w-master']
      2024-05-29 20:55:52  us-central1-c  TERMINATED  jiwei-0530b-q9t8w-master-2        n2-standard-4  ['jiwei-0530b-q9t8w-master']
      2024-05-29 21:10:08  us-central1-a  TERMINATED  jiwei-0530b-q9t8w-worker-a-rkxkk  n2-standard-4  ['jiwei-0530b-q9t8w-worker']
      2024-05-29 21:10:19  us-central1-b  TERMINATED  jiwei-0530b-q9t8w-worker-b-qg6jv  n2-standard-4  ['jiwei-0530b-q9t8w-worker']
      2024-05-29 21:10:31  us-central1-c  TERMINATED  jiwei-0530b-q9t8w-worker-c-ck6s8  a3-highgpu-8g  ['jiwei-0530b-q9t8w-worker']
      $ gcloud compute instances delete -q jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c
      Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instances/jiwei-0530b-q9t8w-worker-c-ck6s8].
      $ 

            sdasu@redhat.com Sandhya Dasu
            rhn-support-jiwei Jianli Wei
            Jianli Wei Jianli Wei
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: