Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36965

[GCP NVIDIA H100] "destroy cluster" will hang at "VM has a Local SSD attached but an undefined value for 'discard-local-ssd'" when trying to stop the A3 instance

XMLWordPrintable

    • Important
    • No
    • Installer Sprint 256, Installer Sprint 257
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the {oc-first} command `openshift-install destroy cluster` stalled and caused the following error message:
      +
      [source,terminal]
      ----
      VM has a local SSD attached but an undefined value for 'discard-local-ssd' when using A3 instance types
      ----
      +
      With this release, after you issue the command, local SSDs are removed so that this bug no longer persists. (link:https://issues.redhat.com/browse/OCPBUGS-36965[*OCPBUGS-36965*])
      Show
      * Previously, the {oc-first} command `openshift-install destroy cluster` stalled and caused the following error message: + [source,terminal] ---- VM has a local SSD attached but an undefined value for 'discard-local-ssd' when using A3 instance types ---- + With this release, after you issue the command, local SSDs are removed so that this bug no longer persists. (link: https://issues.redhat.com/browse/OCPBUGS-36965 [* OCPBUGS-36965 *])
    • Bug Fix
    • Done

      This is a clone of issue OCPBUGS-34638. The following is the description of the original issue:

      Description of problem:

          For a cluster having one worker machine of A3 instance type, during "destroy cluster" it keeps telling below failure until I stopped the instance via "gcloud".
      
      WARNING failed to stop instance jiwei-0530b-q9t8w-worker-c-ck6s8 in zone us-central1-c: googleapi: Error 400: VM has a Local SSD attached but an undefined value for `discard-local-ssd`. If using gcloud, please add `--discard-local-ssd=false` or `--discard-local-ssd=true` to your command., badRequest

      Version-Release number of selected component (if applicable):

          4.16.0-0.nightly-multi-2024-05-29-143245

      How reproducible:

          Always

      Steps to Reproduce:

          1. "create install-config" and then "create manifests"
          2. edit a worker machineset YAML, to specify "machineType: a3-highgpu-8g" along with "onHostMaintenance: Terminate"
          3. "create cluster", and make sure it succeeds
          4. "destroy cluster"     

      Actual results:

          Uninstalling the cluster keeps telling stopping instance error.

      Expected results:

          "destroy cluster" should proceed without any warning/error, and delete everything finally.

      Additional info:

      FYI the .openshift-install.log is available at https://drive.google.com/file/d/15xIwzi0swDk84wqg32tC_4KfUahCalrL/view?usp=drive_link
      
      FYI to stop the A3 instance via "gcloud" by specifying "--discard-local-ssd=false" does succeed.
      
      $ gcloud  compute instances list --format="table(creationTimestamp.date('%Y-%m-%d %H:%M:%S'):sort=1,zone,status,name,machineType,tags.items)" --filter="name~jiwei" 2>/dev/null
      CREATION_TIMESTAMP   ZONE           STATUS      NAME                              MACHINE_TYPE   ITEMS
      2024-05-29 20:55:52  us-central1-a  TERMINATED  jiwei-0530b-q9t8w-master-0        n2-standard-4  ['jiwei-0530b-q9t8w-master']
      2024-05-29 20:55:52  us-central1-b  TERMINATED  jiwei-0530b-q9t8w-master-1        n2-standard-4  ['jiwei-0530b-q9t8w-master']
      2024-05-29 20:55:52  us-central1-c  TERMINATED  jiwei-0530b-q9t8w-master-2        n2-standard-4  ['jiwei-0530b-q9t8w-master']
      2024-05-29 21:10:08  us-central1-a  TERMINATED  jiwei-0530b-q9t8w-worker-a-rkxkk  n2-standard-4  ['jiwei-0530b-q9t8w-worker']
      2024-05-29 21:10:19  us-central1-b  TERMINATED  jiwei-0530b-q9t8w-worker-b-qg6jv  n2-standard-4  ['jiwei-0530b-q9t8w-worker']
      2024-05-29 21:10:31  us-central1-c  RUNNING     jiwei-0530b-q9t8w-worker-c-ck6s8  a3-highgpu-8g  ['jiwei-0530b-q9t8w-worker']
      $ gcloud compute instances stop jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c
      ERROR: (gcloud.compute.instances.stop) HTTPError 400: VM has a Local SSD attached but an undefined value for `discard-local-ssd`. If using gcloud, please add `--discard-local-ssd=false` or `--discard-local-ssd=true` to your command.
      $ gcloud compute instances stop jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c --discard-local-ssd=false
      Stopping instance(s) jiwei-0530b-q9t8w-worker-c-ck6s8...done.                                                                                    
      Updated [https://compute.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instances/jiwei-0530b-q9t8w-worker-c-ck6s8].
      $ gcloud  compute instances list --format="table(creationTimestamp.date('%Y-%m-%d %H:%M:%S'):sort=1,zone,status,name,machineType,tags.items)" --filter="name~jiwei" 2>/dev/null
      CREATION_TIMESTAMP   ZONE           STATUS      NAME                              MACHINE_TYPE   ITEMS
      2024-05-29 20:55:52  us-central1-a  TERMINATED  jiwei-0530b-q9t8w-master-0        n2-standard-4  ['jiwei-0530b-q9t8w-master']
      2024-05-29 20:55:52  us-central1-b  TERMINATED  jiwei-0530b-q9t8w-master-1        n2-standard-4  ['jiwei-0530b-q9t8w-master']
      2024-05-29 20:55:52  us-central1-c  TERMINATED  jiwei-0530b-q9t8w-master-2        n2-standard-4  ['jiwei-0530b-q9t8w-master']
      2024-05-29 21:10:08  us-central1-a  TERMINATED  jiwei-0530b-q9t8w-worker-a-rkxkk  n2-standard-4  ['jiwei-0530b-q9t8w-worker']
      2024-05-29 21:10:19  us-central1-b  TERMINATED  jiwei-0530b-q9t8w-worker-b-qg6jv  n2-standard-4  ['jiwei-0530b-q9t8w-worker']
      2024-05-29 21:10:31  us-central1-c  TERMINATED  jiwei-0530b-q9t8w-worker-c-ck6s8  a3-highgpu-8g  ['jiwei-0530b-q9t8w-worker']
      $ gcloud compute instances delete -q jiwei-0530b-q9t8w-worker-c-ck6s8 --zone us-central1-c
      Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe/zones/us-central1-c/instances/jiwei-0530b-q9t8w-worker-c-ck6s8].
      $ 

              sdasu@redhat.com Sandhya Dasu
              openshift-crt-jira-prow OpenShift Prow Bot
              Jianli Wei Jianli Wei
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: