Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2713

Training fails on Azure Standard_ND96asr_v4 instance with x8 A100-SXM4-40GB

XMLWordPrintable

    • Critical

      A failured happened during training with ‘rhel-ai-nvidia-1.3-1732617023-x86_64-boot.iso’ on nvd-srv-30 (Dell R760xa) which has x4:

              "Traceback (most recent call last):",
              "  File \"/usr/lib64/python3.11/logging/__init__.py\", line 1110, in emit",
              "    msg = self.format(record)",
              "          ^^^^^^^^^^^^^^^^^^^",
              "  File \"/usr/lib64/python3.11/logging/__init__.py\", line 953, in format",
              "    return fmt.format(record)",
              "           ^^^^^^^^^^^^^^^^^^",
              "  File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/log.py\", line 19, in format",
              "    return super().format(record)",
              "           ^^^^^^^^^^^^^^^^^^^^^^",
              "  File \"/usr/lib64/python3.11/logging/__init__.py\", line 687, in format",
              "    record.message = record.getMessage()",
              "                     ^^^^^^^^^^^^^^^^^^^",
              "  File \"/usr/lib64/python3.11/logging/__init__.py\", line 377, in getMessage",
              "    msg = msg % self.args",
              "          ~~~~^~~~~~~~~~~",
              "TypeError: not all arguments converted during string formatting",
              "Call stack:",
              "  File \"/opt/app-root/bin/ilab\", line 8, in <module>",
              "    sys.exit(ilab())",
              "  File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1157, in __call__",
              "    return self.main(*args, **kwargs)",
              "  File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1078, in main",
              "    rv = self.invoke(ctx)",
              "  File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1688, in invoke",
              "    return _process_result(sub_ctx.command.invoke(sub_ctx))",
              "  File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1688, in invoke",
              "    return _process_result(sub_ctx.command.invoke(sub_ctx))",
              "  File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1434, in invoke",
              "    return ctx.invoke(self.callback, **ctx.params)",
              "  File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 783, in invoke",
              "    return __callback(*args, **kwargs)",
              "  File \"/opt/app-root/lib64/python3.11/site-packages/click/decorators.py\", line 33, in new_func",
              "    return f(get_current_context(), *args, **kwargs)",
              "  File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/clickext.py\", line 323, in wrapper",
              "    return f(*args, **kwargs)",
              "  File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/cli/model/train.py\", line 448, in train",
              "    accelerated_train.accelerated_train(",
              "  File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py\", line 170, in accelerated_train",
              "    _run_phased_training(",
              "  File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py\", line 292, in _run_phased_training",
              "    _run_phase(",
              "  File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py\", line 239, in _run_phase", 

      (Full logs are attached)

      System info:

      [cloud-user@nvd-srv-30 ~]$ ilab system info
      Platform:
        sys.version: 3.11.7 (main, Oct  9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
        sys.platform: linux
        os.name: posix
        platform.release: 5.14.0-427.42.1.el9_4.x86_64
        platform.machine: x86_64
        platform.node: nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com
        platform.python_version: 3.11.7
        os-release.ID: rhel
        os-release.VERSION_ID: 9.4
        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
        memory.total: 250.89 GB
        memory.available: 239.31 GB
        memory.used: 9.80 GBInstructLab:
        instructlab.version: 0.21.0
        instructlab-dolomite.version: 0.2.0
        instructlab-eval.version: 0.4.1
        instructlab-quantize.version: 0.1.0
        instructlab-schema.version: 0.4.1
        instructlab-sdg.version: 0.6.0
        instructlab-training.version: 0.6.1Torch:
        torch.version: 2.4.1
        torch.backends.cpu.capability: AVX512
        torch.version.cuda: 12.4
        torch.version.hip: None
        torch.cuda.available: True
        torch.backends.cuda.is_built: True
        torch.backends.mps.is_built: False
        torch.backends.mps.is_available: False
        torch.cuda.bf16: True
        torch.cuda.current.device: 0
        torch.cuda.0.name: NVIDIA L40S
        torch.cuda.0.free: 44.1 GB
        torch.cuda.0.total: 44.5 GB
        torch.cuda.0.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.1.name: NVIDIA L40S
        torch.cuda.1.free: 44.1 GB
        torch.cuda.1.total: 44.5 GB
        torch.cuda.1.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.2.name: NVIDIA L40S
        torch.cuda.2.free: 44.1 GB
        torch.cuda.2.total: 44.5 GB
        torch.cuda.2.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.3.name: NVIDIA L40S
        torch.cuda.3.free: 44.1 GB
        torch.cuda.3.total: 44.5 GB
        torch.cuda.3.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
        llama_cpp_python.version: 0.2.79
        llama_cpp_python.supports_gpu_offload: True 

      Bootc status

      [cloud-user@nvd-srv-30 ~]$ sudo bootc status
      apiVersion: org.containers.bootc/v1alpha1
      kind: BootcHost
      metadata:
        name: host
      spec:
        image:
          image: registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.1
          transport: registry
        bootOrder: default
      status:
        staged: null
        booted:
          image:
            image:
              image: registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.1
              transport: registry
            version: 9.20241104.0
            timestamp: null
            imageDigest: sha256:78063efc909972129f8f6759b10ee7de4cc249d9e7281f5c7d5d2a2e64634c60
          cachedUpdate: null
          incompatible: false
          pinned: false
          store: ostreeContainer
          ostree:
            checksum: 37c3172fa51ddb83578a1611e01158051e84264175ad4ff7f7d8e6042d700b90
            deploySerial: 0
        rollback: null
        rollbackQueued: false
        type: bootcHost 

      Podman images:

      [cloud-user@nvd-srv-30 ~]$ sudo podman images --format json
      [
          {
              "Id": "55e9902a1270478310c70de37eed73b01fabc62898e81596ea6c345247a5d11f",
              "ParentId": "",
              "RepoTags": null,
              "RepoDigests": [
                  "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:1b609e74e936c9cac1f4c9b0ffee62981cff6c976da1e419862a4f902be007ee"
              ],
              "Size": 18204216557,
              "SharedSize": 0,
              "VirtualSize": 18204216557,
              "Labels": {
                  "WHEEL_RELEASE": "v1.3.1059+rhelai-cuda-ubi9",
                  "architecture": "x86_64",
                  "build-date": "2024-11-26T03:02:02",
                  "com.redhat.component": "ubi9-container",
                  "com.redhat.license_terms": "https://www.redhat.com/en/about/red-hat-end-user-license-agreements#UBI",
                  "description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
                  "distribution-scope": "public",
                  "io.buildah.version": "1.38.0-dev",
                  "io.k8s.description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
                  "io.k8s.display-name": "Red Hat Universal Base Image 9",
                  "io.openshift.expose-services": "",
                  "io.openshift.tags": "base rhel9",
                  "maintainer": "Red Hat, Inc.",
                  "name": "ubi9",
                  "org.opencontainers.image.vendor": "Red Hat, Inc.",
                  "release": "1214.1729773476",
                  "summary": "Provides the latest release of Red Hat Universal Base Image 9.",
                  "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9/images/9.4-1214.1729773476",
                  "vcs-ref": "268880ba0780f8158c5b6f7ecd25e96976e4736d",
                  "vcs-type": "git",
                  "vendor": "Red Hat, Inc.",
                  "version": "9.4"
              },
              "Containers": 0,
              "ReadOnly": true,
              "Names": [
                  "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1732590122"
              ],
              "Digest": "sha256:1b609e74e936c9cac1f4c9b0ffee62981cff6c976da1e419862a4f902be007ee",
              "History": [
                  "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1732590122"
              ],
              "Created": 1732591237,
              "CreatedAt": "2024-11-26T03:20:37Z"
          }
      ] 

              nweinber1 Nathan Weinberg
              aopincar Ariel Opincaru
              Ariel Opincaru Ariel Opincaru
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: