-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
False
-
-
False
-
Known Issue
-
-
-
Rejected
A failured happened during training with ‘rhel-ai-nvidia-1.3-1732617023-x86_64-boot.iso’ on nvd-srv-30 (Dell R760xa) which has x4:
"Traceback (most recent call last):", " File \"/usr/lib64/python3.11/logging/__init__.py\", line 1110, in emit", " msg = self.format(record)", " ^^^^^^^^^^^^^^^^^^^", " File \"/usr/lib64/python3.11/logging/__init__.py\", line 953, in format", " return fmt.format(record)", " ^^^^^^^^^^^^^^^^^^", " File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/log.py\", line 19, in format", " return super().format(record)", " ^^^^^^^^^^^^^^^^^^^^^^", " File \"/usr/lib64/python3.11/logging/__init__.py\", line 687, in format", " record.message = record.getMessage()", " ^^^^^^^^^^^^^^^^^^^", " File \"/usr/lib64/python3.11/logging/__init__.py\", line 377, in getMessage", " msg = msg % self.args", " ~~~~^~~~~~~~~~~", "TypeError: not all arguments converted during string formatting", "Call stack:", " File \"/opt/app-root/bin/ilab\", line 8, in <module>", " sys.exit(ilab())", " File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1157, in __call__", " return self.main(*args, **kwargs)", " File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1078, in main", " rv = self.invoke(ctx)", " File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1688, in invoke", " return _process_result(sub_ctx.command.invoke(sub_ctx))", " File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1688, in invoke", " return _process_result(sub_ctx.command.invoke(sub_ctx))", " File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1434, in invoke", " return ctx.invoke(self.callback, **ctx.params)", " File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 783, in invoke", " return __callback(*args, **kwargs)", " File \"/opt/app-root/lib64/python3.11/site-packages/click/decorators.py\", line 33, in new_func", " return f(get_current_context(), *args, **kwargs)", " File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/clickext.py\", line 323, in wrapper", " return f(*args, **kwargs)", " File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/cli/model/train.py\", line 448, in train", " accelerated_train.accelerated_train(", " File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py\", line 170, in accelerated_train", " _run_phased_training(", " File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py\", line 292, in _run_phased_training", " _run_phase(", " File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py\", line 239, in _run_phase",
(Full logs are attached)
System info:
[cloud-user@nvd-srv-30 ~]$ ilab system info Platform: sys.version: 3.11.7 (main, Oct 9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.42.1.el9_4.x86_64 platform.machine: x86_64 platform.node: nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 250.89 GB memory.available: 239.31 GB memory.used: 9.80 GBInstructLab: instructlab.version: 0.21.0 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.4.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.1 instructlab-sdg.version: 0.6.0 instructlab-training.version: 0.6.1Torch: torch.version: 2.4.1 torch.backends.cpu.capability: AVX512 torch.version.cuda: 12.4 torch.version.hip: None torch.cuda.available: True torch.backends.cuda.is_built: True torch.backends.mps.is_built: False torch.backends.mps.is_available: False torch.cuda.bf16: True torch.cuda.current.device: 0 torch.cuda.0.name: NVIDIA L40S torch.cuda.0.free: 44.1 GB torch.cuda.0.total: 44.5 GB torch.cuda.0.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.1.name: NVIDIA L40S torch.cuda.1.free: 44.1 GB torch.cuda.1.total: 44.5 GB torch.cuda.1.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.2.name: NVIDIA L40S torch.cuda.2.free: 44.1 GB torch.cuda.2.total: 44.5 GB torch.cuda.2.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.3.name: NVIDIA L40S torch.cuda.3.free: 44.1 GB torch.cuda.3.total: 44.5 GB torch.cuda.3.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python: llama_cpp_python.version: 0.2.79 llama_cpp_python.supports_gpu_offload: True
Bootc status
[cloud-user@nvd-srv-30 ~]$ sudo bootc status apiVersion: org.containers.bootc/v1alpha1 kind: BootcHost metadata: name: host spec: image: image: registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.1 transport: registry bootOrder: default status: staged: null booted: image: image: image: registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.1 transport: registry version: 9.20241104.0 timestamp: null imageDigest: sha256:78063efc909972129f8f6759b10ee7de4cc249d9e7281f5c7d5d2a2e64634c60 cachedUpdate: null incompatible: false pinned: false store: ostreeContainer ostree: checksum: 37c3172fa51ddb83578a1611e01158051e84264175ad4ff7f7d8e6042d700b90 deploySerial: 0 rollback: null rollbackQueued: false type: bootcHost
Podman images:
[cloud-user@nvd-srv-30 ~]$ sudo podman images --format json [ { "Id": "55e9902a1270478310c70de37eed73b01fabc62898e81596ea6c345247a5d11f", "ParentId": "", "RepoTags": null, "RepoDigests": [ "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:1b609e74e936c9cac1f4c9b0ffee62981cff6c976da1e419862a4f902be007ee" ], "Size": 18204216557, "SharedSize": 0, "VirtualSize": 18204216557, "Labels": { "WHEEL_RELEASE": "v1.3.1059+rhelai-cuda-ubi9", "architecture": "x86_64", "build-date": "2024-11-26T03:02:02", "com.redhat.component": "ubi9-container", "com.redhat.license_terms": "https://www.redhat.com/en/about/red-hat-end-user-license-agreements#UBI", "description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.", "distribution-scope": "public", "io.buildah.version": "1.38.0-dev", "io.k8s.description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.", "io.k8s.display-name": "Red Hat Universal Base Image 9", "io.openshift.expose-services": "", "io.openshift.tags": "base rhel9", "maintainer": "Red Hat, Inc.", "name": "ubi9", "org.opencontainers.image.vendor": "Red Hat, Inc.", "release": "1214.1729773476", "summary": "Provides the latest release of Red Hat Universal Base Image 9.", "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9/images/9.4-1214.1729773476", "vcs-ref": "268880ba0780f8158c5b6f7ecd25e96976e4736d", "vcs-type": "git", "vendor": "Red Hat, Inc.", "version": "9.4" }, "Containers": 0, "ReadOnly": true, "Names": [ "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1732590122" ], "Digest": "sha256:1b609e74e936c9cac1f4c9b0ffee62981cff6c976da1e419862a4f902be007ee", "History": [ "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1732590122" ], "Created": 1732591237, "CreatedAt": "2024-11-26T03:20:37Z" } ]
- clones
-
RHELAI-2398 Training fails on R760xa with x4 L40S
- Closed
- relates to
-
RHELAI-2713 Training fails on Azure Standard_ND96asr_v4 instance with x8 A100-SXM4-40GB
- Refinement