Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: RHELAI Backlog
Affects Version/s: rhelai-1.3.1
Component/s: Engine/Runtime, InstructLab - Training
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Release Note Type:
Known Issue
Git Pull Request:
https://gitlab.com/redhat/rhel-ai/containers/instructlab-nvidia/-/merge_requests/441
Intelligence Requested:
Market:

Severity:
Critical

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

A failured happened during training with ‘rhel-ai-nvidia-1.3-1732617023-x86_64-boot.iso’ on nvd-srv-30 (Dell R760xa) which has x4:

        "Traceback (most recent call last):",
        "  File \"/usr/lib64/python3.11/logging/__init__.py\", line 1110, in emit",
        "    msg = self.format(record)",
        "          ^^^^^^^^^^^^^^^^^^^",
        "  File \"/usr/lib64/python3.11/logging/__init__.py\", line 953, in format",
        "    return fmt.format(record)",
        "           ^^^^^^^^^^^^^^^^^^",
        "  File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/log.py\", line 19, in format",
        "    return super().format(record)",
        "           ^^^^^^^^^^^^^^^^^^^^^^",
        "  File \"/usr/lib64/python3.11/logging/__init__.py\", line 687, in format",
        "    record.message = record.getMessage()",
        "                     ^^^^^^^^^^^^^^^^^^^",
        "  File \"/usr/lib64/python3.11/logging/__init__.py\", line 377, in getMessage",
        "    msg = msg % self.args",
        "          ~~~~^~~~~~~~~~~",
        "TypeError: not all arguments converted during string formatting",
        "Call stack:",
        "  File \"/opt/app-root/bin/ilab\", line 8, in <module>",
        "    sys.exit(ilab())",
        "  File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1157, in __call__",
        "    return self.main(*args, **kwargs)",
        "  File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1078, in main",
        "    rv = self.invoke(ctx)",
        "  File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1688, in invoke",
        "    return _process_result(sub_ctx.command.invoke(sub_ctx))",
        "  File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1688, in invoke",
        "    return _process_result(sub_ctx.command.invoke(sub_ctx))",
        "  File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 1434, in invoke",
        "    return ctx.invoke(self.callback, **ctx.params)",
        "  File \"/opt/app-root/lib64/python3.11/site-packages/click/core.py\", line 783, in invoke",
        "    return __callback(*args, **kwargs)",
        "  File \"/opt/app-root/lib64/python3.11/site-packages/click/decorators.py\", line 33, in new_func",
        "    return f(get_current_context(), *args, **kwargs)",
        "  File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/clickext.py\", line 323, in wrapper",
        "    return f(*args, **kwargs)",
        "  File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/cli/model/train.py\", line 448, in train",
        "    accelerated_train.accelerated_train(",
        "  File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py\", line 170, in accelerated_train",
        "    _run_phased_training(",
        "  File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py\", line 292, in _run_phased_training",
        "    _run_phase(",
        "  File \"/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py\", line 239, in _run_phase",

(Full logs are attached)

System info:

[cloud-user@nvd-srv-30 ~]$ ilab system info
Platform:
  sys.version: 3.11.7 (main, Oct  9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.42.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: nvd-srv-30.nvidia.eng.rdu2.dc.redhat.com
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 250.89 GB
  memory.available: 239.31 GB
  memory.used: 9.80 GBInstructLab:
  instructlab.version: 0.21.0
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.4.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.1
  instructlab-sdg.version: 0.6.0
  instructlab-training.version: 0.6.1Torch:
  torch.version: 2.4.1
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: 12.4
  torch.version.hip: None
  torch.cuda.available: True
  torch.backends.cuda.is_built: True
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  torch.cuda.bf16: True
  torch.cuda.current.device: 0
  torch.cuda.0.name: NVIDIA L40S
  torch.cuda.0.free: 44.1 GB
  torch.cuda.0.total: 44.5 GB
  torch.cuda.0.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.1.name: NVIDIA L40S
  torch.cuda.1.free: 44.1 GB
  torch.cuda.1.total: 44.5 GB
  torch.cuda.1.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.2.name: NVIDIA L40S
  torch.cuda.2.free: 44.1 GB
  torch.cuda.2.total: 44.5 GB
  torch.cuda.2.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.3.name: NVIDIA L40S
  torch.cuda.3.free: 44.1 GB
  torch.cuda.3.total: 44.5 GB
  torch.cuda.3.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
  llama_cpp_python.version: 0.2.79
  llama_cpp_python.supports_gpu_offload: True

Bootc status

[cloud-user@nvd-srv-30 ~]$ sudo bootc status
apiVersion: org.containers.bootc/v1alpha1
kind: BootcHost
metadata:
  name: host
spec:
  image:
    image: registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.1
    transport: registry
  bootOrder: default
status:
  staged: null
  booted:
    image:
      image:
        image: registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.1
        transport: registry
      version: 9.20241104.0
      timestamp: null
      imageDigest: sha256:78063efc909972129f8f6759b10ee7de4cc249d9e7281f5c7d5d2a2e64634c60
    cachedUpdate: null
    incompatible: false
    pinned: false
    store: ostreeContainer
    ostree:
      checksum: 37c3172fa51ddb83578a1611e01158051e84264175ad4ff7f7d8e6042d700b90
      deploySerial: 0
  rollback: null
  rollbackQueued: false
  type: bootcHost

Podman images:

[cloud-user@nvd-srv-30 ~]$ sudo podman images --format json
[
    {
        "Id": "55e9902a1270478310c70de37eed73b01fabc62898e81596ea6c345247a5d11f",
        "ParentId": "",
        "RepoTags": null,
        "RepoDigests": [
            "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:1b609e74e936c9cac1f4c9b0ffee62981cff6c976da1e419862a4f902be007ee"
        ],
        "Size": 18204216557,
        "SharedSize": 0,
        "VirtualSize": 18204216557,
        "Labels": {
            "WHEEL_RELEASE": "v1.3.1059+rhelai-cuda-ubi9",
            "architecture": "x86_64",
            "build-date": "2024-11-26T03:02:02",
            "com.redhat.component": "ubi9-container",
            "com.redhat.license_terms": "https://www.redhat.com/en/about/red-hat-end-user-license-agreements#UBI",
            "description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
            "distribution-scope": "public",
            "io.buildah.version": "1.38.0-dev",
            "io.k8s.description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
            "io.k8s.display-name": "Red Hat Universal Base Image 9",
            "io.openshift.expose-services": "",
            "io.openshift.tags": "base rhel9",
            "maintainer": "Red Hat, Inc.",
            "name": "ubi9",
            "org.opencontainers.image.vendor": "Red Hat, Inc.",
            "release": "1214.1729773476",
            "summary": "Provides the latest release of Red Hat Universal Base Image 9.",
            "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9/images/9.4-1214.1729773476",
            "vcs-ref": "268880ba0780f8158c5b6f7ecd25e96976e4736d",
            "vcs-type": "git",
            "vendor": "Red Hat, Inc.",
            "version": "9.4"
        },
        "Containers": 0,
        "ReadOnly": true,
        "Names": [
            "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1732590122"
        ],
        "Digest": "sha256:1b609e74e936c9cac1f4c9b0ffee62981cff6c976da1e419862a4f902be007ee",
        "History": [
            "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3-1732590122"
        ],
        "Created": 1732591237,
        "CreatedAt": "2024-11-26T03:20:37Z"
    }
]