Loading...

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhelai-1.3
Component/s: InstructLab - Core
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Got the following Python Traceback prints while trying to exit the VllmWorkerProcess process during training:

        "100% 80/80 [00:48<00:00,  1.64it/s]",
        "INFO 12-12 23:15:24 launcher.py:57] Shutting down FastAPI HTTP server.",
        "INFO 12-12 23:15:24 multiproc_worker_utils.py:137] Terminating local vLLM worker processes",
        " INFO 12-12 23:15:24 multiproc_worker_utils.py:244] Worker exiting",
        " INFO 12-12 23:15:24 multiproc_worker_utils.py:244] Worker exiting",
        " INFO 12-12 23:15:24 multiproc_worker_utils.py:244] Worker exiting",
        " Process VllmWorkerProcess:",
        " Traceback (most recent call last):",
        "   File \"/usr/lib64/python3.11/multiprocessing/process.py\", line 314, in _bootstrap",
        "     self.run()",
        "   File \"/usr/lib64/python3.11/multiprocessing/process.py\", line 108, in run",
        "     self._target(*self._args, **self._kwargs)",
        "   File \"/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py\", line 384, in signal_handler",
        "     raise KeyboardInterrupt(\"MQLLMEngine terminated\")",
        " KeyboardInterrupt: MQLLMEngine terminated",
        " Process VllmWorkerProcess:",
        " Traceback (most recent call last):",
        "   File \"/usr/lib64/python3.11/multiprocessing/process.py\", line 314, in _bootstrap",
        "     self.run()",
        "   File \"/usr/lib64/python3.11/multiprocessing/process.py\", line 108, in run",
        "     self._target(*self._args, **self._kwargs)",
        "   File \"/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py\", line 384, in signal_handler",
        "     raise KeyboardInterrupt(\"MQLLMEngine terminated\")",
        " KeyboardInterrupt: MQLLMEngine terminated",
        ":     Shutting down",
        ":     Waiting for application shutdown.",
        ":     Application shutdown complete.",
        "/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown",

The training process itself was completed successfully!

Env info:

[cloud-user@nvd-srv-28 ~]$ sudo bootc status
apiVersion: org.containers.bootc/v1alpha1
kind: BootcHost
metadata:
  name: host
spec:
  image:
    image: registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3.1-1734019952
    transport: registry
  bootOrder: default
status:
  staged: null
  booted:
    image:
      image:
        image: registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3.1-1734019952
        transport: registry
      version: 9.20241104.0
      timestamp: null
      imageDigest: sha256:407e27fd481d4e329b2feefba4f56470fd043931f205a313f6427c3f80c3a890
    cachedUpdate: null
    incompatible: false
    pinned: false
    store: ostreeContainer
    ostree:
      checksum: aa4198c1e3fec5f70dbdad2cc7c35557e54a931397687fdd0fdeab9224797047
      deploySerial: 0
  rollback:
    image:
      image:
        image: registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3
        transport: registry
      version: 9.20241104.0
      timestamp: null
      imageDigest: sha256:1300ac8b47dd05579b84254dcc369acd8e4f02d06a1372eb801aa71150250d91
    cachedUpdate: null
    incompatible: false
    pinned: false
    store: ostreeContainer
    ostree:
      checksum: 05cc1a174bc8745ecf84aa8d5ca23969890a3bfb0c4bb33549e263c1c273af62
      deploySerial: 0
  rollbackQueued: false
  type: bootcHost

[cloud-user@nvd-srv-28 ~]$ sudo podman images --format json
[
    {
        "Id": "5a5e2ed36334199b762eeffd293099d96b51df0f6ef65bb503be0ddbb0d45326",
        "ParentId": "",
        "RepoTags": null,
        "RepoDigests": [
            "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:525ab53de3829cac1a9aabb73194f49e22da8fdcf12a01c56ece961300cdab0d"
        ],
        "Size": 18138403565,
        "SharedSize": 0,
        "VirtualSize": 18138403565,
        "Labels": {
            "WHEEL_RELEASE": "v1.3.1183+rhelai-cuda-ubi9",
            "architecture": "x86_64",
            "build-date": "2024-12-11T21:09:57",
            "com.redhat.component": "ubi9-container",
            "com.redhat.license_terms": "https://www.redhat.com/en/about/red-hat-end-user-license-agreements#UBI",
            "description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
            "distribution-scope": "public",
            "io.buildah.version": "1.38.0-dev",
            "io.k8s.description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
            "io.k8s.display-name": "Red Hat Universal Base Image 9",
            "io.openshift.expose-services": "",
            "io.openshift.tags": "base rhel9",
            "maintainer": "Red Hat, Inc.",
            "name": "ubi9",
            "org.opencontainers.image.vendor": "Red Hat, Inc.",
            "release": "1214.1729773476",
            "summary": "Provides the latest release of Red Hat Universal Base Image 9.",
            "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9/images/9.4-1214.1729773476",
            "vcs-ref": "e212c7252addad2e5342212f9ed3ac41bc462c87",
            "vcs-type": "git",
            "vendor": "Red Hat, Inc.",
            "version": "9.4"
        },
        "Containers": 0,
        "ReadOnly": true,
        "Names": [
            "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3.1-1733951397"
        ],
        "Digest": "sha256:525ab53de3829cac1a9aabb73194f49e22da8fdcf12a01c56ece961300cdab0d",
        "History": [
            "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3.1-1733951397"
        ],
        "Created": 1733953958,
        "CreatedAt": "2024-12-11T21:52:38Z"
    }
]

[cloud-user@nvd-srv-28 ~]$ ilab system info
Platform:
  sys.version: 3.11.7 (main, Oct  9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.42.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: nvd-srv-28.nvidia.eng.rdu2.dc.redhat.com
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 250.89 GB
  memory.available: 239.24 GB
  memory.used: 9.88 GBInstructLab:
  instructlab.version: 0.21.2
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.4.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.1
  instructlab-sdg.version: 0.6.1
  instructlab-training.version: 0.6.1Torch:
  torch.version: 2.4.1
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: 12.4
  torch.version.hip: None
  torch.cuda.available: True
  torch.backends.cuda.is_built: True
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  torch.cuda.bf16: True
  torch.cuda.current.device: 0
  torch.cuda.0.name: NVIDIA L40S
  torch.cuda.0.free: 44.1 GB
  torch.cuda.0.total: 44.5 GB
  torch.cuda.0.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.1.name: NVIDIA L40S
  torch.cuda.1.free: 44.1 GB
  torch.cuda.1.total: 44.5 GB
  torch.cuda.1.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.2.name: NVIDIA L40S
  torch.cuda.2.free: 44.1 GB
  torch.cuda.2.total: 44.5 GB
  torch.cuda.2.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.3.name: NVIDIA L40S
  torch.cuda.3.free: 44.1 GB
  torch.cuda.3.total: 44.5 GB
  torch.cuda.3.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
  llama_cpp_python.version: 0.2.79
  llama_cpp_python.supports_gpu_offload: True

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates