-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
rhelai-1.3
-
None
-
False
-
-
False
-
-
Got the following Python Traceback prints while trying to exit the VllmWorkerProcess process during training:
"100% 80/80 [00:48<00:00, 1.64it/s]", "INFO 12-12 23:15:24 launcher.py:57] Shutting down FastAPI HTTP server.", "INFO 12-12 23:15:24 multiproc_worker_utils.py:137] Terminating local vLLM worker processes", " INFO 12-12 23:15:24 multiproc_worker_utils.py:244] Worker exiting", " INFO 12-12 23:15:24 multiproc_worker_utils.py:244] Worker exiting", " INFO 12-12 23:15:24 multiproc_worker_utils.py:244] Worker exiting", " Process VllmWorkerProcess:", " Traceback (most recent call last):", " File \"/usr/lib64/python3.11/multiprocessing/process.py\", line 314, in _bootstrap", " self.run()", " File \"/usr/lib64/python3.11/multiprocessing/process.py\", line 108, in run", " self._target(*self._args, **self._kwargs)", " File \"/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py\", line 384, in signal_handler", " raise KeyboardInterrupt(\"MQLLMEngine terminated\")", " KeyboardInterrupt: MQLLMEngine terminated", " Process VllmWorkerProcess:", " Traceback (most recent call last):", " File \"/usr/lib64/python3.11/multiprocessing/process.py\", line 314, in _bootstrap", " self.run()", " File \"/usr/lib64/python3.11/multiprocessing/process.py\", line 108, in run", " self._target(*self._args, **self._kwargs)", " File \"/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py\", line 384, in signal_handler", " raise KeyboardInterrupt(\"MQLLMEngine terminated\")", " KeyboardInterrupt: MQLLMEngine terminated", ": Shutting down", ": Waiting for application shutdown.", ": Application shutdown complete.", "/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown",
The training process itself was completed successfully!
Env info:
[cloud-user@nvd-srv-28 ~]$ sudo bootc status apiVersion: org.containers.bootc/v1alpha1 kind: BootcHost metadata: name: host spec: image: image: registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3.1-1734019952 transport: registry bootOrder: default status: staged: null booted: image: image: image: registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3.1-1734019952 transport: registry version: 9.20241104.0 timestamp: null imageDigest: sha256:407e27fd481d4e329b2feefba4f56470fd043931f205a313f6427c3f80c3a890 cachedUpdate: null incompatible: false pinned: false store: ostreeContainer ostree: checksum: aa4198c1e3fec5f70dbdad2cc7c35557e54a931397687fdd0fdeab9224797047 deploySerial: 0 rollback: image: image: image: registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3 transport: registry version: 9.20241104.0 timestamp: null imageDigest: sha256:1300ac8b47dd05579b84254dcc369acd8e4f02d06a1372eb801aa71150250d91 cachedUpdate: null incompatible: false pinned: false store: ostreeContainer ostree: checksum: 05cc1a174bc8745ecf84aa8d5ca23969890a3bfb0c4bb33549e263c1c273af62 deploySerial: 0 rollbackQueued: false type: bootcHost
[cloud-user@nvd-srv-28 ~]$ sudo podman images --format json [ { "Id": "5a5e2ed36334199b762eeffd293099d96b51df0f6ef65bb503be0ddbb0d45326", "ParentId": "", "RepoTags": null, "RepoDigests": [ "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:525ab53de3829cac1a9aabb73194f49e22da8fdcf12a01c56ece961300cdab0d" ], "Size": 18138403565, "SharedSize": 0, "VirtualSize": 18138403565, "Labels": { "WHEEL_RELEASE": "v1.3.1183+rhelai-cuda-ubi9", "architecture": "x86_64", "build-date": "2024-12-11T21:09:57", "com.redhat.component": "ubi9-container", "com.redhat.license_terms": "https://www.redhat.com/en/about/red-hat-end-user-license-agreements#UBI", "description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.", "distribution-scope": "public", "io.buildah.version": "1.38.0-dev", "io.k8s.description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.", "io.k8s.display-name": "Red Hat Universal Base Image 9", "io.openshift.expose-services": "", "io.openshift.tags": "base rhel9", "maintainer": "Red Hat, Inc.", "name": "ubi9", "org.opencontainers.image.vendor": "Red Hat, Inc.", "release": "1214.1729773476", "summary": "Provides the latest release of Red Hat Universal Base Image 9.", "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9/images/9.4-1214.1729773476", "vcs-ref": "e212c7252addad2e5342212f9ed3ac41bc462c87", "vcs-type": "git", "vendor": "Red Hat, Inc.", "version": "9.4" }, "Containers": 0, "ReadOnly": true, "Names": [ "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3.1-1733951397" ], "Digest": "sha256:525ab53de3829cac1a9aabb73194f49e22da8fdcf12a01c56ece961300cdab0d", "History": [ "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3.1-1733951397" ], "Created": 1733953958, "CreatedAt": "2024-12-11T21:52:38Z" } ]
[cloud-user@nvd-srv-28 ~]$ ilab system info Platform: sys.version: 3.11.7 (main, Oct 9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.42.1.el9_4.x86_64 platform.machine: x86_64 platform.node: nvd-srv-28.nvidia.eng.rdu2.dc.redhat.com platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 250.89 GB memory.available: 239.24 GB memory.used: 9.88 GBInstructLab: instructlab.version: 0.21.2 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.4.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.1 instructlab-sdg.version: 0.6.1 instructlab-training.version: 0.6.1Torch: torch.version: 2.4.1 torch.backends.cpu.capability: AVX512 torch.version.cuda: 12.4 torch.version.hip: None torch.cuda.available: True torch.backends.cuda.is_built: True torch.backends.mps.is_built: False torch.backends.mps.is_available: False torch.cuda.bf16: True torch.cuda.current.device: 0 torch.cuda.0.name: NVIDIA L40S torch.cuda.0.free: 44.1 GB torch.cuda.0.total: 44.5 GB torch.cuda.0.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.1.name: NVIDIA L40S torch.cuda.1.free: 44.1 GB torch.cuda.1.total: 44.5 GB torch.cuda.1.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.2.name: NVIDIA L40S torch.cuda.2.free: 44.1 GB torch.cuda.2.total: 44.5 GB torch.cuda.2.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.3.name: NVIDIA L40S torch.cuda.3.free: 44.1 GB torch.cuda.3.total: 44.5 GB torch.cuda.3.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python: llama_cpp_python.version: 0.2.79 llama_cpp_python.supports_gpu_offload: True