-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
rhelai-1.3.1
-
None
-
False
-
-
False
-
-
-
Important
To Reproduce Steps to reproduce the behavior:
- Upgrade to rhel1.3.1 using this command:
- sudo bootc switch registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3.1-1734019952
- run 'ilab config init'
- download images with
- ilab model download --repository docker://registry.redhat.io/rhelai1/granite-8b-lab-v1 --release 1.3 &
ilab model download --repository docker://registry.redhat.io/rhelai1/granite-8b-starter-v1 --release 1.3 &
ilab model download --repository docker://registry.redhat.io/rhelai1/prometheus-8x7b-v2-0 --release 1.3 &
ilab model download --repository docker://registry.redhat.io/rhelai1/granite-7b-starter --release 1.3 &
ilab model download --repository docker://registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 --release 1.3 &
ilab model download --repository docker://registry.redhat.io/rhelai1/skills-adapter-v3 --release 1.3 &
ilab model download --repository docker://registry.redhat.io/rhelai1/knowledge-adapter-v3 --release 1.3 &
- ilab model download --repository docker://registry.redhat.io/rhelai1/granite-8b-lab-v1 --release 1.3 &
- Run `ilab data generate`
- After some time of SDG running I see the following error
INFO 12-12 21:36:52 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. Creating json from Arrow format: 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 340/381 [00:13<00:03, 11.39ba/s] INFO 12-12 21:37:02 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. Creating json from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 381/381 [00:16<00:00, 23.79ba/s] INFO 2024-12-12 21:37:05,031 instructlab.sdg.datamixing:215: Mixed Dataset saved to /var/home/cloud-user/.local/share/instructlab/datasets/skills_train_msgs_2024-12-12T21_34_24.jsonl INFO 2024-12-12 21:37:05,087 instructlab.sdg.generate_data:485: Generation took 160.78s INFO 12-12 21:37:05 launcher.py:57] Shutting down FastAPI HTTP server. INFO 12-12 21:37:05 multiproc_worker_utils.py:137] Terminating local vLLM worker processes (VllmWorkerProcess pid=164) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting (VllmWorkerProcess pid=166) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting (VllmWorkerProcess pid=169) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting (VllmWorkerProcess pid=167) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting (VllmWorkerProcess pid=165) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting (VllmWorkerProcess pid=170) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting (VllmWorkerProcess pid=168) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting (VllmWorkerProcess pid=170) Process VllmWorkerProcess: (VllmWorkerProcess pid=170) Traceback (most recent call last): (VllmWorkerProcess pid=170) File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap (VllmWorkerProcess pid=170) self.run() (VllmWorkerProcess pid=170) File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run (VllmWorkerProcess pid=170) self._target(*self._args, **self._kwargs) (VllmWorkerProcess pid=170) File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler (VllmWorkerProcess pid=170) raise KeyboardInterrupt("MQLLMEngine terminated") (VllmWorkerProcess pid=170) KeyboardInterrupt: MQLLMEngine terminated (VllmWorkerProcess pid=165) Process VllmWorkerProcess: (VllmWorkerProcess pid=167) Process VllmWorkerProcess: (VllmWorkerProcess pid=165) Traceback (most recent call last): (VllmWorkerProcess pid=165) File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap (VllmWorkerProcess pid=165) self.run() (VllmWorkerProcess pid=165) File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run (VllmWorkerProcess pid=165) self._target(*self._args, **self._kwargs) (VllmWorkerProcess pid=165) File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler (VllmWorkerProcess pid=165) raise KeyboardInterrupt("MQLLMEngine terminated") (VllmWorkerProcess pid=165) KeyboardInterrupt: MQLLMEngine terminated (VllmWorkerProcess pid=168) Process VllmWorkerProcess: (VllmWorkerProcess pid=166) Process VllmWorkerProcess: (VllmWorkerProcess pid=168) Traceback (most recent call last): (VllmWorkerProcess pid=166) Traceback (most recent call last): (VllmWorkerProcess pid=168) File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap (VllmWorkerProcess pid=166) File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap (VllmWorkerProcess pid=168) self.run() (VllmWorkerProcess pid=166) self.run() (VllmWorkerProcess pid=168) File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run (VllmWorkerProcess pid=166) File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run (VllmWorkerProcess pid=168) self._target(*self._args, **self._kwargs) (VllmWorkerProcess pid=166) self._target(*self._args, **self._kwargs) (VllmWorkerProcess pid=168) File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler (VllmWorkerProcess pid=166) File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler (VllmWorkerProcess pid=168) raise KeyboardInterrupt("MQLLMEngine terminated") (VllmWorkerProcess pid=166) raise KeyboardInterrupt("MQLLMEngine terminated") (VllmWorkerProcess pid=166) KeyboardInterrupt: MQLLMEngine terminated (VllmWorkerProcess pid=168) KeyboardInterrupt: MQLLMEngine terminated (VllmWorkerProcess pid=167) Traceback (most recent call last): (VllmWorkerProcess pid=167) File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap (VllmWorkerProcess pid=167) self.run() (VllmWorkerProcess pid=167) File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run (VllmWorkerProcess pid=167) self._target(*self._args, **self._kwargs) (VllmWorkerProcess pid=167) File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler (VllmWorkerProcess pid=167) raise KeyboardInterrupt("MQLLMEngine terminated") (VllmWorkerProcess pid=167) KeyboardInterrupt: MQLLMEngine terminated (VllmWorkerProcess pid=169) Process VllmWorkerProcess: (VllmWorkerProcess pid=169) Traceback (most recent call last): (VllmWorkerProcess pid=169) File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap (VllmWorkerProcess pid=169) self.run() (VllmWorkerProcess pid=169) File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run (VllmWorkerProcess pid=169) self._target(*self._args, **self._kwargs) (VllmWorkerProcess pid=169) File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler (VllmWorkerProcess pid=169) raise KeyboardInterrupt("MQLLMEngine terminated") (VllmWorkerProcess pid=169) KeyboardInterrupt: MQLLMEngine terminated (VllmWorkerProcess pid=164) Process VllmWorkerProcess: (VllmWorkerProcess pid=164) Traceback (most recent call last): (VllmWorkerProcess pid=164) File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap (VllmWorkerProcess pid=164) self.run() (VllmWorkerProcess pid=164) File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run (VllmWorkerProcess pid=164) self._target(*self._args, **self._kwargs) (VllmWorkerProcess pid=164) File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler (VllmWorkerProcess pid=164) raise KeyboardInterrupt("MQLLMEngine terminated") (VllmWorkerProcess pid=164) KeyboardInterrupt: MQLLMEngine terminated INFO: Shutting down INFO: Waiting for application shutdown. INFO: Application shutdown complete. /usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' INFO 2024-12-12 21:37:19,577 instructlab.model.backends.vllm:475: Waiting for GPU VRAM reclamation... [cloud-user@ip-10-0-20-165 ~]$
Expected behavior
- no error
Device Info (please complete the following information):
- Hardware Specs: [AWS p5.48xlarge]
- OS Version: [Red Hat Enterprise Linux release 9.4 (Plow]]
- InstructLab Version: [0.21.2]
- Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image
"registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3.1-1734019952" ilab system info Platform: sys.version: 3.11.7 (main, Oct 9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.42.1.el9_4.x86_64 platform.machine: x86_64 platform.node: ip-10-0-20-165.us-east-2.compute.internal platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 1999.96 GB memory.available: 1987.86 GB memory.used: 3.24 GB InstructLab: instructlab.version: 0.21.2 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.4.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.1 instructlab-sdg.version: 0.6.1 instructlab-training.version: 0.6.1 Torch: torch.version: 2.4.1 torch.backends.cpu.capability: AVX2 torch.version.cuda: 12.4 torch.version.hip: None torch.cuda.available: True torch.backends.cuda.is_built: True torch.backends.mps.is_built: False torch.backends.mps.is_available: False torch.cuda.bf16: True torch.cuda.current.device: 0 torch.cuda.0.name: NVIDIA H100 80GB HBM3 torch.cuda.0.free: 78.6 GB torch.cuda.0.total: 79.1 GB torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.1.name: NVIDIA H100 80GB HBM3 torch.cuda.1.free: 78.6 GB torch.cuda.1.total: 79.1 GB torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.2.name: NVIDIA H100 80GB HBM3 torch.cuda.2.free: 78.6 GB torch.cuda.2.total: 79.1 GB torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.3.name: NVIDIA H100 80GB HBM3 torch.cuda.3.free: 78.6 GB torch.cuda.3.total: 79.1 GB torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.4.name: NVIDIA H100 80GB HBM3 torch.cuda.4.free: 78.6 GB torch.cuda.4.total: 79.1 GB torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.5.name: NVIDIA H100 80GB HBM3 torch.cuda.5.free: 78.6 GB torch.cuda.5.total: 79.1 GB torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.6.name: NVIDIA H100 80GB HBM3 torch.cuda.6.free: 78.6 GB torch.cuda.6.total: 79.1 GB torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.7.name: NVIDIA H100 80GB HBM3 torch.cuda.7.free: 78.6 GB torch.cuda.7.total: 79.1 GB torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python: llama_cpp_python.version: 0.2.79 llama_cpp_python.supports_gpu_offload: True
- sudo bootc status --format json | jq .status.booted.image.image.image