Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhelai-1.3.1
Component/s: InstructLab - CLI
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Severity:
Important

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

To Reproduce Steps to reproduce the behavior:

Deploy on AWS instance p5.48xlarge the following image:
rhel-ai-nvidia-1.3-1733167100-x86_64.raw

Upgrade to rhel1.3.1 using this command:
1. sudo bootc switch registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3.1-1734019952
run 'ilab config init'
download images with
1. ilab model download --repository docker://registry.redhat.io/rhelai1/granite-8b-lab-v1 --release 1.3 &
  ilab model download --repository docker://registry.redhat.io/rhelai1/granite-8b-starter-v1 --release 1.3 &
  ilab model download --repository docker://registry.redhat.io/rhelai1/prometheus-8x7b-v2-0 --release 1.3 &
  ilab model download --repository docker://registry.redhat.io/rhelai1/granite-7b-starter --release 1.3 &
  ilab model download --repository docker://registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 --release 1.3 &
  ilab model download --repository docker://registry.redhat.io/rhelai1/skills-adapter-v3 --release 1.3 &
  ilab model download --repository docker://registry.redhat.io/rhelai1/knowledge-adapter-v3 --release 1.3 &

Run `ilab data generate`
After some time of SDG running I see the following error

INFO 12-12 21:36:52 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%,
 CPU KV cache usage: 0.0%.                                                                                                                                                                    
Creating json from Arrow format:  89%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▎            | 340/381 [00:13<00:03, 11.39ba/s]
INFO 12-12 21:37:02 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%,
 CPU KV cache usage: 0.0%.
Creating json from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 381/381 [00:16<00:00, 23.79ba/s]
INFO 2024-12-12 21:37:05,031 instructlab.sdg.datamixing:215: Mixed Dataset saved to /var/home/cloud-user/.local/share/instructlab/datasets/skills_train_msgs_2024-12-12T21_34_24.jsonl
INFO 2024-12-12 21:37:05,087 instructlab.sdg.generate_data:485: Generation took 160.78s
INFO 12-12 21:37:05 launcher.py:57] Shutting down FastAPI HTTP server.
INFO 12-12 21:37:05 multiproc_worker_utils.py:137] Terminating local vLLM worker processes
(VllmWorkerProcess pid=164) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
(VllmWorkerProcess pid=166) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
(VllmWorkerProcess pid=169) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
(VllmWorkerProcess pid=167) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
(VllmWorkerProcess pid=165) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
(VllmWorkerProcess pid=170) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
(VllmWorkerProcess pid=168) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
(VllmWorkerProcess pid=170) Process VllmWorkerProcess:
(VllmWorkerProcess pid=170) Traceback (most recent call last):
(VllmWorkerProcess pid=170)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=170)     self.run()
(VllmWorkerProcess pid=170)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=170)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=170)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
(VllmWorkerProcess pid=170)     raise KeyboardInterrupt("MQLLMEngine terminated")
(VllmWorkerProcess pid=170) KeyboardInterrupt: MQLLMEngine terminated (VllmWorkerProcess pid=165) Process VllmWorkerProcess:
(VllmWorkerProcess pid=167) Process VllmWorkerProcess:
(VllmWorkerProcess pid=165) Traceback (most recent call last):
(VllmWorkerProcess pid=165)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=165)     self.run()
(VllmWorkerProcess pid=165)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=165)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=165)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
(VllmWorkerProcess pid=165)     raise KeyboardInterrupt("MQLLMEngine terminated")
(VllmWorkerProcess pid=165) KeyboardInterrupt: MQLLMEngine terminated
(VllmWorkerProcess pid=168) Process VllmWorkerProcess:
(VllmWorkerProcess pid=166) Process VllmWorkerProcess:
(VllmWorkerProcess pid=168) Traceback (most recent call last):
(VllmWorkerProcess pid=166) Traceback (most recent call last):
(VllmWorkerProcess pid=168)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=166)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=168)     self.run()
(VllmWorkerProcess pid=166)     self.run()
(VllmWorkerProcess pid=168)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=166)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=168)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=166)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=168)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
(VllmWorkerProcess pid=166)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
(VllmWorkerProcess pid=168)     raise KeyboardInterrupt("MQLLMEngine terminated")
(VllmWorkerProcess pid=166)     raise KeyboardInterrupt("MQLLMEngine terminated")
(VllmWorkerProcess pid=166) KeyboardInterrupt: MQLLMEngine terminated
(VllmWorkerProcess pid=168) KeyboardInterrupt: MQLLMEngine terminated
(VllmWorkerProcess pid=167) Traceback (most recent call last):
(VllmWorkerProcess pid=167)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=167)     self.run()
(VllmWorkerProcess pid=167)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=167)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=167)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
(VllmWorkerProcess pid=167)     raise KeyboardInterrupt("MQLLMEngine terminated")
(VllmWorkerProcess pid=167) KeyboardInterrupt: MQLLMEngine terminated
(VllmWorkerProcess pid=169) Process VllmWorkerProcess:
(VllmWorkerProcess pid=169) Traceback (most recent call last):
(VllmWorkerProcess pid=169)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=169)     self.run()
(VllmWorkerProcess pid=169)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=169)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=169)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
(VllmWorkerProcess pid=169)     raise KeyboardInterrupt("MQLLMEngine terminated")
(VllmWorkerProcess pid=169) KeyboardInterrupt: MQLLMEngine terminated
(VllmWorkerProcess pid=164) Process VllmWorkerProcess:
(VllmWorkerProcess pid=164) Traceback (most recent call last):
(VllmWorkerProcess pid=164)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=164)     self.run()
(VllmWorkerProcess pid=164)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=164)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=164)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
(VllmWorkerProcess pid=164)     raise KeyboardInterrupt("MQLLMEngine terminated")
(VllmWorkerProcess pid=164) KeyboardInterrupt: MQLLMEngine terminated
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
INFO 2024-12-12 21:37:19,577 instructlab.model.backends.vllm:475: Waiting for GPU VRAM reclamation...
[cloud-user@ip-10-0-20-165 ~]$

Expected behavior

no error

Device Info (please complete the following information):

Hardware Specs: [AWS p5.48xlarge]
OS Version: [Red Hat Enterprise Linux release 9.4 (Plow]]
InstructLab Version: [0.21.2]

Provide the output of these two commands:

sudo bootc status --format json | jq .status.booted.image.image.image
"registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3.1-1734019952"

ilab system info
Platform:                                                                          
  sys.version: 3.11.7 (main, Oct  9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux       
  os.name: posix             
  platform.release: 5.14.0-427.42.1.el9_4.x86_64                                   
  platform.machine: x86_64                
  platform.node: ip-10-0-20-165.us-east-2.compute.internal
  platform.python_version: 3.11.7
  os-release.ID: rhel                                                              
  os-release.VERSION_ID: 9.4              
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 1999.96 GB   
  memory.available: 1987.86 GB                                                     
  memory.used: 3.24 GB                    
                            
InstructLab:                 
  instructlab.version: 0.21.2                                                      
  instructlab-dolomite.version: 0.2.0     
  instructlab-eval.version: 0.4.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.1                                                
  instructlab-sdg.version: 0.6.1          
  instructlab-training.version: 0.6.1
                             
Torch:                                                                             
  torch.version: 2.4.1
  torch.backends.cpu.capability: AVX2
  torch.version.cuda: 12.4        
  torch.version.hip: None                    
  torch.cuda.available: True  
  torch.backends.cuda.is_built: True
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  torch.cuda.bf16: True
  torch.cuda.current.device: 0
  torch.cuda.0.name: NVIDIA H100 80GB HBM3
  torch.cuda.0.free: 78.6 GB
  torch.cuda.0.total: 79.1 GB
  torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.1.name: NVIDIA H100 80GB HBM3    torch.cuda.1.free: 78.6 GB
  torch.cuda.1.total: 79.1 GB
  torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.2.name: NVIDIA H100 80GB HBM3
  torch.cuda.2.free: 78.6 GB
  torch.cuda.2.total: 79.1 GB
  torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.3.name: NVIDIA H100 80GB HBM3
  torch.cuda.3.free: 78.6 GB
  torch.cuda.3.total: 79.1 GB
  torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.4.name: NVIDIA H100 80GB HBM3
  torch.cuda.4.free: 78.6 GB
  torch.cuda.4.total: 79.1 GB
  torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.5.name: NVIDIA H100 80GB HBM3
  torch.cuda.5.free: 78.6 GB
  torch.cuda.5.total: 79.1 GB
  torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.6.name: NVIDIA H100 80GB HBM3
  torch.cuda.6.free: 78.6 GB
  torch.cuda.6.total: 79.1 GB
  torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.7.name: NVIDIA H100 80GB HBM3
  torch.cuda.7.free: 78.6 GB
  torch.cuda.7.total: 79.1 GB
  torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
  llama_cpp_python.version: 0.2.79
  llama_cpp_python.supports_gpu_offload: True

Details

Description

rhel-ai-nvidia-1.3-1733167100-x86_64.raw

Attachments

Easy Agile Planning Poker

Activity

People

Dates