Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2709

SDG throws an error message

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhelai-1.3.1
    • InstructLab - CLI
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Important

      To Reproduce Steps to reproduce the behavior:

      1.  Deploy on AWS instance p5.48xlarge the following image: 

        rhel-ai-nvidia-1.3-1733167100-x86_64.raw

      1. Upgrade to rhel1.3.1 using this command:
        1. sudo bootc switch registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3.1-1734019952
      2. run 'ilab config init'
      3. download images with
        1. ilab model download --repository docker://registry.redhat.io/rhelai1/granite-8b-lab-v1 --release 1.3 &
          ilab model download --repository docker://registry.redhat.io/rhelai1/granite-8b-starter-v1 --release 1.3 &
          ilab model download --repository docker://registry.redhat.io/rhelai1/prometheus-8x7b-v2-0 --release 1.3 &
          ilab model download --repository docker://registry.redhat.io/rhelai1/granite-7b-starter --release 1.3 &
          ilab model download --repository docker://registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 --release 1.3 &
          ilab model download --repository docker://registry.redhat.io/rhelai1/skills-adapter-v3 --release 1.3 &
          ilab model download --repository docker://registry.redhat.io/rhelai1/knowledge-adapter-v3 --release 1.3 &
      1. Run `ilab data generate`
      2. After some time of SDG running I see the following error

       

      INFO 12-12 21:36:52 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%,
       CPU KV cache usage: 0.0%.                                                                                                                                                                    
      Creating json from Arrow format:  89%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▎            | 340/381 [00:13<00:03, 11.39ba/s]
      INFO 12-12 21:37:02 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%,
       CPU KV cache usage: 0.0%.
      Creating json from Arrow format: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 381/381 [00:16<00:00, 23.79ba/s]
      INFO 2024-12-12 21:37:05,031 instructlab.sdg.datamixing:215: Mixed Dataset saved to /var/home/cloud-user/.local/share/instructlab/datasets/skills_train_msgs_2024-12-12T21_34_24.jsonl
      INFO 2024-12-12 21:37:05,087 instructlab.sdg.generate_data:485: Generation took 160.78s
      INFO 12-12 21:37:05 launcher.py:57] Shutting down FastAPI HTTP server.
      INFO 12-12 21:37:05 multiproc_worker_utils.py:137] Terminating local vLLM worker processes
      (VllmWorkerProcess pid=164) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
      (VllmWorkerProcess pid=166) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
      (VllmWorkerProcess pid=169) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
      (VllmWorkerProcess pid=167) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
      (VllmWorkerProcess pid=165) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
      (VllmWorkerProcess pid=170) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
      (VllmWorkerProcess pid=168) INFO 12-12 21:37:05 multiproc_worker_utils.py:244] Worker exiting
      (VllmWorkerProcess pid=170) Process VllmWorkerProcess:
      (VllmWorkerProcess pid=170) Traceback (most recent call last):
      (VllmWorkerProcess pid=170)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
      (VllmWorkerProcess pid=170)     self.run()
      (VllmWorkerProcess pid=170)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
      (VllmWorkerProcess pid=170)     self._target(*self._args, **self._kwargs)
      (VllmWorkerProcess pid=170)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
      (VllmWorkerProcess pid=170)     raise KeyboardInterrupt("MQLLMEngine terminated")
      (VllmWorkerProcess pid=170) KeyboardInterrupt: MQLLMEngine terminated (VllmWorkerProcess pid=165) Process VllmWorkerProcess:
      (VllmWorkerProcess pid=167) Process VllmWorkerProcess:
      (VllmWorkerProcess pid=165) Traceback (most recent call last):
      (VllmWorkerProcess pid=165)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
      (VllmWorkerProcess pid=165)     self.run()
      (VllmWorkerProcess pid=165)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
      (VllmWorkerProcess pid=165)     self._target(*self._args, **self._kwargs)
      (VllmWorkerProcess pid=165)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
      (VllmWorkerProcess pid=165)     raise KeyboardInterrupt("MQLLMEngine terminated")
      (VllmWorkerProcess pid=165) KeyboardInterrupt: MQLLMEngine terminated
      (VllmWorkerProcess pid=168) Process VllmWorkerProcess:
      (VllmWorkerProcess pid=166) Process VllmWorkerProcess:
      (VllmWorkerProcess pid=168) Traceback (most recent call last):
      (VllmWorkerProcess pid=166) Traceback (most recent call last):
      (VllmWorkerProcess pid=168)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
      (VllmWorkerProcess pid=166)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
      (VllmWorkerProcess pid=168)     self.run()
      (VllmWorkerProcess pid=166)     self.run()
      (VllmWorkerProcess pid=168)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
      (VllmWorkerProcess pid=166)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
      (VllmWorkerProcess pid=168)     self._target(*self._args, **self._kwargs)
      (VllmWorkerProcess pid=166)     self._target(*self._args, **self._kwargs)
      (VllmWorkerProcess pid=168)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
      (VllmWorkerProcess pid=166)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
      (VllmWorkerProcess pid=168)     raise KeyboardInterrupt("MQLLMEngine terminated")
      (VllmWorkerProcess pid=166)     raise KeyboardInterrupt("MQLLMEngine terminated")
      (VllmWorkerProcess pid=166) KeyboardInterrupt: MQLLMEngine terminated
      (VllmWorkerProcess pid=168) KeyboardInterrupt: MQLLMEngine terminated
      (VllmWorkerProcess pid=167) Traceback (most recent call last):
      (VllmWorkerProcess pid=167)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
      (VllmWorkerProcess pid=167)     self.run()
      (VllmWorkerProcess pid=167)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
      (VllmWorkerProcess pid=167)     self._target(*self._args, **self._kwargs)
      (VllmWorkerProcess pid=167)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
      (VllmWorkerProcess pid=167)     raise KeyboardInterrupt("MQLLMEngine terminated")
      (VllmWorkerProcess pid=167) KeyboardInterrupt: MQLLMEngine terminated
      (VllmWorkerProcess pid=169) Process VllmWorkerProcess:
      (VllmWorkerProcess pid=169) Traceback (most recent call last):
      (VllmWorkerProcess pid=169)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
      (VllmWorkerProcess pid=169)     self.run()
      (VllmWorkerProcess pid=169)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
      (VllmWorkerProcess pid=169)     self._target(*self._args, **self._kwargs)
      (VllmWorkerProcess pid=169)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
      (VllmWorkerProcess pid=169)     raise KeyboardInterrupt("MQLLMEngine terminated")
      (VllmWorkerProcess pid=169) KeyboardInterrupt: MQLLMEngine terminated
      (VllmWorkerProcess pid=164) Process VllmWorkerProcess:
      (VllmWorkerProcess pid=164) Traceback (most recent call last):
      (VllmWorkerProcess pid=164)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 314, in _bootstrap
      (VllmWorkerProcess pid=164)     self.run()
      (VllmWorkerProcess pid=164)   File "/usr/lib64/python3.11/multiprocessing/process.py", line 108, in run
      (VllmWorkerProcess pid=164)     self._target(*self._args, **self._kwargs)
      (VllmWorkerProcess pid=164)   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 384, in signal_handler
      (VllmWorkerProcess pid=164)     raise KeyboardInterrupt("MQLLMEngine terminated")
      (VllmWorkerProcess pid=164) KeyboardInterrupt: MQLLMEngine terminated
      INFO:     Shutting down
      INFO:     Waiting for application shutdown.
      INFO:     Application shutdown complete.
      /usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
        warnings.warn('resource_tracker: There appear to be %d '
      INFO 2024-12-12 21:37:19,577 instructlab.model.backends.vllm:475: Waiting for GPU VRAM reclamation...
      [cloud-user@ip-10-0-20-165 ~]$ 
       

       

       

       

       

      Expected behavior

      • no error

      Device Info (please complete the following information):

      • Hardware Specs: [AWS p5.48xlarge]
      • OS Version: [Red Hat Enterprise Linux release 9.4 (Plow]]
      • InstructLab Version: [0.21.2]
      • Provide the output of these two commands:
        • sudo bootc status --format json | jq .status.booted.image.image.image
          "registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3.1-1734019952" 
        • ilab system info
          Platform:                                                                          
            sys.version: 3.11.7 (main, Oct  9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
            sys.platform: linux       
            os.name: posix             
            platform.release: 5.14.0-427.42.1.el9_4.x86_64                                   
            platform.machine: x86_64                
            platform.node: ip-10-0-20-165.us-east-2.compute.internal
            platform.python_version: 3.11.7
            os-release.ID: rhel                                                              
            os-release.VERSION_ID: 9.4              
            os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
            memory.total: 1999.96 GB   
            memory.available: 1987.86 GB                                                     
            memory.used: 3.24 GB                    
                                      
          InstructLab:                 
            instructlab.version: 0.21.2                                                      
            instructlab-dolomite.version: 0.2.0     
            instructlab-eval.version: 0.4.1
            instructlab-quantize.version: 0.1.0
            instructlab-schema.version: 0.4.1                                                
            instructlab-sdg.version: 0.6.1          
            instructlab-training.version: 0.6.1
                                       
          Torch:                                                                             
            torch.version: 2.4.1
            torch.backends.cpu.capability: AVX2
            torch.version.cuda: 12.4        
            torch.version.hip: None                    
            torch.cuda.available: True  
            torch.backends.cuda.is_built: True
            torch.backends.mps.is_built: False
            torch.backends.mps.is_available: False
            torch.cuda.bf16: True
            torch.cuda.current.device: 0
            torch.cuda.0.name: NVIDIA H100 80GB HBM3
            torch.cuda.0.free: 78.6 GB
            torch.cuda.0.total: 79.1 GB
            torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.1.name: NVIDIA H100 80GB HBM3    torch.cuda.1.free: 78.6 GB
            torch.cuda.1.total: 79.1 GB
            torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.2.name: NVIDIA H100 80GB HBM3
            torch.cuda.2.free: 78.6 GB
            torch.cuda.2.total: 79.1 GB
            torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.3.name: NVIDIA H100 80GB HBM3
            torch.cuda.3.free: 78.6 GB
            torch.cuda.3.total: 79.1 GB
            torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.4.name: NVIDIA H100 80GB HBM3
            torch.cuda.4.free: 78.6 GB
            torch.cuda.4.total: 79.1 GB
            torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.5.name: NVIDIA H100 80GB HBM3
            torch.cuda.5.free: 78.6 GB
            torch.cuda.5.total: 79.1 GB
            torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.6.name: NVIDIA H100 80GB HBM3
            torch.cuda.6.free: 78.6 GB
            torch.cuda.6.total: 79.1 GB
            torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.7.name: NVIDIA H100 80GB HBM3
            torch.cuda.7.free: 78.6 GB
            torch.cuda.7.total: 79.1 GB
            torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
            llama_cpp_python.version: 0.2.79
            llama_cpp_python.supports_gpu_offload: True
          

       

              Unassigned Unassigned
              achuzhoy@redhat.com Alexander Chuzhoy
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: