Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: rhelai-1.5
Affects Version/s: rhelai-1.5
Component/s: Accelerators - AMD, vLLM
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Release Blocker:
Approved

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

ilab -v data generate

Expected behavior

Working SDG

Screenshots

Device Info (please complete the following information):

Hardware Specs: Standard_ND96is_MI300X_v5 (MI300X x 8)
OS Version: rhel-ai-amd-azure-1.5-1746228359-x86_64
InstructLab Version: ilab, version 0.26.0
Provide the output of these two commands:

Bug impact

SDG is not working, training can't be verified.

Known workaround

N/A

Additional context

Adding --enable-serving-output to vllm args just makes it exit/fail earlier without any added details.

vLLM works/starts just fine for chat and serve.

$ ilab -v data generate 
Parameters:
             model_path: '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1'     [type: str, src: default_map]
               num_cpus: 2            [type: int, src: default_map]
       chunk_word_count: 1000         [type: int, src: default_map]
       num_instructions: -1           [type: int, src: default]
       sdg_scale_factor: 30           [type: int, src: default_map]
          taxonomy_path: '/var/home/azureuser/.local/share/instructlab/taxonomy'     [type: str, src: default_map]
          taxonomy_base: 'empty'      [type: str, src: default_map]
             output_dir: '/var/home/azureuser/.local/share/instructlab/datasets'     [type: str, src: default_map]
                  quiet: False        [type: bool, src: default]
           endpoint_url: None         [type: None, src: default]
                api_key: 'no_api_key'     [type: str, src: default]
             yaml_rules: None         [type: None, src: default]
        server_ctx_size: 4096         [type: int, src: default]
           tls_insecure: False        [type: bool, src: default]
        tls_client_cert: ''           [type: str, src: default]
         tls_client_key: ''           [type: str, src: default]
      tls_client_passwd: ''           [type: str, src: default]
           model_family: None         [type: None, src: default]
               pipeline: '/usr/share/instructlab/sdg/pipelines/agentic'     [type: str, src: default_map]
             batch_size: 256          [type: int, src: default_map]
  enable_serving_output: False        [type: bool, src: default]
                   gpus: 1            [type: int, src: default_map]
         max_num_tokens: 4096         [type: int, src: default_map]
               detached: False        [type: bool, src: default]
       student_model_id: None         [type: None, src: default]
       teacher_model_id: None         [type: None, src: default]
DEBUG 2025-05-03 13:22:03,062 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/granite-3-1-8b-starter-v2 is a directory
INFO 2025-05-03 13:22:06,615 instructlab.process.process:300: Started subprocess with PID 1. Logs are being written to /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log.
DEBUG 2025-05-03 13:22:06,616 instructlab:0: Checking for existing file handler for log file: /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log
DEBUG 2025-05-03 13:22:06,616 instructlab:0: No file handler found for log file: /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log, adding one!
DEBUG 2025-05-03 13:22:06,616 instructlab:0: Adding file handler for log file: /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log
DEBUG 2025-05-03 13:22:06,616 instructlab:0: Added file handler for log file: /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log
DEBUG 2025-05-03 13:22:07,317 instructlab.model.backends.backends:179: Selecting backend for model /var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
DEBUG 2025-05-03 13:22:07,320 instructlab.model.backends.backends:74: Auto-detecting backend for model /var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
DEBUG 2025-05-03 13:22:07,343 instructlab.model.backends.backends:32: Model is huggingface safetensors and system is Linux, using vllm backend.
DEBUG 2025-05-03 13:22:07,343 instructlab.model.backends.backends:82: Auto-detected backend: vllm
DEBUG 2025-05-03 13:22:07,343 instructlab.model.backends.backends:93: Validating 'vllm' backend
INFO 2025-05-03 13:22:07,454 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1
DEBUG 2025-05-03 13:22:08,798 instructlab.model.backends.vllm:119: Using available port 56489 for temporary model serving.
DEBUG 2025-05-03 13:22:08,799 instructlab.utils:780: '/var/home/azureuser/.cache/instructlab/models/skills-adapter-v3' is missing {'config.json'}
DEBUG 2025-05-03 13:22:08,799 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/skills-adapter-v3 is a directory
DEBUG 2025-05-03 13:22:08,800 instructlab.utils:780: '/var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3' is missing {'config.json'}
DEBUG 2025-05-03 13:22:08,800 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3 is a directory
DEBUG 2025-05-03 13:22:08,998 instructlab.model.backends.vllm:294: vLLM serving command is: ['/opt/app-root/bin/python3.11', '-m', 'vllm.entrypoints.openai.api_server', '--host', '127.0.0.1', '--port', '56489', '--model', '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', '--distributed-executor-backend', 'mp', '--served-model-name', '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', 'mixtral-8x7b-instruct-v0-1', 'models/granite-3-1-8b-lab-v2', 'models/granite-3-1-8b-starter-v2', 'models/mixtral-8x7b-instruct-v0-1', 'models/prometheus-8x7b-v2-0', '--max-num-seqs', '512', '--enable-lora', '--enable-prefix-caching', '--max-lora-rank', '64', '--dtype', 'bfloat16', '--lora-dtype', 'bfloat16', '--fully-sharded-loras', '--lora-modules', 'skill-classifier-v3-clm=/var/home/azureuser/.cache/instructlab/models/skills-adapter-v3', 'text-classifier-knowledge-v3-clm=/var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3', '--tensor-parallel-size', '1']
INFO 2025-05-03 13:22:09,000 instructlab.model.backends.vllm:332: vLLM starting up on pid 5 at http://127.0.0.1:56489/v1
INFO 2025-05-03 13:22:09,000 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:56489/v1
INFO 2025-05-03 13:22:09,000 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 1/120
INFO 2025-05-03 13:22:12,141 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 2/120
INFO 2025-05-03 13:22:15,475 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 3/120
INFO 2025-05-03 13:22:18,821 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 4/120
INFO 2025-05-03 13:22:22,072 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 5/120
INFO 2025-05-03 13:22:25,391 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 6/120
INFO 2025-05-03 13:22:28,609 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 7/120
INFO 2025-05-03 13:22:31,931 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 8/120
INFO 2025-05-03 13:22:35,344 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 9/120
INFO 2025-05-03 13:22:38,695 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 10/120
INFO 2025-05-03 13:22:42,042 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 11/120
INFO 2025-05-03 13:22:45,458 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 12/120
INFO 2025-05-03 13:22:48,870 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 13/120
INFO 2025-05-03 13:22:52,157 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 14/120
INFO 2025-05-03 13:22:55,489 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 15/120
INFO 2025-05-03 13:22:58,772 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 16/120
INFO 2025-05-03 13:23:02,154 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 17/120
INFO 2025-05-03 13:23:05,465 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 18/120
INFO 2025-05-03 13:23:08,778 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 19/120
INFO 2025-05-03 13:23:12,208 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 20/120
INFO 2025-05-03 13:23:15,568 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 21/120
INFO 2025-05-03 13:23:18,815 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 22/120
INFO 2025-05-03 13:23:22,243 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 23/120
INFO 2025-05-03 13:23:25,589 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 24/120
INFO 2025-05-03 13:23:28,894 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 25/120
INFO 2025-05-03 13:23:32,276 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 26/120
INFO 2025-05-03 13:23:35,616 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 27/120
INFO 2025-05-03 13:23:38,881 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 28/120
INFO 2025-05-03 13:23:42,092 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 29/120
INFO 2025-05-03 13:23:45,340 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 30/120
INFO 2025-05-03 13:23:48,626 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 31/120
INFO 2025-05-03 13:23:51,896 instructlab.model.backends.vllm:180: vLLM startup failed.  Retrying (1/1)
ERROR 2025-05-03 13:23:51,897 instructlab.model.backends.vllm:185: vLLM failed to start.  Retry with --enable-serving-output to learn more about the failure.
INFO 2025-05-03 13:23:51,897 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1
DEBUG 2025-05-03 13:23:53,293 instructlab.model.backends.vllm:119: Using available port 33677 for temporary model serving.
DEBUG 2025-05-03 13:23:53,295 instructlab.utils:780: '/var/home/azureuser/.cache/instructlab/models/skills-adapter-v3' is missing {'config.json'}
DEBUG 2025-05-03 13:23:53,295 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/skills-adapter-v3 is a directory
DEBUG 2025-05-03 13:23:53,295 instructlab.utils:780: '/var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3' is missing {'config.json'}
DEBUG 2025-05-03 13:23:53,295 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3 is a directory
DEBUG 2025-05-03 13:23:53,504 instructlab.model.backends.vllm:294: vLLM serving command is: ['/opt/app-root/bin/python3.11', '-m', 'vllm.entrypoints.openai.api_server', '--host', '127.0.0.1', '--port', '33677', '--model', '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', '--distributed-executor-backend', 'mp', '--served-model-name', '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', 'mixtral-8x7b-instruct-v0-1', 'models/granite-3-1-8b-lab-v2', 'models/granite-3-1-8b-starter-v2', 'models/mixtral-8x7b-instruct-v0-1', 'models/prometheus-8x7b-v2-0', '--max-num-seqs', '512', '--enable-lora', '--enable-prefix-caching', '--max-lora-rank', '64', '--dtype', 'bfloat16', '--lora-dtype', 'bfloat16', '--fully-sharded-loras', '--lora-modules', 'skill-classifier-v3-clm=/var/home/azureuser/.cache/instructlab/models/skills-adapter-v3', 'text-classifier-knowledge-v3-clm=/var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3', '--tensor-parallel-size', '1']
INFO 2025-05-03 13:23:53,505 instructlab.model.backends.vllm:332: vLLM starting up on pid 109 at http://127.0.0.1:33677/v1
INFO 2025-05-03 13:23:53,505 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:33677/v1
INFO 2025-05-03 13:23:53,505 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 1/120
INFO 2025-05-03 13:23:56,696 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 2/120
INFO 2025-05-03 13:24:00,050 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 3/120
INFO 2025-05-03 13:24:03,290 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 4/120
INFO 2025-05-03 13:24:06,606 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 5/120
INFO 2025-05-03 13:24:09,987 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 6/120
INFO 2025-05-03 13:24:13,180 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 7/120
INFO 2025-05-03 13:24:16,563 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 8/120
INFO 2025-05-03 13:24:19,983 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 9/120
INFO 2025-05-03 13:24:23,214 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 10/120
INFO 2025-05-03 13:24:26,631 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 11/120
INFO 2025-05-03 13:24:30,012 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 12/120
INFO 2025-05-03 13:24:33,307 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 13/120
INFO 2025-05-03 13:24:36,516 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 14/120
INFO 2025-05-03 13:24:39,895 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 15/120
INFO 2025-05-03 13:24:43,172 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 16/120
INFO 2025-05-03 13:24:46,404 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 17/120
INFO 2025-05-03 13:24:49,660 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 18/120
INFO 2025-05-03 13:24:53,084 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 19/120
INFO 2025-05-03 13:24:56,409 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 20/120
INFO 2025-05-03 13:24:59,829 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 21/120
INFO 2025-05-03 13:25:03,096 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 22/120
INFO 2025-05-03 13:25:06,461 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 23/120
INFO 2025-05-03 13:25:09,703 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 24/120
INFO 2025-05-03 13:25:12,951 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 25/120
INFO 2025-05-03 13:25:16,109 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 26/120
INFO 2025-05-03 13:25:19,381 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 27/120
INFO 2025-05-03 13:25:22,733 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 28/120
INFO 2025-05-03 13:25:26,130 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 29/120
INFO 2025-05-03 13:25:29,308 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 30/120
INFO 2025-05-03 13:25:32,520 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 31/120
INFO 2025-05-03 13:25:35,923 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 32/120
INFO 2025-05-03 13:25:39,258 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 33/120
INFO 2025-05-03 13:25:42,556 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 34/120
INFO 2025-05-03 13:25:45,959 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 35/120
INFO 2025-05-03 13:25:49,353 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 36/120
INFO 2025-05-03 13:25:52,647 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 37/120
failed to generate data with exception: Failed to start server: vLLM failed to start.  Retry with --enable-serving-output to learn more about the failure.

is blocked by

RHELAI-4086 Update support to vLLM 0.8.z to pull in LoRA fix

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates