Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4055

SDG is broken on RHEL AI 1.5 AMD

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Approved

      To Reproduce Steps to reproduce the behavior:

      ilab -v data generate

      Expected behavior

      • Working SDG

      Screenshots

       

      Device Info (please complete the following information):

      • Hardware Specs: Standard_ND96is_MI300X_v5 (MI300X x 8)
      • OS Version: rhel-ai-amd-azure-1.5-1746228359-x86_64
      • InstructLab Version: ilab, version 0.26.0
      • Provide the output of these two commands:
      •  

      Bug impact

      • SDG is not working, training can't be verified.

      Known workaround

      • N/A

      Additional context

      Adding --enable-serving-output to vllm args just makes it exit/fail earlier without any added details.

      vLLM works/starts just fine for chat and serve.

      $ ilab -v data generate 
      Parameters:
                   model_path: '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1'     [type: str, src: default_map]
                     num_cpus: 2            [type: int, src: default_map]
             chunk_word_count: 1000         [type: int, src: default_map]
             num_instructions: -1           [type: int, src: default]
             sdg_scale_factor: 30           [type: int, src: default_map]
                taxonomy_path: '/var/home/azureuser/.local/share/instructlab/taxonomy'     [type: str, src: default_map]
                taxonomy_base: 'empty'      [type: str, src: default_map]
                   output_dir: '/var/home/azureuser/.local/share/instructlab/datasets'     [type: str, src: default_map]
                        quiet: False        [type: bool, src: default]
                 endpoint_url: None         [type: None, src: default]
                      api_key: 'no_api_key'     [type: str, src: default]
                   yaml_rules: None         [type: None, src: default]
              server_ctx_size: 4096         [type: int, src: default]
                 tls_insecure: False        [type: bool, src: default]
              tls_client_cert: ''           [type: str, src: default]
               tls_client_key: ''           [type: str, src: default]
            tls_client_passwd: ''           [type: str, src: default]
                 model_family: None         [type: None, src: default]
                     pipeline: '/usr/share/instructlab/sdg/pipelines/agentic'     [type: str, src: default_map]
                   batch_size: 256          [type: int, src: default_map]
        enable_serving_output: False        [type: bool, src: default]
                         gpus: 1            [type: int, src: default_map]
               max_num_tokens: 4096         [type: int, src: default_map]
                     detached: False        [type: bool, src: default]
             student_model_id: None         [type: None, src: default]
             teacher_model_id: None         [type: None, src: default]
      DEBUG 2025-05-03 13:22:03,062 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/granite-3-1-8b-starter-v2 is a directory
      INFO 2025-05-03 13:22:06,615 instructlab.process.process:300: Started subprocess with PID 1. Logs are being written to /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log.
      DEBUG 2025-05-03 13:22:06,616 instructlab:0: Checking for existing file handler for log file: /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log
      DEBUG 2025-05-03 13:22:06,616 instructlab:0: No file handler found for log file: /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log, adding one!
      DEBUG 2025-05-03 13:22:06,616 instructlab:0: Adding file handler for log file: /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log
      DEBUG 2025-05-03 13:22:06,616 instructlab:0: Added file handler for log file: /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log
      DEBUG 2025-05-03 13:22:07,317 instructlab.model.backends.backends:179: Selecting backend for model /var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
      DEBUG 2025-05-03 13:22:07,320 instructlab.model.backends.backends:74: Auto-detecting backend for model /var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1
      DEBUG 2025-05-03 13:22:07,343 instructlab.model.backends.backends:32: Model is huggingface safetensors and system is Linux, using vllm backend.
      DEBUG 2025-05-03 13:22:07,343 instructlab.model.backends.backends:82: Auto-detected backend: vllm
      DEBUG 2025-05-03 13:22:07,343 instructlab.model.backends.backends:93: Validating 'vllm' backend
      INFO 2025-05-03 13:22:07,454 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1
      DEBUG 2025-05-03 13:22:08,798 instructlab.model.backends.vllm:119: Using available port 56489 for temporary model serving.
      DEBUG 2025-05-03 13:22:08,799 instructlab.utils:780: '/var/home/azureuser/.cache/instructlab/models/skills-adapter-v3' is missing {'config.json'}
      DEBUG 2025-05-03 13:22:08,799 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/skills-adapter-v3 is a directory
      DEBUG 2025-05-03 13:22:08,800 instructlab.utils:780: '/var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3' is missing {'config.json'}
      DEBUG 2025-05-03 13:22:08,800 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3 is a directory
      DEBUG 2025-05-03 13:22:08,998 instructlab.model.backends.vllm:294: vLLM serving command is: ['/opt/app-root/bin/python3.11', '-m', 'vllm.entrypoints.openai.api_server', '--host', '127.0.0.1', '--port', '56489', '--model', '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', '--distributed-executor-backend', 'mp', '--served-model-name', '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', 'mixtral-8x7b-instruct-v0-1', 'models/granite-3-1-8b-lab-v2', 'models/granite-3-1-8b-starter-v2', 'models/mixtral-8x7b-instruct-v0-1', 'models/prometheus-8x7b-v2-0', '--max-num-seqs', '512', '--enable-lora', '--enable-prefix-caching', '--max-lora-rank', '64', '--dtype', 'bfloat16', '--lora-dtype', 'bfloat16', '--fully-sharded-loras', '--lora-modules', 'skill-classifier-v3-clm=/var/home/azureuser/.cache/instructlab/models/skills-adapter-v3', 'text-classifier-knowledge-v3-clm=/var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3', '--tensor-parallel-size', '1']
      INFO 2025-05-03 13:22:09,000 instructlab.model.backends.vllm:332: vLLM starting up on pid 5 at http://127.0.0.1:56489/v1
      INFO 2025-05-03 13:22:09,000 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:56489/v1
      INFO 2025-05-03 13:22:09,000 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 1/120
      INFO 2025-05-03 13:22:12,141 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 2/120
      INFO 2025-05-03 13:22:15,475 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 3/120
      INFO 2025-05-03 13:22:18,821 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 4/120
      INFO 2025-05-03 13:22:22,072 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 5/120
      INFO 2025-05-03 13:22:25,391 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 6/120
      INFO 2025-05-03 13:22:28,609 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 7/120
      INFO 2025-05-03 13:22:31,931 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 8/120
      INFO 2025-05-03 13:22:35,344 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 9/120
      INFO 2025-05-03 13:22:38,695 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 10/120
      INFO 2025-05-03 13:22:42,042 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 11/120
      INFO 2025-05-03 13:22:45,458 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 12/120
      INFO 2025-05-03 13:22:48,870 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 13/120
      INFO 2025-05-03 13:22:52,157 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 14/120
      INFO 2025-05-03 13:22:55,489 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 15/120
      INFO 2025-05-03 13:22:58,772 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 16/120
      INFO 2025-05-03 13:23:02,154 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 17/120
      INFO 2025-05-03 13:23:05,465 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 18/120
      INFO 2025-05-03 13:23:08,778 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 19/120
      INFO 2025-05-03 13:23:12,208 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 20/120
      INFO 2025-05-03 13:23:15,568 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 21/120
      INFO 2025-05-03 13:23:18,815 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 22/120
      INFO 2025-05-03 13:23:22,243 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 23/120
      INFO 2025-05-03 13:23:25,589 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 24/120
      INFO 2025-05-03 13:23:28,894 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 25/120
      INFO 2025-05-03 13:23:32,276 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 26/120
      INFO 2025-05-03 13:23:35,616 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 27/120
      INFO 2025-05-03 13:23:38,881 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 28/120
      INFO 2025-05-03 13:23:42,092 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 29/120
      INFO 2025-05-03 13:23:45,340 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 30/120
      INFO 2025-05-03 13:23:48,626 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 31/120
      INFO 2025-05-03 13:23:51,896 instructlab.model.backends.vllm:180: vLLM startup failed.  Retrying (1/1)
      ERROR 2025-05-03 13:23:51,897 instructlab.model.backends.vllm:185: vLLM failed to start.  Retry with --enable-serving-output to learn more about the failure.
      INFO 2025-05-03 13:23:51,897 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1
      DEBUG 2025-05-03 13:23:53,293 instructlab.model.backends.vllm:119: Using available port 33677 for temporary model serving.
      DEBUG 2025-05-03 13:23:53,295 instructlab.utils:780: '/var/home/azureuser/.cache/instructlab/models/skills-adapter-v3' is missing {'config.json'}
      DEBUG 2025-05-03 13:23:53,295 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/skills-adapter-v3 is a directory
      DEBUG 2025-05-03 13:23:53,295 instructlab.utils:780: '/var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3' is missing {'config.json'}
      DEBUG 2025-05-03 13:23:53,295 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3 is a directory
      DEBUG 2025-05-03 13:23:53,504 instructlab.model.backends.vllm:294: vLLM serving command is: ['/opt/app-root/bin/python3.11', '-m', 'vllm.entrypoints.openai.api_server', '--host', '127.0.0.1', '--port', '33677', '--model', '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', '--distributed-executor-backend', 'mp', '--served-model-name', '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', 'mixtral-8x7b-instruct-v0-1', 'models/granite-3-1-8b-lab-v2', 'models/granite-3-1-8b-starter-v2', 'models/mixtral-8x7b-instruct-v0-1', 'models/prometheus-8x7b-v2-0', '--max-num-seqs', '512', '--enable-lora', '--enable-prefix-caching', '--max-lora-rank', '64', '--dtype', 'bfloat16', '--lora-dtype', 'bfloat16', '--fully-sharded-loras', '--lora-modules', 'skill-classifier-v3-clm=/var/home/azureuser/.cache/instructlab/models/skills-adapter-v3', 'text-classifier-knowledge-v3-clm=/var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3', '--tensor-parallel-size', '1']
      INFO 2025-05-03 13:23:53,505 instructlab.model.backends.vllm:332: vLLM starting up on pid 109 at http://127.0.0.1:33677/v1
      INFO 2025-05-03 13:23:53,505 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:33677/v1
      INFO 2025-05-03 13:23:53,505 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 1/120
      INFO 2025-05-03 13:23:56,696 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 2/120
      INFO 2025-05-03 13:24:00,050 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 3/120
      INFO 2025-05-03 13:24:03,290 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 4/120
      INFO 2025-05-03 13:24:06,606 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 5/120
      INFO 2025-05-03 13:24:09,987 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 6/120
      INFO 2025-05-03 13:24:13,180 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 7/120
      INFO 2025-05-03 13:24:16,563 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 8/120
      INFO 2025-05-03 13:24:19,983 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 9/120
      INFO 2025-05-03 13:24:23,214 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 10/120
      INFO 2025-05-03 13:24:26,631 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 11/120
      INFO 2025-05-03 13:24:30,012 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 12/120
      INFO 2025-05-03 13:24:33,307 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 13/120
      INFO 2025-05-03 13:24:36,516 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 14/120
      INFO 2025-05-03 13:24:39,895 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 15/120
      INFO 2025-05-03 13:24:43,172 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 16/120
      INFO 2025-05-03 13:24:46,404 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 17/120
      INFO 2025-05-03 13:24:49,660 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 18/120
      INFO 2025-05-03 13:24:53,084 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 19/120
      INFO 2025-05-03 13:24:56,409 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 20/120
      INFO 2025-05-03 13:24:59,829 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 21/120
      INFO 2025-05-03 13:25:03,096 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 22/120
      INFO 2025-05-03 13:25:06,461 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 23/120
      INFO 2025-05-03 13:25:09,703 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 24/120
      INFO 2025-05-03 13:25:12,951 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 25/120
      INFO 2025-05-03 13:25:16,109 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 26/120
      INFO 2025-05-03 13:25:19,381 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 27/120
      INFO 2025-05-03 13:25:22,733 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 28/120
      INFO 2025-05-03 13:25:26,130 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 29/120
      INFO 2025-05-03 13:25:29,308 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 30/120
      INFO 2025-05-03 13:25:32,520 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 31/120
      INFO 2025-05-03 13:25:35,923 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 32/120
      INFO 2025-05-03 13:25:39,258 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 33/120
      INFO 2025-05-03 13:25:42,556 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 34/120
      INFO 2025-05-03 13:25:45,959 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 35/120
      INFO 2025-05-03 13:25:49,353 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 36/120
      INFO 2025-05-03 13:25:52,647 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 37/120
      failed to generate data with exception: Failed to start server: vLLM failed to start.  Retry with --enable-serving-output to learn more about the failure.

              kakella@redhat.com Kamesh Akella
              fzatlouk@redhat.com František Zatloukal
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: