-
Bug
-
Resolution: Done
-
Critical
-
rhelai-1.5
-
None
-
False
-
-
False
-
-
-
Approved
To Reproduce Steps to reproduce the behavior:
ilab -v data generate
Expected behavior
- Working SDG
Screenshots
Device Info (please complete the following information):
- Hardware Specs: Standard_ND96is_MI300X_v5 (MI300X x 8)
- OS Version: rhel-ai-amd-azure-1.5-1746228359-x86_64
- InstructLab Version: ilab, version 0.26.0
- Provide the output of these two commands:
Bug impact
- SDG is not working, training can't be verified.
Known workaround
- N/A
Additional context
Adding --enable-serving-output to vllm args just makes it exit/fail earlier without any added details.
vLLM works/starts just fine for chat and serve.
$ ilab -v data generate Parameters: model_path: '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1' [type: str, src: default_map] num_cpus: 2 [type: int, src: default_map] chunk_word_count: 1000 [type: int, src: default_map] num_instructions: -1 [type: int, src: default] sdg_scale_factor: 30 [type: int, src: default_map] taxonomy_path: '/var/home/azureuser/.local/share/instructlab/taxonomy' [type: str, src: default_map] taxonomy_base: 'empty' [type: str, src: default_map] output_dir: '/var/home/azureuser/.local/share/instructlab/datasets' [type: str, src: default_map] quiet: False [type: bool, src: default] endpoint_url: None [type: None, src: default] api_key: 'no_api_key' [type: str, src: default] yaml_rules: None [type: None, src: default] server_ctx_size: 4096 [type: int, src: default] tls_insecure: False [type: bool, src: default] tls_client_cert: '' [type: str, src: default] tls_client_key: '' [type: str, src: default] tls_client_passwd: '' [type: str, src: default] model_family: None [type: None, src: default] pipeline: '/usr/share/instructlab/sdg/pipelines/agentic' [type: str, src: default_map] batch_size: 256 [type: int, src: default_map] enable_serving_output: False [type: bool, src: default] gpus: 1 [type: int, src: default_map] max_num_tokens: 4096 [type: int, src: default_map] detached: False [type: bool, src: default] student_model_id: None [type: None, src: default] teacher_model_id: None [type: None, src: default] DEBUG 2025-05-03 13:22:03,062 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/granite-3-1-8b-starter-v2 is a directory INFO 2025-05-03 13:22:06,615 instructlab.process.process:300: Started subprocess with PID 1. Logs are being written to /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log. DEBUG 2025-05-03 13:22:06,616 instructlab:0: Checking for existing file handler for log file: /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log DEBUG 2025-05-03 13:22:06,616 instructlab:0: No file handler found for log file: /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log, adding one! DEBUG 2025-05-03 13:22:06,616 instructlab:0: Adding file handler for log file: /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log DEBUG 2025-05-03 13:22:06,616 instructlab:0: Added file handler for log file: /var/home/azureuser/.local/share/instructlab/logs/generation/generation-9c153560-2821-11f0-b54b-6045bd06fd56.log DEBUG 2025-05-03 13:22:07,317 instructlab.model.backends.backends:179: Selecting backend for model /var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1 DEBUG 2025-05-03 13:22:07,320 instructlab.model.backends.backends:74: Auto-detecting backend for model /var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1 DEBUG 2025-05-03 13:22:07,343 instructlab.model.backends.backends:32: Model is huggingface safetensors and system is Linux, using vllm backend. DEBUG 2025-05-03 13:22:07,343 instructlab.model.backends.backends:82: Auto-detected backend: vllm DEBUG 2025-05-03 13:22:07,343 instructlab.model.backends.backends:93: Validating 'vllm' backend INFO 2025-05-03 13:22:07,454 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1 DEBUG 2025-05-03 13:22:08,798 instructlab.model.backends.vllm:119: Using available port 56489 for temporary model serving. DEBUG 2025-05-03 13:22:08,799 instructlab.utils:780: '/var/home/azureuser/.cache/instructlab/models/skills-adapter-v3' is missing {'config.json'} DEBUG 2025-05-03 13:22:08,799 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/skills-adapter-v3 is a directory DEBUG 2025-05-03 13:22:08,800 instructlab.utils:780: '/var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3' is missing {'config.json'} DEBUG 2025-05-03 13:22:08,800 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3 is a directory DEBUG 2025-05-03 13:22:08,998 instructlab.model.backends.vllm:294: vLLM serving command is: ['/opt/app-root/bin/python3.11', '-m', 'vllm.entrypoints.openai.api_server', '--host', '127.0.0.1', '--port', '56489', '--model', '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', '--distributed-executor-backend', 'mp', '--served-model-name', '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', 'mixtral-8x7b-instruct-v0-1', 'models/granite-3-1-8b-lab-v2', 'models/granite-3-1-8b-starter-v2', 'models/mixtral-8x7b-instruct-v0-1', 'models/prometheus-8x7b-v2-0', '--max-num-seqs', '512', '--enable-lora', '--enable-prefix-caching', '--max-lora-rank', '64', '--dtype', 'bfloat16', '--lora-dtype', 'bfloat16', '--fully-sharded-loras', '--lora-modules', 'skill-classifier-v3-clm=/var/home/azureuser/.cache/instructlab/models/skills-adapter-v3', 'text-classifier-knowledge-v3-clm=/var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3', '--tensor-parallel-size', '1'] INFO 2025-05-03 13:22:09,000 instructlab.model.backends.vllm:332: vLLM starting up on pid 5 at http://127.0.0.1:56489/v1 INFO 2025-05-03 13:22:09,000 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:56489/v1 INFO 2025-05-03 13:22:09,000 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 1/120 INFO 2025-05-03 13:22:12,141 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 2/120 INFO 2025-05-03 13:22:15,475 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 3/120 INFO 2025-05-03 13:22:18,821 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 4/120 INFO 2025-05-03 13:22:22,072 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 5/120 INFO 2025-05-03 13:22:25,391 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 6/120 INFO 2025-05-03 13:22:28,609 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 7/120 INFO 2025-05-03 13:22:31,931 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 8/120 INFO 2025-05-03 13:22:35,344 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 9/120 INFO 2025-05-03 13:22:38,695 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 10/120 INFO 2025-05-03 13:22:42,042 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 11/120 INFO 2025-05-03 13:22:45,458 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 12/120 INFO 2025-05-03 13:22:48,870 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 13/120 INFO 2025-05-03 13:22:52,157 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 14/120 INFO 2025-05-03 13:22:55,489 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 15/120 INFO 2025-05-03 13:22:58,772 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 16/120 INFO 2025-05-03 13:23:02,154 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 17/120 INFO 2025-05-03 13:23:05,465 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 18/120 INFO 2025-05-03 13:23:08,778 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 19/120 INFO 2025-05-03 13:23:12,208 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 20/120 INFO 2025-05-03 13:23:15,568 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 21/120 INFO 2025-05-03 13:23:18,815 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 22/120 INFO 2025-05-03 13:23:22,243 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 23/120 INFO 2025-05-03 13:23:25,589 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 24/120 INFO 2025-05-03 13:23:28,894 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 25/120 INFO 2025-05-03 13:23:32,276 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 26/120 INFO 2025-05-03 13:23:35,616 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 27/120 INFO 2025-05-03 13:23:38,881 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 28/120 INFO 2025-05-03 13:23:42,092 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 29/120 INFO 2025-05-03 13:23:45,340 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 30/120 INFO 2025-05-03 13:23:48,626 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:56489/v1, this might take a moment... Attempt: 31/120 INFO 2025-05-03 13:23:51,896 instructlab.model.backends.vllm:180: vLLM startup failed. Retrying (1/1) ERROR 2025-05-03 13:23:51,897 instructlab.model.backends.vllm:185: vLLM failed to start. Retry with --enable-serving-output to learn more about the failure. INFO 2025-05-03 13:23:51,897 instructlab.model.backends.vllm:115: Trying to connect to model server at http://127.0.0.1:8000/v1 DEBUG 2025-05-03 13:23:53,293 instructlab.model.backends.vllm:119: Using available port 33677 for temporary model serving. DEBUG 2025-05-03 13:23:53,295 instructlab.utils:780: '/var/home/azureuser/.cache/instructlab/models/skills-adapter-v3' is missing {'config.json'} DEBUG 2025-05-03 13:23:53,295 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/skills-adapter-v3 is a directory DEBUG 2025-05-03 13:23:53,295 instructlab.utils:780: '/var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3' is missing {'config.json'} DEBUG 2025-05-03 13:23:53,295 instructlab.utils:804: GGUF Path /var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3 is a directory DEBUG 2025-05-03 13:23:53,504 instructlab.model.backends.vllm:294: vLLM serving command is: ['/opt/app-root/bin/python3.11', '-m', 'vllm.entrypoints.openai.api_server', '--host', '127.0.0.1', '--port', '33677', '--model', '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', '--distributed-executor-backend', 'mp', '--served-model-name', '/var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1', 'mixtral-8x7b-instruct-v0-1', 'models/granite-3-1-8b-lab-v2', 'models/granite-3-1-8b-starter-v2', 'models/mixtral-8x7b-instruct-v0-1', 'models/prometheus-8x7b-v2-0', '--max-num-seqs', '512', '--enable-lora', '--enable-prefix-caching', '--max-lora-rank', '64', '--dtype', 'bfloat16', '--lora-dtype', 'bfloat16', '--fully-sharded-loras', '--lora-modules', 'skill-classifier-v3-clm=/var/home/azureuser/.cache/instructlab/models/skills-adapter-v3', 'text-classifier-knowledge-v3-clm=/var/home/azureuser/.cache/instructlab/models/knowledge-adapter-v3', '--tensor-parallel-size', '1'] INFO 2025-05-03 13:23:53,505 instructlab.model.backends.vllm:332: vLLM starting up on pid 109 at http://127.0.0.1:33677/v1 INFO 2025-05-03 13:23:53,505 instructlab.model.backends.vllm:123: Starting a temporary vLLM server at http://127.0.0.1:33677/v1 INFO 2025-05-03 13:23:53,505 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 1/120 INFO 2025-05-03 13:23:56,696 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 2/120 INFO 2025-05-03 13:24:00,050 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 3/120 INFO 2025-05-03 13:24:03,290 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 4/120 INFO 2025-05-03 13:24:06,606 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 5/120 INFO 2025-05-03 13:24:09,987 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 6/120 INFO 2025-05-03 13:24:13,180 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 7/120 INFO 2025-05-03 13:24:16,563 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 8/120 INFO 2025-05-03 13:24:19,983 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 9/120 INFO 2025-05-03 13:24:23,214 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 10/120 INFO 2025-05-03 13:24:26,631 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 11/120 INFO 2025-05-03 13:24:30,012 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 12/120 INFO 2025-05-03 13:24:33,307 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 13/120 INFO 2025-05-03 13:24:36,516 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 14/120 INFO 2025-05-03 13:24:39,895 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 15/120 INFO 2025-05-03 13:24:43,172 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 16/120 INFO 2025-05-03 13:24:46,404 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 17/120 INFO 2025-05-03 13:24:49,660 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 18/120 INFO 2025-05-03 13:24:53,084 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 19/120 INFO 2025-05-03 13:24:56,409 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 20/120 INFO 2025-05-03 13:24:59,829 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 21/120 INFO 2025-05-03 13:25:03,096 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 22/120 INFO 2025-05-03 13:25:06,461 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 23/120 INFO 2025-05-03 13:25:09,703 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 24/120 INFO 2025-05-03 13:25:12,951 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 25/120 INFO 2025-05-03 13:25:16,109 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 26/120 INFO 2025-05-03 13:25:19,381 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 27/120 INFO 2025-05-03 13:25:22,733 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 28/120 INFO 2025-05-03 13:25:26,130 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 29/120 INFO 2025-05-03 13:25:29,308 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 30/120 INFO 2025-05-03 13:25:32,520 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 31/120 INFO 2025-05-03 13:25:35,923 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 32/120 INFO 2025-05-03 13:25:39,258 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 33/120 INFO 2025-05-03 13:25:42,556 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 34/120 INFO 2025-05-03 13:25:45,959 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 35/120 INFO 2025-05-03 13:25:49,353 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 36/120 INFO 2025-05-03 13:25:52,647 instructlab.model.backends.vllm:138: Waiting for the vLLM server to start at http://127.0.0.1:33677/v1, this might take a moment... Attempt: 37/120 failed to generate data with exception: Failed to start server: vLLM failed to start. Retry with --enable-serving-output to learn more about the failure.
- is blocked by
-
RHELAI-4086 Update support to vLLM 0.8.z to pull in LoRA fix
-
- Closed
-